Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add parquet upload #14449

Merged
merged 32 commits into from
Aug 31, 2021
Merged

Conversation

exemplary-citizen
Copy link
Contributor

@exemplary-citizen exemplary-citizen commented May 3, 2021

SUMMARY

Allow CSV upload form to accept parquet file. Went in this direction so as not to exacerbate what was brought up in #13834 by adding a new form specifically for parquet files. I believe small modifications can be made to this PR to accommodate feather and orc files.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

image

TEST PLAN

Added a test similar to the ones already in csv_upload_tests.py

ADDITIONAL INFORMATION

Fixes #14020

  • Has associated issue: 'Upload parquet' option #14020
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@junlincc junlincc added the data:csv Related to import/export of CSVs label May 3, 2021
@exemplary-citizen
Copy link
Contributor Author

@john-bodley @villebro just wanted to follow up on this in case it got lost in the shuffle. How should we proceed on this PR?

@villebro
Copy link
Member

@exemplary-citizen sorry for dropping the ball on this - I'll have this reviewed within the next 24 hours

@robdiciuccio
Copy link
Member

/testenv up

"iterator": True,
"keep_default_na": not form.null_values.data,
"mangle_dupe_cols": form.mangle_dupe_cols.data,
"usecols": form.usecols.data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to change the behavior of the existing CSV upload functionality by specifying columns. Can you add some tests around this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a scenario to test_import_csv that tests uploading a CSV with specific columns

"If not None, only these columns will be read from the file."
),
validators=[Optional()],
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a screenshot of the updated form UI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a screenshot to the summary above

@github-actions
Copy link
Contributor

@robdiciuccio Ephemeral environment spinning up at http://34.214.127.48:8080. Credentials are admin/admin. Please allow several minutes for bootstrapping and startup.

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments:

  • While it's convenient to add this functionality to the CSV upload form, I feel this should be in a form of its own, as the majority of the current fields are specific to CSV only (IIUC only usecols is needed for Parquet upload)
  • If the CSV upload form will also handle Parquet, the title needs to be updated to reflect this. However, I'd personally prefer moving this into a form of its own.
  • it would be nice if the form could handle directories/zip files, as it's fairly common to have partitioned data that is split up into multiple Parquet files. As pandas also supports uploading from a directory path, this would be a great feature to avoid having to manually append upload each file.

Comment on lines 98 to 100
config["ALLOWED_EXTENSIONS"].intersection(config["CSV_EXTENSIONS"]),
config["ALLOWED_EXTENSIONS"].intersection(
config["CSV_EXTENSIONS"].union(config["OTHER_EXTENSIONS"])
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add the union to the error message below.

@exemplary-citizen
Copy link
Contributor Author

Yeah I agree that this probably belongs in a separate form. Just went in this direction because creating a new form would mean that we'd effectively be abandoning #13834. I'll go ahead and get started working on a new form for parquet/directory/zip unless @junlincc @srinify @Steejay have any ideas on how to better consolidate these forms. I believe wtforms has a MultipleFileField class that should allow us to upload entire directories. We can also introduce this change to the CsvToDatabaseForm so users can upload multiple CSVs at a time.

@exemplary-citizen
Copy link
Contributor Author

@villebro can you restart CI?

@exemplary-citizen
Copy link
Contributor Author

@villebro should come back green now

@villebro
Copy link
Member

@exemplary-citizen there's a linting error in one of the files. You can either setup pre-commit hooks or apply the diff below to fix the problem:

diff --git a/superset/views/database/views.py b/superset/views/database/views.py
index 8d0a92f6c..3863b165c 100644
--- a/superset/views/database/views.py
+++ b/superset/views/database/views.py
@@ -406,17 +406,23 @@ class ColumnarToDatabaseView(SimpleFormView):
     def form_get(self, form: ColumnarToDatabaseForm) -> None:
         form.if_exists.data = "fail"

-    def form_post(self, form: ColumnarToDatabaseForm) -> Response:  # pylint: disable=too-many-locals
+    def form_post(
+        self, form: ColumnarToDatabaseForm
+    ) -> Response:  # pylint: disable=too-many-locals
         database = form.con.data
         columnar_table = Table(table=form.name.data, schema=form.schema.data)
         files = form.columnar_file.data
         file_type = {file.filename.split(".")[-1] for file in files}

         if file_type == {"zip"}:
-            zipfile_ob = zipfile.ZipFile(form.columnar_file.data[0])  # pylint: disable=consider-using-with
+            zipfile_ob = zipfile.ZipFile(
+                form.columnar_file.data[0]
+            )  # pylint: disable=consider-using-with
             file_type = {filename.split(".")[-1] for filename in zipfile_ob.namelist()}
             files = [
-                io.BytesIO((zipfile_ob.open(filename).read(), filename)[0])  # pylint: disable=consider-using-with
+                io.BytesIO(
+                    (zipfile_ob.open(filename).read(), filename)[0]
+                )  # pylint: disable=consider-using-with
                 for filename in zipfile_ob.namelist()
             ]

@exemplary-citizen
Copy link
Contributor Author

@villebro took care of the code formatting

@villebro
Copy link
Member

@exemplary-citizen sorry to bother you again, but we've recently updated the version of pylint which appears to be causing some new linting issues. Do you want to fix this or is it ok if I jump in and resolve the remaining ones?

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! While testing I found some edge cases that caused trouble, but those can be improved upon later (I'll try to open up a PR for some of them; will tag you for a review when I do).

@villebro villebro merged commit d25b096 into apache:master Aug 31, 2021
@github-actions
Copy link
Contributor

Ephemeral environment shutdown and build artifacts deleted.

@exemplary-citizen
Copy link
Contributor Author

@villebro thx for pushing this through. yeah do tag me, i'll be sure to review. in the meantime i'll get started on #16046

@villebro
Copy link
Member

@villebro thx for pushing this through. yeah do tag me, i'll be sure to review. in the meantime i'll get started on #16046

That sounds great @exemplary-citizen! Oh, and I forgot; thanks so much for your patience with the review process!

opus-42 pushed a commit to opus-42/incubator-superset that referenced this pull request Nov 14, 2021
* allow csv upload to accept parquet file

* fix mypy

* fix if statement

* add test for specificying columns in CSV upload

* clean up test

* change order in test

* fix failures

* upload parquet to seperate table in test

* fix error message

* fix mypy again

* rename other extensions to columnar

* add new form for columnar upload

* add support for zip files

* undo csv form changes except usecols

* add more tests for zip

* isort & black

* pylint

* fix trailing space

* address more review comments

* pylint

* black

* resolve remaining issues
QAlexBall pushed a commit to QAlexBall/superset that referenced this pull request Dec 28, 2021
* allow csv upload to accept parquet file

* fix mypy

* fix if statement

* add test for specificying columns in CSV upload

* clean up test

* change order in test

* fix failures

* upload parquet to seperate table in test

* fix error message

* fix mypy again

* rename other extensions to columnar

* add new form for columnar upload

* add support for zip files

* undo csv form changes except usecols

* add more tests for zip

* isort & black

* pylint

* fix trailing space

* address more review comments

* pylint

* black

* resolve remaining issues
@LeoDiep
Copy link

LeoDiep commented Aug 16, 2023

@villebro thx for pushing this through. yeah do tag me, i'll be sure to review. in the meantime i'll get started on #16046

That sounds great @exemplary-citizen! Oh, and I forgot; thanks so much for your patience with the review process!

Hi @villebro and @exemplary-citizen , I am using Superset v2.1.0 docker compose and I couldn't upload a parquet file to Superset. Is this request get deprecated in new version?
image

@LeoDiep
Copy link

LeoDiep commented Aug 16, 2023

@villebro thx for pushing this through. yeah do tag me, i'll be sure to review. in the meantime i'll get started on #16046

That sounds great @exemplary-citizen! Oh, and I forgot; thanks so much for your patience with the review process!

Hi @villebro and @exemplary-citizen , I am using Superset v2.1.0 docker compose and I couldn't upload a parquet file to Superset. Is this request get deprecated in new version? image

I am sorry for the question above, I see the Upload columnar file to database now.
I am trying to programatic upload file from S3 to Superset db because the performance of slice query to Athena via pyAthena are so slow compare to postgresql query directly to superset db.
As I can see Athena doesn't receive 25 requests at the same time (my dashboard has 25 charts in 1 page) but doing it one by one and send the result back to Superset. I have tried domain sharding, increase gunicorn workers + threads but my Superset on docker compose didn't improve performance.

Is there a way to programatically import parquet files to superset db? Thanks

@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.4.0 labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels data:csv Related to import/export of CSVs size/XL 🚢 1.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

'Upload parquet' option
7 participants