Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support nulls in the csv uploads #10208

Merged
merged 2 commits into from Jul 6, 2020

Conversation

bkyryliuk
Copy link
Member

@bkyryliuk bkyryliuk commented Jun 30, 2020

SUMMARY

There are use cases when users would prefer to treat empty strings in the csv as nulls in the database.
This PR makes the field to be configurable and bring more transparency to how it is handled.
A default value is set to be the same as pandas, it doesn't look pretty and probably has too many values, that's why I also made it configurable.

Default values:
image

In addition to the feature, I've refactored hive_tests.py to follow pytest standards, now they can be run directly via pytest command.

TEST PLAN

  • pytest -s tests/db_engine_specs/hive_tests.py - for hive changes
  • tox -e py36 -- tests/core_tests.py::TestCore::test_import_csv - for general changes
  • tested locally for different uses case on mysql
  • tested on our staging env for hive changes

Hive testing screenshots

image
image

ADDITIONAL INFORMATION

@bkyryliuk bkyryliuk marked this pull request as draft June 30, 2020 18:29
@bkyryliuk bkyryliuk force-pushed the bogdan/support_nulls_in_hive branch 6 times, most recently from 6de52fa to 3e3927c Compare July 1, 2020 05:42
@codecov-commenter
Copy link

codecov-commenter commented Jul 1, 2020

Codecov Report

Merging #10208 into master will decrease coverage by 6.25%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10208      +/-   ##
==========================================
- Coverage   65.70%   59.44%   -6.26%     
==========================================
  Files         594      404     -190     
  Lines       31501    13107   -18394     
  Branches     3221     3221              
==========================================
- Hits        20697     7792   -12905     
+ Misses      10623     5134    -5489     
  Partials      181      181              
Flag Coverage Δ
#javascript 59.44% <ø> (ø)
#python ?
Impacted Files Coverage Δ
superset/dao/base.py
superset/exceptions.py
superset/db_engine_specs/redshift.py
superset/dao/exceptions.py
superset/tasks/celery_app.py
superset/db_engine_specs/pinot.py
superset/views/dashboard/mixin.py
superset/examples/multi_line.py
superset/tasks/slack_util.py
superset/views/chart/filters.py
... and 175 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4281ad5...dba2d9b. Read the comment docs.

@bkyryliuk bkyryliuk force-pushed the bogdan/support_nulls_in_hive branch 2 times, most recently from 4d3b949 to 0033d1a Compare July 1, 2020 19:08
@bkyryliuk bkyryliuk changed the title feat: support nulls in the csv uploads [WIP] feat: support nulls in the csv uploads Jul 1, 2020
@bkyryliuk bkyryliuk marked this pull request as ready for review July 1, 2020 20:06
@bkyryliuk bkyryliuk force-pushed the bogdan/support_nulls_in_hive branch from 0033d1a to d6c0a4b Compare July 1, 2020 20:06
{},
{"if_exists": "append"},
)
def test_0_progress():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored to be pytest friendly

@bkyryliuk bkyryliuk force-pushed the bogdan/support_nulls_in_hive branch from d6c0a4b to 3201703 Compare July 1, 2020 22:29
Copy link
Contributor

@serenajiang serenajiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly lgtm

Are integer columns affected by this change at all? I'm not sure how we deal with null integers in csv upload.

tblproperties.append(f"'serialization.null.format'='{null_values[0]}'")
tblproperties_stmt = ""
if tblproperties:
tblproperties_stmt = f"tblproperties ({', '.join(tblproperties)})"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we write this in some way that lets us use :params for the table properties?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue is that tblproperties is optional - and it makes it quite tricky to implement, I did not figure it out. Open to ideas here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could make this function return Tuple[text, Dict[str, str]]. The code in the outer function would be something like

sql, params = cls.get_create_table_stmt(...)
...
engine.execute(sql, **params)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def _value(self) -> str:
return json.dumps(self.data)

def process_formdata(self, valuelist: List[str]) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious - what happens if the user passes in a malformed input?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will not allow to submit the form, error message is quite cryptic, that's why description has examples

@bkyryliuk
Copy link
Member Author

mostly lgtm

Are integer columns affected by this change at all? I'm not sure how we deal with null integers in csv upload.

Thanks @serenajiang for the review
Mysql recognized the column as double and wrote null
image

Hive:
int + null value
image

@pull-request-size pull-request-size bot added size/XL and removed size/L labels Jul 2, 2020
@@ -15,7 +15,6 @@
# limitations under the License.
#
[pytest]
addopts = -ra -q
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleaned it up to be able to use -vv flag for the verbose output

@bkyryliuk bkyryliuk force-pushed the bogdan/support_nulls_in_hive branch 2 times, most recently from cdde17c to 08143e0 Compare July 2, 2020 15:50
@etr2460
Copy link
Member

etr2460 commented Jul 2, 2020

Maybe this isn't easy/possible with WTForms, but is there a way we can structure this new input so that instead of passing in a JSON encoded array, we can add individual items and they'll render as individual inputs that then get parsed correctly by the backend? Something similar to the owner fields in the crud view:

image

@bkyryliuk
Copy link
Member Author

Maybe this isn't easy/possible with WTForms, but is there a way we can structure this new input so that instead of passing in a JSON encoded array, we can add individual items and they'll render as individual inputs that then get parsed correctly by the backend? Something similar to the owner fields in the crud view:

image

@etr2460 good suggestion, I've explored FieldList but wasn't able to get it work. Owners are built using many to many ORM mapping, it wouldn't fit this use case. Hopefully this form will be shortlived and migrated to react in some foreseeable future.

@bkyryliuk bkyryliuk force-pushed the bogdan/support_nulls_in_hive branch from 08143e0 to a225116 Compare July 2, 2020 17:40
@bkyryliuk bkyryliuk force-pushed the bogdan/support_nulls_in_hive branch from a225116 to 63d502a Compare July 2, 2020 17:46
@bkyryliuk bkyryliuk added the v0.37 label Jul 2, 2020
@bkyryliuk bkyryliuk force-pushed the bogdan/support_nulls_in_hive branch from 63d502a to dba2d9b Compare July 2, 2020 17:56
@@ -206,3 +210,13 @@ def at_least_one_schema_is_allowed(database: Database) -> bool:
validators=[Optional()],
widget=BS3TextFieldWidget(),
)
null_values = JsonListField(
_("Null values"),
default=config.get("CSV_DEFAULT_NA_NAMES", []),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be config["CSV_DEFAULT_NA_NAMES"]

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit, other than that LGTM. Also needs a rebase.

Refactor

Add tests, and refactor them to be pytest friendly

Use lowercase table names

Ignore isort
@bkyryliuk bkyryliuk force-pushed the bogdan/support_nulls_in_hive branch from dba2d9b to 3881fac Compare July 6, 2020 17:53
@bkyryliuk
Copy link
Member Author

Small nit, other than that LGTM. Also needs a rebase.

thanks for the review!

@bkyryliuk bkyryliuk merged commit 84f8a51 into apache:master Jul 6, 2020
@bkyryliuk bkyryliuk deleted the bogdan/support_nulls_in_hive branch July 6, 2020 20:26
auxten pushed a commit to auxten/incubator-superset that referenced this pull request Nov 20, 2020
* Support more table properties for the hive upload

Refactor

Add tests, and refactor them to be pytest friendly

Use lowercase table names

Ignore isort

* Use sql params

Co-authored-by: bogdan kyryliuk <bogdankyryliuk@dropbox.com>
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.37.0 labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/XL v0.37 🚢 0.37.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants