Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hive-csv] Infer schema from csv #5267

Merged
merged 1 commit into from Jun 25, 2018

Conversation

timifasubaa
Copy link
Contributor

@timifasubaa timifasubaa commented Jun 22, 2018

Previously all column types were string. This PR leverages the tableschema package to get a more precise column type and uses it in the table creation.

@john-bodley @mistercrunch

@timifasubaa timifasubaa force-pushed the infer_types_for_csv_upload branch 2 times, most recently from 7b72ba4 to adb194b Compare June 22, 2018 01:51
filename = form.csv_file.data.filename
def convert_to_hive_type(col_type):
"""maps tableschema's types to hive types"""
if col_type == "number":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having a dict for the mapping would be cleaner also would allow for people to easily add more types if need be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Added it to config

@@ -34,6 +34,7 @@ six==1.11.0
sqlalchemy==1.2.2
sqlalchemy-utils==0.32.21
sqlparse==0.2.4
tableschema==1.1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’ll have to add this to setup.py as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@timifasubaa timifasubaa force-pushed the infer_types_for_csv_upload branch 4 times, most recently from 71b3dcf to ec72456 Compare June 22, 2018 18:06
@codecov-io
Copy link

codecov-io commented Jun 22, 2018

Codecov Report

Merging #5267 into master will decrease coverage by 0.01%.
The diff coverage is 13.33%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #5267      +/-   ##
=========================================
- Coverage   61.31%   61.3%   -0.02%     
=========================================
  Files         368     368              
  Lines       23449   23453       +4     
  Branches     2713    2713              
=========================================
  Hits        14378   14378              
- Misses       9059    9063       +4     
  Partials       12      12
Impacted Files Coverage Δ
superset/db_engine_specs.py 53.17% <13.33%> (-0.35%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd24f85...f6ac85b. Read the comment docs.

setup.py Outdated
@@ -90,6 +90,7 @@ def get_git_sha():
'sqlalchemy',
'sqlalchemy-utils',
'sqlparse',
'tableschema==1.1.0',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't pin in setup.py unless it's strictly required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DOne

@@ -314,6 +314,14 @@ class CeleryConfig(object):
# contain all the external tables
CSV_TO_HIVE_UPLOAD_DIRECTORY = 'EXTERNAL_HIVE_TABLES/'

# This contains the mappings from the tableschema types to HIVE types
TABLESCHEMA_TO_HIVE_TYPES = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this probably belongs in db_engine_spec. I agree this isn't cut-and-dry as different versions of Hive support different types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

bucket_path = app.config['CSV_TO_HIVE_UPLOAD_S3_BUCKET']
def convert_to_hive_type(col_type):
"""maps tableschema's types to hive types"""
TABLESCHEMA_TO_HIVE_TYPES = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry given this no longer resides in the config this should be a lower case variable name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mzangie
Copy link

mzangie commented Jun 23, 2018

Hm

@timifasubaa timifasubaa merged commit b0eee12 into apache:master Jun 25, 2018
mistercrunch pushed a commit that referenced this pull request Jun 28, 2018
timifasubaa added a commit to airbnb/superset-fork that referenced this pull request Jul 25, 2018
timifasubaa added a commit to airbnb/superset-fork that referenced this pull request Sep 20, 2018
wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018
@mistercrunch mistercrunch added 🍒 0.26.0 🍒 0.26.1 🍒 0.26.2 🍒 0.26.3 🍒 0.27.0 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.28.0 labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🍒 0.26.0 🍒 0.26.1 🍒 0.26.2 🍒 0.26.3 🍒 0.27.0 🚢 0.28.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants