Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX]: Check datatype of results before converting to DataFrame #4108

Merged
merged 2 commits into from
Jan 24, 2018

Conversation

marcusianlevine
Copy link
Contributor

@marcusianlevine marcusianlevine commented Dec 22, 2017

The merged hotfix #2412 resolved an issue with Presto/Hive columns #2398 which are lists of dictionaries.

However, this introduced issue #3934 with traditional RDB datasources like MS SQL Server, which only return lists of lists. Wrapping such results in an outer list causes pandas to interpret the list as a single column.

This PR attempts to resolve that issue by introspecting the datatypes of the columns in the first row of the returned data. The type check may not be checking for exactly the right thing, but we likely need some sort of conditional to avoid wrapping a list of lists in an outer list

@marcusianlevine marcusianlevine changed the title Bugfix: Check datatype of results before converting to DataFrame [BUGFIX]: Check datatype of results before converting to DataFrame Dec 23, 2017
@xrmx
Copy link
Contributor

xrmx commented Dec 23, 2017

Could you please clean the branch from other people commits?

Copy link
Contributor

@xrmx xrmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to add a test so we don't regress.

@@ -231,11 +231,12 @@ def handle_error(msg):

# check whether the result set is comprised of lists or dict
if data and len(data) > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if data:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep you're right the len check is redundant

df_data = np.array(data)
else:
first_row = data[0]
first_row_types = set([type(c) for c in first_row])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this more readable?

any([isinstance(c, dict) for c in first_row])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I just took a stab at refactoring for clarity

@marcusianlevine marcusianlevine force-pushed the df-np-conversion branch 2 times, most recently from 6959b35 to ea14efe Compare December 23, 2017 15:05
@marcusianlevine
Copy link
Contributor Author

@xrmx how would you recommend testing this? AFAIK the only reason for using list(data) instead of np.array(data) would be when reading certain nested column types from Hive or Presto

The existing sqllab_tests.py has queries in it and those tests seem to be passing, but we only test against MySQL, SQLite, and Postgres from what I can tell.

Any suggestions how we might simulate an embedded dict column from the test suite?

@xrmx
Copy link
Contributor

xrmx commented Dec 23, 2017

The easiest way to test that code would be to move it to an helper and unit test it.

fix type checking

fix conditional checks

remove trailing whitespace and fix df_data fallback def

actually remove trailing whitespace

generalized type check to check all columns for dict

refactor dict col check
@marcusianlevine marcusianlevine force-pushed the df-np-conversion branch 2 times, most recently from 5db8d3e to 29be9c2 Compare December 24, 2017 00:52
@marcusianlevine
Copy link
Contributor Author

Ok moved the df conversion logic to helper and wrote two simple unit tests.

Not sure the tests actually prevent regression because I'm mocking the query results...

add missing newlines

another missing newline

fix quotes

more quote fixes
@marcusianlevine
Copy link
Contributor Author

@xrmx do you think this is ready to merge?

@mistercrunch mistercrunch merged commit 4bc5fe5 into apache:master Jan 24, 2018
mistercrunch added a commit that referenced this pull request Jan 27, 2018
@marcusianlevine marcusianlevine deleted the df-np-conversion branch February 3, 2018 03:10
michellethomas pushed a commit to michellethomas/panoramix that referenced this pull request May 24, 2018
…pache#4108)

* conditional check on datatype of results before converting to df

fix type checking

fix conditional checks

remove trailing whitespace and fix df_data fallback def

actually remove trailing whitespace

generalized type check to check all columns for dict

refactor dict col check

* move df conversion to helper and add unit test

add missing newlines

another missing newline

fix quotes

more quote fixes
wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018
…pache#4108)

* conditional check on datatype of results before converting to df

fix type checking

fix conditional checks

remove trailing whitespace and fix df_data fallback def

actually remove trailing whitespace

generalized type check to check all columns for dict

refactor dict col check

* move df conversion to helper and add unit test

add missing newlines

another missing newline

fix quotes

more quote fixes
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.23.0 labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.23.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants