Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve database type inference #4724

Merged
merged 8 commits into from
Jun 28, 2018
Merged

Conversation

mistercrunch
Copy link
Member

@mistercrunch mistercrunch commented Mar 30, 2018

Python's DBAPI isn't super clear and homogeneous on the
cursor.description specification, and this PR attempts to improve
inferring the datatypes returned in the cursor.

This work started around Presto's TIMESTAMP type being mishandled as
string as the database driver (pyhive) returns it as a string. The work
here fixes this bug and does a better job at inferring MySQL and Presto types.
It also creates a new method in db_engine_specs allowing for other
databases engines to implement and become more precise on type-inference
as needed.

Note that the role of SupersetDataFrame isn't super clear here. I think the original intent was to derive pandas.DataFrame but this isn't allowed by the pandas lib. So it became more of a wrapper around pandas.DataFrame.

Now its role becomes even less clear here after the PR.

The longer term vision is probably to make this class a pandas DataFrame-related utility wrapper, meaning it would know how to get a dataframe give a query & cursor and will be able to do things like type-inference or other operation that are dataframe related.

@codecov-io
Copy link

codecov-io commented Mar 30, 2018

Codecov Report

Merging #4724 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4724      +/-   ##
==========================================
+ Coverage   61.31%   61.32%   +<.01%     
==========================================
  Files         369      369              
  Lines       23468    23483      +15     
  Branches     2717     2717              
==========================================
+ Hits        14390    14401      +11     
- Misses       9066     9070       +4     
  Partials       12       12
Impacted Files Coverage Δ
superset/dataframe.py 94.49% <100%> (-3.21%) ⬇️
superset/db_engine_specs.py 54.28% <100%> (+1.11%) ⬆️
superset/sql_lab.py 71.6% <100%> (-3.4%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 04fc1d1...437bbf4. Read the comment docs.

column_names = [col[0] for col in cursor_description]
self.column_names = dedup(column_names)

# check whether the result set has any nested dict columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment confused me... we're not looking for nested dicts here, just dicts, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just moved that code from sql_lab.convert_results_to_df, but I think you're right about the comment. I didn't write that line originally.

# check whether the result set has any nested dict columns
if data:
has_dict_col = any([isinstance(c, dict) for c in data[0]])
df_data = list(data) if has_dict_col else np.array(data, dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you doing this because if a column has dicts all columns will have the object type in the dataframe when initialized from the numpy array? If so, why not use infer_objects here?

df_data = np.array(data or [], dtype=object)
df = pd.DataFrame(df_data, columns=column_names).infer_objects()

Copy link
Member Author

@mistercrunch mistercrunch Apr 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just relocating the code here. I agree that it's opaque.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the behavior with [(1, 2, {'a': 'b'})] as data, trying to understand the code. If we build the dataframe using np.array on data with a dict we get the following types for the columns:

>>> data = [(1, 2, {'a': 'b'})]
>>> df = pd.DataFrame(np.array(data, dtype=object))
>>> print(df.dtypes)
0    object
1    object
2    object
dtype: object

Here Pandas doesn't know that the first two columns are integers. OTOH, using list(data):

>>> data = [(1, 2, {'a': 'b'})]
>>> df = pd.DataFrame(list(data))
>>> print(df.dtypes)
0     int64
1     int64
2    object
dtype: object

Now we have correct type information for all columns. So it seems to me that the list(data) if has_dict_col else np.array(data, dtype=object) is trying to preserve the type information.

We can get the same result using infer_objects():

>>> data = [(1, 2, {'a': 'b'})]
>>> df = pd.DataFrame(np.array(data, dtype=object)).infer_objects()
>>> print(df.dtypes)
0     int64
1     int64
2    object
dtype: object

Copy link
Member Author

@mistercrunch mistercrunch Apr 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the dtype=object was recently added here:
#4629
@michellethomas

It could be one of those where by fixing a bug another one popped up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added dtype=object just to solve an issue because np.array was erroring with data with different types in a tuple ([('str', 1, [1], 1.0)]). The list(data) if... logic was there already. infer_objects sounds like a good option though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, looks like infer_objects does much of what Bogdan was trying to do later on in dataframe.py, maybe we can get ride of some of that code, though I hate to touch this as it's impossible to fully understand the implications... I'll read the source for it and decide if I can remove more code.

@@ -95,7 +141,7 @@ def agg_func(cls, dtype, column_name):
# consider checking for key substring too.
if cls.is_id(column_name):
return 'count_distinct'
if (issubclass(dtype.type, np.generic) and
if (hasattr(dtype, 'tupe') and issubclass(dtype.type, np.generic) and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a typo? tuple?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, the intent was to check that dtype.type actually exists. I'm not 100% sure whether I actually hit a missing attr or if it was preventative.

for k in dir(ft)
if not k.startswith('_')
}
print(cls.type_code_map)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops

@@ -550,6 +575,11 @@ def adjust_database_uri(cls, uri, selected_schema=None):
uri.database = database
return uri

@classmethod
def get_datatype(cls, type_code):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is identical to the base class and can be removed, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's the case, I know that Presto should be this method, but I'm unclear on what the base class should do to capture the most common use cases.

@mistercrunch
Copy link
Member Author

@betodealmeida I rebased and I'm pretty sure I addressed all comments, what dyou think?

@mistercrunch
Copy link
Member Author

@vylc I think this should do the trick for the infamous TIMESTAMP-related bug where the type gets lost through the visualize flow.

@mistercrunch mistercrunch force-pushed the data_types branch 2 times, most recently from a1336f2 to 6e8050b Compare June 22, 2018 18:55
Python's DBAPI isn't super clear and homogeneous on the
cursor.description specification, and this PR attempts to improve
inferring the datatypes returned in the cursor.

This work started around Presto's TIMESTAMP type being mishandled as
string as the database driver (pyhive) returns it as a string. The work
here fixes this bug and does a better job at inferring MySQL and Presto types.
It also creates a new method in db_engine_specs allowing for other
databases engines to implement and become more precise on type-inference
as needed.
@mistercrunch mistercrunch merged commit 777d876 into apache:master Jun 28, 2018
@mistercrunch mistercrunch deleted the data_types branch June 28, 2018 04:35
mistercrunch added a commit that referenced this pull request Jun 28, 2018
* Improve database type inference

Python's DBAPI isn't super clear and homogeneous on the
cursor.description specification, and this PR attempts to improve
inferring the datatypes returned in the cursor.

This work started around Presto's TIMESTAMP type being mishandled as
string as the database driver (pyhive) returns it as a string. The work
here fixes this bug and does a better job at inferring MySQL and Presto types.
It also creates a new method in db_engine_specs allowing for other
databases engines to implement and become more precise on type-inference
as needed.

* Fixing tests

* Adressing comments

* Using infer_objects

* Removing faulty line

* Addressing PrestoSpec redundant method comment

* Fix rebase issue

* Fix tests

(cherry picked from commit 777d876)
timifasubaa pushed a commit to airbnb/superset-fork that referenced this pull request Jul 25, 2018
* Improve database type inference

Python's DBAPI isn't super clear and homogeneous on the
cursor.description specification, and this PR attempts to improve
inferring the datatypes returned in the cursor.

This work started around Presto's TIMESTAMP type being mishandled as
string as the database driver (pyhive) returns it as a string. The work
here fixes this bug and does a better job at inferring MySQL and Presto types.
It also creates a new method in db_engine_specs allowing for other
databases engines to implement and become more precise on type-inference
as needed.

* Fixing tests

* Adressing comments

* Using infer_objects

* Removing faulty line

* Addressing PrestoSpec redundant method comment

* Fix rebase issue

* Fix tests
wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018
* Improve database type inference

Python's DBAPI isn't super clear and homogeneous on the
cursor.description specification, and this PR attempts to improve
inferring the datatypes returned in the cursor.

This work started around Presto's TIMESTAMP type being mishandled as
string as the database driver (pyhive) returns it as a string. The work
here fixes this bug and does a better job at inferring MySQL and Presto types.
It also creates a new method in db_engine_specs allowing for other
databases engines to implement and become more precise on type-inference
as needed.

* Fixing tests

* Adressing comments

* Using infer_objects

* Removing faulty line

* Addressing PrestoSpec redundant method comment

* Fix rebase issue

* Fix tests
@mistercrunch mistercrunch added 🍒 0.27.0 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.28.0 labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🍒 0.26.0 🍒 0.26.1 🍒 0.26.2 🍒 0.26.3 🍒 0.27.0 🚢 0.28.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants