Improve database type inference #4724

mistercrunch · 2018-03-30T20:53:57Z

Python's DBAPI isn't super clear and homogeneous on the
cursor.description specification, and this PR attempts to improve
inferring the datatypes returned in the cursor.

This work started around Presto's TIMESTAMP type being mishandled as
string as the database driver (pyhive) returns it as a string. The work
here fixes this bug and does a better job at inferring MySQL and Presto types.
It also creates a new method in db_engine_specs allowing for other
databases engines to implement and become more precise on type-inference
as needed.

Note that the role of SupersetDataFrame isn't super clear here. I think the original intent was to derive pandas.DataFrame but this isn't allowed by the pandas lib. So it became more of a wrapper around pandas.DataFrame.

Now its role becomes even less clear here after the PR.

The longer term vision is probably to make this class a pandas DataFrame-related utility wrapper, meaning it would know how to get a dataframe give a query & cursor and will be able to do things like type-inference or other operation that are dataframe related.

codecov-io · 2018-03-30T21:55:53Z

Codecov Report

Merging #4724 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #4724      +/-   ##
==========================================
+ Coverage   61.31%   61.32%   +<.01%     
==========================================
  Files         369      369              
  Lines       23468    23483      +15     
  Branches     2717     2717              
==========================================
+ Hits        14390    14401      +11     
- Misses       9066     9070       +4     
  Partials       12       12

Impacted Files	Coverage Δ
superset/dataframe.py	`94.49% <100%> (-3.21%)`	⬇️
superset/db_engine_specs.py	`54.28% <100%> (+1.11%)`	⬆️
superset/sql_lab.py	`71.6% <100%> (-3.4%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 04fc1d1...437bbf4. Read the comment docs.

betodealmeida · 2018-03-31T15:16:37Z

superset/dataframe.py

+            column_names = [col[0] for col in cursor_description]
+        self.column_names = dedup(column_names)
+
+        # check whether the result set has any nested dict columns


This comment confused me... we're not looking for nested dicts here, just dicts, right?

I just moved that code from sql_lab.convert_results_to_df, but I think you're right about the comment. I didn't write that line originally.

betodealmeida · 2018-03-31T15:52:22Z

superset/dataframe.py

+        # check whether the result set has any nested dict columns
+        if data:
+            has_dict_col = any([isinstance(c, dict) for c in data[0]])
+            df_data = list(data) if has_dict_col else np.array(data, dtype=object)


Are you doing this because if a column has dicts all columns will have the object type in the dataframe when initialized from the numpy array? If so, why not use infer_objects here?

df_data = np.array(data or [], dtype=object) df = pd.DataFrame(df_data, columns=column_names).infer_objects()

Just relocating the code here. I agree that it's opaque.

I tested the behavior with [(1, 2, {'a': 'b'})] as data, trying to understand the code. If we build the dataframe using np.array on data with a dict we get the following types for the columns:

>>> data = [(1, 2, {'a': 'b'})] >>> df = pd.DataFrame(np.array(data, dtype=object)) >>> print(df.dtypes) 0 object 1 object 2 object dtype: object

Here Pandas doesn't know that the first two columns are integers. OTOH, using list(data):

>>> data = [(1, 2, {'a': 'b'})] >>> df = pd.DataFrame(list(data)) >>> print(df.dtypes) 0 int64 1 int64 2 object dtype: object

Now we have correct type information for all columns. So it seems to me that the list(data) if has_dict_col else np.array(data, dtype=object) is trying to preserve the type information.

We can get the same result using infer_objects():

>>> data = [(1, 2, {'a': 'b'})] >>> df = pd.DataFrame(np.array(data, dtype=object)).infer_objects() >>> print(df.dtypes) 0 int64 1 int64 2 object dtype: object

Looks like the dtype=object was recently added here:
#4629
@michellethomas

It could be one of those where by fixing a bug another one popped up.

I added dtype=object just to solve an issue because np.array was erroring with data with different types in a tuple ([('str', 1, [1], 1.0)]). The list(data) if... logic was there already. infer_objects sounds like a good option though.

Wow, looks like infer_objects does much of what Bogdan was trying to do later on in dataframe.py, maybe we can get ride of some of that code, though I hate to touch this as it's impossible to fully understand the implications... I'll read the source for it and decide if I can remove more code.

betodealmeida · 2018-03-31T15:58:27Z

superset/dataframe.py

@@ -95,7 +141,7 @@ def agg_func(cls, dtype, column_name):
        # consider checking for key substring too.
        if cls.is_id(column_name):
            return 'count_distinct'
-        if (issubclass(dtype.type, np.generic) and
+        if (hasattr(dtype, 'tupe') and issubclass(dtype.type, np.generic) and


Is this a typo? tuple?

Typo, the intent was to check that dtype.type actually exists. I'm not 100% sure whether I actually hit a missing attr or if it was preventative.

betodealmeida · 2018-03-31T16:00:35Z

superset/db_engine_specs.py

+                for k in dir(ft)
+                if not k.startswith('_')
+            }
+            print(cls.type_code_map)


Remove line.

betodealmeida · 2018-03-31T16:02:15Z

superset/db_engine_specs.py

@@ -550,6 +575,11 @@ def adjust_database_uri(cls, uri, selected_schema=None):
            uri.database = database
        return uri

+    @classmethod
+    def get_datatype(cls, type_code):


This method is identical to the base class and can be removed, no?

Yes that's the case, I know that Presto should be this method, but I'm unclear on what the base class should do to capture the most common use cases.

mistercrunch · 2018-06-22T05:46:48Z

@betodealmeida I rebased and I'm pretty sure I addressed all comments, what dyou think?

mistercrunch · 2018-06-22T05:48:22Z

@vylc I think this should do the trick for the infamous TIMESTAMP-related bug where the type gets lost through the visualize flow.

Python's DBAPI isn't super clear and homogeneous on the cursor.description specification, and this PR attempts to improve inferring the datatypes returned in the cursor. This work started around Presto's TIMESTAMP type being mishandled as string as the database driver (pyhive) returns it as a string. The work here fixes this bug and does a better job at inferring MySQL and Presto types. It also creates a new method in db_engine_specs allowing for other databases engines to implement and become more precise on type-inference as needed.

* Improve database type inference Python's DBAPI isn't super clear and homogeneous on the cursor.description specification, and this PR attempts to improve inferring the datatypes returned in the cursor. This work started around Presto's TIMESTAMP type being mishandled as string as the database driver (pyhive) returns it as a string. The work here fixes this bug and does a better job at inferring MySQL and Presto types. It also creates a new method in db_engine_specs allowing for other databases engines to implement and become more precise on type-inference as needed. * Fixing tests * Adressing comments * Using infer_objects * Removing faulty line * Addressing PrestoSpec redundant method comment * Fix rebase issue * Fix tests (cherry picked from commit 777d876)

* Improve database type inference Python's DBAPI isn't super clear and homogeneous on the cursor.description specification, and this PR attempts to improve inferring the datatypes returned in the cursor. This work started around Presto's TIMESTAMP type being mishandled as string as the database driver (pyhive) returns it as a string. The work here fixes this bug and does a better job at inferring MySQL and Presto types. It also creates a new method in db_engine_specs allowing for other databases engines to implement and become more precise on type-inference as needed. * Fixing tests * Adressing comments * Using infer_objects * Removing faulty line * Addressing PrestoSpec redundant method comment * Fix rebase issue * Fix tests

mistercrunch force-pushed the data_types branch from 247dff7 to 0e6b775 Compare March 30, 2018 21:38

betodealmeida reviewed Mar 31, 2018

View reviewed changes

mistercrunch mentioned this pull request Apr 6, 2018

Cannot display Time Series Line Chart #4770

Closed

3 tasks

mistercrunch mentioned this pull request Jun 14, 2018

Introduce class attr BaseViz.enforce_numerical_metrics #5176

Merged

mistercrunch force-pushed the data_types branch from 819f52e to af8e65c Compare June 22, 2018 05:41

mistercrunch force-pushed the data_types branch 2 times, most recently from a1336f2 to 6e8050b Compare June 22, 2018 18:55

mistercrunch added 8 commits June 27, 2018 09:08

Fixing tests

381d774

Adressing comments

717dec6

Using infer_objects

2d29bb0

Removing faulty line

b630aef

Addressing PrestoSpec redundant method comment

c1c9f3f

Fix rebase issue

5127fdd

Fix tests

437bbf4

mistercrunch force-pushed the data_types branch from 6e8050b to 437bbf4 Compare June 27, 2018 23:16

mistercrunch merged commit 777d876 into apache:master Jun 28, 2018

mistercrunch deleted the data_types branch June 28, 2018 04:35

villebro mentioned this pull request Jul 23, 2018

Fix broken dedup and remove redundant db_spec logic #5467

Merged

mistercrunch added 🍒 0.26.0 🍒 0.26.1 🍒 0.26.2 🍒 0.26.3 labels Feb 27, 2024

mistercrunch added 🍒 0.27.0 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.28.0 labels Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve database type inference #4724

Improve database type inference #4724

mistercrunch commented Mar 30, 2018 •

edited

Loading

codecov-io commented Mar 30, 2018 •

edited

Loading

betodealmeida Mar 31, 2018

mistercrunch Apr 2, 2018

betodealmeida Mar 31, 2018

mistercrunch Apr 2, 2018 •

edited

Loading

betodealmeida Apr 2, 2018

mistercrunch Apr 2, 2018 •

edited

Loading

michellethomas Apr 2, 2018

mistercrunch Apr 3, 2018

betodealmeida Mar 31, 2018

mistercrunch Apr 2, 2018

betodealmeida Mar 31, 2018

mistercrunch Apr 2, 2018

betodealmeida Mar 31, 2018

mistercrunch Apr 2, 2018

mistercrunch commented Jun 22, 2018

mistercrunch commented Jun 22, 2018

Improve database type inference #4724

Improve database type inference #4724

Conversation

mistercrunch commented Mar 30, 2018 • edited Loading

codecov-io commented Mar 30, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mistercrunch Apr 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mistercrunch Apr 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mistercrunch commented Jun 22, 2018

mistercrunch commented Jun 22, 2018

mistercrunch commented Mar 30, 2018 •

edited

Loading

codecov-io commented Mar 30, 2018 •

edited

Loading

mistercrunch Apr 2, 2018 •

edited

Loading

mistercrunch Apr 2, 2018 •

edited

Loading