Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(sql_lab): Add custom timestamp type for literal casting for presto timestamps #13082

Merged
merged 16 commits into from
Apr 20, 2022

Conversation

kekwan
Copy link
Contributor

@kekwan kekwan commented Feb 11, 2021

SUMMARY

Described in detail in: #13009 but described here again just for clarity.

When previewing a table in SQL Lab, a SELECT * statement is generated based on table data and metadata. This particular bug occur whens previewing tables with TIMESTAMP partition for Presto/Trino data sources. Unlike some other databases, Presto doesn't automatically convert between varchars and other types. In particular, TIMESTAMP needs to be explicitly casted. The simplest way is to use the type constructor TIMESTAMP before the value.

Therefore, where_latest_partition in presto.py needs to cast the TIMESTAMP when it generates the WHERE block in the preview table statement. Accomplish this by creating new SQLAlchemy Type which subclasses TypeDecorator adding additional functionality to existing TIMESTAMP type. For each partition column, I check if the type is TIMESTAMP, and if it is I instantiate the custom TimeStamp class in the Column object which will call process_bind_params when the query compiles.

Invalid syntax (generated by /<int:pk>/table/<table_name>/<schema_name>/ -> get_table_metadata -> select_star -> where_latest_partition)

presto:events> SELECT "event_type" AS "event_type",
            ->        "functional_domain" AS "functional_domain",
            ->        "falcon_instance" AS "falcon_instance",
            ->        "trace_id" AS "trace_id",
            ->        "tenant" AS "tenant",
            ->        "core_tenant" AS "core_tenant",
            ->        "event_timestamp" AS "event_timestamp",
            ->        "table_id" AS "table_id",
            ->        "dtr_workspace_id" AS "dtr_workspace_id",
            ->        "connectors" AS "connectors",
            ->        "table_type" AS "table_type",
            ->        "row_count" AS "row_count",
            ->        "field_count" AS "field_count",
            ->        "partition_count" AS "partition_count",
            ->        "data_size_bytes" AS "data_size_bytes",
            ->        "datetimepartition" AS "datetimepartition",
            ->        "etl_ts" AS "etl_ts",
            ->        "insert_try_number" AS "insert_try_number"
            -> FROM "events"."data_tables"
            -> WHERE "datetimepartition" = '2021-02-11 18:00:00.000'
            ->   AND "etl_ts" = '2021-02-11 18:50:00.000'
            ->   AND "insert_try_number" = 1
            -> LIMIT 100;
Query 20210211_191554_05354_4nnfj failed: line 20:27: Cannot apply operator: timestamp(3) = varchar(23)

Proper syntax

presto:events> SELECT "event_type" AS "event_type",
            ->        "functional_domain" AS "functional_domain",
            ->        "falcon_instance" AS "falcon_instance",
            ->        "trace_id" AS "trace_id",
            ->        "tenant" AS "tenant",
            ->        "core_tenant" AS "core_tenant",
            ->        "event_timestamp" AS "event_timestamp",
            ->        "table_id" AS "table_id",
            ->        "dtr_workspace_id" AS "dtr_workspace_id",
            ->        "connectors" AS "connectors",
            ->        "table_type" AS "table_type",
            ->        "row_count" AS "row_count",
            ->        "field_count" AS "field_count",
            ->        "partition_count" AS "partition_count",
            ->        "data_size_bytes" AS "data_size_bytes",
            ->        "datetimepartition" AS "datetimepartition",
            ->        "etl_ts" AS "etl_ts",
            ->        "insert_try_number" AS "insert_try_number"
            -> FROM "events"."data_tables"
            -> WHERE "datetimepartition" = TIMESTAMP '2021-02-11 18:00:00.000'
            ->   AND "etl_ts" = TIMESTAMP '2021-02-11 18:50:00.000'
            ->   AND "insert_try_number" = 1
            -> LIMIT 100

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Before
Screen Shot 2021-02-11 at 10 34 46 AM

After
Screen Shot 2021-02-11 at 10 53 39 AM

TEST PLAN

ADDITIONAL INFORMATION

@codecov-io
Copy link

codecov-io commented Feb 11, 2021

Codecov Report

Merging #13082 (a06038b) into master (2ce7982) will increase coverage by 15.74%.
The diff coverage is 42.85%.

@@             Coverage Diff             @@
##           master   #13082       +/-   ##
===========================================
+ Coverage   53.06%   68.81%   +15.74%     
===========================================
  Files         489      697      +208     
  Lines       17314    38411    +21097     
  Branches     4482        0     -4482     
===========================================
+ Hits         9187    26431    +17244     
- Misses       8127    11980     +3853     
Flag Coverage Δ
cypress ?
hive 44.03% <26.42%> (?)
mysql 68.42% <42.85%> (?)
postgres 68.45% <42.85%> (?)
presto 43.91% <26.42%> (?)
python 68.81% <42.85%> (?)
sqlite 68.26% <42.85%> (?)
unit 39.75% <25.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/examples/birth_names.py 71.29% <ø> (ø)
...s/260bf0649a77_migrate_x_dateunit_in_time_range.py 0.00% <0.00%> (ø)
...ons/versions/41ce8799acc3_rename_pie_label_type.py 0.00% <0.00%> (ø)
superset/utils/core.py 89.83% <52.00%> (ø)
superset/views/core.py 77.02% <60.00%> (ø)
superset/utils/date_parser.py 96.63% <75.00%> (ø)
superset/db_engine_specs/presto.py 88.70% <80.00%> (ø)
superset/models/sql_types/presto_sql_types.py 80.39% <84.61%> (ø)
superset/charts/commands/exceptions.py 91.48% <87.50%> (ø)
superset/db_engine_specs/elasticsearch.py 92.85% <94.11%> (ø)
... and 1187 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad380fd...a06038b. Read the comment docs.

Used for in-line rendering of TIMESTAMP data type
as Presto does not support automatic casting.
"""
return "TIMESTAMP '%s'" % value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing EOF (CI will fail)

@etr2460
Copy link
Member

etr2460 commented Feb 11, 2021

@john-bodley this looks like something you'd be very interested in

@pull-request-size pull-request-size bot added size/M and removed size/S labels Feb 12, 2021
@@ -911,10 +912,14 @@ def where_latest_partition( # pylint: disable=too-many-arguments
if values is None:
return None

column_names = {column.get("name") for column in columns or []}
column_names = {column.get("name") : column.get('type') for column in columns or []}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be column_type_by_name and thus line 918 will be easier to grok. In that case you could roll lines 918 and 919 into one.

for col_name, value in zip(col_names, values):
if col_name in column_names:
query = query.where(Column(col_name) == value)
col_type = column_names.get(col_name)
if col_type == 'TIMESTAMP':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the existing logic is wrong but I wonder if SQLAlchemy has a mechanism for auto-casting between the type of value and the column type for the comparison.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would be ideal solution or getting Presto/Trino to support auto-casting.

I've looked into it a bit and tried a few things but couldn't get any auto-casting. I've tried to pass in the default SQLAlchemy data types e.g. types.TIMESTAMP to the Column object as well as using a literal clause but the default data types doesn't seem to know how to auto-cast the type of value. I'm guessing because each DB dialect casts types differently so its up to the user to cast it yourself using one of these approaches to apply SQL-level binding

I have only done this casting for 2 Trino data types TIMESTAMP and DATE in this PR since these are the most popular partition column data types for our use case. But all of these data types besides the basic string, integer, boolean types require explicit casting so this is still an issue for other data types.

@junlincc junlincc added the new:contributor The author is a new contributor label Feb 12, 2021
if col_name in column_names:
query = query.where(Column(col_name) == value)
if col_name in column_type_by_name:
col_type = column_type_by_name.get(col_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
col_type = column_type_by_name.get(col_name)

query = query.where(Column(col_name) == value)
if col_name in column_type_by_name:
col_type = column_type_by_name.get(col_name)
if col_type == 'TIMESTAMP':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if col_type == 'TIMESTAMP':
if column_type_by_name.get(col_name) == 'TIMESTAMP':

@john-bodley
Copy link
Member

@kekwan you'll need to fix a couple of linting issues.

@kekwan
Copy link
Contributor Author

kekwan commented Feb 17, 2021

thanks @john-bodley fixed the linting issues now 😅

@kekwan
Copy link
Contributor Author

kekwan commented Feb 24, 2021

@john-bodley can i get another look at this when u get a chance? thanks!

Co-authored-by: John Bodley <4567245+john-bodley@users.noreply.github.com>
Comment on lines -916 to +927
if col_name in column_names:
query = query.where(Column(col_name) == value)
if col_name in column_type_by_name:
if column_type_by_name.get(col_name) == "TIMESTAMP":
query = query.where(Column(col_name, TimeStamp()) == value)
elif column_type_by_name.get(col_name) == "DATE":
query = query.where(Column(col_name, Date()) == value)
else:
query = query.where(Column(col_name) == value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we couldn't just use the existing convert_dttm here, as we should already have type casting logic in place for these types of conversions?

@classmethod
def convert_dttm(cls, target_type: str, dttm: datetime) -> Optional[str]:
tt = target_type.upper()
if tt == utils.TemporalType.DATE:
return f"""from_iso8601_date('{dttm.date().isoformat()}')"""
if tt == utils.TemporalType.TIMESTAMP:
return f"""from_iso8601_timestamp('{dttm.isoformat(timespec="microseconds")}')""" # pylint: disable=line-too-long
return None

Something like query = query.where(Column(col_name) == cls.convert_dttm(column_type_by_name.get(col_name), value) (not tested! thinking out loud)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly possible but the drawbacks I see to re-using this existing method is that:

  1. We have to explicitly cast the string value to a python datetime to call the method only to get it converted back to a string with .isoformat.
  2. from_iso8601_x doesn't seem to be a recommended function anymore in the latest versions of Presto/Trino https://trino.io/docs/current/language/types.html#date-and-time
  3. This method only works if your date/timestamps are in ISO 8601 format.
  4. In the future, if more Trino types are to be supported in previewing data with partitions, we will have to create new types like this anyways. Also, I think this is more the sqlalchemy friendly way.

@andreas-eberle
Copy link
Contributor

What is the status of this PR? This makes it impossible to work with timeseries data on Trino. :(

@stale
Copy link

stale bot commented Jun 11, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.

@stale stale bot added the inactive Inactive for >= 30 days label Jun 11, 2021
@junlincc
Copy link
Member

@villebro what do we need to do to push this PR forward?

@stale stale bot removed the inactive Inactive for >= 30 days label Oct 19, 2021
@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.

@stale stale bot added the inactive Inactive for >= 30 days label Apr 16, 2022
@stale stale bot removed the inactive Inactive for >= 30 days label Apr 20, 2022
@john-bodley john-bodley merged commit 1b55778 into apache:master Apr 20, 2022
hughhhh pushed a commit to hve-labs/superset that referenced this pull request May 11, 2022
…o timestamps (apache#13082)

* Add custom timestamp type for literal casting for presto timestamps

* Remove typo in comment

* Use process_bind_params as in sqla docs

* Uncommit local superset config

* Add DATE literal casting

* Fix lint errors and change var name

* Get rid of col_type and whitespace

* Fix linting

* Fix arg type

* Fix isort lint error

* ran black and isort locally..

* accidentally removed EOF

* Dont need eof

* Use newer string formatting style from comments

Co-authored-by: John Bodley <4567245+john-bodley@users.noreply.github.com>

* Trigger notification

* Trigger notification

Co-authored-by: Kenny Kwan <kennykwan@salesforce.com>
Co-authored-by: John Bodley <4567245+john-bodley@users.noreply.github.com>
philipher29 pushed a commit to ValtechMobility/superset that referenced this pull request Jun 9, 2022
…o timestamps (apache#13082)

* Add custom timestamp type for literal casting for presto timestamps

* Remove typo in comment

* Use process_bind_params as in sqla docs

* Uncommit local superset config

* Add DATE literal casting

* Fix lint errors and change var name

* Get rid of col_type and whitespace

* Fix linting

* Fix arg type

* Fix isort lint error

* ran black and isort locally..

* accidentally removed EOF

* Dont need eof

* Use newer string formatting style from comments

Co-authored-by: John Bodley <4567245+john-bodley@users.noreply.github.com>

* Trigger notification

* Trigger notification

Co-authored-by: Kenny Kwan <kennykwan@salesforce.com>
Co-authored-by: John Bodley <4567245+john-bodley@users.noreply.github.com>
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 2.0.0 labels Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels new:contributor The author is a new contributor size/M 🚢 2.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants