Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SIP-15A] Proposal for inferring temporal formatting and parsing #7656

Closed
john-bodley opened this issue Jun 5, 2019 · 1 comment
Closed
Assignees
Labels
enhancement:request Enhancement request submitted by anyone from the community sip Superset Improvement Proposal
Projects

Comments

@john-bodley
Copy link
Member

john-bodley commented Jun 5, 2019

[SIP-15A] Proposal for inferring temporal formatting and parsing

Motivation

In SIP-15 we surfaced examples were lexicographical sorting could result in incorrect time intervals when the optional configuration fields weren't specified. Initially the proposal was to ensure all temporal fields were cast to a timestamp and filter comparisons were between the appropriate timestamp (or similar) types, i.e., for Presto:

DATE_PARSE(ds, '%Y-%m-%d') >= TIMESTAMP '2018-01-01 00:00:00'

Additionally if all temporal fields were transformed to a timestamp the time grains would also work correctly (the current logic assumes that said type can be successfully cast to a timestamp which is not always the case).

@mistercrunch correctly pointed out out that not all query planners would be able to rewrite these filters to take advantage of indexes (if present) resulting in sub-performant queries and thus the status quo of ensuring the left-hand-side (LHS) of the filter comparison remains unchanged and that the right-hand-side (RHS) formats the Python datetime object appropriately, i.e.,

ds >= '2018-01-01'

There are two problems with the current functionality:

  1. One must explicitly define the format of the temporal column for non date/date-time like types, i.e., strings and numbers as there are multiple temporal encodings, e.g. %Y-%m-%d, %Y-%d-%m, epoch timestamp (in seconds), epoch timestamp (in milliseconds), etc.

  2. The time grains incorrectly assume that the temporal column can simply be cast to timestamp (or equivalent) type.

Really this can be seen as two conversions:

  1. Converting a Python datetime object into the appropriate database type for filtering.
  2. Converting a database type into a timestamp (or equivalent) which is necessary for the time grain transformations for grouping.

Note that the convert_dttm handles both of these already for date/date-time like types so the problem really lies with string like and numeric types which have temporal encoding.

Proposed Change

We propose the following to address the two problems outlined above:

Format Inferencing

Rather than having to explicitly define the format for all non date/date-time temporal columns it would be great to infer this from a sample. There are a few Python libraries (arrow, datetutil, maya, etc.) which can parse non a priori defined date-time formats, i.e.,

>>> from dateutil.parser import parse
>>> parse('2018-01-01')
datetime.datetime(2018, 1, 1, 0, 0)

Sadly none of these libraries will provide the underlying format though there are ways of inferring it. It's also worth pointing out that a single value could be expressed by multiple formats, e.g. 2018-01-01 could be either %Y-%m-%d or %Y-%d-%m. Taking a sample of values should help further restrict the set of plausible formats.

Regarding epoch timestamps which can be defined via integers (representing seconds or milliseconds) or floats, to differentiate between these one could use basis logic like www.epochconverter.com where if there are less than 12 digits the timestamp is assumed to be in seconds, 12 - 14 as milliseconds, and 15+ as microseconds (see here for detail and the rare occurrences where this fails).

We propose the following solution:

  1. Add a database specific column name/type to format mapping for non-explicit temporal columns, i.e., we use the ds column of type VARCHAR with the %Y-%m-%d format to represent a date-stamp.

  2. Whenever a SQLAlchemy column is marked as temporal and the column type is not explicitly a temporal type and no formatting is present then:

    • If it exists in the mapping apply the format.
    • Otherwise sample the column (using say 100 values) and try to infer the best format.
  3. Rather than using a free-form text box the python_date_format field should represent a selector with the various types where either the mapped or best format is selected. This allows users to override the format if the inferencing was incorrect. Why a drop-down and not pre-populated free-form text? Mostly because some formats are not overly human readable and have an example/more detailed description would help.

  4. Deprecate the database_expression field. This logic should be obtainable via i) using the python_date_format field, ii) using a custom expression, or iii) ensuring the type mapping exists in the engine spec.

Time Grains

Given that the format of the type is already inferred we simply need to provide at the engine level in db_engine_specs.py functionality (by ways of a SQL expression) to map from a string or numerical type to a timestamp which will then be wrapped inside of the SQL expressions representing the time_grain_functions, i.e., for the example of a date encoded as a string, for the Presto engine we would use the DATE_PARSE(string, format) UDF. Note there is already some logic here regarding converting various types to a timestamp.

For reference here's a few patterns for date-time formats:

It seems SQLAlchemy doesn't provide any abstraction and thus the only viable solution may be to explicitly define a mapping of the various date-time formats similar to this.

ISO 8601

For string like temporal columns they must adhere to the ISO 8601. The reason being is strings use lexicographical ordering thus we need to ensure the representation coincides with the chronological ordering which is the case for the ISO 8601 format.

For example say you had dates of the form %m/%d/%Y (MM/DD/YYYY in ISO 8601 syntax) and we were fixed on not converting types to timestamps then the date filtering would fail,

>>> '12/31/2018' < '01/01/2019'
False

as opposed to dates of the form %Y-%m-%d (YYYY-MM-DD in ISO 8601 syntax),

>>> '2018-12-31' < '2019-01-01'
True

Note if a string column doesn't adhere to the ISO 8601 format one will have to use a SQL expression in order to convert the column to either a date or timestamp (and possibly forgo leveraging the index), i.e., for Presto this would be defined as:

  • type: TIMESTAMP
  • expression: DATE_PARSE(ds, '%m/%d/%Y')

New or Changed Public Interfaces

N/A.

New dependencies

Depends on what Python package we use for inferring the date-time format.

Migration Plan and Compatibility

A migration would be needed to:

  1. Add support for the database specific column name/type mapping.
  2. Re-encoding the table_columns.python_date_format column.
  3. Bulk inferring temporal formats.

Note the migration and change to type interfering should be rolled out in conjunction with SIP-15 given that by remedying the lexicographical sorting issue outlined in SIP-15 we would fundamentally be changing the time filters for existing charts. Please refer to here for more details.

Rejected Alternatives

See SIP-15.

to: @agrawaldevesh @betodealmeida @mistercrunch @michellethomas @villebro

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label #enhancement to this issue, with a confidence of 0.69. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@issue-label-bot issue-label-bot bot added the enhancement:request Enhancement request submitted by anyone from the community label Jun 5, 2019
@john-bodley john-bodley added the sip Superset Improvement Proposal label Jun 5, 2019
@john-bodley john-bodley changed the title [SIP-21] Proposal for inferring temporal formatting and parsing [SIP-15A] Proposal for inferring temporal formatting and parsing Jun 6, 2019
@mistercrunch mistercrunch added this to pre-DISCUSS in SIPs Feb 24, 2020
@apache apache locked and limited conversation to collaborators Feb 2, 2022
@geido geido converted this issue into discussion #18385 Feb 2, 2022
@srinify srinify moved this from Inactive Discussion to Active Discussion in SIPs Mar 9, 2022
@michael-s-molina michael-s-molina moved this from Active Discussion to Inactive Discussion in SIPs Mar 9, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
enhancement:request Enhancement request submitted by anyone from the community sip Superset Improvement Proposal
Projects
Status: Denied / Closed / Discarded
SIPs
INACTIVE DISCUSSION
Development

No branches or pull requests

1 participant