fix: change the validation logic for python_date_format #25510

mapledan · 2023-10-04T10:03:16Z

SUMMARY

The PUT /api/v1/dataset/{id} endpoint includes the python_date_format validation.
A PR #24113 optimizes the validation, but there's still an issue with the '%Y%m%d' format.

According to the DATETIME FORMAT description, "If the timestamp format does not adhere to the ISO 8601 standard, you will need to define an expression and type for transforming the string into a date or timestamp.". Therefore, the validation should be more flexible.

In this PR, I have modified the logic to use the strftime function for validating the format string.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

Edit the dataest.
Go to the COLUMNS tab.
Choose the date string column.
Edit the DATETIME FORMAT to '%Y%m%d'.
Save the dataset.
The column can serve as the temporal filter, and the SQL expression can format the datetime to a date string using the python_date_format.

ADDITIONAL INFORMATION

Has associated issue:
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

sfirke · 2023-10-04T15:12:30Z

Thanks for submitting this! ping @Antonio-RiveroMartnez and @jfrag1 for review, they worked on that PR you linked.

john-bodley · 2023-10-04T17:30:48Z

Thanks @mapledan for the PR. Would you mind completing the PR description to include testing instructions? Additionally adding unit tests would be highly beneficial.

cc @betodealmeida @jfrag1

mapledan · 2023-10-05T11:49:11Z

Sure, I have added the testing instructions and included unit tests as well.

john-bodley · 2023-11-16T20:37:14Z

superset/datasets/schemas.py

+    if value in ("epoch_s", "epoch_ms"):
+        return
+    try:
+        datetime.now().strftime(value or "")


@mapledan would you mind using the dateutil.parser.isoparse method instead?

Sure, I can use the dateutil.parser.isoparse method to make it support a more accurate ISO 8601 format than the original regular expression validation.
However, let me confirm again, does python_date_format only support the ISO 8601 format?
Because based on the description I read, it is written as follows: "If the timestamp format does not adhere to the ISO 8601 standard, you will need to define an expression and type for transforming the string into a date or timestamp."
That's why I chose to use strftime for validation to handle the user's defined expression.

@mapledan it should only support ISO 8601—in addition to epoch_s and epoch_ms. The reason being is ISO 8601 guarantees lexigraphical ordering when comparing strings.

This is mentioned throughout the code.

Sorry, I missed the preceding sentence: 'which needs to adhere to the ISO 8601 standard to ensure that the lexicographical ordering coincides with the chronological ordering.'

In my test case, the dateutil.parser.isoparse strictly follows the ISO 8601 standard, so it does not allow YYYY/MM/DD format.
This will impact PR (#24113) as it allows the slash format.
And I traced the python_date_format.
It is using by pandas.to_datetime in superset.utils.core.normalize_dttm_col.

Which should we use as the validation, thepandas.to_datetime method, regex like (?P<date>%Y([-/]?%m([-/]?%d)?)?)([\sT](?P<time>%H(:%M(:%S(\.%f)?)?)?))?, or the dateutil.parser.isoparse method?

Thanks @mapledan for flagging the other PR. Per the tooltip (see attached) it clearly states that the format should be ISO 8601, though the placeholder text uses /. Granted the / (in addition to -) does adhere to lexicographical ordering, but we should strictly adhere to the ISO 8601 standard for historical reasons.

@jfrag1 this PR reverts the logic you defined in #24113. I've also authored #26017 which updates the placeholder text which likely lead people astray.

@mapledan feel free to merge the logic from #26017 into your PR if you think it makes sense to have both changes combined.

I think if we want to move towards strict ISO, we definitely need a migration to update existing python_date_format's that currently use /. Without a migration, any existing datasets with columns that use / will not be able to be updated without changing the python_date_format since they'll fail the validation.

@jfrag1 a migration isn't possible as the string format is specified by the underly column in the dataset. Granted this change has the potential to churn users who were exposed to the relaxation in #24113, i.e., since 3.0 and thus they would need to update there dataset definition to use a SQL expression instead.

@michael-s-molina this is likely another great example of a breaking change and thus we likely need to wait until 4.0 to re-restrict the eligible formats.

@john-bodley #24113 didn't expose a new relaxation to users; that PR was a follow-up to #23678, which changed the endpoint used to update dataset definitions. The old endpoint did no validation at all for python_date_format, so there were people using slashes before that too (since the placeholder indicated to do so).

Also I'm not sure what you mean by "a migration isn't possible", can't we have a migration that updates the table that stores column definitions?

@jfrag1 the column in question isn't in the metadata database it's in the underlying analytical database(s), i.e., Trino, Hive, etc. where the temporal format is defined and thus Superset merely reflects how the data is defined. It's difficult to estimate the impact of non-strict ISO 8601 validation, though since #24113 it only impacts newly registered datasets where the format is %Y/%m/%d[...]. Organizations likely have many registered temporal columns where the format is complete nonsense pre-#23678.

I think there's two option:

Relax requiring the ISO 8601 standard to include /, i.e., what is currently implemented (for right or wrong)*. This is potentially risky as the ISO 8601 standard guarantees lexicographical ordering** whereas custom formatters don't.

Wait until 4.0 and mention in UPDATING.md the (re)restriction that only ISO 8601 formats are acceptable (enforced by way of validation at the API and database level) and that dataset owners will need to use a SQL expression instead to convert their string columns of the form %Y/%m/%d etc. to a DATE, DATETIME, etc. type.

* Note due to lack of validation prior to #23678 (when we switched to the new RESTful API) there are likely organizations which have datasets which have a slew of formats in table_columns.python_datetime_format column which don't adhere to the current format and thus wont work as expected. I was able to confirm this in Airbnb's Superset metadata database. As part of 4.0 we should add a database level CHECK constraint for the table_columns.python_datetime_format column which will force admins and dataset owners to correct any violating temporal columns.

** Lexicographical ordering is essential when we're filtering temporal string columns via the >, >=, <, and <= operators.

This is a classic "shift left" problem where the underlying validation should also reside in the database layer. This is mentioned in [SIP-99C] Proposal for model and business validation and something that has been somewhat of an achilles heal for community over the years.

tests/unit_tests/datasets/schema_tests.py

superset/datasets/schemas.py

codecov · 2023-11-18T07:17:02Z

Codecov Report

Attention: 12 lines in your changes are missing coverage. Please review.

Comparison is base (04445a3) 69.48% compared to head (daf90cb) 59.59%.

Files	Patch %	Lines
superset/daos/dataset.py	29.41%	12 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #25510      +/-   ##
==========================================
- Coverage   69.48%   59.59%   -9.90%     
==========================================
  Files        1894     1894              
  Lines       74151    74173      +22     
  Branches     8243     8243              
==========================================
- Hits        51527    44200    -7327     
- Misses      20555    27904    +7349     
  Partials     2069     2069

Flag	Coverage Δ
hive	`53.78% <31.03%> (-0.02%)`	⬇️
javascript	`56.86% <ø> (ø)`
mysql	`?`
postgres	`?`
presto	`53.73% <31.03%> (-0.02%)`	⬇️
python	`62.50% <58.62%> (-20.52%)`	⬇️
sqlite	`?`
unit	`56.40% <58.62%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

john-bodley · 2024-01-18T23:34:49Z

@mapledan and @michael-s-molina I sense this PR should be resurrected and merged as part of the 4.0 breaking window. I realize that the proposal window has closed, but this is actually a (breaking) fix and thus I gather it's exempt from those rules.

@mapledan in addition to the current logic you've outlined here I sense the PR should be updated per (2) in #25510 (comment) to include:

Updating UPDATING.md. [required]
Adding a CHECK constraint (or similar) to the table_columns.python_datetime_format column to ensure that the specified format is either ISO 8601 compliant, epoch_ms, or epoch_s. [optional]

…dataset_datetime_format

michael-s-molina · 2024-01-19T11:08:40Z

I sense this PR should be resurrected and merged as part of the 4.0 breaking window. I realize that the proposal window has closed, but this is actually a (breaking) fix and thus I gather it's exempt from those rules.

This missed the 4.0 breaking window and was not submitted for lazy consensus. Someone might have disagreed with the proposed change if it was submitted for lazy consensus. I think the correct process here is to add a card to Punted to 5.0 column so we don't miss this again in the next breaking window.

sfirke · 2024-01-19T16:22:19Z

Maybe we could discuss at the townhall today? I feel like a bug fix should not be subject to as rigorous a community approval as the changes that were passed via lazy consensus and think perhaps we should be able to merge breaking bug fixes during this window.

…dataset_datetime_format

john-bodley

Woot! LGTM. Thanks @mapledan for the fix and addressing the various requests.

Co-authored-by: John Bodley <john.bodley@gmail.com>

fix: change the validation logic for python_date_format

c4d70e4

pull-request-size bot added the size/S label Oct 4, 2023

john-bodley requested review from betodealmeida and john-bodley October 4, 2023 17:27

mapledan added 2 commits October 5, 2023 19:38

fix: strftime may receive None.

0108456

add unit tests

0cb09cc

pull-request-size bot added size/M and removed size/S labels Oct 5, 2023

re-run pre-commit

7c7210f

john-bodley reviewed Nov 16, 2023

View reviewed changes

mapledan added 3 commits November 17, 2023 16:08

rewrite the unit test

2f05b7f

change the validate method to dateutil.parser.isoparse

4c849e2

Adjust function return type

dfb55e1

john-bodley mentioned this pull request Nov 17, 2023

fix(dataset): Ensure DATETIME FORMAT is ISO 8601 compliant #26017

Closed

9 tasks

john-bodley reviewed Nov 17, 2023

View reviewed changes

tests/unit_tests/datasets/schema_tests.py Outdated Show resolved Hide resolved

john-bodley reviewed Nov 17, 2023

View reviewed changes

superset/datasets/schemas.py Show resolved Hide resolved

superset/datasets/schemas.py Outdated Show resolved Hide resolved

superset/datasets/schemas.py Outdated Show resolved Hide resolved

john-bodley added the hold! On hold label Nov 17, 2023

mapledan added 2 commits November 18, 2023 15:10

%Y/%m/%d should be the raises case

ff6ca17

Per the type hint value cannot be None

baae852

john-bodley requested a review from michael-s-molina January 18, 2024 23:28

john-bodley added the v4.0 Label added by the release manager to track PRs to be included in the 4.0 branch label Jan 18, 2024

Merge branch 'master' of https://github.com/apache/superset into fix/…

a6fa404

…dataset_datetime_format

michael-s-molina removed the v4.0 Label added by the release manager to track PRs to be included in the 4.0 branch label Jan 19, 2024

mapledan requested review from rusackas, nytai, mistercrunch, craig-rueda, kgabryje, jinghua-qa, dpgaspar and a team as code owners January 23, 2024 09:52

pull-request-size bot added size/XXL and removed size/L labels Jan 23, 2024

mapledan force-pushed the fix/dataset_datetime_format branch from 5c4b82f to 7f3055a Compare January 23, 2024 09:59

pull-request-size bot added size/L and removed size/XXL labels Jan 23, 2024

mapledan and others added 6 commits January 23, 2024 18:08

refactor: validation at the DAOs level

20e484e

refactor: validation at the DAOs level

6aee27c

fix(dataset): Ensure DATETIME FORMAT is ISO 8601 compliant

0269e95

remove unnecessary test case

94b1482

pylint: fix invalid-name

b5ca468

Merge branch 'master' of https://github.com/apache/superset into fix/…

daf90cb

…dataset_datetime_format

john-bodley approved these changes Jan 23, 2024

View reviewed changes

john-bodley merged commit c2ab9bb into apache:master Jan 23, 2024
33 checks passed

john-bodley removed the hold! On hold label Jan 23, 2024

mapledan deleted the fix/dataset_datetime_format branch January 24, 2024 02:17

rusackas pushed a commit that referenced this pull request Jan 24, 2024

fix: change the validation logic for python_date_format (#25510)

e8862f2

Co-authored-by: John Bodley <john.bodley@gmail.com>

eschutho pushed a commit to preset-io/superset that referenced this pull request Jan 31, 2024

fix: change the validation logic for python_date_format (apache#25510)

cc6670f

Co-authored-by: John Bodley <john.bodley@gmail.com>

sfirke pushed a commit to sfirke/superset that referenced this pull request Mar 22, 2024

fix: change the validation logic for python_date_format (apache#25510)

8bffe0a

Co-authored-by: John Bodley <john.bodley@gmail.com>

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 4.0.0 labels Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: change the validation logic for python_date_format #25510

fix: change the validation logic for python_date_format #25510

mapledan commented Oct 4, 2023 •

edited

Loading

sfirke commented Oct 4, 2023

john-bodley commented Oct 4, 2023

mapledan commented Oct 5, 2023

john-bodley Nov 16, 2023

mapledan Nov 17, 2023 •

edited

Loading

john-bodley Nov 17, 2023 •

edited

Loading

mapledan Nov 17, 2023 •

edited

Loading

john-bodley Nov 17, 2023 •

edited

Loading

jfrag1 Nov 17, 2023

john-bodley Nov 17, 2023

jfrag1 Nov 17, 2023

jfrag1 Nov 17, 2023

john-bodley Nov 17, 2023 •

edited

Loading

codecov bot commented Nov 18, 2023 •

edited

Loading

john-bodley commented Jan 18, 2024 •

edited

Loading

michael-s-molina commented Jan 19, 2024

sfirke commented Jan 19, 2024

john-bodley left a comment

fix: change the validation logic for python_date_format #25510

fix: change the validation logic for python_date_format #25510

Conversation

mapledan commented Oct 4, 2023 • edited Loading

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

sfirke commented Oct 4, 2023

john-bodley commented Oct 4, 2023

mapledan commented Oct 5, 2023

john-bodley Nov 16, 2023

Choose a reason for hiding this comment

mapledan Nov 17, 2023 • edited Loading

Choose a reason for hiding this comment

john-bodley Nov 17, 2023 • edited Loading

Choose a reason for hiding this comment

mapledan Nov 17, 2023 • edited Loading

Choose a reason for hiding this comment

john-bodley Nov 17, 2023 • edited Loading

Choose a reason for hiding this comment

jfrag1 Nov 17, 2023

Choose a reason for hiding this comment

john-bodley Nov 17, 2023

Choose a reason for hiding this comment

jfrag1 Nov 17, 2023

Choose a reason for hiding this comment

jfrag1 Nov 17, 2023

Choose a reason for hiding this comment

john-bodley Nov 17, 2023 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Nov 18, 2023 • edited Loading

Codecov Report

john-bodley commented Jan 18, 2024 • edited Loading

michael-s-molina commented Jan 19, 2024

sfirke commented Jan 19, 2024

john-bodley left a comment

Choose a reason for hiding this comment

mapledan commented Oct 4, 2023 •

edited

Loading

mapledan Nov 17, 2023 •

edited

Loading

john-bodley Nov 17, 2023 •

edited

Loading

mapledan Nov 17, 2023 •

edited

Loading

john-bodley Nov 17, 2023 •

edited

Loading

john-bodley Nov 17, 2023 •

edited

Loading

codecov bot commented Nov 18, 2023 •

edited

Loading

john-bodley commented Jan 18, 2024 •

edited

Loading