-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: run extra query on QueryObject and add compare operator for post_processing #15279
Conversation
Codecov Report
@@ Coverage Diff @@
## master #15279 +/- ##
==========================================
- Coverage 76.92% 76.91% -0.01%
==========================================
Files 987 988 +1
Lines 52000 52167 +167
Branches 7090 7090
==========================================
+ Hits 40000 40126 +126
- Misses 11775 11816 +41
Partials 225 225
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9a210cd
to
5804d65
Compare
029f37c
to
6efc065
Compare
eae050a
to
ecc7af0
Compare
ecc7af0
to
f36268f
Compare
f36268f
to
2957344
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Posting a partial first pass as I have limited time to do a more thorough review. I asked @betodealmeida to help review this too.
def get_past_or_future( | ||
human_readable: Optional[str], source_time: Optional[datetime] = None, | ||
) -> datetime: | ||
cal = parsedatetime.Calendar() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next few lines have a lot in common with parse_human_timedelta
, should we refactor them both use a common simple method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function returns datetime
, while the following ones return timedelta
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will open a separate PR to fix these problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some reservation on running multiple queries for each period after offset----it basically means an additional 1x query time for each new offset---plus the overheads of transferring potentially duplicate rows with overlapping periods.
@@ -115,6 +115,8 @@ | |||
|
|||
DTTM_ALIAS = "__timestamp" | |||
|
|||
TIME_COMPARISION = "__" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is no need to revert the column name construction, maybe we can make this a function:
def get_time_comparison_column_name(col: str, period: str):
return f"{col} ({period})"
(I think parentheses would look nice than __
, too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This separator
is reserved for now because it has to match the frontend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some reservation on running multiple queries for each period after offset----it basically means an additional 1x query time for each new offset---plus the overheads of transferring potentially duplicate rows with overlapping periods.
TL;DR: if we want to maintain feature parity with the current time offset functionality in viz.py
(which issues multiple queries), we need to do that here, too.
@ktmud This was my initial reaction, too, and I assumed we could just add the offsets by based on the initial dataframe. However, after studying how this currently works, I noticed that we need to issue separate queries for each offset, as they will retrieve data for different time ranges. Take this screenshot from @zhaoyongjie 's comment above:
Here you can see that the "1 year ago" offset in year 1980 is in fact the data for 1979, which isn't visible in the original series. If we wanted to support arbitrary offsets based on just one query response, we would need to know the maximum offset, query based on that, and then truncate the original series to it's original time ranges etc.
time_offsets = query_object.time_offsets | ||
outer_from_dttm = query_object.from_dttm | ||
outer_to_dttm = query_object.to_dttm | ||
for offset in time_offsets: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure you need to run and cache a completely new query for each offset.
Can we somehow compute the final time periods and generate proper WHERE
conditions with or
filters instead?
def get_time_periods_for_offsets(time_range, offsets):
[start, end] = time_range
periods = [time_range]
for offset in periods:
periods.append([start += offset, end += offset])
return periods
Then change
superset/superset/connectors/sqla/models.py
Lines 1370 to 1375 in bee386e
inner_time_filter = dttm_col.get_time_filter( | |
inner_from_dttm or from_dttm, | |
inner_to_dttm or to_dttm, | |
time_range_endpoints, | |
) | |
subq = subq.where(and_(*(where_clause_and + [inner_time_filter]))) |
to something like
inner_time_filter = or_([dttm_col.between(start, end) for start, end in periods])
subq = subq.where(and_(*(where_clause_and + [inner_time_filter]))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For where clause combined by or
operator, I estimate that the system consumption is approximately equal to multiple queries. This is because the or
operator does not reduce rows scan for the database engine. And we don't have the opportunity to cache each time offset. Let me explain.
Use or
operator in the where clause
- unable to cache each time-offset slice
- unable to easily generate the final dataframe, when it faces to null values, it is difficult to join with main-query
Use extra query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> This is because the or operator does not reduce rows scan for the database engine.
It would reduce rows scanned if there are significant overlaps among the offset time periods. E.g. you query for two years of data and offset by 1 year.
unable to easily generate the final dataframe
Isn't each sub-dataframe a between(start_time, end_time) filter on the query result dataframe? We should probably use pandas to handle the time periods and join by datetime index anyway, if we are not already doing that, so the split & join by time process shouldn't be that hard, either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> This is because the or operator does not reduce rows scan for the database engine.
It would reduce rows scanned if there are significant overlaps among the offset time periods. E.g. you query for two years of data and offset by 1 year.
unable to easily generate the final dataframe
Isn't each sub-dataframe a between(start_time, end_time) filter on the query result dataframe? We should probably use pandas to handle the time periods and join by datetime index anyway, if we are not already doing that, so the split & join by time process shouldn't be that hard, either.
I believe moving this type of logic into Superset would be a slippery slope to introducing logic that's usually best handled by the analytical database and cause major maintainability overhead. I'm open to considering this down the road, but it would require some more discussion to ensure we don't end up building a pseudo-database engine inside Superset. Maybe we can revisit this if/when we start working on adding the semantic layer for table joins?
d0f8af3
to
b95e39d
Compare
/testenv up |
@rusackas Ephemeral environment spinning up at http://35.163.144.171:8080. Credentials are |
31a8695
to
e28ac64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🚀
Ephemeral environment shutdown and build artifacts deleted. |
…t_processing (apache#15279) * rebase master and resolve conflicts * pylint to makefile * fix crash when pivot operator * fix comments * add precision argument * query test * wip * fix ut * rename * set time_offsets to cache key wip * refactor get_df_payload wip * extra query cache * cache ut * normalize df * fix timeoffset * fix ut * make cache key logging sense * resolve conflicts * backend follow up iteration 1 wip * rolling window type * rebase master * py lint and minor follow ups * pylintrc
…t_processing (apache#15279) * rebase master and resolve conflicts * pylint to makefile * fix crash when pivot operator * fix comments * add precision argument * query test * wip * fix ut * rename * set time_offsets to cache key wip * refactor get_df_payload wip * extra query cache * cache ut * normalize df * fix timeoffset * fix ut * make cache key logging sense * resolve conflicts * backend follow up iteration 1 wip * rolling window type * rebase master * py lint and minor follow ups * pylintrc
…t_processing (apache#15279) * rebase master and resolve conflicts * pylint to makefile * fix crash when pivot operator * fix comments * add precision argument * query test * wip * fix ut * rename * set time_offsets to cache key wip * refactor get_df_payload wip * extra query cache * cache ut * normalize df * fix timeoffset * fix ut * make cache key logging sense * resolve conflicts * backend follow up iteration 1 wip * rolling window type * rebase master * py lint and minor follow ups * pylintrc
…t_processing (apache#15279) * rebase master and resolve conflicts * pylint to makefile * fix crash when pivot operator * fix comments * add precision argument * query test * wip * fix ut * rename * set time_offsets to cache key wip * refactor get_df_payload wip * extra query cache * cache ut * normalize df * fix timeoffset * fix ut * make cache key logging sense * resolve conflicts * backend follow up iteration 1 wip * rolling window type * rebase master * py lint and minor follow ups * pylintrc
SUMMARY
This PR is part of the
advanced analytics
. in this PR introduces:run extra query
on the v1 data APIcompare
operator for time compareget_df_payload
Makefile
,diff operator
andpylint
client-side codes at: apache-superset/superset-ui#1170
Preparing test data
please use
random_time_series
dataset for this test.Calculation changes (there are 3 changes)
Now the time offset calculation has some changes from before.
where clause
offset calculation, use time units(x years/x months/x weeks) as calculation offset instead of day timedelta to calculate offsets. This will avoid leap year with incorrect calculationBefore Main query(line chart)
Befor 1 year ago offset query(line chart)
After Main query (time-series chart)
After 1 year ago offset query
join on __timestamp
instead ofconcat
merge main dataframe and extra dataframe. This ensures that the main query and the extension query have the same time granularity calculation on metrics.Consider this scenario(time-series chart):
The main query is:
with the time shift
28 days ago
query is:It is obvious that the extra query is missing the data of 2016-02-01(the main query grain is month, but the extra query is missing a day, the metric is incorrect), when main dataframe left join extra dataframe on __timestamp, the extra data that is not "align" on the main dataframe will be removed automatically.
extra query dataframe
merged dataframe
x time unit
format for timeoffset. Must be specifiedx timeunit ago
orx timeunit later
How to test it.
A. prepare code
superset-ui
: feat: advanced analytics for timeseries in echart viz apache-superset/superset-ui#1170superset
$ tree -L 1 . ├── superset-ui └── superset
B. Build superset-ui
run after commands in terminal
C. Build superset-frontend
D. Hack core module
package.json
in superset-ui-coreE. run dev server in Superset
TESTING INSTRUCTIONS
Added UT in python codebase
ADDITIONAL INFORMATION