feat: run extra query on QueryObject and add compare operator for post_processing #15279

zhaoyongjie · 2021-06-21T10:32:54Z

SUMMARY

This PR is part of the advanced analytics. in this PR introduces:

run extra query on the v1 data API
add compare operator for time compare
refactoring get_df_payload
persistent extra query results(per time offset)
unit test for these new functions
minor improvements in Makefile, diff operator and pylint

client-side codes at: apache-superset/superset-ui#1170

Preparing test data

please use random_time_series dataset for this test.

Calculation changes (there are 3 changes)

Now the time offset calculation has some changes from before.

Extra query where clause offset calculation, use time units(x years/x months/x weeks) as calculation offset instead of day timedelta to calculate offsets. This will avoid leap year with incorrect calculation

Before Main query(line chart)

SELECT DATE_TRUNC('month', ds) AS __timestamp,
       COUNT(*) AS count
FROM random_time_series
WHERE ds >= TO_TIMESTAMP('2016-03-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
  AND ds < TO_TIMESTAMP('2017-03-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
GROUP BY DATE_TRUNC('month', ds)
ORDER BY count DESC
LIMIT 50000

Befor 1 year ago offset query(line chart)

SELECT DATE_TRUNC('month', ds) AS __timestamp,
       COUNT(*) AS count
FROM random_time_series
WHERE ds >= TO_TIMESTAMP('2015-03-02 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
  AND ds < TO_TIMESTAMP('2016-03-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
GROUP BY DATE_TRUNC('month', ds)
ORDER BY count DESC
LIMIT 50000

After Main query (time-series chart)

SELECT DATE_TRUNC('month', ds) AS __timestamp,
       COUNT(*) AS count
FROM random_time_series
WHERE ds >= TO_TIMESTAMP('2016-03-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
  AND ds < TO_TIMESTAMP('2017-03-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
GROUP BY DATE_TRUNC('month', ds)
LIMIT 50000

After 1 year ago offset query

SELECT DATE_TRUNC('month', ds) AS __timestamp,
       COUNT(*) AS count
FROM random_time_series
WHERE ds >= TO_TIMESTAMP('2015-03-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
  AND ds < TO_TIMESTAMP('2016-03-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
GROUP BY DATE_TRUNC('month', ds)
LIMIT 50000

Use join on __timestamp instead of concat merge main dataframe and extra dataframe. This ensures that the main query and the extension query have the same time granularity calculation on metrics.

Consider this scenario(time-series chart):

The main query is:

SELECT DATE_TRUNC('month', ds) AS __timestamp,
       COUNT(*) AS count
FROM random_time_series
WHERE ds >= TO_TIMESTAMP('2016-03-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
  AND ds < TO_TIMESTAMP('2017-03-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
GROUP BY DATE_TRUNC('month', ds)
LIMIT 50000

with the time shift 28 days ago query is:

SELECT DATE_TRUNC('month', ds) AS __timestamp,
       COUNT(*) AS count
FROM random_time_series
WHERE ds >= TO_TIMESTAMP('2016-02-02 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
  AND ds < TO_TIMESTAMP('2017-02-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
GROUP BY DATE_TRUNC('month', ds)
LIMIT 50000

It is obvious that the extra query is missing the data of 2016-02-01(the main query grain is month, but the extra query is missing a day, the metric is incorrect), when main dataframe left join extra dataframe on __timestamp, the extra data that is not "align" on the main dataframe will be removed automatically.

extra query dataframe

merged dataframe

Not allowed x time unit format for timeoffset. Must be specified x timeunit ago or x timeunit later

How to test it.

A. prepare code

pull associated PR in superset-ui: feat: advanced analytics for timeseries in echart viz apache-superset/superset-ui#1170
pull current pr in superset
the directory like this:

$ tree -L 1
.
├── superset-ui
└── superset

B. Build superset-ui

run after commands in terminal

$ cd superset-ui
superset-ui$ yarn clean
superset-ui$ rm -rf ./{packages,plugins}/*/node_modules ./node_modules
superset-ui$ yarn clean-npm-lock

superset-ui$ yarn
superset-ui$ yarn build

C. Build superset-frontend

$ cd superset/superset-frontend
superset-frontend$ npm ci
superset-frontend$ npm link --legacy-peer-deps ../../superset-ui/plugins/plugin-chart-echarts/ ../../superset-ui/packages/superset-ui-chart-controls/ ../../superset-ui/packages/superset-ui-core/

D. Hack core module

change package.json in superset-ui-core

diff --git a/packages/superset-ui-core/package.json b/packages/superset-ui-core/package.json
index 30a78f1e..a67dfd40 100644
--- a/packages/superset-ui-core/package.json
+++ b/packages/superset-ui-core/package.json
@@ -32,8 +32,6 @@
   },
   "dependencies": {
     "@babel/runtime": "^7.1.2",
-    "@emotion/cache": "^11.1.3",
-    "@emotion/react": "^11.1.5",
     "@emotion/styled": "^11.3.0",
     "@types/d3-format": "^1.3.0",
     "@types/d3-interpolate": "^1.3.1",
@@ -65,6 +63,8 @@
     "@types/react": "*",
     "@types/react-loadable": "*",
     "react": "^16.13.1",
-    "react-loadable": "^5.5.0"
+    "react-loadable": "^5.5.0",
+    "@emotion/cache": "^11.1.3",
+    "@emotion/react": "^11.1.5"
   }
 }

build each package of superset-ui

$ cd superset-ui/packages/superset-ui-core/
superset-ui-core$ yarn

$ cd superset-ui/packages/superset-ui-chart-controls/
superset-ui-chart-controls$ yarn

$ cd superset-ui/plugins/plugin-chart-echarts/
plugin-chart-echarts$ yarn

E. run dev server in Superset

$ cd superset/superset-frontend
superset-frontend$ npm run dev-server

TESTING INSTRUCTIONS

Added UT in python codebase

ADDITIONAL INFORMATION

Has associated issue: feat: advanced analytics for timeseries in echart viz apache-superset/superset-ui#1170
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

codecov · 2021-06-22T08:15:53Z

Codecov Report

Merging #15279 (57f01ce) into master (ea49aa3) will decrease coverage by 0.00%.
The diff coverage is 91.91%.

❗ Current head 57f01ce differs from pull request most recent head ecb606e. Consider uploading reports for the commit ecb606e to get more accurate results

@@            Coverage Diff             @@
##           master   #15279      +/-   ##
==========================================
- Coverage   76.92%   76.91%   -0.01%     
==========================================
  Files         987      988       +1     
  Lines       52000    52167     +167     
  Branches     7090     7090              
==========================================
+ Hits        40000    40126     +126     
- Misses      11775    11816      +41     
  Partials      225      225

Flag	Coverage Δ
hive	`?`
mysql	`81.52% <91.88%> (+0.11%)`	⬆️
postgres	`?`
presto	`81.33% <87.23%> (?)`
python	`81.76% <91.91%> (-0.04%)`	⬇️
sqlite	`81.18% <91.88%> (+0.11%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
superset/examples/birth_names.py	`73.78% <ø> (ø)`
superset/views/api.py	`71.42% <0.00%> (ø)`
superset/charts/commands/exceptions.py	`91.11% <75.00%> (-1.75%)`	⬇️
superset/utils/pandas_postprocessing.py	`84.24% <84.61%> (-0.22%)`	⬇️
superset/common/utils.py	`88.75% <88.75%> (ø)`
superset/utils/date_parser.py	`96.56% <93.75%> (-0.33%)`	⬇️
superset/common/query_context.py	`91.20% <96.51%> (+9.38%)`	⬆️
superset/charts/schemas.py	`100.00% <100.00%> (ø)`
superset/common/query_object.py	`90.66% <100.00%> (+0.52%)`	⬆️
superset/constants.py	`100.00% <100.00%> (ø)`
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ea49aa3...ecb606e. Read the comment docs.

villebro

Some comments. Also, I'm unable to get the offsets to work. When I try to do a 1 year offset, I get the following results (notice that the offset label is correct, but the data is the same as the original query):

The same parameters on the NVD3 Line chart:

superset/utils/pandas_postprocessing.py

superset/common/query_context.py

superset/common/query_object.py

superset/common/query_context.py

zhaoyongjie · 2021-06-30T13:42:52Z

Some comments. Also, I'm unable to get the offsets to work. When I try to do a 1 year offset, I get the following results (notice that the offset label is correct, but the data is the same as the original query):

The same parameters on the NVD3 Line chart:

This bug has been fixed.

mistercrunch

Posting a partial first pass as I have limited time to do a more thorough review. I asked @betodealmeida to help review this too.

superset/charts/commands/exceptions.py

superset/common/query_context.py

mistercrunch · 2021-07-08T15:32:45Z

superset/utils/date_parser.py

+def get_past_or_future(
+    human_readable: Optional[str], source_time: Optional[datetime] = None,
+) -> datetime:
+    cal = parsedatetime.Calendar()


The next few lines have a lot in common with parse_human_timedelta, should we refactor them both use a common simple method?

This function returns datetime, while the following ones return timedelta.

I will open a separate PR to fix these problems.

superset/common/utils.py

ktmud

I have some reservation on running multiple queries for each period after offset----it basically means an additional 1x query time for each new offset---plus the overheads of transferring potentially duplicate rows with overlapping periods.

ktmud · 2021-07-08T21:13:22Z

superset/utils/core.py

@@ -115,6 +115,8 @@

 DTTM_ALIAS = "__timestamp"

+TIME_COMPARISION = "__"


Since there is no need to revert the column name construction, maybe we can make this a function:

def get_time_comparison_column_name(col: str, period: str): return f"{col} ({period})"

(I think parentheses would look nice than __, too)

This separator is reserved for now because it has to match the frontend.

I have some reservation on running multiple queries for each period after offset----it basically means an additional 1x query time for each new offset---plus the overheads of transferring potentially duplicate rows with overlapping periods.

TL;DR: if we want to maintain feature parity with the current time offset functionality in viz.py (which issues multiple queries), we need to do that here, too.

@ktmud This was my initial reaction, too, and I assumed we could just add the offsets by based on the initial dataframe. However, after studying how this currently works, I noticed that we need to issue separate queries for each offset, as they will retrieve data for different time ranges. Take this screenshot from @zhaoyongjie 's comment above:

Here you can see that the "1 year ago" offset in year 1980 is in fact the data for 1979, which isn't visible in the original series. If we wanted to support arbitrary offsets based on just one query response, we would need to know the maximum offset, query based on that, and then truncate the original series to it's original time ranges etc.

superset/common/utils.py

ktmud · 2021-07-08T22:03:28Z

superset/common/query_context.py

+        time_offsets = query_object.time_offsets
+        outer_from_dttm = query_object.from_dttm
+        outer_to_dttm = query_object.to_dttm
+        for offset in time_offsets:


I'm not sure you need to run and cache a completely new query for each offset.

Can we somehow compute the final time periods and generate proper WHERE conditions with or filters instead?

def get_time_periods_for_offsets(time_range, offsets): [start, end] = time_range periods = [time_range] for offset in periods: periods.append([start += offset, end += offset]) return periods

Then change

superset/superset/connectors/sqla/models.py

Lines 1370 to 1375 in bee386e

inner_time_filter = dttm_col.get_time_filter(

inner_from_dttm or from_dttm,

inner_to_dttm or to_dttm,

time_range_endpoints,

)

subq = subq.where(and_(*(where_clause_and + [inner_time_filter])))

to something like

inner_time_filter = or_([dttm_col.between(start, end) for start, end in periods]) subq = subq.where(and_(*(where_clause_and + [inner_time_filter]))

For where clause combined by or operator, I estimate that the system consumption is approximately equal to multiple queries. This is because the or operator does not reduce rows scan for the database engine. And we don't have the opportunity to cache each time offset. Let me explain.

Use or operator in the where clause

unable to cache each time-offset slice

unable to easily generate the final dataframe, when it faces to null values, it is difficult to join with main-query

Use extra query

＞ This is because the or operator does not reduce rows scan for the database engine.

It would reduce rows scanned if there are significant overlaps among the offset time periods. E.g. you query for two years of data and offset by 1 year.

unable to easily generate the final dataframe

Isn't each sub-dataframe a between(start_time, end_time) filter on the query result dataframe? We should probably use pandas to handle the time periods and join by datetime index anyway, if we are not already doing that, so the split & join by time process shouldn't be that hard, either.

＞ This is because the or operator does not reduce rows scan for the database engine.

It would reduce rows scanned if there are significant overlaps among the offset time periods. E.g. you query for two years of data and offset by 1 year.

unable to easily generate the final dataframe

Isn't each sub-dataframe a between(start_time, end_time) filter on the query result dataframe? We should probably use pandas to handle the time periods and join by datetime index anyway, if we are not already doing that, so the split & join by time process shouldn't be that hard, either.

I believe moving this type of logic into Superset would be a slippery slope to introducing logic that's usually best handled by the analytical database and cause major maintainability overhead. I'm open to considering this down the road, but it would require some more discussion to ensure we don't end up building a pseudo-database engine inside Superset. Maybe we can revisit this if/when we start working on adding the semantic layer for table joins?

rusackas · 2021-07-14T21:16:11Z

/testenv up

github-actions · 2021-07-14T21:17:58Z

@rusackas Ephemeral environment spinning up at http://35.163.144.171:8080. Credentials are admin/admin. Please allow several minutes for bootstrapping and startup.

wip

villebro

LGTM! 🚀

github-actions · 2021-07-28T14:35:06Z

Ephemeral environment shutdown and build artifacts deleted.

…t_processing (apache#15279) * rebase master and resolve conflicts * pylint to makefile * fix crash when pivot operator * fix comments * add precision argument * query test * wip * fix ut * rename * set time_offsets to cache key wip * refactor get_df_payload wip * extra query cache * cache ut * normalize df * fix timeoffset * fix ut * make cache key logging sense * resolve conflicts * backend follow up iteration 1 wip * rolling window type * rebase master * py lint and minor follow ups * pylintrc

pull-request-size bot added the size/L label Jun 21, 2021

zhaoyongjie marked this pull request as ready for review June 22, 2021 03:25

zhaoyongjie requested review from villebro and hughhhh June 22, 2021 14:31

villebro reviewed Jun 23, 2021

View reviewed changes

zhaoyongjie force-pushed the run_extra_query branch 2 times, most recently from 9a210cd to 5804d65 Compare June 26, 2021 02:54

pull-request-size bot added size/XL and removed size/L labels Jun 26, 2021

zhaoyongjie force-pushed the run_extra_query branch 2 times, most recently from 029f37c to 6efc065 Compare June 29, 2021 13:31

zhaoyongjie force-pushed the run_extra_query branch from eae050a to ecc7af0 Compare July 1, 2021 15:51

zhaoyongjie added the need:review Requires a code review label Jul 1, 2021

zhaoyongjie force-pushed the run_extra_query branch from ecc7af0 to f36268f Compare July 3, 2021 02:30

zhaoyongjie mentioned this pull request Jul 3, 2021

feat: advanced analytics for timeseries in echart viz apache-superset/superset-ui#1170

Merged

zhaoyongjie requested a review from a team July 3, 2021 03:05

zhaoyongjie force-pushed the run_extra_query branch from f36268f to 2957344 Compare July 6, 2021 02:17

zhaoyongjie mentioned this pull request Jul 8, 2021

Add "Advanced Analytics" functionality from legacy charts to Time-Series ECharts #15564

Closed

mistercrunch reviewed Jul 8, 2021

View reviewed changes

ktmud reviewed Jul 8, 2021

View reviewed changes

betodealmeida self-requested a review July 8, 2021 22:19

zhaoyongjie force-pushed the run_extra_query branch 5 times, most recently from d0f8af3 to b95e39d Compare July 13, 2021 01:50

zhaoyongjie added 19 commits July 28, 2021 11:27

fix comments

463077d

add precision argument

9782fbd

query test

994287e

wip

94b3651

fix ut

1328094

rename

6f680d9

set time_offsets to cache key

f1535eb

wip

refactor get_df_payload

56aa9f2

wip

extra query cache

e6e49f4

cache ut

598c730

normalize df

20aa0d8

fix timeoffset

76e2664

fix ut

3d7930b

make cache key logging sense

01a8e51

resolve conflicts

f315e8f

backend follow up iteration 1

5ea1b7d

wip

rolling window type

46e5bb5

rebase master

6fad8d0

py lint and minor follow ups

e28ac64

zhaoyongjie force-pushed the run_extra_query branch from 31a8695 to e28ac64 Compare July 28, 2021 04:18

pylintrc

ecb606e

villebro approved these changes Jul 28, 2021

View reviewed changes

zhaoyongjie merged commit 32d2aa0 into apache:master Jul 28, 2021

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.3.0 labels Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: run extra query on QueryObject and add compare operator for post_processing #15279

feat: run extra query on QueryObject and add compare operator for post_processing #15279

zhaoyongjie commented Jun 21, 2021 •

edited by rusackas

Loading

codecov bot commented Jun 22, 2021 •

edited

Loading

villebro left a comment

zhaoyongjie commented Jun 30, 2021

mistercrunch left a comment

mistercrunch Jul 8, 2021

zhaoyongjie Jul 8, 2021

zhaoyongjie Jul 12, 2021

ktmud left a comment •

edited

Loading

ktmud Jul 8, 2021

zhaoyongjie Jul 12, 2021 •

edited

Loading

villebro Jul 27, 2021

ktmud Jul 8, 2021

zhaoyongjie Jul 12, 2021 •

edited

Loading

ktmud Jul 13, 2021

villebro Jul 27, 2021

rusackas commented Jul 14, 2021

github-actions bot commented Jul 14, 2021

villebro left a comment

github-actions bot commented Jul 28, 2021

		@@ -115,6 +115,8 @@

		DTTM_ALIAS = "__timestamp"

		TIME_COMPARISION = "__"

	inner_time_filter = dttm_col.get_time_filter(
	inner_from_dttm or from_dttm,
	inner_to_dttm or to_dttm,
	time_range_endpoints,
	)
	subq = subq.where(and_(*(where_clause_and + [inner_time_filter])))

feat: run extra query on QueryObject and add compare operator for post_processing #15279

feat: run extra query on QueryObject and add compare operator for post_processing #15279

Conversation

zhaoyongjie commented Jun 21, 2021 • edited by rusackas Loading

SUMMARY

Preparing test data

Calculation changes (there are 3 changes)

How to test it.

A. prepare code

B. Build superset-ui

C. Build superset-frontend

D. Hack core module

E. run dev server in Superset

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

codecov bot commented Jun 22, 2021 • edited Loading

Codecov Report

villebro left a comment

Choose a reason for hiding this comment

zhaoyongjie commented Jun 30, 2021

mistercrunch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ktmud left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhaoyongjie Jul 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhaoyongjie Jul 12, 2021 • edited Loading

Choose a reason for hiding this comment

Use or operator in the where clause

Use extra query

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rusackas commented Jul 14, 2021

github-actions bot commented Jul 14, 2021

villebro left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 28, 2021

zhaoyongjie commented Jun 21, 2021 •

edited by rusackas

Loading

codecov bot commented Jun 22, 2021 •

edited

Loading

ktmud left a comment •

edited

Loading

zhaoyongjie Jul 12, 2021 •

edited

Loading

zhaoyongjie Jul 12, 2021 •

edited

Loading

Use `or` operator in the where clause