Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: run extra query on QueryObject and add compare operator for post_processing #15279

Merged
merged 23 commits into from
Jul 28, 2021

Conversation

zhaoyongjie
Copy link
Member

@zhaoyongjie zhaoyongjie commented Jun 21, 2021

SUMMARY

This PR is part of the advanced analytics. in this PR introduces:

  • run extra query on the v1 data API
  • add compare operator for time compare
  • refactoring get_df_payload
  • persistent extra query results(per time offset)
  • unit test for these new functions
  • minor improvements in Makefile, diff operator and pylint

client-side codes at: apache-superset/superset-ui#1170

Preparing test data

please use random_time_series dataset for this test.

Calculation changes (there are 3 changes)

Now the time offset calculation has some changes from before.

  1. Extra query where clause offset calculation, use time units(x years/x months/x weeks) as calculation offset instead of day timedelta to calculate offsets. This will avoid leap year with incorrect calculation

Before Main query(line chart)

SELECT DATE_TRUNC('month', ds) AS __timestamp,
       COUNT(*) AS count
FROM random_time_series
WHERE ds >= TO_TIMESTAMP('2016-03-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
  AND ds < TO_TIMESTAMP('2017-03-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
GROUP BY DATE_TRUNC('month', ds)
ORDER BY count DESC
LIMIT 50000

Befor 1 year ago offset query(line chart)

SELECT DATE_TRUNC('month', ds) AS __timestamp,
       COUNT(*) AS count
FROM random_time_series
WHERE ds >= TO_TIMESTAMP('2015-03-02 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
  AND ds < TO_TIMESTAMP('2016-03-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
GROUP BY DATE_TRUNC('month', ds)
ORDER BY count DESC
LIMIT 50000

After Main query (time-series chart)

SELECT DATE_TRUNC('month', ds) AS __timestamp,
       COUNT(*) AS count
FROM random_time_series
WHERE ds >= TO_TIMESTAMP('2016-03-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
  AND ds < TO_TIMESTAMP('2017-03-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
GROUP BY DATE_TRUNC('month', ds)
LIMIT 50000

After 1 year ago offset query

SELECT DATE_TRUNC('month', ds) AS __timestamp,
       COUNT(*) AS count
FROM random_time_series
WHERE ds >= TO_TIMESTAMP('2015-03-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
  AND ds < TO_TIMESTAMP('2016-03-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
GROUP BY DATE_TRUNC('month', ds)
LIMIT 50000
  1. Use join on __timestamp instead of concat merge main dataframe and extra dataframe. This ensures that the main query and the extension query have the same time granularity calculation on metrics.

Consider this scenario(time-series chart):
Screen Shot 2021-07-01 at 1 10 50 PM

The main query is:

SELECT DATE_TRUNC('month', ds) AS __timestamp,
       COUNT(*) AS count
FROM random_time_series
WHERE ds >= TO_TIMESTAMP('2016-03-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
  AND ds < TO_TIMESTAMP('2017-03-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
GROUP BY DATE_TRUNC('month', ds)
LIMIT 50000

with the time shift 28 days ago query is:

SELECT DATE_TRUNC('month', ds) AS __timestamp,
       COUNT(*) AS count
FROM random_time_series
WHERE ds >= TO_TIMESTAMP('2016-02-02 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
  AND ds < TO_TIMESTAMP('2017-02-01 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
GROUP BY DATE_TRUNC('month', ds)
LIMIT 50000

It is obvious that the extra query is missing the data of 2016-02-01(the main query grain is month, but the extra query is missing a day, the metric is incorrect), when main dataframe left join extra dataframe on __timestamp, the extra data that is not "align" on the main dataframe will be removed automatically.

extra query dataframe
image

merged dataframe
image

  1. Not allowed x time unit format for timeoffset. Must be specified x timeunit ago or x timeunit later

How to test it.

A. prepare code

  1. pull associated PR in superset-ui: feat: advanced analytics for timeseries in echart viz apache-superset/superset-ui#1170
  2. pull current pr in superset
  3. the directory like this:
$ tree -L 1
.
├── superset-ui
└── superset

B. Build superset-ui

run after commands in terminal

$ cd superset-ui
superset-ui$ yarn clean
superset-ui$ rm -rf ./{packages,plugins}/*/node_modules ./node_modules
superset-ui$ yarn clean-npm-lock

superset-ui$ yarn
superset-ui$ yarn build

C. Build superset-frontend

$ cd superset/superset-frontend
superset-frontend$ npm ci
superset-frontend$ npm link --legacy-peer-deps ../../superset-ui/plugins/plugin-chart-echarts/ ../../superset-ui/packages/superset-ui-chart-controls/ ../../superset-ui/packages/superset-ui-core/

D. Hack core module

  1. change package.json in superset-ui-core
diff --git a/packages/superset-ui-core/package.json b/packages/superset-ui-core/package.json
index 30a78f1e..a67dfd40 100644
--- a/packages/superset-ui-core/package.json
+++ b/packages/superset-ui-core/package.json
@@ -32,8 +32,6 @@
   },
   "dependencies": {
     "@babel/runtime": "^7.1.2",
-    "@emotion/cache": "^11.1.3",
-    "@emotion/react": "^11.1.5",
     "@emotion/styled": "^11.3.0",
     "@types/d3-format": "^1.3.0",
     "@types/d3-interpolate": "^1.3.1",
@@ -65,6 +63,8 @@
     "@types/react": "*",
     "@types/react-loadable": "*",
     "react": "^16.13.1",
-    "react-loadable": "^5.5.0"
+    "react-loadable": "^5.5.0",
+    "@emotion/cache": "^11.1.3",
+    "@emotion/react": "^11.1.5"
   }
 }
  1. build each package of superset-ui
$ cd superset-ui/packages/superset-ui-core/
superset-ui-core$ yarn

$ cd superset-ui/packages/superset-ui-chart-controls/
superset-ui-chart-controls$ yarn

$ cd superset-ui/plugins/plugin-chart-echarts/
plugin-chart-echarts$ yarn

E. run dev server in Superset

$ cd superset/superset-frontend
superset-frontend$ npm run dev-server

TESTING INSTRUCTIONS

Added UT in python codebase

ADDITIONAL INFORMATION

@zhaoyongjie zhaoyongjie marked this pull request as ready for review June 22, 2021 03:25
@codecov
Copy link

codecov bot commented Jun 22, 2021

Codecov Report

Merging #15279 (57f01ce) into master (ea49aa3) will decrease coverage by 0.00%.
The diff coverage is 91.91%.

❗ Current head 57f01ce differs from pull request most recent head ecb606e. Consider uploading reports for the commit ecb606e to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15279      +/-   ##
==========================================
- Coverage   76.92%   76.91%   -0.01%     
==========================================
  Files         987      988       +1     
  Lines       52000    52167     +167     
  Branches     7090     7090              
==========================================
+ Hits        40000    40126     +126     
- Misses      11775    11816      +41     
  Partials      225      225              
Flag Coverage Δ
hive ?
mysql 81.52% <91.88%> (+0.11%) ⬆️
postgres ?
presto 81.33% <87.23%> (?)
python 81.76% <91.91%> (-0.04%) ⬇️
sqlite 81.18% <91.88%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/examples/birth_names.py 73.78% <ø> (ø)
superset/views/api.py 71.42% <0.00%> (ø)
superset/charts/commands/exceptions.py 91.11% <75.00%> (-1.75%) ⬇️
superset/utils/pandas_postprocessing.py 84.24% <84.61%> (-0.22%) ⬇️
superset/common/utils.py 88.75% <88.75%> (ø)
superset/utils/date_parser.py 96.56% <93.75%> (-0.33%) ⬇️
superset/common/query_context.py 91.20% <96.51%> (+9.38%) ⬆️
superset/charts/schemas.py 100.00% <100.00%> (ø)
superset/common/query_object.py 90.66% <100.00%> (+0.52%) ⬆️
superset/constants.py 100.00% <100.00%> (ø)
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ea49aa3...ecb606e. Read the comment docs.

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. Also, I'm unable to get the offsets to work. When I try to do a 1 year offset, I get the following results (notice that the offset label is correct, but the data is the same as the original query):
image

The same parameters on the NVD3 Line chart:
image

superset/utils/pandas_postprocessing.py Outdated Show resolved Hide resolved
superset/utils/pandas_postprocessing.py Outdated Show resolved Hide resolved
superset/common/query_context.py Show resolved Hide resolved
superset/common/query_object.py Outdated Show resolved Hide resolved
superset/common/query_context.py Outdated Show resolved Hide resolved
@zhaoyongjie zhaoyongjie force-pushed the run_extra_query branch 2 times, most recently from 9a210cd to 5804d65 Compare June 26, 2021 02:54
@zhaoyongjie zhaoyongjie force-pushed the run_extra_query branch 2 times, most recently from 029f37c to 6efc065 Compare June 29, 2021 13:31
@zhaoyongjie
Copy link
Member Author

Some comments. Also, I'm unable to get the offsets to work. When I try to do a 1 year offset, I get the following results (notice that the offset label is correct, but the data is the same as the original query):
image

The same parameters on the NVD3 Line chart:
image

This bug has been fixed.

image

Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posting a partial first pass as I have limited time to do a more thorough review. I asked @betodealmeida to help review this too.

superset/charts/commands/exceptions.py Outdated Show resolved Hide resolved
superset/common/query_context.py Show resolved Hide resolved
def get_past_or_future(
human_readable: Optional[str], source_time: Optional[datetime] = None,
) -> datetime:
cal = parsedatetime.Calendar()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next few lines have a lot in common with parse_human_timedelta, should we refactor them both use a common simple method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function returns datetime, while the following ones return timedelta.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will open a separate PR to fix these problems.

superset/common/utils.py Show resolved Hide resolved
Copy link
Member

@ktmud ktmud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some reservation on running multiple queries for each period after offset----it basically means an additional 1x query time for each new offset---plus the overheads of transferring potentially duplicate rows with overlapping periods.

@@ -115,6 +115,8 @@

DTTM_ALIAS = "__timestamp"

TIME_COMPARISION = "__"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is no need to revert the column name construction, maybe we can make this a function:

def get_time_comparison_column_name(col: str, period: str):
    return f"{col} ({period})"

(I think parentheses would look nice than __, too)

Copy link
Member Author

@zhaoyongjie zhaoyongjie Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This separator is reserved for now because it has to match the frontend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some reservation on running multiple queries for each period after offset----it basically means an additional 1x query time for each new offset---plus the overheads of transferring potentially duplicate rows with overlapping periods.

TL;DR: if we want to maintain feature parity with the current time offset functionality in viz.py (which issues multiple queries), we need to do that here, too.

@ktmud This was my initial reaction, too, and I assumed we could just add the offsets by based on the initial dataframe. However, after studying how this currently works, I noticed that we need to issue separate queries for each offset, as they will retrieve data for different time ranges. Take this screenshot from @zhaoyongjie 's comment above:
image
Here you can see that the "1 year ago" offset in year 1980 is in fact the data for 1979, which isn't visible in the original series. If we wanted to support arbitrary offsets based on just one query response, we would need to know the maximum offset, query based on that, and then truncate the original series to it's original time ranges etc.

superset/common/utils.py Outdated Show resolved Hide resolved
time_offsets = query_object.time_offsets
outer_from_dttm = query_object.from_dttm
outer_to_dttm = query_object.to_dttm
for offset in time_offsets:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you need to run and cache a completely new query for each offset.

Can we somehow compute the final time periods and generate proper WHERE conditions with or filters instead?

def get_time_periods_for_offsets(time_range, offsets):
    [start, end] = time_range
    periods = [time_range]
    for offset in periods:
        periods.append([start += offset, end += offset])
    return periods

Then change

inner_time_filter = dttm_col.get_time_filter(
inner_from_dttm or from_dttm,
inner_to_dttm or to_dttm,
time_range_endpoints,
)
subq = subq.where(and_(*(where_clause_and + [inner_time_filter])))

to something like

inner_time_filter = or_([dttm_col.between(start, end) for start, end in periods])
subq = subq.where(and_(*(where_clause_and + [inner_time_filter]))

Copy link
Member Author

@zhaoyongjie zhaoyongjie Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For where clause combined by or operator, I estimate that the system consumption is approximately equal to multiple queries. This is because the or operator does not reduce rows scan for the database engine. And we don't have the opportunity to cache each time offset. Let me explain.

Use or operator in the where clause

  • unable to cache each time-offset slice
  • unable to easily generate the final dataframe, when it faces to null values, it is difficult to join with main-query

image

Use extra query

image

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> This is because the or operator does not reduce rows scan for the database engine.

It would reduce rows scanned if there are significant overlaps among the offset time periods. E.g. you query for two years of data and offset by 1 year.

unable to easily generate the final dataframe

Isn't each sub-dataframe a between(start_time, end_time) filter on the query result dataframe? We should probably use pandas to handle the time periods and join by datetime index anyway, if we are not already doing that, so the split & join by time process shouldn't be that hard, either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> This is because the or operator does not reduce rows scan for the database engine.

It would reduce rows scanned if there are significant overlaps among the offset time periods. E.g. you query for two years of data and offset by 1 year.

unable to easily generate the final dataframe

Isn't each sub-dataframe a between(start_time, end_time) filter on the query result dataframe? We should probably use pandas to handle the time periods and join by datetime index anyway, if we are not already doing that, so the split & join by time process shouldn't be that hard, either.

I believe moving this type of logic into Superset would be a slippery slope to introducing logic that's usually best handled by the analytical database and cause major maintainability overhead. I'm open to considering this down the road, but it would require some more discussion to ensure we don't end up building a pseudo-database engine inside Superset. Maybe we can revisit this if/when we start working on adding the semantic layer for table joins?

@betodealmeida betodealmeida self-requested a review July 8, 2021 22:19
@zhaoyongjie zhaoyongjie force-pushed the run_extra_query branch 5 times, most recently from d0f8af3 to b95e39d Compare July 13, 2021 01:50
@rusackas
Copy link
Member

/testenv up

@github-actions
Copy link
Contributor

@rusackas Ephemeral environment spinning up at http://35.163.144.171:8080. Credentials are admin/admin. Please allow several minutes for bootstrapping and startup.

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀

@zhaoyongjie zhaoyongjie merged commit 32d2aa0 into apache:master Jul 28, 2021
@github-actions
Copy link
Contributor

Ephemeral environment shutdown and build artifacts deleted.

opus-42 pushed a commit to opus-42/incubator-superset that referenced this pull request Nov 14, 2021
…t_processing (apache#15279)

* rebase master and resolve conflicts

* pylint to makefile

* fix crash when pivot operator

* fix comments

* add precision argument

* query test

* wip

* fix ut

* rename

* set time_offsets to cache key

wip

* refactor get_df_payload

wip

* extra query cache

* cache ut

* normalize df

* fix timeoffset

* fix ut

* make cache key logging sense

* resolve conflicts

* backend follow up iteration 1

wip

* rolling window type

* rebase master

* py lint and minor follow ups

* pylintrc
cccs-RyanS pushed a commit to CybercentreCanada/superset that referenced this pull request Dec 17, 2021
…t_processing (apache#15279)

* rebase master and resolve conflicts

* pylint to makefile

* fix crash when pivot operator

* fix comments

* add precision argument

* query test

* wip

* fix ut

* rename

* set time_offsets to cache key

wip

* refactor get_df_payload

wip

* extra query cache

* cache ut

* normalize df

* fix timeoffset

* fix ut

* make cache key logging sense

* resolve conflicts

* backend follow up iteration 1

wip

* rolling window type

* rebase master

* py lint and minor follow ups

* pylintrc
QAlexBall pushed a commit to QAlexBall/superset that referenced this pull request Dec 29, 2021
…t_processing (apache#15279)

* rebase master and resolve conflicts

* pylint to makefile

* fix crash when pivot operator

* fix comments

* add precision argument

* query test

* wip

* fix ut

* rename

* set time_offsets to cache key

wip

* refactor get_df_payload

wip

* extra query cache

* cache ut

* normalize df

* fix timeoffset

* fix ut

* make cache key logging sense

* resolve conflicts

* backend follow up iteration 1

wip

* rolling window type

* rebase master

* py lint and minor follow ups

* pylintrc
cccs-rc pushed a commit to CybercentreCanada/superset that referenced this pull request Mar 6, 2024
…t_processing (apache#15279)

* rebase master and resolve conflicts

* pylint to makefile

* fix crash when pivot operator

* fix comments

* add precision argument

* query test

* wip

* fix ut

* rename

* set time_offsets to cache key

wip

* refactor get_df_payload

wip

* extra query cache

* cache ut

* normalize df

* fix timeoffset

* fix ut

* make cache key logging sense

* resolve conflicts

* backend follow up iteration 1

wip

* rolling window type

* rebase master

* py lint and minor follow ups

* pylintrc
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.3.0 labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels need:review Requires a code review size/XL 🚢 1.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants