feat(advanced analysis): support MultiIndex column in post processing stage #19116

zhaoyongjie · 2022-03-11T09:19:21Z

Summary

Superset uses Pandas Dataframe for query results processing. Now, these Dataframe between different Operators only support 1-dimensional Dataframe. This has caused confusion in some calculations. This PR introduces MultiIndex Dataframe for Operators. This is useful for simplifying calculations.

Current PostProcessing

Typical PostProcessing Operators are used in time-series charts for getting queries. e.g.:

        post_processing: [
          resampleOperator(formData, baseQueryObject),
          timeCompareOperator(formData, baseQueryObject),
          sortOperator(formData, { ...baseQueryObject, is_timeseries: true }),
          pivotOperatorInRuntime,
          rollingWindowOperator(formData, baseQueryObject),
          contributionOperator(formData, baseQueryObject),
          prophetOperator(formData, baseQueryObject),
        ],

We face these challenges for existing designs:

Such query can't be adapted to both rolling calculation and time compared calculation. In the other words, rollingWindowOperator should before timeCompareOperator.
We also need to consider how to face series in Dataframe. for example, in the resample.py. We need to "flat" multidimensional Dataframe to adapt this one.

After Changed

The new operators support calculate on multidimensional Dataframe. So, operators in QueryObject can change its position at will. eg:

        post_processing: [
          pivotOperatorInRuntime,
          resampleOperator(formData, baseQueryObject),
          rollingWindowOperator(formData, baseQueryObject),
          timeCompareOperator(formData, baseQueryObject),
          flatOperator(formData, baseQueryObject),
          contributionOperator(formData, baseQueryObject),
          prophetOperator(formData, baseQueryObject),
        ],

For example, The compare() operator supports the Dataframe shape like the following:

                   count_metric    sum_metric
    country              UK US         UK US
    dttm
    2019-01-01            1  2          5  6
    2019-01-02            3  4          7  8

adding new operator flat.py, this new operator will transform the multidimensional dataframe to flatten dataframe.

The full examples:

    pivot_df = pp.pivot(
        df=multiple_metrics_df,
        index=["dttm"],
        columns=["country"],
        aggregates={
            "sum_metric": {"operator": "sum"},
            "count_metric": {"operator": "sum"},
        },
        flatten_columns=False,
        reset_index=False,
    )
    """
                   count_metric    sum_metric
    country              UK US         UK US
    dttm
    2019-01-01            1  2          5  6
    2019-01-02            3  4          7  8
    """
    compared_df = pp.compare(
        pivot_df,
        source_columns=["count_metric"],
        compare_columns=["sum_metric"],
        compare_type=PPC.DIFF,
        drop_original_columns=True,
    )
    """
               difference__count_metric__sum_metric
    country                                      UK US
    dttm
    2019-01-01                                    4  4
    2019-01-02                                    4  4
    """
    flat_df = pp.flat(compared_df)
    """
            dttm  difference__count_metric__sum_metric, UK  difference__count_metric__sum_metric, US
    0 2019-01-01                                         4                                         4
    1 2019-01-02                                         4                                         4
    """

TESTING INSTRUCTIONS

Let's use a new dataset to test.

download DailyDelhiClimateTrain.csv to your local.
import DailyDelhiClimateTrain into Superset
a) click upload CSV to database on top blue plus sign of Superset
b) fill in DailyDelhiClimateTrain on table name
c) select DailyDelhiClimateTrain.csv
d) fill in date in Parse Dates line
e) click Save
open explore page and select DailyDelhiClimateTrain dataset
change viz type to Line Chart, notice that you should pick up echart version
drag date in X-Axis
drag meantemp in metrics and select max for aggregate
select a time range
select day as time grainularity
select a time shift in AA section 1 year ago and actual values
validate data in 2015 and data in 2014 are correct (use SQLLab to validate)
select different calculation types in time comparison and validate data
select different rolling types in rolling window and validate data
select different time granularity and validate data.
you can find some "time hole" and validate Resample

Daily PCT with cumsum

Monthly comparison with quarter sum rolling

ADDITIONAL INFORMATION

Has associated issue:
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

codecov · 2022-03-14T14:02:20Z

Codecov Report

Merging #19116 (b7bc340) into master (6083545) will decrease coverage by 0.00%.
The diff coverage is 84.67%.

@@            Coverage Diff             @@
##           master   #19116      +/-   ##
==========================================
- Coverage   66.65%   66.64%   -0.01%     
==========================================
  Files        1672     1674       +2     
  Lines       64611    64602       -9     
  Branches     6505     6498       -7     
==========================================
- Hits        43066    43057       -9     
  Misses      19862    19862              
  Partials     1683     1683

Flag	Coverage Δ
hive	`52.67% <32.29%> (+0.02%)`	⬆️
javascript	`51.31% <96.42%> (-0.02%)`	⬇️
mysql	`81.65% <81.25%> (-0.01%)`	⬇️
postgres	`81.70% <81.25%> (-0.01%)`	⬇️
presto	`52.51% <32.29%> (+0.02%)`	⬆️
python	`82.12% <81.25%> (-0.01%)`	⬇️
sqlite	`81.47% <81.25%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...superset-ui-core/src/query/types/PostProcessing.ts	`100.00% <ø> (ø)`
...src/BigNumber/BigNumberWithTrendline/buildQuery.ts	`11.11% <ø> (+2.02%)`	⬆️
...in-chart-echarts/src/MixedTimeseries/buildQuery.ts	`0.00% <ø> (ø)`
.../plugin-chart-echarts/src/Timeseries/buildQuery.ts	`66.66% <0.00%> (ø)`
superset/charts/schemas.py	`99.33% <ø> (ø)`
superset/utils/pandas_postprocessing/aggregate.py	`90.90% <ø> (ø)`
superset/utils/pandas_postprocessing/diff.py	`100.00% <ø> (ø)`
superset/utils/pandas_postprocessing/select.py	`100.00% <ø> (ø)`
superset/utils/pandas_postprocessing/sort.py	`100.00% <ø> (ø)`
superset/utils/pandas_postprocessing/geography.py	`82.85% <25.00%> (ø)`
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6083545...b7bc340. Read the comment docs.

villebro

Really nice refactor! Code LGTM and makes the AA/post processing functionality more uniform. My main concern is the assumption that before we've run the pivot operation, the data is not indexed, but after the pivot operation the df is indexed. This could be slightly confusing for viz developers, as they need to be aware of this. But we can keep iterating on this later (maybe do a big breaking change on the post processing API on 3.0)

superset-frontend/packages/superset-ui-chart-controls/src/operators/flatOperator.ts

villebro · 2022-03-21T08:05:07Z

superset/utils/pandas_postprocessing/flat.py

+)
+
+
+def flat(df: pd.DataFrame, reset_index: bool = True,) -> pd.DataFrame:


Just a thought (doesn't need to be done here): Should we introduce a separate operation index that sets the index without needing to do a full pivot along with aggregations on the metrics? Then we could perhaps call them set_index and flatten_index or something that clearly communicates that we're specifically changing the index.

It seems like stack and unstack function in Pandas. I will add those in future.

jinghua-qa · 2022-03-21T23:57:15Z

/testenv up

github-actions · 2022-03-21T23:59:03Z

@jinghua-qa Ephemeral environment spinning up at http://18.237.75.154:8080. Credentials are admin/admin. Please allow several minutes for bootstrapping and startup.

jinghua-qa

Thank you for the detail test steps~ LGTM

zhaoyongjie · 2022-03-23T03:58:22Z

Really nice refactor! Code LGTM and makes the AA/post processing functionality more uniform. My main concern is the assumption that before we've run the pivot operation, the data is not indexed, but after the pivot operation the df is indexed. This could be slightly confusing for viz developers, as they need to be aware of this. But we can keep iterating on this later (maybe do a big breaking change on the post processing API on 3.0)

I'm thinking that the sort operator could use on this case(may be), and if you have related examples I'd love to continue iterating.

github-actions · 2022-03-23T05:46:54Z

Ephemeral environment shutdown and build artifacts deleted.

… stage (apache#19116)

… stage (#19116)

… stage (apache#19116)

pull-request-size bot added the size/XXL label Mar 11, 2022

zhaoyongjie marked this pull request as draft March 11, 2022 09:20

zhaoyongjie changed the title ~~[WIP] Support MultiIndex column in post processing stage~~ feat(advanced analysis): Support MultiIndex column in post processing stage Mar 14, 2022

zhaoyongjie marked this pull request as ready for review March 14, 2022 09:20

zhaoyongjie changed the title ~~feat(advanced analysis): Support MultiIndex column in post processing stage~~ feat(advanced analysis): support MultiIndex column in post processing stage Mar 14, 2022

zhaoyongjie force-pushed the support_mulipleindex branch from 9d36bfb to 04fad0e Compare March 14, 2022 13:32

zhaoyongjie force-pushed the support_mulipleindex branch from 04fad0e to f218b89 Compare March 15, 2022 05:05

zhaoyongjie requested review from jinghua-qa, ktmud, villebro and a team March 15, 2022 12:22

zhaoyongjie mentioned this pull request Mar 16, 2022

[Aera/line chart] Zero imputation resample returns error when combined with Group By #19157

Closed

3 tasks

jinghua-qa added the need:qa-review Requires QA review label Mar 16, 2022

zhaoyongjie force-pushed the support_mulipleindex branch 3 times, most recently from 74e592d to 8479dc8 Compare March 20, 2022 05:43

villebro approved these changes Mar 21, 2022

View reviewed changes

jinghua-qa approved these changes Mar 22, 2022

View reviewed changes

zhaoyongjie added 9 commits March 23, 2022 10:38

operator refactor

b98ab2e

refine exception

ddbdd44

add cr EOF

17a0021

add cr EOF

b49db14

add schemas

cad94b6

convert cell

4902a2a

fe operator

38de775

fe codes

3bddbd2

remove unused ut

785b92d

zhaoyongjie added 13 commits March 23, 2022 10:38

refine type

4cfbe40

lint

0a840a8

fix lint

8cb9485

fix py ut

b9edb28

fix IT

3e1ad46

fix UT

0f537ef

clean up

9603127

add new ut

941d717

fix Decimal metric in mysql

fe7b9aa

refine compare result

712efc2

resample should after timecompare

e656aa4

flat -> flatten

56e41f1

change schema

b7bc340

zhaoyongjie force-pushed the support_mulipleindex branch from 8479dc8 to b7bc340 Compare March 23, 2022 03:44

zhaoyongjie merged commit 375c03e into apache:master Mar 23, 2022

michael-hoffman-26 pushed a commit to nielsen-oss/superset that referenced this pull request Mar 23, 2022

feat(advanced analysis): support MultiIndex column in post processing…

7169d69

… stage (apache#19116)

villebro added lts-v1 and removed need:qa-review Requires QA review labels Mar 29, 2022

villebro pushed a commit that referenced this pull request Apr 3, 2022

feat(advanced analysis): support MultiIndex column in post processing…

9bc7633

… stage (#19116)

cwegener mentioned this pull request Apr 11, 2022

Regression - Echarts Bar and Line chart label values changed unexpectectly when using Group By #19654

Closed

3 tasks

john-bodley mentioned this pull request Apr 13, 2022

fix: time comparision #19659

Merged

9 tasks

philipher29 pushed a commit to ValtechMobility/superset that referenced this pull request Jun 9, 2022

feat(advanced analysis): support MultiIndex column in post processing…

2e6bf73

… stage (apache#19116)

mistercrunch added 🍒 1.5.0 🍒 1.5.1 🍒 1.5.2 🍒 1.5.3 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 2.0.0 labels Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(advanced analysis): support MultiIndex column in post processing stage #19116

feat(advanced analysis): support MultiIndex column in post processing stage #19116

zhaoyongjie commented Mar 11, 2022 •

edited

codecov bot commented Mar 14, 2022 •

edited

villebro left a comment

villebro Mar 21, 2022

zhaoyongjie Mar 23, 2022

jinghua-qa commented Mar 21, 2022

github-actions bot commented Mar 21, 2022

jinghua-qa left a comment

zhaoyongjie commented Mar 23, 2022

github-actions bot commented Mar 23, 2022

		)


		def flat(df: pd.DataFrame, reset_index: bool = True,) -> pd.DataFrame:

feat(advanced analysis): support MultiIndex column in post processing stage #19116

feat(advanced analysis): support MultiIndex column in post processing stage #19116

Conversation

zhaoyongjie commented Mar 11, 2022 • edited

Summary

Current PostProcessing

After Changed

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

codecov bot commented Mar 14, 2022 • edited

Codecov Report

villebro left a comment

Choose a reason for hiding this comment

villebro Mar 21, 2022

Choose a reason for hiding this comment

zhaoyongjie Mar 23, 2022

Choose a reason for hiding this comment

jinghua-qa commented Mar 21, 2022

github-actions bot commented Mar 21, 2022

jinghua-qa left a comment

Choose a reason for hiding this comment

zhaoyongjie commented Mar 23, 2022

github-actions bot commented Mar 23, 2022

zhaoyongjie commented Mar 11, 2022 •

edited

codecov bot commented Mar 14, 2022 •

edited