Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(advanced analysis): support MultiIndex column in post processing stage #19116

Merged
merged 22 commits into from
Mar 23, 2022

Conversation

zhaoyongjie
Copy link
Member

@zhaoyongjie zhaoyongjie commented Mar 11, 2022

Summary

Superset uses Pandas Dataframe for query results processing. Now, these Dataframe between different Operators only support 1-dimensional Dataframe. This has caused confusion in some calculations. This PR introduces MultiIndex Dataframe for Operators. This is useful for simplifying calculations.

Current PostProcessing

Typical PostProcessing Operators are used in time-series charts for getting queries. e.g.:

        post_processing: [
          resampleOperator(formData, baseQueryObject),
          timeCompareOperator(formData, baseQueryObject),
          sortOperator(formData, { ...baseQueryObject, is_timeseries: true }),
          pivotOperatorInRuntime,
          rollingWindowOperator(formData, baseQueryObject),
          contributionOperator(formData, baseQueryObject),
          prophetOperator(formData, baseQueryObject),
        ],

We face these challenges for existing designs:

  1. Such query can't be adapted to both rolling calculation and time compared calculation. In the other words, rollingWindowOperator should before timeCompareOperator.
  2. We also need to consider how to face series in Dataframe. for example, in the resample.py. We need to "flat" multidimensional Dataframe to adapt this one.

After Changed

The new operators support calculate on multidimensional Dataframe. So, operators in QueryObject can change its position at will. eg:

        post_processing: [
          pivotOperatorInRuntime,
          resampleOperator(formData, baseQueryObject),
          rollingWindowOperator(formData, baseQueryObject),
          timeCompareOperator(formData, baseQueryObject),
          flatOperator(formData, baseQueryObject),
          contributionOperator(formData, baseQueryObject),
          prophetOperator(formData, baseQueryObject),
        ],

For example, The compare() operator supports the Dataframe shape like the following:

                   count_metric    sum_metric
    country              UK US         UK US
    dttm
    2019-01-01            1  2          5  6
    2019-01-02            3  4          7  8

adding new operator flat.py, this new operator will transform the multidimensional dataframe to flatten dataframe.

The full examples:

    pivot_df = pp.pivot(
        df=multiple_metrics_df,
        index=["dttm"],
        columns=["country"],
        aggregates={
            "sum_metric": {"operator": "sum"},
            "count_metric": {"operator": "sum"},
        },
        flatten_columns=False,
        reset_index=False,
    )
    """
                   count_metric    sum_metric
    country              UK US         UK US
    dttm
    2019-01-01            1  2          5  6
    2019-01-02            3  4          7  8
    """
    compared_df = pp.compare(
        pivot_df,
        source_columns=["count_metric"],
        compare_columns=["sum_metric"],
        compare_type=PPC.DIFF,
        drop_original_columns=True,
    )
    """
               difference__count_metric__sum_metric
    country                                      UK US
    dttm
    2019-01-01                                    4  4
    2019-01-02                                    4  4
    """
    flat_df = pp.flat(compared_df)
    """
            dttm  difference__count_metric__sum_metric, UK  difference__count_metric__sum_metric, US
    0 2019-01-01                                         4                                         4
    1 2019-01-02                                         4                                         4
    """

TESTING INSTRUCTIONS

Let's use a new dataset to test.

  1. download DailyDelhiClimateTrain.csv to your local.
  2. import DailyDelhiClimateTrain into Superset
    a) click upload CSV to database on top blue plus sign of Superset
    b) fill in DailyDelhiClimateTrain on table name
    c) select DailyDelhiClimateTrain.csv
    d) fill in date in Parse Dates line
    e) click Save
  3. open explore page and select DailyDelhiClimateTrain dataset
  4. change viz type to Line Chart, notice that you should pick up echart version
  5. drag date in X-Axis
  6. drag meantemp in metrics and select max for aggregate
  7. select a time range
    image
  8. select day as time grainularity
  9. select a time shift in AA section 1 year ago and actual values
  10. validate data in 2015 and data in 2014 are correct (use SQLLab to validate)
  11. select different calculation types in time comparison and validate data
  12. select different rolling types in rolling window and validate data
  13. select different time granularity and validate data.
  14. you can find some "time hole" and validate Resample

Daily PCT with cumsum
image

Monthly comparison with quarter sum rolling
image

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@zhaoyongjie zhaoyongjie marked this pull request as draft March 11, 2022 09:20
@zhaoyongjie zhaoyongjie changed the title [WIP] Support MultiIndex column in post processing stage feat(advanced analysis): Support MultiIndex column in post processing stage Mar 14, 2022
@zhaoyongjie zhaoyongjie marked this pull request as ready for review March 14, 2022 09:20
@zhaoyongjie zhaoyongjie changed the title feat(advanced analysis): Support MultiIndex column in post processing stage feat(advanced analysis): support MultiIndex column in post processing stage Mar 14, 2022
@codecov
Copy link

codecov bot commented Mar 14, 2022

Codecov Report

Merging #19116 (b7bc340) into master (6083545) will decrease coverage by 0.00%.
The diff coverage is 84.67%.

@@            Coverage Diff             @@
##           master   #19116      +/-   ##
==========================================
- Coverage   66.65%   66.64%   -0.01%     
==========================================
  Files        1672     1674       +2     
  Lines       64611    64602       -9     
  Branches     6505     6498       -7     
==========================================
- Hits        43066    43057       -9     
  Misses      19862    19862              
  Partials     1683     1683              
Flag Coverage Δ
hive 52.67% <32.29%> (+0.02%) ⬆️
javascript 51.31% <96.42%> (-0.02%) ⬇️
mysql 81.65% <81.25%> (-0.01%) ⬇️
postgres 81.70% <81.25%> (-0.01%) ⬇️
presto 52.51% <32.29%> (+0.02%) ⬆️
python 82.12% <81.25%> (-0.01%) ⬇️
sqlite 81.47% <81.25%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...superset-ui-core/src/query/types/PostProcessing.ts 100.00% <ø> (ø)
...src/BigNumber/BigNumberWithTrendline/buildQuery.ts 11.11% <ø> (+2.02%) ⬆️
...in-chart-echarts/src/MixedTimeseries/buildQuery.ts 0.00% <ø> (ø)
.../plugin-chart-echarts/src/Timeseries/buildQuery.ts 66.66% <0.00%> (ø)
superset/charts/schemas.py 99.33% <ø> (ø)
superset/utils/pandas_postprocessing/aggregate.py 90.90% <ø> (ø)
superset/utils/pandas_postprocessing/diff.py 100.00% <ø> (ø)
superset/utils/pandas_postprocessing/select.py 100.00% <ø> (ø)
superset/utils/pandas_postprocessing/sort.py 100.00% <ø> (ø)
superset/utils/pandas_postprocessing/geography.py 82.85% <25.00%> (ø)
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6083545...b7bc340. Read the comment docs.

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice refactor! Code LGTM and makes the AA/post processing functionality more uniform. My main concern is the assumption that before we've run the pivot operation, the data is not indexed, but after the pivot operation the df is indexed. This could be slightly confusing for viz developers, as they need to be aware of this. But we can keep iterating on this later (maybe do a big breaking change on the post processing API on 3.0)

)


def flat(df: pd.DataFrame, reset_index: bool = True,) -> pd.DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought (doesn't need to be done here): Should we introduce a separate operation index that sets the index without needing to do a full pivot along with aggregations on the metrics? Then we could perhaps call them set_index and flatten_index or something that clearly communicates that we're specifically changing the index.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like stack and unstack function in Pandas. I will add those in future.

@jinghua-qa
Copy link
Member

/testenv up

@github-actions
Copy link
Contributor

@jinghua-qa Ephemeral environment spinning up at http://18.237.75.154:8080. Credentials are admin/admin. Please allow several minutes for bootstrapping and startup.

Copy link
Member

@jinghua-qa jinghua-qa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detail test steps~ LGTM

@zhaoyongjie
Copy link
Member Author

Really nice refactor! Code LGTM and makes the AA/post processing functionality more uniform. My main concern is the assumption that before we've run the pivot operation, the data is not indexed, but after the pivot operation the df is indexed. This could be slightly confusing for viz developers, as they need to be aware of this. But we can keep iterating on this later (maybe do a big breaking change on the post processing API on 3.0)

I'm thinking that the sort operator could use on this case(may be), and if you have related examples I'd love to continue iterating.

@zhaoyongjie zhaoyongjie merged commit 375c03e into apache:master Mar 23, 2022
@github-actions
Copy link
Contributor

Ephemeral environment shutdown and build artifacts deleted.

michael-hoffman-26 pushed a commit to nielsen-oss/superset that referenced this pull request Mar 23, 2022
@villebro villebro added lts-v1 and removed need:qa-review Requires QA review labels Mar 29, 2022
@john-bodley john-bodley mentioned this pull request Apr 13, 2022
9 tasks
philipher29 pushed a commit to ValtechMobility/superset that referenced this pull request Jun 9, 2022
@mistercrunch mistercrunch added 🍒 1.5.0 🍒 1.5.1 🍒 1.5.2 🍒 1.5.3 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 2.0.0 labels Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels lts-v1 size/XXL 🍒 1.5.0 🍒 1.5.1 🍒 1.5.2 🍒 1.5.3 🚢 2.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants