Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(postprocessing): improve pivot postprocessing operation #23465

Merged
merged 1 commit into from
Mar 24, 2023

Conversation

Usiel
Copy link
Contributor

@Usiel Usiel commented Mar 23, 2023

SUMMARY

See the linked issue for a description of how this can affect a Superset user.

Executing a pivot for with drop_missing_columns=False and lots of resulting columns can increase the postprocessing time by seconds or even minutes for large datasets. The main culprit is df.drop(...) operation in the for loop. We can refactor this slightly, without any change to the results, and push down the postprocessing time to seconds instead of minutes for large datasets (thousands or millions of columns).

TESTING INSTRUCTIONS

Luckily, this feature is already covered by unit tests.

Manual test to see performance improvement:

  1. Explore the example dataset cleaned_sales_data
  2. Add multiple dimensions (e.g. contact_first_name, contact_last_name, phone)
  3. Add any metric
  4. Select the Time-series Line Chart
  5. Click on "Update Chart"

Results on this branch
Chart loads within a few seconds

Negative results (e.g. on master)
Chart will time out or take a very long time

ADDITIONAL INFORMATION

Related to #15975 - keeps the workaround and its tests, just changes the implementation very slightly.

@codecov
Copy link

codecov bot commented Mar 23, 2023

Codecov Report

Merging #23465 (e4edc55) into master (7ef06b0) will increase coverage by 1.85%.
The diff coverage is 82.14%.

❗ Current head e4edc55 differs from pull request most recent head 20bc99a. Consider uploading reports for the commit 20bc99a to get more accurate results

@@            Coverage Diff             @@
##           master   #23465      +/-   ##
==========================================
+ Coverage   65.76%   67.62%   +1.85%     
==========================================
  Files        1908     1908              
  Lines       73726    73736      +10     
  Branches     7989     7988       -1     
==========================================
+ Hits        48489    49861    +1372     
+ Misses      23189    21829    -1360     
+ Partials     2048     2046       -2     
Flag Coverage Δ
hive 52.73% <45.00%> (?)
postgres 78.51% <70.00%> (+<0.01%) ⬆️
presto 52.66% <45.00%> (?)
python 82.32% <80.00%> (+3.81%) ⬆️
sqlite 77.01% <55.00%> (?)
unit 52.58% <55.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...chart-echarts/src/Timeseries/Area/controlPanel.tsx 40.00% <ø> (ø)
...harts/src/Timeseries/Regular/Line/controlPanel.tsx 33.33% <ø> (ø)
...ts/src/Timeseries/Regular/Scatter/controlPanel.tsx 40.00% <ø> (ø)
...src/Timeseries/Regular/SmoothLine/controlPanel.tsx 40.00% <ø> (ø)
...chart-echarts/src/Timeseries/Step/controlPanel.tsx 33.33% <ø> (ø)
...s/plugin-chart-pivot-table/src/PivotTableChart.tsx 0.00% <ø> (ø)
...n-chart-pivot-table/src/react-pivottable/Styles.js 0.00% <ø> (ø)
superset-frontend/src/SqlLab/actions/sqlLab.js 64.32% <ø> (-0.18%) ⬇️
...frontend/src/SqlLab/components/SaveQuery/index.tsx 75.55% <ø> (ø)
...frontend/src/SqlLab/components/SqlEditor/index.jsx 56.68% <0.00%> (ø)
... and 14 more

... and 90 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome perf improvement! One comment about the PR description and added comment in the diff, but other than that I feel like this is good to go if CI passes.

@@ -100,11 +100,9 @@ def pivot( # pylint: disable=too-many-arguments,too-many-locals
margins_name=marginal_distribution_name,
)

# removes additional columns from pandas issue #18030 described above
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I don't think we should place comments like this in the codebase, as over time it will cause a significant amount of clutter. People can always check git blame and take a look at the underlying PR if they need more context, and that's where info like this should be placed IMO (btw, the issue referenced here doesn't appear to be mentioned in the PR description). In addition, it was unclear to me what is meant by "described above" - there is no description directly above this block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can remove this if you all prefer to keep the comments to a minimum. The intent was a to make it more obvious that the workaround for the pandas issue is a 2 step operation.

I'll add a ref to the original PR for better context, in case this ever gets fixed on pandas.

Executing a pivot for with `drop_missing_columns=False` and lots of resulting columns can increase the postprocessing time by seconds or even minutes for large datasets.
The main culprit is `df.drop(...)` operation in the for loop. We can refactor this slightly, without any change to the results, and push down the postprocessing time
to seconds instead of minutes for large datasets (millions of columns).

Fixes apache#23464
@Usiel Usiel force-pushed the usiel/perf-pivot-post-process branch from 8641834 to 20bc99a Compare March 23, 2023 08:44
Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I'd be curious to hear @zhaoyongjie 's thoughts, too

@@ -87,7 +87,7 @@ def pivot( # pylint: disable=too-many-arguments,too-many-locals
if not drop_missing_columns and columns:
for row in df[columns].itertuples():
for metric in aggfunc.keys():
series_set.add(str(tuple([metric]) + tuple(row[1:])))
series_set.add(tuple([metric]) + tuple(row[1:]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhaoyongjie do you have any reservations about this change? IMO we have pretty good test coverage, so if this change doesn't break existing tests I feel this change is good to go (if it does cause a regression it's just a sign that we needed more/better tests). So I feel like this change is warranted, as it simplifies the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@villebro from the line of code seems to serialize the combination of tuples, this change(L104) has used vector difference on the columns so I think this PR is better than before performance, and Thanks @Usiel for this PR.

@@ -87,7 +87,7 @@ def pivot( # pylint: disable=too-many-arguments,too-many-locals
if not drop_missing_columns and columns:
for row in df[columns].itertuples():
for metric in aggfunc.keys():
series_set.add(str(tuple([metric]) + tuple(row[1:])))
series_set.add(tuple([metric]) + tuple(row[1:]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@villebro from the line of code seems to serialize the combination of tuples, this change(L104) has used vector difference on the columns so I think this PR is better than before performance, and Thanks @Usiel for this PR.

@villebro villebro merged commit be2eb31 into apache:master Mar 24, 2023
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 3.0.0 labels Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/XS 🚢 3.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants