fix: eliminate cartesian product columns in pivot operator #15975

zhaoyongjie · 2021-07-30T10:55:38Z

SUMMARY

closes: #15956

eliminate cartesian product columns in pivot operator

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

After

TESTING INSTRUCTIONS

added test in integrated test

ADDITIONAL INFORMATION

Has associated issue: Time Series viz shows incorrect legend for multiple group by columns #15956
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

codecov · 2021-07-30T11:12:50Z

Codecov Report

Merging #15975 (2ac07f1) into master (cc704dd) will decrease coverage by 0.22%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #15975      +/-   ##
==========================================
- Coverage   77.06%   76.83%   -0.23%     
==========================================
  Files         988      988              
  Lines       52387    52397      +10     
  Branches     6626     6626              
==========================================
- Hits        40370    40259     -111     
- Misses      11793    11914     +121     
  Partials      224      224

Flag	Coverage Δ
hive	`?`
mysql	`81.58% <100.00%> (-0.01%)`	⬇️
postgres	`81.65% <100.00%> (-0.01%)`	⬇️
presto	`?`
python	`81.74% <100.00%> (-0.44%)`	⬇️
sqlite	`81.29% <100.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
superset/utils/pandas_postprocessing.py	`84.80% <100.00%> (+0.55%)`	⬆️
superset/db_engines/hive.py	`0.00% <0.00%> (-82.15%)`	⬇️
superset/db_engine_specs/hive.py	`69.80% <0.00%> (-16.87%)`	⬇️
superset/db_engine_specs/presto.py	`83.47% <0.00%> (-6.91%)`	⬇️
superset/reports/notifications/slack.py	`86.95% <0.00%> (-3.21%)`	⬇️
superset/charts/post_processing.py	`77.01% <0.00%> (-2.99%)`	⬇️
superset/views/database/mixins.py	`81.03% <0.00%> (-1.73%)`	⬇️
superset/connectors/sqla/models.py	`88.20% <0.00%> (-1.64%)`	⬇️
superset/db_engine_specs/base.py	`87.98% <0.00%> (-0.39%)`	⬇️
superset/models/core.py	`89.61% <0.00%> (-0.26%)`	⬇️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc704dd...2ac07f1. Read the comment docs.

villebro

Great solution! A few small comments, other than that LGTM!

superset/utils/pandas_postprocessing.py

tests/integration_tests/pandas_postprocessing_tests.py

junlincc · 2021-07-30T13:17:12Z

/testenv up

github-actions · 2021-07-30T13:19:31Z

@junlincc Ephemeral environment spinning up at http://34.219.46.208:8080. Credentials are admin/admin. Please allow several minutes for bootstrapping and startup.

junlincc · 2021-07-30T13:58:41Z

After in testenv

tested on a virtual dataset LGTM. ✅ unless i miss something

villebro · 2021-07-30T13:58:52Z

superset/utils/pandas_postprocessing.py

+    # https://github.com/apache/superset/issues/15956
+    # https://github.com/pandas-dev/pandas/issues/18030
+    series_set = set()
+    to_string_list: Callable[[List[Any]], List[str]] = lambda lst: [str(_) for _ in lst]


Could we even wrap the "_".join inside the lambda?

Sure! thanks!

villebro

LGTM!

serenajiang

Thanks for addressing this! ❤️

One minor comment. Also just wondering - would dropna(how="all", ...) on the final result work as a simpler solution?

serenajiang · 2021-07-30T16:50:30Z

superset/utils/pandas_postprocessing.py

+    # https://github.com/apache/superset/issues/15956
+    # https://github.com/pandas-dev/pandas/issues/18030
+    series_set = set()
+    lst_to_str: Callable[[List[Any]], str] = lambda lst: "_".join(str(_) for _ in lst)


minor nit - I think you could just replace calls of this this with str().

The value of columns might integer or float, so needs to be converted. for instance:

>>> "".join([1,0]) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: sequence item 0: expected str instance, int found

Yeah, but the join is not necessary - you can just call str() on a list to get a (deterministic) string representation of the list.

In [1]: str([0,"x"]) Out[1]: "[0, 'x']"

tuple() might be better (I assume the intention is to make the list hashable)

nice tips! changes done.

github-actions · 2021-07-31T08:02:26Z

Ephemeral environment shutdown and build artifacts deleted.

zhaoyongjie · 2021-07-31T08:40:55Z

Thanks for addressing this! ❤️

One minor comment. Also just wondering - would dropna(how="all", ...) on the final result work as a simpler solution?

sorry, missing this review. dropna with all might be Ignore NULL value of metric, for instance

import pandas as pd
import numpy as np
from datetime import datetime

df = pd.DataFrame({
    "ds": [datetime(2012, 11, 1), datetime(2012, 11, 1)],
    "col1": ['a', 'b'],
    "col2": ['a', 'b'],
    "metric": [np.NaN, 9], #<--- metric values
})
df

  | ds | col1 | col2 | metric
-- | -- | -- | -- | --
2012-11-01 | a | a | NaN
2012-11-01 | b | b | 9.0

df = df.pivot_table(
    index="ds",
    columns=["col1", "col2"],
    values=["metric"],
    aggfunc={
     "metric": np.mean
    },
    dropna=False
)

df.dropna(how="all", axis=1)



  | metric
-- | --
9.0

jinghua-qa · 2021-11-01T15:23:23Z

test cases added

) * fix: eliminate cartesian product columns in pivot operator * wip * wip * minor tip

pull-request-size bot added the size/M label Jul 30, 2021

zhaoyongjie requested a review from villebro July 30, 2021 10:55

zhaoyongjie force-pushed the pivot_columns branch from 2aebd16 to cd2056e Compare July 30, 2021 11:02

villebro reviewed Jul 30, 2021

View reviewed changes

zhaoyongjie force-pushed the pivot_columns branch from 493457c to 2dc41f8 Compare July 30, 2021 12:51

zhaoyongjie requested a review from villebro July 30, 2021 12:52

junlincc requested a review from serenajiang July 30, 2021 13:24

junlincc added rush! Requires immediate attention and removed rush! Requires immediate attention labels Jul 30, 2021

villebro reviewed Jul 30, 2021

View reviewed changes

junlincc added the test:case label Jul 30, 2021

villebro approved these changes Jul 30, 2021

View reviewed changes

zhaoyongjie added 3 commits July 31, 2021 00:25

fix: eliminate cartesian product columns in pivot operator

0b73e84

wip

b42060d

wip

424b047

zhaoyongjie force-pushed the pivot_columns branch from 338761c to 424b047 Compare July 30, 2021 16:26

serenajiang reviewed Jul 30, 2021

View reviewed changes

minor tip

2ac07f1

serenajiang approved these changes Jul 31, 2021

View reviewed changes

zhaoyongjie merged commit c01d42f into apache:master Jul 31, 2021

junlincc mentioned this pull request Aug 2, 2021

[time-series]fail to add more than one metric in time-series chart #16023

Closed

opus-42 pushed a commit to opus-42/incubator-superset that referenced this pull request Nov 14, 2021

fix: eliminate cartesian product columns in pivot operator (apache#15975

315feb8

) * fix: eliminate cartesian product columns in pivot operator * wip * wip * minor tip

cccs-RyanS pushed a commit to CybercentreCanada/superset that referenced this pull request Dec 17, 2021

fix: eliminate cartesian product columns in pivot operator (apache#15975

06cb3bb

) * fix: eliminate cartesian product columns in pivot operator * wip * wip * minor tip

QAlexBall pushed a commit to QAlexBall/superset that referenced this pull request Dec 29, 2021

fix: eliminate cartesian product columns in pivot operator (apache#15975

d932058

) * fix: eliminate cartesian product columns in pivot operator * wip * wip * minor tip

Usiel mentioned this pull request Mar 23, 2023

perf(postprocessing): improve pivot postprocessing operation #23465

Merged

1 task

cccs-rc pushed a commit to CybercentreCanada/superset that referenced this pull request Mar 6, 2024

fix: eliminate cartesian product columns in pivot operator (apache#15975

02f6f12

) * fix: eliminate cartesian product columns in pivot operator * wip * wip * minor tip

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.3.0 labels Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: eliminate cartesian product columns in pivot operator #15975

fix: eliminate cartesian product columns in pivot operator #15975

zhaoyongjie commented Jul 30, 2021 •

edited

Loading

codecov bot commented Jul 30, 2021 •

edited

Loading

villebro left a comment

junlincc commented Jul 30, 2021

github-actions bot commented Jul 30, 2021

junlincc commented Jul 30, 2021 •

edited

Loading

villebro Jul 30, 2021

zhaoyongjie Jul 30, 2021

villebro left a comment

serenajiang left a comment

serenajiang Jul 30, 2021

zhaoyongjie Jul 30, 2021 •

edited

Loading

serenajiang Jul 30, 2021

zhaoyongjie Jul 31, 2021

github-actions bot commented Jul 31, 2021

zhaoyongjie commented Jul 31, 2021

jinghua-qa commented Nov 1, 2021

fix: eliminate cartesian product columns in pivot operator #15975

fix: eliminate cartesian product columns in pivot operator #15975

Conversation

zhaoyongjie commented Jul 30, 2021 • edited Loading

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

After

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

codecov bot commented Jul 30, 2021 • edited Loading

Codecov Report

villebro left a comment

Choose a reason for hiding this comment

junlincc commented Jul 30, 2021

github-actions bot commented Jul 30, 2021

junlincc commented Jul 30, 2021 • edited Loading

villebro Jul 30, 2021

Choose a reason for hiding this comment

zhaoyongjie Jul 30, 2021

Choose a reason for hiding this comment

villebro left a comment

Choose a reason for hiding this comment

serenajiang left a comment

Choose a reason for hiding this comment

serenajiang Jul 30, 2021

Choose a reason for hiding this comment

zhaoyongjie Jul 30, 2021 • edited Loading

Choose a reason for hiding this comment

serenajiang Jul 30, 2021

Choose a reason for hiding this comment

zhaoyongjie Jul 31, 2021

Choose a reason for hiding this comment

github-actions bot commented Jul 31, 2021

zhaoyongjie commented Jul 31, 2021

jinghua-qa commented Nov 1, 2021

zhaoyongjie commented Jul 30, 2021 •

edited

Loading

codecov bot commented Jul 30, 2021 •

edited

Loading

junlincc commented Jul 30, 2021 •

edited

Loading

zhaoyongjie Jul 30, 2021 •

edited

Loading