Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: eliminate cartesian product columns in pivot operator #15975

Merged
merged 4 commits into from
Jul 31, 2021

Conversation

zhaoyongjie
Copy link
Member

@zhaoyongjie zhaoyongjie commented Jul 30, 2021

SUMMARY

closes: #15956

eliminate cartesian product columns in pivot operator

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

After

image

TESTING INSTRUCTIONS

added test in integrated test

ADDITIONAL INFORMATION

@codecov
Copy link

codecov bot commented Jul 30, 2021

Codecov Report

Merging #15975 (2ac07f1) into master (cc704dd) will decrease coverage by 0.22%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15975      +/-   ##
==========================================
- Coverage   77.06%   76.83%   -0.23%     
==========================================
  Files         988      988              
  Lines       52387    52397      +10     
  Branches     6626     6626              
==========================================
- Hits        40370    40259     -111     
- Misses      11793    11914     +121     
  Partials      224      224              
Flag Coverage Δ
hive ?
mysql 81.58% <100.00%> (-0.01%) ⬇️
postgres 81.65% <100.00%> (-0.01%) ⬇️
presto ?
python 81.74% <100.00%> (-0.44%) ⬇️
sqlite 81.29% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/utils/pandas_postprocessing.py 84.80% <100.00%> (+0.55%) ⬆️
superset/db_engines/hive.py 0.00% <0.00%> (-82.15%) ⬇️
superset/db_engine_specs/hive.py 69.80% <0.00%> (-16.87%) ⬇️
superset/db_engine_specs/presto.py 83.47% <0.00%> (-6.91%) ⬇️
superset/reports/notifications/slack.py 86.95% <0.00%> (-3.21%) ⬇️
superset/charts/post_processing.py 77.01% <0.00%> (-2.99%) ⬇️
superset/views/database/mixins.py 81.03% <0.00%> (-1.73%) ⬇️
superset/connectors/sqla/models.py 88.20% <0.00%> (-1.64%) ⬇️
superset/db_engine_specs/base.py 87.98% <0.00%> (-0.39%) ⬇️
superset/models/core.py 89.61% <0.00%> (-0.26%) ⬇️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc704dd...2ac07f1. Read the comment docs.

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great solution! A few small comments, other than that LGTM!

superset/utils/pandas_postprocessing.py Outdated Show resolved Hide resolved
superset/utils/pandas_postprocessing.py Outdated Show resolved Hide resolved
tests/integration_tests/pandas_postprocessing_tests.py Outdated Show resolved Hide resolved
tests/integration_tests/pandas_postprocessing_tests.py Outdated Show resolved Hide resolved
@junlincc
Copy link
Member

/testenv up

@github-actions
Copy link
Contributor

@junlincc Ephemeral environment spinning up at http://34.219.46.208:8080. Credentials are admin/admin. Please allow several minutes for bootstrapping and startup.

@junlincc junlincc added rush! Requires immediate attention and removed rush! Requires immediate attention labels Jul 30, 2021
@junlincc
Copy link
Member

junlincc commented Jul 30, 2021

After in testenv
Screen Shot 2021-07-30 at 3 52 05 AM

tested on a virtual dataset LGTM. ✅ unless i miss something

# https://github.com/apache/superset/issues/15956
# https://github.com/pandas-dev/pandas/issues/18030
series_set = set()
to_string_list: Callable[[List[Any]], List[str]] = lambda lst: [str(_) for _ in lst]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we even wrap the "_".join inside the lambda?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! thanks!

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@serenajiang serenajiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing this! ❤️

One minor comment. Also just wondering - would dropna(how="all", ...) on the final result work as a simpler solution?

# https://github.com/apache/superset/issues/15956
# https://github.com/pandas-dev/pandas/issues/18030
series_set = set()
lst_to_str: Callable[[List[Any]], str] = lambda lst: "_".join(str(_) for _ in lst)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit - I think you could just replace calls of this this with str().

Copy link
Member Author

@zhaoyongjie zhaoyongjie Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of columns might integer or float, so needs to be converted. for instance:

>>> "".join([1,0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: sequence item 0: expected str instance, int found

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but the join is not necessary - you can just call str() on a list to get a (deterministic) string representation of the list.

In [1]: str([0,"x"])
Out[1]: "[0, 'x']"

tuple() might be better (I assume the intention is to make the list hashable)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice tips! changes done.

@zhaoyongjie zhaoyongjie merged commit c01d42f into apache:master Jul 31, 2021
@github-actions
Copy link
Contributor

Ephemeral environment shutdown and build artifacts deleted.

@zhaoyongjie
Copy link
Member Author

Thanks for addressing this! ❤️

One minor comment. Also just wondering - would dropna(how="all", ...) on the final result work as a simpler solution?

sorry, missing this review. dropna with all might be Ignore NULL value of metric, for instance

import pandas as pd
import numpy as np
from datetime import datetime

df = pd.DataFrame({
    "ds": [datetime(2012, 11, 1), datetime(2012, 11, 1)],
    "col1": ['a', 'b'],
    "col2": ['a', 'b'],
    "metric": [np.NaN, 9], #<--- metric values
})
df

  | ds | col1 | col2 | metric
-- | -- | -- | -- | --
2012-11-01 | a | a | NaN
2012-11-01 | b | b | 9.0

df = df.pivot_table(
    index="ds",
    columns=["col1", "col2"],
    values=["metric"],
    aggfunc={
     "metric": np.mean
    },
    dropna=False
)

df.dropna(how="all", axis=1)



  | metric
-- | --
9.0

@jinghua-qa
Copy link
Member

test cases added

opus-42 pushed a commit to opus-42/incubator-superset that referenced this pull request Nov 14, 2021
)

* fix: eliminate cartesian product columns in pivot operator

* wip

* wip

* minor tip
cccs-RyanS pushed a commit to CybercentreCanada/superset that referenced this pull request Dec 17, 2021
)

* fix: eliminate cartesian product columns in pivot operator

* wip

* wip

* minor tip
QAlexBall pushed a commit to QAlexBall/superset that referenced this pull request Dec 29, 2021
)

* fix: eliminate cartesian product columns in pivot operator

* wip

* wip

* minor tip
cccs-rc pushed a commit to CybercentreCanada/superset that referenced this pull request Mar 6, 2024
)

* fix: eliminate cartesian product columns in pivot operator

* wip

* wip

* minor tip
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.3.0 labels Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/M test:case 🚢 1.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Time Series viz shows incorrect legend for multiple group by columns
6 participants