Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Aug 26, 2024

What changes were proposed in this pull request?

fliers/outliers was ignored in the initial implementation #36317

Why are the changes needed?

feature parity for Pandas and Series box plot

Does this PR introduce any user-facing change?

import pyspark.pandas as ps
df = ps.DataFrame([[5.1, 3.5, 0], [4.9, 3.0, 0], [7.0, 3.2, 1], [6.4, 3.2, 1], [5.9, 3.0, 2], [100, 200, 300]], columns=['length', 'width', 'species'])
df.boxplot()

df.length.plot.box()
image

before:
df.boxplot()
image

after:
df.boxplot()
image

How was this patch tested?

CI and manually check

Was this patch authored or co-authored using generative AI tooling?

No

@zhengruifeng zhengruifeng changed the title [SPARK-49382][PS] Make frame box plot properly render the fliers/outlier [SPARK-49382][PS] Make frame box plot properly render the fliers/outliers Aug 26, 2024
for i, colname in enumerate(colnames):
formated_colname = "`{}`".format(colname)
outlier_colname = "__{}_outlier".format(colname)
min_val = multicol_whiskers[colname]["min"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it weird to select the outliers by the distance |value - lower_whisker|, which is used in series.boxplot.

It should be something like |value - median| or |value - mean|, will revisit this later.

@zhengruifeng zhengruifeng deleted the plot_hist_fly branch August 26, 2024 05:13
@zhengruifeng
Copy link
Contributor Author

merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants