Skip to content

[SPARK-40579][PS] GroupBy.first should skip NULLs#38017

Closed
zhengruifeng wants to merge 1 commit intoapache:masterfrom
zhengruifeng:ps_first_skip_na
Closed

[SPARK-40579][PS] GroupBy.first should skip NULLs#38017
zhengruifeng wants to merge 1 commit intoapache:masterfrom
zhengruifeng:ps_first_skip_na

Conversation

@zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

make GroupBy.first skip nulls

Why are the changes needed?

to fix the behavior difference

In [1]: 
   ...: import pandas as pd
   ...: import numpy as np
   ...: import pyspark.pandas as ps
   ...: 
   ...: pdf = pd.DataFrame({"A": [1, 2, 1, 2],"B": [-1.5, np.nan, -3.2, 0.1],})
   ...: psdf = ps.from_pandas(pdf)
   ...: 

In [2]: pdf.groupby("A").first()
Out[2]: 
     B
A     
1 -1.5
2  0.1

In [3]: psdf.groupby("A").first()
                                                                                
     B
A     
1 -1.5
2  NaN

Does this PR introduce any user-facing change?

yes, updated GroupBy.first will skip NULLs

How was this patch tested?

added UT

Copy link
Contributor

@itholic itholic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng
Copy link
Contributor Author

Thanks you @HyukjinKwon @dongjoon-hyun @itholic

@zhengruifeng zhengruifeng deleted the ps_first_skip_na branch September 28, 2022 01:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants