Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-41066][CONNECT][PYTHON] Implement DataFrame.sampleBy and DataFrame.stat.sampleBy #39328

Closed
wants to merge 1 commit into from

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

Implement DataFrame.sampleBy and DataFrame.stat.sampleBy

Why are the changes needed?

For API coverage

Does this PR introduce any user-facing change?

yes

How was this patch tested?

added UT

@zhengruifeng
Copy link
Contributor Author

cc @HyukjinKwon @beliefer

@@ -546,6 +547,34 @@ message StatFreqItems {
optional double support = 3;
}


// Returns a stratified sample without replacement based on the fraction
// given on each stratum.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will invoke 'Dataset.stat.sampleBy' (same as 'StatFunctions.sampleBy') to compute the results. should be added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to keep the comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, will update

repeated Fraction fractions = 3;

// (Optional) The random seed.
optional int64 seed = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems is required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I want to keep in line with other method which generate a random seed in server if not provided in the proto

@@ -419,6 +421,26 @@ class SparkConnectPlanner(session: SparkSession) {
}
}

private def transformStatSampleBy(rel: proto.StatSampleBy): LogicalPlan = {
val fractions = mutable.Map.empty[Any, Double]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

val fractions = rel.getFractionsList.asScala.toSeq.map { protoFraction =>
...
}

def sampleBy(
self, col: "ColumnOrName", fractions: Dict[Any, float], seed: Optional[int] = None
) -> "DataFrame":
if not isinstance(col, (Column, str)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior is changed from pyspark sql

if isinstance(col, str):
    col = Column(col)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no behavior change, since the underlying plan can accept a ColumnOrName

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

for k, v in fractions.items():
assert v is not None and isinstance(v, float)

assert seed is None or isinstance(seed, int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need add these check here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer adding assertion in the plan layer to make sure all parameters are expected

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can let server side require that the client must pass seed.

@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng zhengruifeng deleted the connect_df_sampleby branch January 2, 2023 02:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants