[SPARK-41066][CONNECT][PYTHON] Implement `DataFrame.sampleBy` and `DataFrame.stat.sampleBy` #39328

zhengruifeng · 2022-12-31T12:36:13Z

What changes were proposed in this pull request?

Implement DataFrame.sampleBy and DataFrame.stat.sampleBy

Why are the changes needed?

For API coverage

Does this PR introduce any user-facing change?

yes

How was this patch tested?

added UT

zhengruifeng · 2022-12-31T12:40:09Z

cc @HyukjinKwon @beliefer

beliefer · 2023-01-01T02:03:14Z

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

@@ -546,6 +547,34 @@ message StatFreqItems {
  optional double support = 3;
 }

+
+// Returns a stratified sample without replacement based on the fraction
+// given on each stratum.


It will invoke 'Dataset.stat.sampleBy' (same as 'StatFunctions.sampleBy') to compute the results. should be added.

It would be better to keep the comment

nice, will update

beliefer · 2023-01-01T02:08:22Z

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

+  repeated Fraction fractions = 3;
+
+  // (Optional) The random seed.
+  optional int64 seed = 5;


It seems is required.

here I want to keep in line with other method which generate a random seed in server if not provided in the proto

beliefer · 2023-01-01T02:10:15Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

@@ -419,6 +421,26 @@ class SparkConnectPlanner(session: SparkSession) {
    }
  }

+  private def transformStatSampleBy(rel: proto.StatSampleBy): LogicalPlan = {
+    val fractions = mutable.Map.empty[Any, Double]


How about

val fractions = rel.getFractionsList.asScala.toSeq.map { protoFraction => ... }

beliefer · 2023-01-01T02:14:04Z

python/pyspark/sql/connect/dataframe.py

+    def sampleBy(
+        self, col: "ColumnOrName", fractions: Dict[Any, float], seed: Optional[int] = None
+    ) -> "DataFrame":
+        if not isinstance(col, (Column, str)):


The behavior is changed from pyspark sql

if isinstance(col, str): col = Column(col)

there is no behavior change, since the underlying plan can accept a ColumnOrName

beliefer · 2023-01-01T02:15:06Z

python/pyspark/sql/connect/plan.py

+        for k, v in fractions.items():
+            assert v is not None and isinstance(v, float)
+
+        assert seed is None or isinstance(seed, int)


Do we need add these check here?

I prefer adding assertion in the plan layer to make sure all parameters are expected

I think we can let server side require that the client must pass seed.

HyukjinKwon · 2023-01-02T00:31:02Z

Merged to master.

github-actions bot added CONNECT CORE PYTHON SQL labels Dec 31, 2022

init

192ca0b

zhengruifeng force-pushed the connect_df_sampleby branch from c5182ef to 192ca0b Compare January 1, 2023 01:04

beliefer reviewed Jan 1, 2023

View reviewed changes

HyukjinKwon approved these changes Jan 1, 2023

View reviewed changes

HyukjinKwon closed this in d9c604e Jan 2, 2023

zhengruifeng deleted the connect_df_sampleby branch January 2, 2023 02:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-41066][CONNECT][PYTHON] Implement `DataFrame.sampleBy` and `DataFrame.stat.sampleBy` #39328

[SPARK-41066][CONNECT][PYTHON] Implement `DataFrame.sampleBy` and `DataFrame.stat.sampleBy` #39328

zhengruifeng commented Dec 31, 2022

zhengruifeng commented Dec 31, 2022

beliefer Jan 1, 2023

beliefer Jan 1, 2023

zhengruifeng Jan 1, 2023

beliefer Jan 1, 2023

zhengruifeng Jan 1, 2023

beliefer Jan 1, 2023

beliefer Jan 1, 2023

zhengruifeng Jan 1, 2023

beliefer Jan 1, 2023

beliefer Jan 1, 2023

zhengruifeng Jan 1, 2023

beliefer Jan 1, 2023

HyukjinKwon commented Jan 2, 2023

[SPARK-41066][CONNECT][PYTHON] Implement DataFrame.sampleBy and DataFrame.stat.sampleBy #39328

[SPARK-41066][CONNECT][PYTHON] Implement DataFrame.sampleBy and DataFrame.stat.sampleBy #39328

Conversation

zhengruifeng commented Dec 31, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

zhengruifeng commented Dec 31, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jan 2, 2023

[SPARK-41066][CONNECT][PYTHON] Implement `DataFrame.sampleBy` and `DataFrame.stat.sampleBy` #39328

[SPARK-41066][CONNECT][PYTHON] Implement `DataFrame.sampleBy` and `DataFrame.stat.sampleBy` #39328