SPARK-1240: handle the case of empty RDD when takeSample #135

CodingCat · 2014-03-13T18:22:36Z

https://spark-project.atlassian.net/browse/SPARK-1240

It seems that the current implementation does not handle the empty RDD case when run takeSample

In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value

In the test case, I also add several lines for this case

AmplabJenkins · 2014-03-13T18:40:24Z

Merged build triggered.

AmplabJenkins · 2014-03-13T18:40:24Z

Merged build started.

rxin · 2014-03-13T18:48:14Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

@@ -310,6 +310,9 @@ abstract class RDD[T: ClassTag](
   * Return a sampled subset of this RDD.
   */
  def sample(withReplacement: Boolean, fraction: Double, seed: Int): RDD[T] = {
+    if (fraction < Double.MinValue  || fraction > Double.MaxValue) {


Use require. i.e.

require(fraction > Double.MinValue && fraction < Double.MaxValue, "...")

Shouldn't you just check for fraction > 0 but < 1?

The lower bound should be >= 0.0. Sample with replacement can have a faction greater than 1.0.

Hi, @rxin , I'm also a bit confused here, I think the name of the argument is a bit confusing

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L357

The above line contains a multiplier to ensure that the sampling can return enough sample points in most of cases..(I think so), so the fraction value can actually be larger than 1

also, this value actually determines the mean value of Poisson/Bernoulli distribution

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L314

mengxr · 2014-03-13T19:09:33Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

@@ -310,6 +310,8 @@ abstract class RDD[T: ClassTag](
   * Return a sampled subset of this RDD.
   */
  def sample(withReplacement: Boolean, fraction: Double, seed: Int): RDD[T] = {
+    require(fraction >= 0 && fraction <= Double.MaxValue,


require(fraction >= 0.0) should be sufficient here.

AmplabJenkins · 2014-03-13T19:37:53Z

Merged build finished.

AmplabJenkins · 2014-03-13T19:37:53Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13164/

AmplabJenkins · 2014-03-13T19:40:30Z

Merged build triggered.

AmplabJenkins · 2014-03-13T19:40:30Z

Merged build started.

mengxr · 2014-03-13T20:11:44Z

core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala

@@ -457,6 +457,10 @@ class RDDSuite extends FunSuite with SharedSparkContext {

  test("takeSample") {
    val data = sc.parallelize(1 to 100, 2)
+    val emptySet = data.mapPartitions { iter => Iterator.empty }


Let us create a separate test "takeSample from an empty rdd" and construct an empty rdd directly:

val emptyRdd = sc.parallelize(Seq.empty[Int], 2)

AmplabJenkins · 2014-03-13T20:38:02Z

Merged build finished.

AmplabJenkins · 2014-03-13T20:38:03Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13167/

AmplabJenkins · 2014-03-13T20:42:14Z

Merged build triggered.

AmplabJenkins · 2014-03-13T20:42:14Z

Merged build started.

mengxr · 2014-03-13T21:08:24Z

LGTM. Waiting for Jenkins.

AmplabJenkins · 2014-03-13T21:37:54Z

Merged build finished.

AmplabJenkins · 2014-03-13T21:37:54Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13169/

CodingCat · 2014-03-14T00:26:00Z

Ah, good, thank you very much for the comments @rxin @mengxr

mateiz · 2014-03-14T18:20:01Z

Can you check whether this is broken in Python too, and fix it there as well?

CodingCat · 2014-03-14T18:28:30Z

sure, will do that this evening~

AmplabJenkins · 2014-03-14T22:13:08Z

Merged build triggered.

AmplabJenkins · 2014-03-14T22:13:08Z

Merged build started.

AmplabJenkins · 2014-03-14T22:40:43Z

Merged build finished.

AmplabJenkins · 2014-03-14T22:40:44Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13186/

CodingCat · 2014-03-14T22:41:55Z

@mateiz, done~

mateiz · 2014-03-17T00:42:40Z

Actually sorry, one other thought -- instead of throwing an error when fraction == 0, just return the empty array there too. Some code might pass 0 for a valid reason (e.g. if the sampling rate is coming from a user, or whatever).

Anyway both this and the empty RDD case are good catches.

CodingCat · 2014-03-17T01:31:19Z

Hi, @mateiz , I think the current implementation allows fraction == 0 case, or I misunderstood something?

mateiz · 2014-03-17T05:17:48Z

Ah, thanks, I missed that. Merged this in.

https://spark-project.atlassian.net/browse/SPARK-1240 It seems that the current implementation does not handle the empty RDD case when run takeSample In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value In the test case, I also add several lines for this case Author: CodingCat <zhunansjtu@gmail.com> Closes #135 from CodingCat/SPARK-1240 and squashes the following commits: fef57d4 [CodingCat] fix the same problem in PySpark 36db06b [CodingCat] create new test cases for takeSample from an empty red 810948d [CodingCat] further fix a40e8fb [CodingCat] replace if with require ad483fd [CodingCat] handle the case with empty RDD when take sample Conflicts: core/src/main/scala/org/apache/spark/rdd/RDD.scala core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala

https://spark-project.atlassian.net/browse/SPARK-1240 It seems that the current implementation does not handle the empty RDD case when run takeSample In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value In the test case, I also add several lines for this case Author: CodingCat <zhunansjtu@gmail.com> Closes apache#135 from CodingCat/SPARK-1240 and squashes the following commits: fef57d4 [CodingCat] fix the same problem in PySpark 36db06b [CodingCat] create new test cases for takeSample from an empty red 810948d [CodingCat] further fix a40e8fb [CodingCat] replace if with require ad483fd [CodingCat] handle the case with empty RDD when take sample Conflicts: core/src/main/scala/org/apache/spark/rdd/RDD.scala core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala

Previous merge with upstream broke the build. Author: Herman van Hovell <hvanhovell@databricks.com> Closes apache#135 from hvanhovell/fix-hook-calling-external-catalog.

[ESPARK-135] 解决当不需要合并的时候仍然更换临时目录，导致最终结束为空的问题解决当不需要合并的时候仍然更换临时目录，导致最终结束为空的问题 . resolve apache#135 See merge request !82

(cherry picked from commit b27ea82)

* Support k8s E2E test against specified k8s version - Create install-k8s role - Support to specify version of k8s and etcd - Add new job cloud-provider-openstack-acceptance-test-e2e-conformance-latest-release - Skip to copy 0 size test_results.html - Add post.yaml in cloud-provider-openstack-acceptance-test-e2e-conformance as a placeholder to upload test result to testgrid, but upload_e2e.py can not be found, implement it in following patchset. Partial-Bug: kubernetes/cloud-provider-openstack#103

* apache#135 bump jackson version to 2.10.4 * apache#135 update spark version r41 Co-authored-by: Yu Gan <yu.gan@kyligence.io>

Co-authored-by: Yu Gan <yu.gan@kyligence.io>

* apache#135 bump jackson version to 2.10.4 * apache#135 update spark version r41 Co-authored-by: Yu Gan <yu.gan@kyligence.io>

Co-authored-by: Yu Gan <yu.gan@kyligence.io>

### What changes were proposed in this pull request? The pr aims to upgrade commons-codec from 1.15 to 1.16.0. ### Why are the changes needed? 1.The new version brings some bug fixed, eg: - Fix byte-skipping in Base16 decoding #135. Fixes CODEC-305. - BaseNCodecOutputStream.eof() should not throw IOException. - Add support for Blake3 family of hashes. Fixes [CODEC-296](https://issues.apache.org/jira/browse/CODEC-296). 2.The full release notes: https://commons.apache.org/proper/commons-codec/changes-report.html#a1.16.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #41707 from panbingkun/SPARK-44151. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

handle the case with empty RDD when take sample

ad483fd

rxin reviewed Mar 13, 2014
View reviewed changes

replace if with require

a40e8fb

mengxr reviewed Mar 13, 2014
View reviewed changes

further fix

810948d

mengxr reviewed Mar 13, 2014
View reviewed changes

create new test cases for takeSample from an empty red

36db06b

fix the same problem in PySpark

fef57d4

asfgit closed this in dc96546 Mar 17, 2014

CodingCat deleted the SPARK-1240 branch March 17, 2014 17:21

vanzin mentioned this pull request Jul 23, 2015

[SPARK-9270] [PySpark] allow --name option in pyspark #7610

Closed

jamesrgrinter pushed a commit to jamesrgrinter/spark that referenced this pull request Apr 22, 2018

Fixed "Single message comes late" (apache#135)

f338920

(cherry picked from commit b27ea82)

fishcus pushed a commit to fishcus/spark that referenced this pull request Jul 8, 2020

apache#135 bump jackson version to 2.10.4 (apache#136)

a26aa4b

* apache#135 bump jackson version to 2.10.4 * apache#135 update spark version r41 Co-authored-by: Yu Gan <yu.gan@kyligence.io>

fishcus pushed a commit to fishcus/spark that referenced this pull request Jul 8, 2020

apache#135 update spark version r37 (apache#141)

a2ce93f

Co-authored-by: Yu Gan <yu.gan@kyligence.io>

arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020

Fixed "Single message comes late" (apache#135)

448e758

microbearz pushed a commit to microbearz/spark that referenced this pull request Dec 15, 2020

apache#135 bump jackson version to 2.10.4 (apache#136)

6de2ac7

* apache#135 bump jackson version to 2.10.4 * apache#135 update spark version r41 Co-authored-by: Yu Gan <yu.gan@kyligence.io>

microbearz pushed a commit to microbearz/spark that referenced this pull request Dec 15, 2020

apache#135 update spark version r37 (apache#141)

058169b

Co-authored-by: Yu Gan <yu.gan@kyligence.io>

panbingkun mentioned this pull request Jun 22, 2023

[SPARK-44151][BUILD] Upgrade commons-codec to 1.16.0 #41707

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-1240: handle the case of empty RDD when takeSample #135

SPARK-1240: handle the case of empty RDD when takeSample #135

CodingCat commented Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

rxin Mar 13, 2014

mengxr Mar 13, 2014

CodingCat Mar 13, 2014

mengxr Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

mengxr Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

mengxr commented Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

CodingCat commented Mar 14, 2014

mateiz commented Mar 14, 2014

CodingCat commented Mar 14, 2014

AmplabJenkins commented Mar 14, 2014

AmplabJenkins commented Mar 14, 2014

AmplabJenkins commented Mar 14, 2014

AmplabJenkins commented Mar 14, 2014

CodingCat commented Mar 14, 2014

mateiz commented Mar 17, 2014

CodingCat commented Mar 17, 2014

mateiz commented Mar 17, 2014

SPARK-1240: handle the case of empty RDD when takeSample #135

SPARK-1240: handle the case of empty RDD when takeSample #135

Conversation

CodingCat commented Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

rxin Mar 13, 2014

Choose a reason for hiding this comment

mengxr Mar 13, 2014

Choose a reason for hiding this comment

CodingCat Mar 13, 2014

Choose a reason for hiding this comment

mengxr Mar 13, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

mengxr Mar 13, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

mengxr commented Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

CodingCat commented Mar 14, 2014

mateiz commented Mar 14, 2014

CodingCat commented Mar 14, 2014

AmplabJenkins commented Mar 14, 2014

AmplabJenkins commented Mar 14, 2014

AmplabJenkins commented Mar 14, 2014

AmplabJenkins commented Mar 14, 2014

CodingCat commented Mar 14, 2014

mateiz commented Mar 17, 2014

CodingCat commented Mar 17, 2014

mateiz commented Mar 17, 2014