Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1240: handle the case of empty RDD when takeSample #135

Closed
wants to merge 5 commits into from

Conversation

CodingCat
Copy link
Contributor

https://spark-project.atlassian.net/browse/SPARK-1240

It seems that the current implementation does not handle the empty RDD case when run takeSample

In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value

In the test case, I also add several lines for this case

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@@ -310,6 +310,9 @@ abstract class RDD[T: ClassTag](
* Return a sampled subset of this RDD.
*/
def sample(withReplacement: Boolean, fraction: Double, seed: Int): RDD[T] = {
if (fraction < Double.MinValue || fraction > Double.MaxValue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use require. i.e.

require(fraction > Double.MinValue && fraction < Double.MaxValue, "...")

Shouldn't you just check for fraction > 0 but < 1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lower bound should be >= 0.0. Sample with replacement can have a faction greater than 1.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @rxin , I'm also a bit confused here, I think the name of the argument is a bit confusing

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L357

The above line contains a multiplier to ensure that the sampling can return enough sample points in most of cases..(I think so), so the fraction value can actually be larger than 1

also, this value actually determines the mean value of Poisson/Bernoulli distribution

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L314

@@ -310,6 +310,8 @@ abstract class RDD[T: ClassTag](
* Return a sampled subset of this RDD.
*/
def sample(withReplacement: Boolean, fraction: Double, seed: Int): RDD[T] = {
require(fraction >= 0 && fraction <= Double.MaxValue,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require(fraction >= 0.0) should be sufficient here.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13164/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@@ -457,6 +457,10 @@ class RDDSuite extends FunSuite with SharedSparkContext {

test("takeSample") {
val data = sc.parallelize(1 to 100, 2)
val emptySet = data.mapPartitions { iter => Iterator.empty }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us create a separate test "takeSample from an empty rdd" and construct an empty rdd directly:

val emptyRdd = sc.parallelize(Seq.empty[Int], 2)

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13167/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@mengxr
Copy link
Contributor

mengxr commented Mar 13, 2014

LGTM. Waiting for Jenkins.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13169/

@CodingCat
Copy link
Contributor Author

Ah, good, thank you very much for the comments @rxin @mengxr

@mateiz
Copy link
Contributor

mateiz commented Mar 14, 2014

Can you check whether this is broken in Python too, and fix it there as well?

@CodingCat
Copy link
Contributor Author

sure, will do that this evening~

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13186/

@CodingCat
Copy link
Contributor Author

@mateiz, done~

@mateiz
Copy link
Contributor

mateiz commented Mar 17, 2014

Actually sorry, one other thought -- instead of throwing an error when fraction == 0, just return the empty array there too. Some code might pass 0 for a valid reason (e.g. if the sampling rate is coming from a user, or whatever).

Anyway both this and the empty RDD case are good catches.

@CodingCat
Copy link
Contributor Author

Hi, @mateiz , I think the current implementation allows fraction == 0 case, or I misunderstood something?

@mateiz
Copy link
Contributor

mateiz commented Mar 17, 2014

Ah, thanks, I missed that. Merged this in.

@asfgit asfgit closed this in dc96546 Mar 17, 2014
asfgit pushed a commit that referenced this pull request Mar 17, 2014
https://spark-project.atlassian.net/browse/SPARK-1240

It seems that the current implementation does not handle the empty RDD case when run takeSample

In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value

In the test case, I also add several lines for this case

Author: CodingCat <zhunansjtu@gmail.com>

Closes #135 from CodingCat/SPARK-1240 and squashes the following commits:

fef57d4 [CodingCat] fix the same problem in PySpark
36db06b [CodingCat] create new test cases for takeSample from an empty red
810948d [CodingCat] further fix
a40e8fb [CodingCat] replace if with require
ad483fd [CodingCat] handle the case with empty RDD when take sample

Conflicts:
	core/src/main/scala/org/apache/spark/rdd/RDD.scala
	core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala
@CodingCat CodingCat deleted the SPARK-1240 branch March 17, 2014 17:21
gzm55 pushed a commit to MediaV/spark that referenced this pull request Jul 17, 2014
https://spark-project.atlassian.net/browse/SPARK-1240

It seems that the current implementation does not handle the empty RDD case when run takeSample

In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value

In the test case, I also add several lines for this case

Author: CodingCat <zhunansjtu@gmail.com>

Closes apache#135 from CodingCat/SPARK-1240 and squashes the following commits:

fef57d4 [CodingCat] fix the same problem in PySpark
36db06b [CodingCat] create new test cases for takeSample from an empty red
810948d [CodingCat] further fix
a40e8fb [CodingCat] replace if with require
ad483fd [CodingCat] handle the case with empty RDD when take sample

Conflicts:
	core/src/main/scala/org/apache/spark/rdd/RDD.scala
	core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala
ericl pushed a commit to ericl/spark that referenced this pull request Dec 21, 2016
Previous merge with upstream broke the build.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes apache#135 from hvanhovell/fix-hook-calling-external-catalog.
cenyuhai added a commit to cenyuhai/spark that referenced this pull request Nov 14, 2017
[ESPARK-135] 解决当不需要合并的时候仍然更换临时目录,导致最终结束为空的问题

解决当不需要合并的时候仍然更换临时目录,导致最终结束为空的问题 . 
resolve apache#135

See merge request !82
jamesrgrinter pushed a commit to jamesrgrinter/spark that referenced this pull request Apr 22, 2018
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
* Support k8s E2E test against specified k8s version

- Create install-k8s role
- Support to specify version of k8s and etcd
- Add new job
cloud-provider-openstack-acceptance-test-e2e-conformance-latest-release
- Skip to copy 0 size test_results.html
- Add post.yaml in
cloud-provider-openstack-acceptance-test-e2e-conformance
as a placeholder to upload test result to testgrid, but
upload_e2e.py can not be found, implement it in following
patchset.

Partial-Bug: kubernetes/cloud-provider-openstack#103
fishcus pushed a commit to fishcus/spark that referenced this pull request Jul 8, 2020
* apache#135 bump jackson version to 2.10.4

* apache#135 update spark version r41

Co-authored-by: Yu Gan <yu.gan@kyligence.io>
fishcus pushed a commit to fishcus/spark that referenced this pull request Jul 8, 2020
Co-authored-by: Yu Gan <yu.gan@kyligence.io>
arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020
microbearz pushed a commit to microbearz/spark that referenced this pull request Dec 15, 2020
* apache#135 bump jackson version to 2.10.4

* apache#135 update spark version r41

Co-authored-by: Yu Gan <yu.gan@kyligence.io>
microbearz pushed a commit to microbearz/spark that referenced this pull request Dec 15, 2020
Co-authored-by: Yu Gan <yu.gan@kyligence.io>
dongjoon-hyun pushed a commit that referenced this pull request Jun 23, 2023
### What changes were proposed in this pull request?
The pr aims to upgrade commons-codec from 1.15 to 1.16.0.

### Why are the changes needed?
1.The new version brings some bug fixed, eg:
- Fix byte-skipping in Base16 decoding #135. Fixes CODEC-305.
- BaseNCodecOutputStream.eof() should not throw IOException.
- Add support for Blake3 family of hashes. Fixes [CODEC-296](https://issues.apache.org/jira/browse/CODEC-296).

2.The full release notes:
https://commons.apache.org/proper/commons-codec/changes-report.html#a1.16.0

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

Closes #41707 from panbingkun/SPARK-44151.

Authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants