[SPARK-23849][SQL] Tests for the samplingRatio option of JSON datasource #20963

MaxGekk · 2018-04-02T18:50:02Z

What changes were proposed in this pull request?

Proposed tests checks that only subset of input dataset is touched during schema inferring.

SparkQA · 2018-04-02T22:15:21Z

Test build #88828 has finished for PR 20963 at commit a664465.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-04-03T01:21:01Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+      }
+      writer.close()
+
+      val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath)


@MaxGekk, wouldn't this test be flaky?

It could be if the partitionIndex is flaky here:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

Line 320 in 2ce37b5

| $v.setSeed(${seed}L + partitionIndex);

but in both tests there is only one partition with stable index 0

Yes. The seed is also given.

OK but I don't think we are guaranteed to have always one partition here though. Shall we at least explicitly set spark.sql.files.maxPartitionBytes big enough with some comments?

I think we shouldn't encourage this way because it should likely be easy to be broken IMHO. I am fine with it anyway as I can't think of a better way on the other hand.

It seems specifying only spark.sql.files.maxPartitionBytes is not enough. Please, look at the formula and slicing input files:

val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))

Is ok if I just check that file size is less than maxSplitBytes?

yup, please set the appropriate numbers. I think it's fine if it has some comments so that we read and fix the tests if it's broken.

^ It's based upon actual experience before. There was a similar case that the test was broken due to the number of partitions and it took me a while to debug it, https://issues.apache.org/jira/browse/SPARK-13728

gatorsmile · 2018-04-08T04:42:35Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+
+  test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") {
+    val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
+      57, 62, 68, 72)


Not need to have so many elements in this set. Please combine the tests in your CSV PR.

Instead of calling json(), we can do it using format("json"). Then, you can combine the test cases for both CSV and Json.

gatorsmile · 2018-04-08T04:43:00Z

LGTM. Thanks! Merged to master/

MaxGekk added 4 commits April 2, 2018 20:35

Adding samplingRation tests for json

76e38a8

Removing debug code

1acc3ec

Adding the ticket ref to test titles

0d5fcfb

Making Scala style checker happy

a664465

MaxGekk mentioned this pull request Apr 2, 2018

[SPARK-23846][SQL] The samplingRatio option for CSV datasource #20959

Closed

HyukjinKwon reviewed Apr 3, 2018

View reviewed changes

gatorsmile reviewed Apr 8, 2018

View reviewed changes

asfgit closed this in 6a73457 Apr 8, 2018

MaxGekk deleted the json-sampling-tests branch August 17, 2019 13:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-23849][SQL] Tests for the samplingRatio option of JSON datasource #20963

[SPARK-23849][SQL] Tests for the samplingRatio option of JSON datasource #20963

Uh oh!

MaxGekk commented Apr 2, 2018

Uh oh!

SparkQA commented Apr 2, 2018

Uh oh!

HyukjinKwon Apr 3, 2018

Uh oh!

MaxGekk Apr 4, 2018

Uh oh!

gatorsmile Apr 8, 2018

Uh oh!

HyukjinKwon Apr 8, 2018

Uh oh!

MaxGekk Apr 10, 2018

Uh oh!

HyukjinKwon Apr 11, 2018

Uh oh!

HyukjinKwon Apr 11, 2018

Uh oh!

gatorsmile Apr 8, 2018

Uh oh!

gatorsmile commented Apr 8, 2018

Uh oh!

Uh oh!

[SPARK-23849][SQL] Tests for the samplingRatio option of JSON datasource #20963

[SPARK-23849][SQL] Tests for the samplingRatio option of JSON datasource #20963

Uh oh!

Conversation

MaxGekk commented Apr 2, 2018

What changes were proposed in this pull request?

Uh oh!

SparkQA commented Apr 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Apr 8, 2018

Uh oh!

Uh oh!