[SPARK-23849][SQL] Tests for samplingRatio of json datasource #21056

MaxGekk · 2018-04-12T17:41:01Z

What changes were proposed in this pull request?

Added the samplingRatio option to the json() method of PySpark DataFrame Reader. Improving existing tests for Scala API according to review of the PR: #20959

How was this patch tested?

Added new test for PySpark, updated 2 existing tests according to reviews of #20959 and added new negative test

SparkQA · 2018-04-12T17:45:18Z

Test build #89285 has finished for PR 21056 at commit a31b2e2.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-12T20:17:20Z

Test build #89289 has finished for PR 21056 at commit 6aa36d4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-04-12T20:25:57Z

jenkins, retest this, please

SparkQA · 2018-04-13T00:02:14Z

Test build #89298 has finished for PR 21056 at commit 6aa36d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2018-04-13T23:30:34Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

  }
+
+  test("SPARK-23849: samplingRatio is out of the range (0, 1.0]") {
+    val dstr = spark.sparkContext.parallelize(0 until 100, 1).map(_.toString).toDS()


can you just use spark.range?

SparkQA · 2018-04-14T14:17:52Z

Test build #89366 has finished for PR 21056 at commit 5c1d2b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-04-18T18:00:02Z

@rxin May I ask you to look at the PR again.

SparkQA · 2018-04-21T14:57:56Z

Test build #89674 has finished for PR 21056 at commit f96134c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-21T18:05:58Z

Test build #89678 has finished for PR 21056 at commit fdeac84.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-04-22T06:41:01Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala


-  test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") {
-    val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
+  val sampledTestData = (value: java.lang.Long) => {


@MaxGekk, can we have the data in TestJsonData, for example,

def sampledTestData: Dataset[String] = spark.createDataset(spark.sparkContext.parallelize( ... )(Encoders.STRING)

and use it, for example, sampledTestData.coalesce(1) in JsonSuite?

HyukjinKwon · 2018-04-22T06:42:22Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+
+  test("SPARK-23849: sampling files for schema inferring in the multiLine mode") {
+    withTempDir { dir =>
+      Files.write(Paths.get(dir.getAbsolutePath, "0.json"), """{"a":"a"}""".getBytes,


maybe getBytes with explicit encoding UTF-8.

SparkQA · 2018-04-23T00:32:27Z

Test build #89692 has finished for PR 21056 at commit 1b86df3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-04-25T08:28:30Z

python/pyspark/sql/readwriter.py

                                          including tab and line feed characters) or not.
        :param lineSep: defines the line separator that should be used for parsing. If None is
                        set, it covers all ``\\r``, ``\\r\\n`` and ``\\n``.
+        :param samplingRatio: defines fraction of rows (when ``multiLine`` is ``false``) or fraction


@MaxGekk, can we just don't say it's for files when multiLine is enabled? I think JSON makes a sample for each JSON (or an array of JSONs) regardless of record per file or multiple records in a file at high level.

Will it be ok if I write something like:

samplingRation defines fraction of input json objects used for schema inferring

Please, give me your variants if it doesn't work for you.

+1 (json -> JSON)

HyukjinKwon

LGTM otherwise

HyukjinKwon · 2018-04-25T08:46:08Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+      }
+      // The test uses the internal method because public API cannot guarantee order of files
+      // passed to the infer method. The order is changed between runs because the temporary
+      // folder has different path which leads to different order of file statuses returned


the temporary folder has different path which leads to different order of file statuses

Can we just read dir since we explicitly wrote the files above?

Do you mean to pass a dir into the infer() method instead of sequence of files?

Maybe I missed something but if the order by different paths were problem, that's fixed by explicit file name above. Then, I thought we could just use spark.read?

Ah, do you mean it's not guaranteed if we get the file statues in such order or not, via, for example Hadoop API? I took a look about this before and found it's designed to be in an alphabetical order although we shouldn't rely on this order if I remember this correctly.

Even if so, then we can't gurantee RDD[PortableDataStream] is a single partition within MultiLineJsonDataSource.infer .. @MaxGekk, it's okay if it's difficult to write a test. Manual test and updating PR description is a-okay.

Ah, do you mean it's not guaranteed if we get the file statues in such order or not, via, for example Hadoop API?

Yes, I am not sure we can guarantee that but frankly speaking I am not sure that sequence of files can guarantee that too. ;-)

I took a look about this before and found it's designed to be in an alphabetical order although we shouldn't rely on this order if I remember this correctly.

Probably you are right. I ran the test 100 times by passing dir successfully but it says nothing. In another environment it can be flaky.

Even if so, then we can't gurantee RDD[PortableDataStream] is a single partition within MultiLineJsonDataSource.infer.

Right. From another hand it is pretty bad that we cannot control number of partitions and its sizes. Let's image sampled input is big enough and isn't evenly distributed across partitions. We will have the situation when most part of schema inferring job is performed by one task. Maybe it is better to repartition sampled RDD/Dataset before doing schema inferring. What do you think of it?

it's okay if it's difficult to write a test. Manual test and updating PR description is a-okay.

Would you propose to delete the test?

Yup. I think it's fine to delete.

for repartition in schena inference, I dont feel strongly. lets monitor JIRAs and mailing list and see if there are some requests for it. it sounds too detailed control to me for now.

SparkQA · 2018-04-26T00:58:22Z

Test build #89861 has finished for PR 21056 at commit 2bd14e2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-04-26T01:14:11Z

Merged to master.

MaxGekk added 2 commits April 12, 2018 18:47

Making changes according to Hyukjin Kwon's review

95dcdee

Adding samplingRatio as an option for the json method

a31b2e2

MaxGekk mentioned this pull request Apr 12, 2018

[SPARK-23846][SQL] The samplingRatio option for CSV datasource #20959

Closed

Making Python style checker happy

6aa36d4

rxin reviewed Apr 13, 2018

View reviewed changes

Refactoring of the tests and replacing parallelize by range

5c1d2b2

MaxGekk added 3 commits April 21, 2018 12:28

Added a test for samplingRatio in the multiLine mode

1761cac

Improving comments

f96134c

Removing the unused option - inferSchema

fdeac84

HyukjinKwon reviewed Apr 22, 2018

View reviewed changes

Addressing Hyukjin Kwon's review comments

1b86df3

HyukjinKwon reviewed Apr 25, 2018

View reviewed changes

HyukjinKwon approved these changes Apr 25, 2018

View reviewed changes

HyukjinKwon reviewed Apr 25, 2018

View reviewed changes

MaxGekk added 3 commits April 25, 2018 23:30

Removing the test for which we cannot guarantee stability

ccb3f2a

Checking border cases

df540df

Improving description of the samplingRatio option

2bd14e2

asfgit closed this in 3f1e999 Apr 26, 2018

MaxGekk deleted the json-sampling branch August 17, 2019 13:34

[SPARK-23849][SQL] Tests for samplingRatio of json datasource #21056

[SPARK-23849][SQL] Tests for samplingRatio of json datasource #21056

Uh oh!

Conversation

MaxGekk commented Apr 12, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 12, 2018

Uh oh!

SparkQA commented Apr 12, 2018

Uh oh!

MaxGekk commented Apr 12, 2018

Uh oh!

SparkQA commented Apr 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 14, 2018

Uh oh!

MaxGekk commented Apr 18, 2018

Uh oh!

SparkQA commented Apr 21, 2018

Uh oh!

SparkQA commented Apr 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 23, 2018

Uh oh!

HyukjinKwon Apr 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Apr 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 26, 2018

Uh oh!

HyukjinKwon commented Apr 26, 2018

Uh oh!

Uh oh!

HyukjinKwon Apr 25, 2018 •

edited

Loading

HyukjinKwon Apr 25, 2018 •

edited

Loading