Skip to content

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Apr 12, 2018

What changes were proposed in this pull request?

Added the samplingRatio option to the json() method of PySpark DataFrame Reader. Improving existing tests for Scala API according to review of the PR: #20959

How was this patch tested?

Added new test for PySpark, updated 2 existing tests according to reviews of #20959 and added new negative test

@SparkQA
Copy link

SparkQA commented Apr 12, 2018

Test build #89285 has finished for PR 21056 at commit a31b2e2.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2018

Test build #89289 has finished for PR 21056 at commit 6aa36d4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 12, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Apr 13, 2018

Test build #89298 has finished for PR 21056 at commit 6aa36d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

test("SPARK-23849: samplingRatio is out of the range (0, 1.0]") {
val dstr = spark.sparkContext.parallelize(0 until 100, 1).map(_.toString).toDS()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just use spark.range?

@SparkQA
Copy link

SparkQA commented Apr 14, 2018

Test build #89366 has finished for PR 21056 at commit 5c1d2b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 18, 2018

@rxin May I ask you to look at the PR again.

@SparkQA
Copy link

SparkQA commented Apr 21, 2018

Test build #89674 has finished for PR 21056 at commit f96134c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 21, 2018

Test build #89678 has finished for PR 21056 at commit fdeac84.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") {
val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
val sampledTestData = (value: java.lang.Long) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk, can we have the data in TestJsonData, for example,

def sampledTestData: Dataset[String] =
    spark.createDataset(spark.sparkContext.parallelize(
      ...
    )(Encoders.STRING)

and use it, for example, sampledTestData.coalesce(1) in JsonSuite?


test("SPARK-23849: sampling files for schema inferring in the multiLine mode") {
withTempDir { dir =>
Files.write(Paths.get(dir.getAbsolutePath, "0.json"), """{"a":"a"}""".getBytes,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe getBytes with explicit encoding UTF-8.

@SparkQA
Copy link

SparkQA commented Apr 23, 2018

Test build #89692 has finished for PR 21056 at commit 1b86df3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

including tab and line feed characters) or not.
:param lineSep: defines the line separator that should be used for parsing. If None is
set, it covers all ``\\r``, ``\\r\\n`` and ``\\n``.
:param samplingRatio: defines fraction of rows (when ``multiLine`` is ``false``) or fraction
Copy link
Member

@HyukjinKwon HyukjinKwon Apr 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk, can we just don't say it's for files when multiLine is enabled? I think JSON makes a sample for each JSON (or an array of JSONs) regardless of record per file or multiple records in a file at high level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be ok if I write something like:

samplingRation defines fraction of input json objects used for schema inferring

Please, give me your variants if it doesn't work for you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 (json -> JSON)

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

}
// The test uses the internal method because public API cannot guarantee order of files
// passed to the infer method. The order is changed between runs because the temporary
// folder has different path which leads to different order of file statuses returned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the temporary folder has different path which leads to different order of file statuses

Can we just read dir since we explicitly wrote the files above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to pass a dir into the infer() method instead of sequence of files?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I missed something but if the order by different paths were problem, that's fixed by explicit file name above. Then, I thought we could just use spark.read?

Copy link
Member

@HyukjinKwon HyukjinKwon Apr 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, do you mean it's not guaranteed if we get the file statues in such order or not, via, for example Hadoop API? I took a look about this before and found it's designed to be in an alphabetical order although we shouldn't rely on this order if I remember this correctly.

Even if so, then we can't gurantee RDD[PortableDataStream] is a single partition within MultiLineJsonDataSource.infer .. @MaxGekk, it's okay if it's difficult to write a test. Manual test and updating PR description is a-okay.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, do you mean it's not guaranteed if we get the file statues in such order or not, via, for example Hadoop API?

Yes, I am not sure we can guarantee that but frankly speaking I am not sure that sequence of files can guarantee that too. ;-)

I took a look about this before and found it's designed to be in an alphabetical order although we shouldn't rely on this order if I remember this correctly.

Probably you are right. I ran the test 100 times by passing dir successfully but it says nothing. In another environment it can be flaky.

Even if so, then we can't gurantee RDD[PortableDataStream] is a single partition within MultiLineJsonDataSource.infer.

Right. From another hand it is pretty bad that we cannot control number of partitions and its sizes. Let's image sampled input is big enough and isn't evenly distributed across partitions. We will have the situation when most part of schema inferring job is performed by one task. Maybe it is better to repartition sampled RDD/Dataset before doing schema inferring. What do you think of it?

it's okay if it's difficult to write a test. Manual test and updating PR description is a-okay.

Would you propose to delete the test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. I think it's fine to delete.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for repartition in schena inference, I dont feel strongly. lets monitor JIRAs and mailing list and see if there are some requests for it. it sounds too detailed control to me for now.

@SparkQA
Copy link

SparkQA commented Apr 26, 2018

Test build #89861 has finished for PR 21056 at commit 2bd14e2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in 3f1e999 Apr 26, 2018
@MaxGekk MaxGekk deleted the json-sampling branch August 17, 2019 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants