Skip to content

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Apr 2, 2018

What changes were proposed in this pull request?

Proposed tests checks that only subset of input dataset is touched during schema inferring.

@SparkQA
Copy link

SparkQA commented Apr 2, 2018

Test build #88828 has finished for PR 20963 at commit a664465.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
writer.close()

val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk, wouldn't this test be flaky?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be if the partitionIndex is flaky here:

but in both tests there is only one partition with stable index 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The seed is also given.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK but I don't think we are guaranteed to have always one partition here though. Shall we at least explicitly set spark.sql.files.maxPartitionBytes big enough with some comments?

I think we shouldn't encourage this way because it should likely be easy to be broken IMHO. I am fine with it anyway as I can't think of a better way on the other hand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems specifying only spark.sql.files.maxPartitionBytes is not enough. Please, look at the formula and slicing input files:

val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))

Is ok if I just check that file size is less than maxSplitBytes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, please set the appropriate numbers. I think it's fine if it has some comments so that we read and fix the tests if it's broken.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ It's based upon actual experience before. There was a similar case that the test was broken due to the number of partitions and it took me a while to debug it, https://issues.apache.org/jira/browse/SPARK-13728


test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") {
val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46,
57, 62, 68, 72)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not need to have so many elements in this set. Please combine the tests in your CSV PR.

Instead of calling json(), we can do it using format("json"). Then, you can combine the test cases for both CSV and Json.

@gatorsmile
Copy link
Member

LGTM. Thanks! Merged to master/

@asfgit asfgit closed this in 6a73457 Apr 8, 2018
@MaxGekk MaxGekk deleted the json-sampling-tests branch August 17, 2019 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants