-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-23849][SQL] Tests for the samplingRatio option of JSON datasource #20963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #88828 has finished for PR 20963 at commit
|
} | ||
writer.close() | ||
|
||
val ds = spark.read.option("samplingRatio", 0.1).json(path.getCanonicalPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk, wouldn't this test be flaky?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be if the partitionIndex is flaky here:
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
Line 320 in 2ce37b5
| $v.setSeed(${seed}L + partitionIndex); |
0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The seed is also given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK but I don't think we are guaranteed to have always one partition here though. Shall we at least explicitly set spark.sql.files.maxPartitionBytes
big enough with some comments?
I think we shouldn't encourage this way because it should likely be easy to be broken IMHO. I am fine with it anyway as I can't think of a better way on the other hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems specifying only spark.sql.files.maxPartitionBytes
is not enough. Please, look at the formula and slicing input files:
val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
Is ok if I just check that file size is less than maxSplitBytes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, please set the appropriate numbers. I think it's fine if it has some comments so that we read and fix the tests if it's broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ It's based upon actual experience before. There was a similar case that the test was broken due to the number of partitions and it took me a while to debug it, https://issues.apache.org/jira/browse/SPARK-13728
|
||
test("SPARK-23849: schema inferring touches less data if samplingRation < 1.0") { | ||
val predefinedSample = Set[Int](2, 8, 15, 27, 30, 34, 35, 37, 44, 46, | ||
57, 62, 68, 72) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not need to have so many elements in this set. Please combine the tests in your CSV PR.
Instead of calling json()
, we can do it using format("json")
. Then, you can combine the test cases for both CSV and Json.
LGTM. Thanks! Merged to master/ |
What changes were proposed in this pull request?
Proposed tests checks that only subset of input dataset is touched during schema inferring.