New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32025][SQL] Csv schema inference problems with different types in the same column #28896
[SPARK-32025][SQL] Csv schema inference problems with different types in the same column #28896
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If StringType is not the last type in any type inference chains, this can break existing users apps, I guess. Can't it?
@MaxGekk Yes, the last type in inference is StringType in all inference chains. I have changed the description to show it better. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
Outdated
Show resolved
Hide resolved
Does JSON source have the same problem? |
@HyukjinKwon It's exactly what you say, it only happens when the incompatibility is inside one partition. I will change the PR to use compatibleType, and I will make some performance test. Thanks for your help! @cloud-fan I tested the same situation with json and it works fine, we don't have problems there |
I have done some performance tests in my local machine
Result without changes:
Results after changes
It seems that we don't have a very significant impact but the tests in local are not the best way to be sure |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
Show resolved
Hide resolved
ok to test |
Test build #124510 has finished for PR 28896 at commit
|
retest this please |
1 similar comment
retest this please |
Test build #124512 has finished for PR 28896 at commit
|
Merged to master. |
Is that possible we can document the type inference rule? something like the traditional database like https://www.ibm.com/support/knowledgecenter/SSEPGG_10.1.0/com.ibm.db2.luw.sql.ref.doc/doc/r0008477.html |
@gatorsmile I will take a look. |
Thanks @huaxingao If you can't do it for any reason tell me. |
What changes were proposed in this pull request?
This pull request fixes a bug present in the csv type inference.
We have problems when we have different types in the same column.
Previously:
Now
Previously the hierarchy of type inference is the following:
So, when, for example, we have integers in one column, and the last element is a boolean, all the column is inferred as a boolean column incorrectly and all the number are shown as null when you see the data
We need the following hierarchy. When we have different numeric types in the column it will be resolved correctly. And when we have other different types it will be resolved as a String type column
Why are the changes needed?
Fix the bug explained
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit test and manual tests