New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18246][SQL] Throws an exception before execution for unsupported types in Json, CSV and text functionailities #15751
Conversation
override def checkInputDataTypes(): TypeCheckResult = { | ||
if (StringType.acceptsType(child.dataType)) { | ||
try { | ||
JacksonUtils.verifySchema(schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, we don't have to worry about parsing mode, mode
because from_json
produces null
with the default parse mode, FAILFAST
.
// mode, it drops records only containing non-null values in unsupported types. We should use | ||
// `requiredSchema` instead of whole schema `dataSchema` here to not to break the original | ||
// behaviour. | ||
verifySchema(requiredSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, it only checks projected columns for not changing the existing behaviour (we are not really checking the other columns when parsing already).
cc @marmbrus, could you please take a look? This is the one about schema verification we talked when adding |
Test build #68067 has finished for PR 15751 at commit
|
Let me try to explain in more details and add some tests for reviewing. |
Test build #68106 has finished for PR 15751 at commit
|
Test build #68104 has finished for PR 15751 at commit
|
Test build #68105 has finished for PR 15751 at commit
|
s"CSV data source does not support ${dataType.simpleString} data type.") | ||
case _ => | ||
throw new UnsupportedOperationException( | ||
s"CSV data source does not support ${dataType.simpleString} data type.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSV currently throws UnsupportedOperation
but text datasource throws AnalysisException
. I just matched this to UnsupportedOperation
. I am happy to match this to AnalysisException
if anyone thinks so.
Test build #68111 has finished for PR 15751 at commit
|
Test build #68119 has finished for PR 15751 at commit
|
Test build #68121 has finished for PR 15751 at commit
|
retest this please |
Test build #68123 has finished for PR 15751 at commit
|
Thanks for working on this! My meta question however is why don't we just support this instead? We can parse interval type from strings and I think it would be really easy to write a converter that produces a string for write. |
@marmbrus Sure! makes sense. One thing I instantly worry is, though, IIUC, it seems we should unset the check[1] for Anyway, this PR only makes the datasources throw the exceptions ahead and introduces some more test cases not existing before. So, I hope this is okay as it is if you are all fine. |
Sure, we could possibly merge this first, and you are right, we'd need to remove checks. I'll try to find some time to look this over as its a larger patch and I'm focusing on 2.1 bugs right now. If you have time you might beat me with the better solution :) |
…SV and text functionailities
aea65e9
to
bae8db8
Compare
Test build #68316 has finished for PR 15751 at commit
|
retest this please |
Test build #68326 has finished for PR 15751 at commit
|
Hi @marmbrus, would there be something you are worried of and I should take a careful look for? |
Do we want this change? If there is an approval, I will rebase. It seems easily making a conflict. |
I will close this for now and make a new one soon. |
What changes were proposed in this pull request?
This PR includes several fixes as below:
Case 1
read.json(rdd)
- throws an exception before the execution for unsupported types.Case 2
read.json(path)
- throws an exception before the execution for unsupported types.Case 3
read.csv(path)
- throws an exception before the execution for unsupported types.Case 4
read.text(path)
- throws an exception before the execution for unsupported types rather than printing incorrect values.currently this prints as below:
whereas actual content is
Case 5
from_json(...)
- throws analysis exception for unsupported types (to_json
is already throwing an analysis exception).Case 6
write.json(path)
(this is a potential issue) - adds the schema check in writingCurrently, it seems JSON conversion does not support only
CalendarIntervalType
but this case seems already covered as below:However, it'd be safe if we add this check like the other text-based datasources. We might add some more types in the future.
In more details for Case 1, Case 2 and Case 3, there are parsing modes,
FAILFAST
,DROPMALFORMED
andPERMISSIVE
. The original behaviour is,FAILFAST
- fails if it meets the unsupported type when parsingDROPMALFORMED
- drops record having non-null values in the unsupported types. Otherwise, it reads it asnull
.PERMISSIVE
- allows to read the values as null for unsupported types.In case of
FAILFAST
, we can fail right after only checking the schema.How was this patch tested?
Unit tests in
JsonSuite
,CSVSuite
,TextSuite
andJsonFunctionsSuite
.