-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-8072] [SQL] Better AnalysisException for writing DataFrame with identically named columns #7013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
ok to test |
|
Merged build triggered. |
|
Merged build started. |
|
LGTM |
|
Test build #35826 has started for PR 7013 at commit |
|
Test build #35826 has finished for PR 7013 at commit
|
|
Merged build finished. Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the best place to do this might be looking at a logical plan that contains an output operator, rather than putting it in the writer itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not figure out how to check if a logical plan has an output operator... Any guidance would help a lot.
|
After talking more off-line with @rxin I think we want to make this check specific to parquet. For other data sources (like CSV) its actually not a problem to have duplicate column names. |
|
Merged build triggered. |
|
Merged build started. |
|
Test build #36131 has started for PR 7013 at commit |
|
Test build #36131 has finished for PR 7013 at commit
|
|
Merged build finished. Test FAILed. |
|
Merged build triggered. |
|
Merged build started. |
|
Test build #36136 has started for PR 7013 at commit |
|
Test build #36136 has finished for PR 7013 at commit
|
|
Merged build finished. Test PASSed. |
|
@animeshbaranawal Do any other data sources also have this problem? I'm thinking of Orc and JSON, will JSON overwrite duplicated column silently? |
|
Yes. I tried with JSON and it overwrites the data. Michael Armbrust also said that he want the rule for parquet only. |
|
Let me clarify: I want the error on a per datasource basis, contingent upon whether it makes sense given the limitations of the format. |
|
@marmbrus Didn't get you? Am I missing something? |
|
We should also do it for JSON. |
|
And we should throw the error inside of parquet if possible. That way we don't have tons of special case code inside of the generic data source handler. |
|
Ideally, this would serve as an example so that other data source implementors could throw errors when people try to write out invalid data (i.e. consider a datasource that only allows alpha numeric characters in its column names). |
|
Got it ! What about jdbc ? |
|
Merged build triggered. |
|
Merged build started. |
|
Test build #36257 has started for PR 7013 at commit |
|
Test build #36257 has finished for PR 7013 at commit
|
|
Merged build finished. Test FAILed. |
|
Merged build triggered. |
|
Merged build started. |
|
Test build #36263 has started for PR 7013 at commit |
|
Test build #36263 has finished for PR 7013 at commit
|
|
Merged build finished. Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be an AnalysisException probably. This is the exception we throw when users try to run an invalid query.
|
Build triggered. |
|
Build started. |
|
Test build #36345 has started for PR 7013 at commit |
|
Test build #36345 has finished for PR 7013 at commit
|
|
Build finished. Test PASSed. |
|
Why is it not merging cleanly? |
Someone else has added tests to DataFrameSuite. |
|
Fixed the conflict manually. Merged to master. Thanks! |
|
@animeshbaranawal I think you want to add the email address you used in your commit to your github profile, so the commit will show up properly as yours. |
|
added ! |
Adding a function checkConstraints which will check for the constraints to be applied on the dataframe / dataframe schema. Function called before storing the dataframe to an external storage. Function added in the corresponding datasource API.
cc @rxin @marmbrus