[SPARK-8072] [SQL] Better AnalysisException for writing DataFrame with identically named columns #7013

animeshbaranawal · 2015-06-25T09:45:43Z

Adding a function checkConstraints which will check for the constraints to be applied on the dataframe / dataframe schema. Function called before storing the dataframe to an external storage. Function added in the corresponding datasource API.
cc @rxin @marmbrus

AmplabJenkins · 2015-06-25T09:47:11Z

Can one of the admins verify this patch?

davies · 2015-06-26T05:49:59Z

ok to test

AmplabJenkins · 2015-06-26T05:52:14Z

Merged build triggered.

AmplabJenkins · 2015-06-26T05:52:23Z

Merged build started.

davies · 2015-06-26T05:54:32Z

LGTM

SparkQA · 2015-06-26T05:55:11Z

Test build #35826 has started for PR 7013 at commit 7c3d928.

SparkQA · 2015-06-26T07:51:23Z

Test build #35826 has finished for PR 7013 at commit 7c3d928.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-06-26T07:51:35Z

Merged build finished. Test PASSed.

rxin · 2015-06-26T07:59:02Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

the best place to do this might be looking at a logical plan that contains an output operator, rather than putting it in the writer itself.

I could not figure out how to check if a logical plan has an output operator... Any guidance would help a lot.

marmbrus · 2015-06-29T23:25:20Z

After talking more off-line with @rxin I think we want to make this check specific to parquet. For other data sources (like CSV) its actually not a problem to have duplicate column names.

… parquet format

animeshbaranawal · 2015-06-30T12:06:00Z

cc @rxin @marmbrus
Can you review this?

AmplabJenkins · 2015-06-30T12:07:12Z

Merged build triggered.

AmplabJenkins · 2015-06-30T12:56:13Z

Merged build started.

SparkQA · 2015-06-30T12:59:24Z

Test build #36131 has started for PR 7013 at commit 98b4399.

SparkQA · 2015-06-30T13:01:25Z

Test build #36131 has finished for PR 7013 at commit 98b4399.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-06-30T13:01:26Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-06-30T13:17:14Z

Merged build triggered.

AmplabJenkins · 2015-06-30T13:17:22Z

Merged build started.

SparkQA · 2015-06-30T13:20:31Z

Test build #36136 has started for PR 7013 at commit 3cc4d2c.

SparkQA · 2015-06-30T14:57:59Z

Test build #36136 has finished for PR 7013 at commit 3cc4d2c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-06-30T14:58:28Z

Merged build finished. Test PASSed.

davies · 2015-06-30T15:27:22Z

@animeshbaranawal Do any other data sources also have this problem? I'm thinking of Orc and JSON, will JSON overwrite duplicated column silently?

animeshbaranawal · 2015-06-30T16:49:52Z

Yes. I tried with JSON and it overwrites the data. Michael Armbrust also said that he want the rule for parquet only.

marmbrus · 2015-06-30T17:12:46Z

Let me clarify: I want the error on a per datasource basis, contingent upon whether it makes sense given the limitations of the format.

animeshbaranawal · 2015-06-30T17:14:22Z

@marmbrus Didn't get you? Am I missing something?

marmbrus · 2015-06-30T17:15:05Z

We should also do it for JSON.

marmbrus · 2015-06-30T17:15:54Z

And we should throw the error inside of parquet if possible. That way we don't have tons of special case code inside of the generic data source handler.

marmbrus · 2015-06-30T17:16:44Z

Ideally, this would serve as an example so that other data source implementors could throw errors when people try to write out invalid data (i.e. consider a datasource that only allows alpha numeric characters in its column names).

animeshbaranawal · 2015-06-30T17:18:47Z

Got it ! What about jdbc ?

AmplabJenkins · 2015-07-01T12:23:13Z

Merged build triggered.

AmplabJenkins · 2015-07-01T12:23:18Z

Merged build started.

SparkQA · 2015-07-01T12:27:34Z

Test build #36257 has started for PR 7013 at commit a8a964f.

SparkQA · 2015-07-01T12:29:55Z

Test build #36257 has finished for PR 7013 at commit a8a964f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-07-01T12:29:56Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-07-01T13:18:13Z

Merged build triggered.

AmplabJenkins · 2015-07-01T13:18:20Z

Merged build started.

SparkQA · 2015-07-01T13:19:19Z

Test build #36263 has started for PR 7013 at commit fd45e1b.

SparkQA · 2015-07-01T15:01:00Z

Test build #36263 has finished for PR 7013 at commit fd45e1b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-07-01T15:01:41Z

Merged build finished. Test PASSed.

marmbrus · 2015-07-01T18:44:54Z

sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala

Should be an AnalysisException probably. This is the exception we throw when users try to run an invalid query.

AmplabJenkins · 2015-07-02T05:18:11Z

Build triggered.

AmplabJenkins · 2015-07-02T05:18:18Z

Build started.

SparkQA · 2015-07-02T05:19:22Z

Test build #36345 has started for PR 7013 at commit f70dd0e.

SparkQA · 2015-07-02T07:41:19Z

Test build #36345 has finished for PR 7013 at commit f70dd0e.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-07-02T07:41:57Z

Build finished. Test PASSed.

animeshbaranawal · 2015-07-02T11:02:24Z

Why is it not merging cleanly?

marmbrus · 2015-07-06T21:41:32Z

[marmbrus@Michaels-MacBook-Pro-2 spark ((14e4bf8...))]$ git checkout pr/7013
Previous HEAD position was 14e4bf8... Use CanBroadcast in broadcast outer join planning
Branch pr/7013 set up to track remote branch pr/7013 from origin.
Switched to a new branch 'pr/7013'

[marmbrus@Michaels-MacBook-Pro-2 spark (pr/7013)]$ git merge origin/master
Auto-merging sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala
Removing sql/hive/src/test/resources/golden/Date comparison test 2-0-dc1b267f1d79d49e6675afe4fd2a34a5
Removing sql/hive/src/test/resources/golden/Date cast-0-a7cd69b80c77a771a2c955db666be53d
Auto-merging sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala
Auto-merging sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
CONFLICT (content): Merge conflict in sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
Auto-merging sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
Auto-merging sql/core/src/main/scala/org/apache/spark/sql/execution/pythonUDFs.scala
Auto-merging sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
Removing sql/catalyst/src/main/java/org/apache/spark/sql/BaseRow.java
Removing sql/catalyst/src/main/java/org/apache/spark/sql/BaseMutableRow.java
Automatic merge failed; fix conflicts and then commit the result.

Someone else has added tests to DataFrameSuite.

marmbrus · 2015-07-06T23:40:32Z

Fixed the conflict manually. Merged to master. Thanks!

rxin · 2015-07-06T23:44:21Z

@animeshbaranawal I think you want to add the email address you used in your commit to your github profile, so the commit will show up properly as yours.

animeshbaranawal · 2015-07-07T03:56:12Z

added !

8072: Adding check to DataFrameWriter.scala

7c3d928

rxin reviewed Jun 26, 2015
View reviewed changes

8072 : Moved the exception handling to ResolvedDataSource specific to…

98b4399

… parquet format

animesh added 2 commits June 30, 2015 18:41

Fix Style Issues

1a89115

Fix Style Issues

3cc4d2c

8072: Improving on previous commits

a8a964f

8072: Fix Style Issues

fd45e1b

marmbrus reviewed Jul 1, 2015
View reviewed changes

Change IO exception to Analysis Exception

f70dd0e

asfgit closed this in 09a0641 Jul 6, 2015

[SPARK-8072] [SQL] Better AnalysisException for writing DataFrame with identically named columns #7013

[SPARK-8072] [SQL] Better AnalysisException for writing DataFrame with identically named columns #7013

Uh oh!

Conversation

animeshbaranawal commented Jun 25, 2015

Uh oh!

AmplabJenkins commented Jun 25, 2015

Uh oh!

davies commented Jun 26, 2015

Uh oh!

AmplabJenkins commented Jun 26, 2015

Uh oh!

AmplabJenkins commented Jun 26, 2015

Uh oh!

davies commented Jun 26, 2015

Uh oh!

SparkQA commented Jun 26, 2015

Uh oh!

SparkQA commented Jun 26, 2015

Uh oh!

AmplabJenkins commented Jun 26, 2015

Uh oh!

rxin Jun 26, 2015

Choose a reason for hiding this comment

Uh oh!

animeshbaranawal Jun 26, 2015

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Jun 29, 2015

Uh oh!

animeshbaranawal commented Jun 30, 2015

Uh oh!

AmplabJenkins commented Jun 30, 2015

Uh oh!

AmplabJenkins commented Jun 30, 2015

Uh oh!

SparkQA commented Jun 30, 2015

Uh oh!

SparkQA commented Jun 30, 2015

Uh oh!

AmplabJenkins commented Jun 30, 2015

Uh oh!

AmplabJenkins commented Jun 30, 2015

Uh oh!

AmplabJenkins commented Jun 30, 2015

Uh oh!

SparkQA commented Jun 30, 2015

Uh oh!

SparkQA commented Jun 30, 2015

Uh oh!

AmplabJenkins commented Jun 30, 2015

Uh oh!

davies commented Jun 30, 2015

Uh oh!

animeshbaranawal commented Jun 30, 2015

Uh oh!

marmbrus commented Jun 30, 2015

Uh oh!

animeshbaranawal commented Jun 30, 2015

Uh oh!

marmbrus commented Jun 30, 2015

Uh oh!

marmbrus commented Jun 30, 2015

Uh oh!

marmbrus commented Jun 30, 2015

Uh oh!

animeshbaranawal commented Jun 30, 2015

Uh oh!

AmplabJenkins commented Jul 1, 2015

Uh oh!

AmplabJenkins commented Jul 1, 2015

Uh oh!

SparkQA commented Jul 1, 2015

Uh oh!

SparkQA commented Jul 1, 2015

Uh oh!

AmplabJenkins commented Jul 1, 2015

Uh oh!

AmplabJenkins commented Jul 1, 2015

Uh oh!