Skip to content

Conversation

@animeshbaranawal
Copy link

Adding a function checkConstraints which will check for the constraints to be applied on the dataframe / dataframe schema. Function called before storing the dataframe to an external storage. Function added in the corresponding datasource API.
cc @rxin @marmbrus

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@davies
Copy link
Contributor

davies commented Jun 26, 2015

ok to test

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@davies
Copy link
Contributor

davies commented Jun 26, 2015

LGTM

@SparkQA
Copy link

SparkQA commented Jun 26, 2015

Test build #35826 has started for PR 7013 at commit 7c3d928.

@SparkQA
Copy link

SparkQA commented Jun 26, 2015

Test build #35826 has finished for PR 7013 at commit 7c3d928.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the best place to do this might be looking at a logical plan that contains an output operator, rather than putting it in the writer itself.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not figure out how to check if a logical plan has an output operator... Any guidance would help a lot.

@marmbrus
Copy link
Contributor

After talking more off-line with @rxin I think we want to make this check specific to parquet. For other data sources (like CSV) its actually not a problem to have duplicate column names.

@animeshbaranawal
Copy link
Author

cc @rxin @marmbrus
Can you review this?

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36131 has started for PR 7013 at commit 98b4399.

@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36131 has finished for PR 7013 at commit 98b4399.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36136 has started for PR 7013 at commit 3cc4d2c.

@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36136 has finished for PR 7013 at commit 3cc4d2c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@davies
Copy link
Contributor

davies commented Jun 30, 2015

@animeshbaranawal Do any other data sources also have this problem? I'm thinking of Orc and JSON, will JSON overwrite duplicated column silently?

@animeshbaranawal
Copy link
Author

Yes. I tried with JSON and it overwrites the data. Michael Armbrust also said that he want the rule for parquet only.

@marmbrus
Copy link
Contributor

Let me clarify: I want the error on a per datasource basis, contingent upon whether it makes sense given the limitations of the format.

@animeshbaranawal
Copy link
Author

@marmbrus Didn't get you? Am I missing something?

@marmbrus
Copy link
Contributor

We should also do it for JSON.

@marmbrus
Copy link
Contributor

And we should throw the error inside of parquet if possible. That way we don't have tons of special case code inside of the generic data source handler.

@marmbrus
Copy link
Contributor

Ideally, this would serve as an example so that other data source implementors could throw errors when people try to write out invalid data (i.e. consider a datasource that only allows alpha numeric characters in its column names).

@animeshbaranawal
Copy link
Author

Got it ! What about jdbc ?

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jul 1, 2015

Test build #36257 has started for PR 7013 at commit a8a964f.

@SparkQA
Copy link

SparkQA commented Jul 1, 2015

Test build #36257 has finished for PR 7013 at commit a8a964f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jul 1, 2015

Test build #36263 has started for PR 7013 at commit fd45e1b.

@SparkQA
Copy link

SparkQA commented Jul 1, 2015

Test build #36263 has finished for PR 7013 at commit fd45e1b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be an AnalysisException probably. This is the exception we throw when users try to run an invalid query.

@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@SparkQA
Copy link

SparkQA commented Jul 2, 2015

Test build #36345 has started for PR 7013 at commit f70dd0e.

@SparkQA
Copy link

SparkQA commented Jul 2, 2015

Test build #36345 has finished for PR 7013 at commit f70dd0e.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Build finished. Test PASSed.

@animeshbaranawal
Copy link
Author

Why is it not merging cleanly?

@marmbrus
Copy link
Contributor

marmbrus commented Jul 6, 2015

[marmbrus@Michaels-MacBook-Pro-2 spark ((14e4bf8...))]$ git checkout pr/7013
Previous HEAD position was 14e4bf8... Use CanBroadcast in broadcast outer join planning
Branch pr/7013 set up to track remote branch pr/7013 from origin.
Switched to a new branch 'pr/7013'

[marmbrus@Michaels-MacBook-Pro-2 spark (pr/7013)]$ git merge origin/master
Auto-merging sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala
Removing sql/hive/src/test/resources/golden/Date comparison test 2-0-dc1b267f1d79d49e6675afe4fd2a34a5
Removing sql/hive/src/test/resources/golden/Date cast-0-a7cd69b80c77a771a2c955db666be53d
Auto-merging sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala
Auto-merging sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
CONFLICT (content): Merge conflict in sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
Auto-merging sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
Auto-merging sql/core/src/main/scala/org/apache/spark/sql/execution/pythonUDFs.scala
Auto-merging sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
Removing sql/catalyst/src/main/java/org/apache/spark/sql/BaseRow.java
Removing sql/catalyst/src/main/java/org/apache/spark/sql/BaseMutableRow.java
Automatic merge failed; fix conflicts and then commit the result.

Someone else has added tests to DataFrameSuite.

@marmbrus
Copy link
Contributor

marmbrus commented Jul 6, 2015

Fixed the conflict manually. Merged to master. Thanks!

@asfgit asfgit closed this in 09a0641 Jul 6, 2015
@rxin
Copy link
Contributor

rxin commented Jul 6, 2015

@animeshbaranawal I think you want to add the email address you used in your commit to your github profile, so the commit will show up properly as yours.

@animeshbaranawal
Copy link
Author

added !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants