Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3713][SQL] Uses JSON to serialize DataType objects #2563

Closed
wants to merge 8 commits into from

Conversation

liancheng
Copy link
Contributor

This PR uses JSON instead of toString to serialize DataTypes. The latter is not only hard to parse but also flaky in many cases.

Since we already write schema information to Parquet metadata in the old style, we have to reserve the old DataType parser and ensure downward compatibility. The old parser is now renamed to CaseClassStringParser and moved into object DataType.

@JoshRosen @davies Please help review PySpark related changes, thanks!

@SparkQA
Copy link

SparkQA commented Sep 28, 2014

QA tests have started for PR 2563 at commit 26c6563.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 28, 2014

QA tests have finished for PR 2563 at commit 26c6563.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20938/

@SparkQA
Copy link

SparkQA commented Sep 28, 2014

QA tests have started for PR 2563 at commit 03da3ec.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 28, 2014

Tests timed out after a configured wait of 120m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20939/

@SparkQA
Copy link

SparkQA commented Sep 30, 2014

QA tests have started for PR 2563 at commit 03da3ec.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 30, 2014

QA tests have finished for PR 2563 at commit 03da3ec.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Parses a string representation of a DataType.
*
* TODO: Generate parser as pickler...
* Utility functions for working with DataTypes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is in the wrong place. We should probably note that this parser is deprecated and is only here for backwards compatibility. We might even print a warning when it is used so we can get rid of it eventually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this comment is a mistake. Instead of print a warning, I made fromCaseClassString() private. It's only referenced by CaseClassStringParser, which has already been marked as deprecated.

@marmbrus
Copy link
Contributor

marmbrus commented Oct 1, 2014

Minor comment otherwise this LGTM.

@@ -62,6 +63,12 @@ def __eq__(self, other):
def __ne__(self, other):
return not self.__eq__(other)

def jsonValue(self):
return self.simpleString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can have default implementation as:

self.__class__.__name__.[:-4].lower()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, saved lots of boilerplate code! Removed all simpleString() method in subclasses.

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have started for PR 2563 at commit 5169238.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have finished for PR 2563 at commit 5169238.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21205/

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have started for PR 2563 at commit 81e28fb.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 3, 2014

QA tests have finished for PR 2563 at commit 81e28fb.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class GetPeers(blockManagerId: BlockManagerId) extends ToBlockManagerMaster

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21227/

@@ -62,6 +67,17 @@ def __eq__(self, other):
def __ne__(self, other):
return not self.__eq__(other)

def simpleString(self):
return _get_simple_string(self.__class__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just put _get_simple_string here? (it's not needed as separated functions, it will harder to understand without this context)

In order to make it available to class, it could be classmethod:

@classmethod
def simpleString(cls):
     return cls.__name__[:-4].lower()

@liancheng
Copy link
Contributor Author

@davis Thanks for all the suggestions, really makes things a lot cleaner!

@SparkQA
Copy link

SparkQA commented Oct 4, 2014

QA tests have started for PR 2563 at commit 54c46ce.

  • This patch does not merge cleanly!

@SparkQA
Copy link

SparkQA commented Oct 4, 2014

QA tests have started for PR 2563 at commit 785b683.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 4, 2014

QA tests have finished for PR 2563 at commit 785b683.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21291/Test PASSed.

@SparkQA
Copy link

SparkQA commented Oct 4, 2014

Tests timed out after a configured wait of 120m.

return cls.__name__[:-4].lower()

def jsonValue(self):
return {"type": self.typeName()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you like to use single string for Primitive types, it's still doable, only use one layer dict for others.

Either one is good to me.

@davies
Copy link
Contributor

davies commented Oct 5, 2014

This looks good to me, you just forget to rollback the changes in run-tests after debugging.

@liancheng
Copy link
Contributor Author

@davies Sorry for my carelessness... And thanks again for all the great advices!

@SparkQA
Copy link

SparkQA commented Oct 5, 2014

QA tests have started for PR 2563 at commit de18dea.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 5, 2014

QA tests have finished for PR 2563 at commit de18dea.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21305/Test PASSed.

@davies
Copy link
Contributor

davies commented Oct 5, 2014

LGTM now, thanks!

@davies
Copy link
Contributor

davies commented Oct 7, 2014

Could you rebase this to master?

@liancheng
Copy link
Contributor Author

Finished rebasing.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have started for PR 2563 at commit fc92eb3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have started for PR 2563 at commit fc92eb3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have finished for PR 2563 at commit fc92eb3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21437/Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have finished for PR 2563 at commit fc92eb3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(inputFile: String = null, threshold: Double = 0.1)
    • class Word2VecModel(object):
    • class Word2Vec(object):

@davies
Copy link
Contributor

davies commented Oct 8, 2014

@marmbrus I think this is ready to go.

@marmbrus
Copy link
Contributor

marmbrus commented Oct 9, 2014

Thanks! I've merged this.

@asfgit asfgit closed this in a42cc08 Oct 9, 2014
@liancheng liancheng deleted the datatype-to-json branch October 10, 2014 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants