Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3569][SQL] Add metadata field to StructField #2701

Closed
wants to merge 29 commits into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented Oct 8, 2014

Add metadata: Metadata to StructField to store extra information of columns. Metadata is a simple wrapper over Map[String, Any] with value types restricted to Boolean, Long, Double, String, Metadata, and arrays of those types. SerDe is via JSON.

Metadata is preserved through simple operations like SELECT.

@marmbrus @liancheng

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have started for PR 2701 at commit d8af0ed.

  • This patch does not merge cleanly!

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have started for PR 2701 at commit c41a664.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have finished for PR 2701 at commit c41a664.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AttributeReference(
    • case class StructField(

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21435/Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have finished for PR 2701 at commit d8af0ed.

  • This patch fails unit tests.
  • This patch does not merge cleanly!

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21429/Test FAILed.

nameToField.get(name).getOrElse(
throw new IllegalArgumentException(s"Field ${name} does not exist."))
nameToField.getOrElse(name,
throw new IllegalArgumentException(s"Field $name does not exist."))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: no need to wrap

@liancheng
Copy link
Contributor

I think using immutable Map for metadata is enough. We can add an API like .transformMetadata(f: Map[String, Any] => Map[String, Any]) to alter metadata as needed.

@liancheng
Copy link
Contributor

PySpark also need to be updated.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have started for PR 2701 at commit 7e5a322.

  • This patch merges cleanly.

@mengxr
Copy link
Contributor Author

mengxr commented Oct 8, 2014

@liancheng The Python API is hard to update at this time because the schema SerDe is via str(..):

https://github.com/apache/spark/blob/master/python/pyspark/sql.py#L1131

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have finished for PR 2701 at commit 7e5a322.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AttributeReference(
    • case class StructField(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21440/Test PASSed.

@liancheng
Copy link
Contributor

#2563 has already replaced str(...) with .json(). You can add Python metadata ser/de by modifying .jsonValue().

@mengxr mengxr changed the title [WIP][SPARK-3569][SQL] Add metadata field to StructField [SPARK-3569][SQL] Add metadata field to StructField Oct 9, 2014
@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have started for PR 2701 at commit 93518fb.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have finished for PR 2701 at commit 93518fb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AttributeReference(
    • case class StructField(

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21529/Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have started for PR 2701 at commit 611d3c2.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have finished for PR 2701 at commit 611d3c2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AttributeReference(
    • case class StructField(
    • class MetadataBuilder

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21967/
Test PASSed.

@@ -305,12 +305,15 @@ class StructField(DataType):

"""

def __init__(self, name, dataType, nullable):
def __init__(self, name, dataType, nullable, metadata={}):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use {} as default value will have side effects, such as:

>>> a = StructField('a', StringType(), True)
>>> b = StructField('b', StringType(), True)
>>> a.metadata['name'] = 'a'
>>> b.metadata['name']
'a'

So if the metadata could be modified somewhere, here you should use None as default value.

def xxx(xxx, metadata=None):
    .... 
    self.metadata = metadata or {}

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

QA tests have started for PR 2701 at commit 589f314.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

QA tests have finished for PR 2701 at commit 589f314.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AttributeReference(
    • case class StructField(
    • class MetadataBuilder

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22098/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #443 has started for PR 2701 at commit 589f314.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #443 has finished for PR 2701 at commit 589f314.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AttributeReference(
    • case class StructField(
    • class MetadataBuilder

@SparkQA
Copy link

SparkQA commented Oct 26, 2014

Test build #458 has started for PR 2701 at commit 589f314.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 26, 2014

Test build #465 has started for PR 2701 at commit 589f314.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 26, 2014

Test build #458 has finished for PR 2701 at commit 589f314.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AttributeReference(
    • case class StructField(
    • class MetadataBuilder

@marmbrus
Copy link
Contributor

@mengxr, thanks for working on this! Overall LGTM. One minor thing: I think we should expose Metadata as a type variable in the org.apache.spark.sql package and as a subclass in org.apache.spark.sql.api.java. Catalyst is considered a Private API so we explicitly expose the parts of it in sql that we expect users to use.

@SparkQA
Copy link

SparkQA commented Oct 27, 2014

Test build #465 has finished for PR 2701 at commit 589f314.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AttributeReference(
    • case class StructField(
    • class MetadataBuilder

@marmbrus
Copy link
Contributor

Here's a PR to fix the package visibility. If that looks good to you I think this is ready to merge: mengxr#1

Expose Metadata and MetadataBuilder through the public scala and java packages.
@SparkQA
Copy link

SparkQA commented Oct 30, 2014

QA tests have started for PR 2701 at commit c35203f.

  • This patch does not merge cleanly.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

QA tests have finished for PR 2701 at commit c35203f.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • case class AttributeReference(
    • case class StructField(
    • class MetadataBuilder
    • class Metadata extends org.apache.spark.sql.catalyst.util.Metadata
    • public class MetadataBuilder extends org.apache.spark.sql.catalyst.util.MetadataBuilder

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22543/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 1, 2014

Test build #22671 has started for PR 2701 at commit dedda56.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 1, 2014

Test build #22671 has finished for PR 2701 at commit dedda56.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AttributeReference(
    • case class StructField(
    • class MetadataBuilder
    • class Metadata extends org.apache.spark.sql.catalyst.util.Metadata
    • public class MetadataBuilder extends org.apache.spark.sql.catalyst.util.MetadataBuilder

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22671/
Test PASSed.

@marmbrus
Copy link
Contributor

marmbrus commented Nov 1, 2014

Thanks! Merged to master.

@asfgit asfgit closed this in 1d4f355 Nov 1, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants