Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20009][SQL] Support DDL strings for defining schema in functions.from_json #17406

Closed
wants to merge 9 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Mar 24, 2017

What changes were proposed in this pull request?

This pr added StructType.fromDDL to convert a DDL format string into StructType for defining schemas in functions.from_json.

How was this patch tested?

Added tests in JsonFunctionsSuite.

// Since we add an user-friendly API in the DDL parser, we employ DDL formats for the case.
// To keep back-compatibility, we use `fromJson` first, and then try the new API.
def fromString(text: String): DataType = {
try { fromJson(text) } catch { case _: Throwable => CatalystSqlParser.parseTableSchema(text) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throwable -> NonFatal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

// Until Spark-2.1, we use json strings for defining schemas in user-facing APIs.
// Since we add an user-friendly API in the DDL parser, we employ DDL formats for the case.
// To keep back-compatibility, we use `fromJson` first, and then try the new API.
def fromString(text: String): DataType = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a new API, why need to keep back-compatibility? For the use of json, there is fromJson and I think you don't change it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this pr, I just propose we support new DDL formats in the existing APIs that's already used json formats for defining schemas (e.g., functions.from_json). So, we need to support both json formats and DDL formats there. If I misunderstood you, could you correct me? Thanks for your comment!

Copy link
Member

@viirya viirya Mar 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean users should use fromString just for DDL formats.

It is pretty weird that an API named from_json is used to parse both json and DDL format.

How about add a new from_ddl or from_string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see. WDYT? cc: @gatorsmile

Copy link
Member

@HyukjinKwon HyukjinKwon Mar 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for interrupting. Actually, my point is a bit different. I thought DataType.fromString should take only the SQL type string with proper documentation unless we are going to deprecate DataType.fromJson on this purpose, and both cases should be handled separately whether within functions.from_json or somewhere.

I guess functions.from_json refers the data itself is in json format not the schema. So probably, it is fine if we document this well.

To cut it short, my point was DataType.fromString looks going to be exposed as a separate functionality as not a private function and therefore we should not include another existing functionality within this.

Copy link
Member

@viirya viirya Mar 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataType.fromString should only take DDL format.
DataType.fromJson should only take json.

I don't think functions.from_json should accept both DDL format and json like current change does.

Maybe we can add a new from_ddl or from_string using DataType.fromString, like from_json using DataType.fromJson?

That is basically my point here. :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, to cut my comment even shorter, for me,

+1 only for ...

DataType.fromString should only take DDL format.
DataType.fromJson should only take json.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. Got it. I got confused from the name offrom_json...

Right. It is used to parse a column of json data...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to support both json schema string and DDL format in from_json, shall we add a DataType.fromStringOrJson that does exactly the current DataType.fromString does? And let DataType.fromString only accepts DDL format.

Copy link
Member Author

@maropu maropu Mar 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the valuable comments!
Yea, I basically agree with the @HyukjinKwon idea. Also, fromDdl is better than fromString to me as @viirya suggested. I'll update soon.

@SparkQA
Copy link

SparkQA commented Mar 24, 2017

Test build #75141 has finished for PR 17406 at commit 87abf7b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 24, 2017

Test build #75143 has finished for PR 17406 at commit 2afa6f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Mar 24, 2017

@gatorsmile could you check?

@SparkQA
Copy link

SparkQA commented Mar 24, 2017

Test build #75159 has finished for PR 17406 at commit fb61535.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -103,6 +104,8 @@ object DataType {

def fromJson(json: String): DataType = parseDataType(parse(json))

def fromDdl(ddl: String): DataType = CatalystSqlParser.parseTableSchema(ddl)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fromDdl -> fromDDL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please add comments above the function to explain what is the purpose of these functions.

} catch {
case NonFatal(_) => DataType.fromDdl(schemaAsString)
}
Some(schema.asInstanceOf[StructType])
Copy link
Member

@gatorsmile gatorsmile Mar 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SimpleTextSource is just a fake source. What is the reason you made these changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I missed the point. okay, I'll revert this.

@gatorsmile
Copy link
Member

How about only focusing on the from_json function in this PR?

def from_json(e: Column, schema: String, options: java.util.Map[String, String]): Column = {
// Until Spark-2.1, we use json strings for defining schemas here. Since we add an user-friendly
// API in the DDL parser, we employ DDL formats for the case. To keep back-compatibility,
// we use `fromJson` first, and then try the new API.
Copy link
Member

@gatorsmile gatorsmile Mar 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about?

In Spark 2.1, the user-provided schema has to be in JSON format. Since Spark 2.2, the DDL format is also supported for schema.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the new description to @parm schema?

// Test DDL formats
test(s"from ddl - $dataType") {
assert(DataType.fromDdl(s"a ${dataType.sql}") === new StructType().add("a", dataType))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about defining a function like checkDataTypeFromJson ? It looks more consistent, if so.

@gatorsmile
Copy link
Member

Also please update the PR title and description. Thanks for working on it!

@maropu
Copy link
Member Author

maropu commented Mar 25, 2017

okay, I'll update soon! Thanks!

@SparkQA
Copy link

SparkQA commented Mar 25, 2017

Test build #75196 has finished for PR 17406 at commit ec5452f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@maropu
Copy link
Member Author

maropu commented Mar 25, 2017

oh, it seems we hit weird errors...

@gatorsmile
Copy link
Member

Just submitted a fix. NVM

@SparkQA
Copy link

SparkQA commented Mar 25, 2017

Test build #75202 has finished for PR 17406 at commit ec5452f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Mar 25, 2017

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Mar 25, 2017

Test build #75207 has finished for PR 17406 at commit ec5452f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Mar 25, 2017

The issue fixed in a2ce0a2, so I'll retest again.

@maropu
Copy link
Member Author

maropu commented Mar 25, 2017

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Mar 25, 2017

Test build #75213 has finished for PR 17406 at commit ec5452f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

checkDataTypeFromDDL(dataType)
}

// In some types, check json formats only because the types do not support DDL formats.
Copy link
Member

@gatorsmile gatorsmile Mar 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, except NullType, all the following data types should be supported. The only issue is nullability. Our DDL does not recognize something like

CREATE TABLE TAB (COL1  CHAR(3) NOT NULL)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I think so. we'd be better to file a JIRA for that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Spark SQL, we do not enforce nullability. It is like a hint for optimization. Not supporting such an syntax is reasonable.

checkDataTypeFromJson(ArrayType(StringType, false))

checkDataTypeFromText(MapType(IntegerType, StringType, true))
checkDataTypeFromJson(MapType(IntegerType, ArrayType(DoubleType), false))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can keep both, but add an extra one for checkDataTypeFromJson when the nullability is false.

checkDataTypeFromText(MapType(IntegerType, ArrayType(DoubleType), true))
checkDataTypeFromJson(MapType(IntegerType, ArrayType(DoubleType), false))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

@@ -201,7 +216,7 @@ class DataTypeSuite extends SparkFunSuite {
StructField("a", IntegerType, nullable = true),
StructField("b", ArrayType(DoubleType), nullable = false),
StructField("c", DoubleType, nullable = false, metadata)))
checkDataTypeJsonRepr(structType)
checkDataTypeFromJson(structType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here

@SparkQA
Copy link

SparkQA commented Mar 26, 2017

Test build #75231 has started for PR 17406 at commit e2d121f.

@maropu
Copy link
Member Author

maropu commented Mar 26, 2017

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Mar 26, 2017

Test build #75236 has finished for PR 17406 at commit e2d121f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -103,6 +104,12 @@ object DataType {

def fromJson(json: String): DataType = parseDataType(parse(json))

/**
* Creates DataType for a given DDL-formatted string, which is a comma separated list of field
* definitions, e.g., a INT, b STRING.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creates DataType -> Creates StructType.

Because parseTableSchema always returns StructType.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks comments! Fixed.

* Creates DataType for a given DDL-formatted string, which is a comma separated list of field
* definitions, e.g., a INT, b STRING.
*/
def fromDDL(ddl: String): DataType = CatalystSqlParser.parseTableSchema(ddl)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use StructType as return type?

@SparkQA
Copy link

SparkQA commented Mar 27, 2017

Test build #75252 has finished for PR 17406 at commit 842ca77.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@gatorsmile
Copy link
Member

Also cc @cloud-fan @hvanhovell

If no further comment, maybe we can merge it to master tomorrow?

@SparkQA
Copy link

SparkQA commented Mar 28, 2017

Test build #75296 has started for PR 17406 at commit 842ca77.

def from_json(e: Column, schema: String, options: java.util.Map[String, String]): Column =
from_json(e, DataType.fromJson(schema), options)
def from_json(e: Column, schema: String, options: java.util.Map[String, String]): Column = {
val dataType = try {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little concern: Won't the error message from parsing json be shadowed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is fine, right? cc @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea I think it's fine

* Creates StructType for a given DDL-formatted string, which is a comma separated list of field
* definitions, e.g., a INT, b STRING.
*/
def fromDDL(ddl: String): StructType = CatalystSqlParser.parseTableSchema(ddl)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we move it to object StructType?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I'll do soon

@viirya
Copy link
Member

viirya commented Mar 28, 2017

Ok. LGTM.

@SparkQA
Copy link

SparkQA commented Mar 28, 2017

Test build #75303 has finished for PR 17406 at commit 83f5d6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 28, 2017

Test build #75304 has finished for PR 17406 at commit 9ddd515.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

checkDataTypeJsonRepr(ArrayType(StringType, false))
checkDataTypeJsonRepr(MapType(IntegerType, StringType, true))
checkDataTypeJsonRepr(MapType(IntegerType, ArrayType(DoubleType), false))
def checkDataTypeFromDDL(dataType: DataType, ignoreNullability: Boolean = false): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have the ignoreNullability parameter? DataType.sql doesn't contains nullability information so we should not expect to retain the nullability here. We should always compare the types by sameType.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I got the point. I'll fix now

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Mar 28, 2017

Test build #75313 has finished for PR 17406 at commit 8448d7a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@maropu Could you update the PR title? Then, I can merge it once it is done. Thanks!

@maropu
Copy link
Member Author

maropu commented Mar 28, 2017

@gatorsmile okay, please. Thanks!

@gatorsmile
Copy link
Member

Could you update the PR title?

[SPARK-20009][SQL] Support DDL strings for defining schema in functions.from_json

@maropu maropu changed the title [SPARK-20009][SQL] Use DDL strings for defining schema in user-facing APIs [SPARK-20009][SQL] Support DDL strings for defining schema in functions.from_json Mar 29, 2017
@maropu
Copy link
Member Author

maropu commented Mar 29, 2017

oh, my bad. Fixed.

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@asfgit asfgit closed this in c400848 Mar 29, 2017
Comment on lines +3067 to +3071
val dataType = try {
DataType.fromJson(schema)
} catch {
case NonFatal(_) => StructType.fromDDL(schema)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering why parsing schema from a string is implemented here but not inside of the JsonToStructs expression? So, calling of from_json in SQL and in Scala/API has different behaviour, right. Did you do that intentionally?

Copy link
Member

@HyukjinKwon HyukjinKwon Oct 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you came to a wrong PR. The schema parsing was started in Scala side at maropu@fe33121#diff-b5e6d03d9c9afbfa925e039c48e31078608ea749c193e6af3087b79eb701bc7cR2877. I guess reason is that, the schema parameter was not intended to be used an expression before. Now we take it in SQL as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I just missed it, so making the code in common seems fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @HyukjinKwon. Ah, I see, I totally forgot it...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved parsing to a common place. Please, take a look at #30201

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants