Add support of custom column infer strategy #205 #393

yjhyjhyjh0 · 2019-06-13T05:22:26Z

Background
Discussion of support all string type column inferring strategy.
Issues : #205

Use case
Want to use column inferring to take advantage of no predefined schema.
But want to treat every data as plain text instead of parsed one.

Ex: value “00010” will be inferred as long type and data becomes 10.
It’s not possible to recover data from 10 to “00010” during processing.

Proposed Solution
Add interface for custom column data type inferring strategy.

Modification
Add interface for column infer strategy.
Add new column infer strategy to treat every column as string.
Add unit test for default, all string infer strategy.

Test result
Pass scalastyle, all unit tests, local integration test.

Please let me know if there is any feedback on this pull request.
Thanks.

codecov-io · 2019-06-13T05:44:16Z

Codecov Report

Merging #393 into master will decrease coverage by 0.1%.
The diff coverage is 88.88%.

@@            Coverage Diff             @@
##           master     #393      +/-   ##
==========================================
- Coverage   87.71%   87.61%   -0.11%     
==========================================
  Files          14       15       +1     
  Lines         741      751      +10     
  Branches       94       66      -28     
==========================================
+ Hits          650      658       +8     
- Misses         91       93       +2

Impacted Files	Coverage Δ
...la/com/databricks/spark/xml/util/InferSchema.scala	`87.59% <100%> (-0.61%)`	⬇️
...in/scala/com/databricks/spark/xml/XmlOptions.scala	`100% <100%> (ø)`	⬆️
.../com/databricks/spark/xml/util/InferStrategy.scala	`86.66% <86.66%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6da10bc...b7da2bf. Read the comment docs.

codecov-io · 2019-06-13T05:44:17Z

Codecov Report

Merging #393 into master will increase coverage by 0.06%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #393      +/-   ##
==========================================
+ Coverage   87.71%   87.78%   +0.06%     
==========================================
  Files          14       14              
  Lines         741      745       +4     
  Branches       94       64      -30     
==========================================
+ Hits          650      654       +4     
  Misses         91       91

Impacted Files	Coverage Δ
...in/scala/com/databricks/spark/xml/XmlOptions.scala	`100% <100%> (ø)`	⬆️
...la/com/databricks/spark/xml/util/InferSchema.scala	`88.43% <100%> (+0.24%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6da10bc...308829c. Read the comment docs.

Wei-1 · 2019-06-13T05:44:09Z

README.md

@@ -62,6 +62,9 @@ When reading files the API accepts several options:
    * When it encounters a field of the wrong datatype, it sets the offending field to `null`.
  * `DROPMALFORMED` : ignores the whole corrupted records.
  * `FAILFAST` : throws an exception when it meets corrupted records.
+* `inferStrategy`: The mode for dealing with corrupt records during parsing. Default is `PERMISSIVE`.


I think this should be:

Default is `DEFAULT`.

Wow, you're right.
I'll fix it now.
Thanks.

HyukjinKwon

I think we can have an option called inferSchema which CSV datasource has.

HyukjinKwon · 2019-06-13T07:04:19Z

src/main/scala/com/databricks/spark/xml/util/InferSchema.scala

-      case v if isTimestamp(v) => TimestampType
-      case _ => StringType
-    }
+    options.inferStrategy.getDataType(value)


I think it doesn't need another object. Let's just do an if-else.

I was thinking about the same thing.
Since we only have two option for now, if-else would be enough.
Just add the interface to better explain the use case as the first draft solution.
I'm fine for both solutions.
Please let me know if modification is required.

Let's make it boolean and add an if-else. I have observed the usecases in Spark CSV datasource so far, and I think two cases are good enough.

Yeah, I can't think of another use case:

User specifies schema -> use that

User asks for inferred schema -> infer types

Otherwise use the string rep that's already in the XML

That's how CSV works too, so being consistent and using a boolean here may be fine and less surprising.

So, what happens right now if you don't specify a schema? always inferred?

Yea, it always infers ..

Thanks for organizing the use cases.
I've pushed a version that uses if-else.

HyukjinKwon · 2019-06-13T07:05:34Z

OK, let's add this option but I would name it inferSchema

yjhyjhyjh0 · 2019-06-13T09:02:31Z

Hi @HyukjinKwon,
Thanks for the review.
I've modify the option name to “inferSchema”.

yjhyjhyjh0 · 2019-06-14T03:20:34Z

After some discussion of design and implementation.
Just push new version that uses boolean and an if-else to indicate inferring or not.

Pass scalastyle, unit tests and local integration test.

HyukjinKwon · 2019-06-14T05:05:09Z

Looks good. I'll merge this if there are no more comments.

srowen · 2019-06-16T19:45:32Z

src/main/scala/com/databricks/spark/xml/util/InferSchema.scala

-      case v if isBoolean(v) => BooleanType
-      case v if isTimestamp(v) => TimestampType
-      case _ => StringType
+    if(options.inferSchema){


Not a big deal, but can we fix style here? spaces around parens.

Thanks for the reminder.
I'll fix it.

srowen · 2019-06-16T19:47:19Z

README.md

@@ -62,6 +62,9 @@ When reading files the API accepts several options:
    * When it encounters a field of the wrong datatype, it sets the offending field to `null`.
  * `DROPMALFORMED` : ignores the whole corrupted records.
  * `FAILFAST` : throws an exception when it meets corrupted records.
+* `inferSchema`: Trying to infer column data type or not. Default is `true`.


I would just write...

`inferSchema`: if `true`, attempts to infer an appropriate type for each resulting DataFrame column, like a boolean, numeric or date type. If `false`, all resulting columns are of string type.

That's indeed more concise.
I'll update the description.

yjhyjhyjh0 · 2019-06-17T03:05:39Z

Just pushed a new version for
1- coding style correction. (spaces around parens)
2- update description on README.MD

Pass scalastyle, unit tests and local integration test.

HyukjinKwon · 2019-06-17T03:16:18Z

Merged. Thanks.

yjhyjhyjh0 force-pushed the support_custom_infer_strategy branch from 53192a6 to b7da2bf Compare June 13, 2019 05:40

Wei-1 suggested changes Jun 13, 2019

View reviewed changes

yjhyjhyjh0 force-pushed the support_custom_infer_strategy branch from b7da2bf to 30d45a4 Compare June 13, 2019 06:40

HyukjinKwon reviewed Jun 13, 2019

View reviewed changes

yjhyjhyjh0 force-pushed the support_custom_infer_strategy branch 2 times, most recently from cd77248 to fe9fc19 Compare June 14, 2019 03:16

HyukjinKwon approved these changes Jun 14, 2019

View reviewed changes

srowen reviewed Jun 16, 2019

View reviewed changes

Add support of inferring column data type optionally (databricks#393)

308829c

yjhyjhyjh0 force-pushed the support_custom_infer_strategy branch from fe9fc19 to 308829c Compare June 17, 2019 03:00

HyukjinKwon merged commit 0ff88df into databricks:master Jun 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support of custom column infer strategy #205 #393

Add support of custom column infer strategy #205 #393

yjhyjhyjh0 commented Jun 13, 2019 •

edited

Loading

codecov-io commented Jun 13, 2019

codecov-io commented Jun 13, 2019 •

edited

Loading

Wei-1 Jun 13, 2019

yjhyjhyjh0 Jun 13, 2019 •

edited

Loading

HyukjinKwon left a comment

HyukjinKwon Jun 13, 2019

yjhyjhyjh0 Jun 13, 2019 •

edited

Loading

HyukjinKwon Jun 13, 2019

srowen Jun 13, 2019

HyukjinKwon Jun 13, 2019

yjhyjhyjh0 Jun 14, 2019

HyukjinKwon commented Jun 13, 2019

yjhyjhyjh0 commented Jun 13, 2019

yjhyjhyjh0 commented Jun 14, 2019

HyukjinKwon commented Jun 14, 2019

srowen Jun 16, 2019

yjhyjhyjh0 Jun 17, 2019

srowen Jun 16, 2019

yjhyjhyjh0 Jun 17, 2019

yjhyjhyjh0 commented Jun 17, 2019 •

edited

Loading

HyukjinKwon commented Jun 17, 2019

Add support of custom column infer strategy #205 #393

Add support of custom column infer strategy #205 #393

Conversation

yjhyjhyjh0 commented Jun 13, 2019 • edited Loading

codecov-io commented Jun 13, 2019

Codecov Report

codecov-io commented Jun 13, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

yjhyjhyjh0 Jun 13, 2019 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yjhyjhyjh0 Jun 13, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jun 13, 2019

yjhyjhyjh0 commented Jun 13, 2019

yjhyjhyjh0 commented Jun 14, 2019

HyukjinKwon commented Jun 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yjhyjhyjh0 commented Jun 17, 2019 • edited Loading

HyukjinKwon commented Jun 17, 2019

yjhyjhyjh0 commented Jun 13, 2019 •

edited

Loading

codecov-io commented Jun 13, 2019 •

edited

Loading

yjhyjhyjh0 Jun 13, 2019 •

edited

Loading

yjhyjhyjh0 Jun 13, 2019 •

edited

Loading

yjhyjhyjh0 commented Jun 17, 2019 •

edited

Loading