Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support of custom column infer strategy #205 #393

Merged

Conversation

yjhyjhyjh0
Copy link
Contributor

@yjhyjhyjh0 yjhyjhyjh0 commented Jun 13, 2019

Background
Discussion of support all string type column inferring strategy.
Issues : #205

Use case
Want to use column inferring to take advantage of no predefined schema.
But want to treat every data as plain text instead of parsed one.

Ex: value “00010” will be inferred as long type and data becomes 10.
It’s not possible to recover data from 10 to “00010” during processing.

Proposed Solution
Add interface for custom column data type inferring strategy.

Modification
Add interface for column infer strategy.
Add new column infer strategy to treat every column as string.
Add unit test for default, all string infer strategy.

Test result
Pass scalastyle, all unit tests, local integration test.

Please let me know if there is any feedback on this pull request.
Thanks.

@yjhyjhyjh0 yjhyjhyjh0 force-pushed the support_custom_infer_strategy branch from 53192a6 to b7da2bf Compare June 13, 2019 05:40
@codecov-io
Copy link

Codecov Report

Merging #393 into master will decrease coverage by 0.1%.
The diff coverage is 88.88%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #393      +/-   ##
==========================================
- Coverage   87.71%   87.61%   -0.11%     
==========================================
  Files          14       15       +1     
  Lines         741      751      +10     
  Branches       94       66      -28     
==========================================
+ Hits          650      658       +8     
- Misses         91       93       +2
Impacted Files Coverage Δ
...la/com/databricks/spark/xml/util/InferSchema.scala 87.59% <100%> (-0.61%) ⬇️
...in/scala/com/databricks/spark/xml/XmlOptions.scala 100% <100%> (ø) ⬆️
.../com/databricks/spark/xml/util/InferStrategy.scala 86.66% <86.66%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6da10bc...b7da2bf. Read the comment docs.

@codecov-io
Copy link

codecov-io commented Jun 13, 2019

Codecov Report

Merging #393 into master will increase coverage by 0.06%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #393      +/-   ##
==========================================
+ Coverage   87.71%   87.78%   +0.06%     
==========================================
  Files          14       14              
  Lines         741      745       +4     
  Branches       94       64      -30     
==========================================
+ Hits          650      654       +4     
  Misses         91       91
Impacted Files Coverage Δ
...in/scala/com/databricks/spark/xml/XmlOptions.scala 100% <100%> (ø) ⬆️
...la/com/databricks/spark/xml/util/InferSchema.scala 88.43% <100%> (+0.24%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6da10bc...308829c. Read the comment docs.

README.md Outdated
@@ -62,6 +62,9 @@ When reading files the API accepts several options:
* When it encounters a field of the wrong datatype, it sets the offending field to `null`.
* `DROPMALFORMED` : ignores the whole corrupted records.
* `FAILFAST` : throws an exception when it meets corrupted records.
* `inferStrategy`: The mode for dealing with corrupt records during parsing. Default is `PERMISSIVE`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be:

    Default is `DEFAULT`.

Copy link
Contributor Author

@yjhyjhyjh0 yjhyjhyjh0 Jun 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, you're right.
I'll fix it now.
Thanks.

@yjhyjhyjh0 yjhyjhyjh0 force-pushed the support_custom_infer_strategy branch from b7da2bf to 30d45a4 Compare June 13, 2019 06:40
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can have an option called inferSchema which CSV datasource has.

case v if isTimestamp(v) => TimestampType
case _ => StringType
}
options.inferStrategy.getDataType(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't need another object. Let's just do an if-else.

Copy link
Contributor Author

@yjhyjhyjh0 yjhyjhyjh0 Jun 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the same thing.
Since we only have two option for now, if-else would be enough.
Just add the interface to better explain the use case as the first draft solution.
I'm fine for both solutions.
Please let me know if modification is required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it boolean and add an if-else. I have observed the usecases in Spark CSV datasource so far, and I think two cases are good enough.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can't think of another use case:

  • User specifies schema -> use that
  • User asks for inferred schema -> infer types
  • Otherwise use the string rep that's already in the XML

That's how CSV works too, so being consistent and using a boolean here may be fine and less surprising.

So, what happens right now if you don't specify a schema? always inferred?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it always infers ..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for organizing the use cases.
I've pushed a version that uses if-else.

@HyukjinKwon
Copy link
Member

OK, let's add this option but I would name it inferSchema

@yjhyjhyjh0
Copy link
Contributor Author

Hi @HyukjinKwon,
Thanks for the review.
I've modify the option name to “inferSchema”.

@yjhyjhyjh0 yjhyjhyjh0 force-pushed the support_custom_infer_strategy branch 2 times, most recently from cd77248 to fe9fc19 Compare June 14, 2019 03:16
@yjhyjhyjh0
Copy link
Contributor Author

After some discussion of design and implementation.
Just push new version that uses boolean and an if-else to indicate inferring or not.

Pass scalastyle, unit tests and local integration test.

@HyukjinKwon
Copy link
Member

Looks good. I'll merge this if there are no more comments.

case v if isBoolean(v) => BooleanType
case v if isTimestamp(v) => TimestampType
case _ => StringType
if(options.inferSchema){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but can we fix style here? spaces around parens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder.
I'll fix it.

README.md Outdated
@@ -62,6 +62,9 @@ When reading files the API accepts several options:
* When it encounters a field of the wrong datatype, it sets the offending field to `null`.
* `DROPMALFORMED` : ignores the whole corrupted records.
* `FAILFAST` : throws an exception when it meets corrupted records.
* `inferSchema`: Trying to infer column data type or not. Default is `true`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just write...

`inferSchema`: if `true`, attempts to infer an appropriate type for each resulting DataFrame column, like a boolean, numeric or date type. If `false`, all resulting columns are of string type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's indeed more concise.
I'll update the description.

@yjhyjhyjh0 yjhyjhyjh0 force-pushed the support_custom_infer_strategy branch from fe9fc19 to 308829c Compare June 17, 2019 03:00
@yjhyjhyjh0
Copy link
Contributor Author

yjhyjhyjh0 commented Jun 17, 2019

Just pushed a new version for
1- coding style correction. (spaces around parens)
2- update description on README.MD

Pass scalastyle, unit tests and local integration test.

@HyukjinKwon HyukjinKwon merged commit 0ff88df into databricks:master Jun 17, 2019
@HyukjinKwon
Copy link
Member

Merged. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants