-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support of custom column infer strategy #205 #393
Add support of custom column infer strategy #205 #393
Conversation
53192a6
to
b7da2bf
Compare
Codecov Report
@@ Coverage Diff @@
## master #393 +/- ##
==========================================
- Coverage 87.71% 87.61% -0.11%
==========================================
Files 14 15 +1
Lines 741 751 +10
Branches 94 66 -28
==========================================
+ Hits 650 658 +8
- Misses 91 93 +2
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #393 +/- ##
==========================================
+ Coverage 87.71% 87.78% +0.06%
==========================================
Files 14 14
Lines 741 745 +4
Branches 94 64 -30
==========================================
+ Hits 650 654 +4
Misses 91 91
Continue to review full report at Codecov.
|
README.md
Outdated
@@ -62,6 +62,9 @@ When reading files the API accepts several options: | |||
* When it encounters a field of the wrong datatype, it sets the offending field to `null`. | |||
* `DROPMALFORMED` : ignores the whole corrupted records. | |||
* `FAILFAST` : throws an exception when it meets corrupted records. | |||
* `inferStrategy`: The mode for dealing with corrupt records during parsing. Default is `PERMISSIVE`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be:
Default is `DEFAULT`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, you're right.
I'll fix it now.
Thanks.
b7da2bf
to
30d45a4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can have an option called inferSchema
which CSV datasource has.
case v if isTimestamp(v) => TimestampType | ||
case _ => StringType | ||
} | ||
options.inferStrategy.getDataType(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it doesn't need another object. Let's just do an if-else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about the same thing.
Since we only have two option for now, if-else would be enough.
Just add the interface to better explain the use case as the first draft solution.
I'm fine for both solutions.
Please let me know if modification is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make it boolean and add an if-else. I have observed the usecases in Spark CSV datasource so far, and I think two cases are good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can't think of another use case:
- User specifies schema -> use that
- User asks for inferred schema -> infer types
- Otherwise use the string rep that's already in the XML
That's how CSV works too, so being consistent and using a boolean here may be fine and less surprising.
So, what happens right now if you don't specify a schema? always inferred?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it always infers ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for organizing the use cases.
I've pushed a version that uses if-else.
OK, let's add this option but I would name it |
Hi @HyukjinKwon, |
cd77248
to
fe9fc19
Compare
After some discussion of design and implementation. Pass scalastyle, unit tests and local integration test. |
Looks good. I'll merge this if there are no more comments. |
case v if isBoolean(v) => BooleanType | ||
case v if isTimestamp(v) => TimestampType | ||
case _ => StringType | ||
if(options.inferSchema){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal, but can we fix style here? spaces around parens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reminder.
I'll fix it.
README.md
Outdated
@@ -62,6 +62,9 @@ When reading files the API accepts several options: | |||
* When it encounters a field of the wrong datatype, it sets the offending field to `null`. | |||
* `DROPMALFORMED` : ignores the whole corrupted records. | |||
* `FAILFAST` : throws an exception when it meets corrupted records. | |||
* `inferSchema`: Trying to infer column data type or not. Default is `true`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just write...
`inferSchema`: if `true`, attempts to infer an appropriate type for each resulting DataFrame column, like a boolean, numeric or date type. If `false`, all resulting columns are of string type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's indeed more concise.
I'll update the description.
fe9fc19
to
308829c
Compare
Just pushed a new version for Pass scalastyle, unit tests and local integration test. |
Merged. Thanks. |
Background
Discussion of support all string type column inferring strategy.
Issues : #205
Use case
Want to use column inferring to take advantage of no predefined schema.
But want to treat every data as plain text instead of parsed one.
Ex: value “00010” will be inferred as long type and data becomes 10.
It’s not possible to recover data from 10 to “00010” during processing.
Proposed Solution
Add interface for custom column data type inferring strategy.
Modification
Add interface for column infer strategy.
Add new column infer strategy to treat every column as string.
Add unit test for default, all string infer strategy.
Test result
Pass scalastyle, all unit tests, local integration test.
Please let me know if there is any feedback on this pull request.
Thanks.