-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3962: [Go] Accept null values while reading CSV #3129
Conversation
959b194
to
fe031bb
Compare
} | ||
|
||
want := `rec[0]["bool"]: [true] | ||
for _, tc := range []struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change table-driven-tests style for adding tests.
Codecov Report
@@ Coverage Diff @@
## master #3129 +/- ##
==========================================
+ Coverage 87.02% 87.02% +<.01%
==========================================
Files 495 495
Lines 69679 69686 +7
==========================================
+ Hits 60640 60647 +7
Misses 8942 8942
Partials 97 97
Continue to review full report at Codecov.
|
}{ | ||
{ | ||
name: "including various values which doesn't contain null values", | ||
csv: bytes.NewBufferString(`## a simple set of data which contains all supported types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is better than using csv files in csv/testdata directory for maintainability reasons.
If you don't agree, I'll add csv/testdata/types_with_null.csv
and use it. How do you feel about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Godoc example also uses csv files in csv/testdata directory. I refactor it at #3131 .
fe031bb
to
e36fbe4
Compare
@sbinet Could you review this? |
52fc763
to
e0892fa
Compare
sure. I'll give it a go tomorrow (Paris time.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've always disliked the automatic, by default, handling of missing values of pandas
.
and always prefered the "in your face" handling errors of Go.
so I am personally a bit reluctant to have the same behaviour in Go-Arrow, although, here, it's easier to figure out this was a missing value and not a valid zero value.
this behaviour could be activated with a WithNulls
(or some other name) option, of course.
what do others think? @stuartcarnie @alexandreyc ? @wesm ?
go/arrow/csv/csv_test.go
Outdated
want string | ||
}{ | ||
{ | ||
name: "including various values which doesn't contain null values", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a bit of a mouthful.
also, it makes it harder to selectively select or disable a given set of sub-tests.
could we come up with a shorter yet descriptive name?
perhaps something along the lines of without-null-values
?
go/arrow/csv/csv_test.go
Outdated
{ | ||
name: "including various values which doesn't contain null values", | ||
csv: bytes.NewBufferString(`## a simple set of data which contains all supported types | ||
## supported types: bool;int8;int16;int32;int64;uint8;uint16;uint32;uint64;float32;float64;string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
supported types
is also mentioned on the line above.
perhaps rephrase to remove the stuttering?
go/arrow/csv/csv_test.go
Outdated
`, | ||
}, | ||
{ | ||
name: "including null values", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/including null values/with-null-values/
?
go/arrow/csv/csv_test.go
Outdated
{ | ||
name: "including null values", | ||
csv: bytes.NewBufferString(`## a simple set of data which contains all supported types | ||
## supported types: bool;int8;int16;int32;int64;uint8;uint16;uint32;uint64;float32;float64;string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
@sbinet are you asking about the default behavior for handling |
yes. I think I am asking for the default behaviour of the CSV reader to be to fail early and loudly when encountering missing values. I'd argue that dataset cleaning shouldn't be coupled to nor baked in the CSV reader. |
go/arrow/csv/csv.go
Outdated
@@ -237,6 +238,11 @@ func (r *Reader) validate(recs []string) { | |||
|
|||
func (r *Reader) read(recs []string) { | |||
for i, str := range recs { | |||
if str == "" { | |||
r.bld.Field(i).AppendNull() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, there's an issue here: how do we make the distinction b/w a null
value for a string
column and a valid ""
value for a string
column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. I fixed this.
11c7ca9
to
9724028
Compare
It seems fine to not recognize null values by default, as long as you have the option to provide a global list (in case the same types of markers are used in many columns) or a per column list |
9724028
to
40fc29b
Compare
@wesm I'd prefer this modus operandi, indeed. |
For the record, the C++ CSV reader automatically recognizes null values in most data types. That is empty values, but also a bunch of conventional "null" notations listed here: It may be better to have similar characteristics from one Arrow implementation to another. |
@pitrou I'd be in favor of reducing the defaults in the C++ implementation to a much more conservative set (eg "null", "NULL", and empty strings) |
Making the set of null values configurable will require some work to keep the implementation performant, though. The current solution covers a wide range of values with a simple and performant implementation. |
so... it seems the consensus is:
right? |
In my opinion, the behavior that CSV reader accepting null values as default is not weird. At least, I feel it's better to take the same behavior between each implementations. |
It would not be unreasonable to recognize a conservative list of null markers by default, like "null" and "NULL". There is also the question of empty cells in columns that contain numeric data (with string columns, you probably want to distinguish empty string vs. null) |
ping @c-bata what's the status on this? |
Hi @sbinet . Thank you for your remind. I give up on this one as it does not seem to reach a consensus. Thank you for reviewers 🙇 (P.S. Good job to implement an IPC protocol in Go! That's really needed.) |
This patch-set extends the support for nulls in Arrow's CSV reader and writer. The string used for nulls is added as an option for the Reader and Writer structs. Also included in the PR is a commit that refactors the test code to avoid repetition between with-header and non-header tests. If preferred, I can create a separate PR and corresponding JIRA. See #3129 for a prior (abandoned) attempt at this, but with useful discussion. This current PR explicitly passes the string for null values to the reader or writer. We could generalize the reader further by using a slice of valid NULLs; this is roughly what the C++ interface provides from what I can tell. Closes #4346 from briangold/csvnull and squashes the following commits: e9f4b2c <Micah Kornfield> Address PR feedback on naming 1722d26 <Brian Gold> Improved null handling for CSV reading 3323672 <Brian Gold> Added CSV support for reading/writing NULLs 3c593c9 <Brian Gold> Refactored CSV tests Lead-authored-by: Brian Gold <bgold@purestorage.com> Co-authored-by: Micah Kornfield <emkornfield@gmail.com> Signed-off-by: Sebastien Binet <binet@cern.ch>
Summary
Fix the bug of CSV reader couldn't accept null values.
How to reproduce
Create a following CSV file:
After that run example function in csv_test.go, got following results.
The reason why stopping is csv.Reader got error while parsing empty string as a float64 (the error message is
strconv.ParseFloat: parsing "": invalid syntax
).Link