ARROW-3962: [Go] Accept null values while reading CSV #3129

c-bata · 2018-12-08T11:33:13Z

Summary

Fix the bug of CSV reader couldn't accept null values.

How to reproduce

Create a following CSV file:

0;0;str-0
1;1;str-1
;2;str-2
3;3;str-3
4;;str-4
5;5;str-5
6;6;
7;7;str-7
8;8;str-8
9;9;str-9

After that run example function in csv_test.go, got following results.

rec[0]["i64"]: [0]
rec[1]["f64"]: [0]
rec[2]["str"]: ["str-0"]
rec[0]["i64"]: [1]
rec[1]["f64"]: [0]
rec[2]["str"]: ["str-1"]

The reason why stopping is csv.Reader got error while parsing empty string as a float64 (the error message is strconv.ParseFloat: parsing "": invalid syntax).

Link

Jira: https://issues.apache.org/jira/browse/ARROW-3962

c-bata · 2018-12-08T12:54:48Z

go/arrow/csv/csv_test.go

-	}
-
-	want := `rec[0]["bool"]: [true]
+	for _, tc := range []struct {


Change table-driven-tests style for adding tests.

codecov-io · 2018-12-08T12:56:12Z

Codecov Report

Merging #3129 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3129      +/-   ##
==========================================
+ Coverage   87.02%   87.02%   +<.01%     
==========================================
  Files         495      495              
  Lines       69679    69686       +7     
==========================================
+ Hits        60640    60647       +7     
  Misses       8942     8942              
  Partials       97       97

Impacted Files	Coverage Δ
go/arrow/csv/csv.go	`82.11% <100%> (+0.59%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0ac97f...40fc29b. Read the comment docs.

c-bata · 2018-12-08T13:03:22Z

go/arrow/csv/csv_test.go

+	}{
+		{
+			name: "including various values which doesn't contain null values",
+			csv: bytes.NewBufferString(`## a simple set of data which contains all supported types


I think this is better than using csv files in csv/testdata directory for maintainability reasons.

If you don't agree, I'll add csv/testdata/types_with_null.csv and use it. How do you feel about this?

Godoc example also uses csv files in csv/testdata directory. I refactor it at #3131 .

c-bata · 2018-12-10T09:27:13Z

@sbinet Could you review this?

sbinet · 2018-12-10T17:38:31Z

sure. I'll give it a go tomorrow (Paris time.)

sbinet

I've always disliked the automatic, by default, handling of missing values of pandas.
and always prefered the "in your face" handling errors of Go.

so I am personally a bit reluctant to have the same behaviour in Go-Arrow, although, here, it's easier to figure out this was a missing value and not a valid zero value.

this behaviour could be activated with a WithNulls (or some other name) option, of course.

what do others think? @stuartcarnie @alexandreyc ? @wesm ?

sbinet · 2018-12-11T09:00:41Z

go/arrow/csv/csv_test.go

+		want    string
+	}{
+		{
+			name: "including various values which doesn't contain null values",


that's a bit of a mouthful.
also, it makes it harder to selectively select or disable a given set of sub-tests.

could we come up with a shorter yet descriptive name?
perhaps something along the lines of without-null-values?

sbinet · 2018-12-11T09:01:56Z

go/arrow/csv/csv_test.go

+		{
+			name: "including various values which doesn't contain null values",
+			csv: bytes.NewBufferString(`## a simple set of data which contains all supported types
+## supported types: bool;int8;int16;int32;int64;uint8;uint16;uint32;uint64;float32;float64;string


supported types is also mentioned on the line above.
perhaps rephrase to remove the stuttering?

sbinet · 2018-12-11T09:02:34Z

go/arrow/csv/csv_test.go

+`,
+		},
+		{
+			name: "including null values",


s/including null values/with-null-values/ ?

sbinet · 2018-12-11T09:02:53Z

go/arrow/csv/csv_test.go

+		{
+			name: "including null values",
+			csv: bytes.NewBufferString(`## a simple set of data which contains all supported types
+## supported types: bool;int8;int16;int32;int64;uint8;uint16;uint32;uint64;float32;float64;string


stuartcarnie · 2018-12-11T21:44:14Z

@sbinet are you asking about the default behavior for handling null in the CSV reader?

sbinet · 2018-12-12T08:33:49Z

yes. I think I am asking for the default behaviour of the CSV reader to be to fail early and loudly when encountering missing values.
but still have an optional config-func (say, WithNulls() or something) to enable adding null values when encountering missing values.

I'd argue that dataset cleaning shouldn't be coupled to nor baked in the CSV reader.

sbinet · 2018-12-12T08:34:14Z

go/arrow/csv/csv.go

@@ -237,6 +238,11 @@ func (r *Reader) validate(recs []string) {

 func (r *Reader) read(recs []string) {
 	for i, str := range recs {
+		if str == "" {
+			r.bld.Field(i).AppendNull()


actually, there's an issue here: how do we make the distinction b/w a null value for a string column and a valid "" value for a string column?

thanks. I fixed this.

wesm · 2018-12-12T14:04:20Z

It seems fine to not recognize null values by default, as long as you have the option to provide a global list (in case the same types of markers are used in many columns) or a per column list

sbinet · 2018-12-12T16:26:01Z

@wesm I'd prefer this modus operandi, indeed.

pitrou · 2018-12-12T18:35:59Z

For the record, the C++ CSV reader automatically recognizes null values in most data types. That is empty values, but also a bunch of conventional "null" notations listed here:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/converter-test.cc#L39

It may be better to have similar characteristics from one Arrow implementation to another.

wesm · 2018-12-12T19:35:32Z

@pitrou I'd be in favor of reducing the defaults in the C++ implementation to a much more conservative set (eg "null", "NULL", and empty strings)

pitrou · 2018-12-12T19:43:57Z

Making the set of null values configurable will require some work to keep the implementation performant, though. The current solution covers a wide range of values with a simple and performant implementation.

sbinet · 2018-12-19T09:52:44Z

so... it seems the consensus is:

as a default behaviour, fail when encountering null or missing values
have an opt-in switch to enable the handling of missing values (which could itself be enabled for a (sub)set of columns.)

right?

c-bata · 2018-12-19T12:21:59Z

In my opinion, the behavior that CSV reader accepting null values as default is not weird. At least, I feel it's better to take the same behavior between each implementations.

wesm · 2018-12-19T16:02:39Z

It would not be unreasonable to recognize a conservative list of null markers by default, like "null" and "NULL". There is also the question of empty cells in columns that contain numeric data (with string columns, you probably want to distinguish empty string vs. null)

sbinet · 2019-05-17T07:38:41Z

ping @c-bata

what's the status on this?

c-bata · 2019-05-17T07:49:59Z

Hi @sbinet . Thank you for your remind. I give up on this one as it does not seem to reach a consensus. Thank you for reviewers 🙇

(P.S. Good job to implement an IPC protocol in Go! That's really needed.)

This patch-set extends the support for nulls in Arrow's CSV reader and writer. The string used for nulls is added as an option for the Reader and Writer structs. Also included in the PR is a commit that refactors the test code to avoid repetition between with-header and non-header tests. If preferred, I can create a separate PR and corresponding JIRA. See #3129 for a prior (abandoned) attempt at this, but with useful discussion. This current PR explicitly passes the string for null values to the reader or writer. We could generalize the reader further by using a slice of valid NULLs; this is roughly what the C++ interface provides from what I can tell. Closes #4346 from briangold/csvnull and squashes the following commits: e9f4b2c <Micah Kornfield> Address PR feedback on naming 1722d26 <Brian Gold> Improved null handling for CSV reading 3323672 <Brian Gold> Added CSV support for reading/writing NULLs 3c593c9 <Brian Gold> Refactored CSV tests Lead-authored-by: Brian Gold <bgold@purestorage.com> Co-authored-by: Micah Kornfield <emkornfield@gmail.com> Signed-off-by: Sebastien Binet <binet@cern.ch>

c-bata changed the title ~~ARROW-3962(Go): Consider null value while reading CSV~~ ARROW-3962: [Go] Consider null value while reading CSV Dec 8, 2018

c-bata force-pushed the nullable-csv-reader branch from 959b194 to fe031bb Compare December 8, 2018 12:54

c-bata commented Dec 8, 2018

View reviewed changes

c-bata changed the title ~~ARROW-3962: [Go] Consider null value while reading CSV~~ ARROW-3962: [Go] Support null value while reading CSV Dec 8, 2018

c-bata force-pushed the nullable-csv-reader branch from fe031bb to e36fbe4 Compare December 10, 2018 03:57

c-bata force-pushed the nullable-csv-reader branch 2 times, most recently from 52fc763 to e0892fa Compare December 10, 2018 09:31

c-bata changed the title ~~ARROW-3962: [Go] Support null value while reading CSV~~ ARROW-3962: [Go] Accept null values while reading CSV Dec 10, 2018

sbinet requested changes Dec 11, 2018

View reviewed changes

sbinet reviewed Dec 12, 2018

View reviewed changes

c-bata force-pushed the nullable-csv-reader branch from 11c7ca9 to 9724028 Compare December 12, 2018 10:47

Accept null values while reading CSV

40fc29b

c-bata force-pushed the nullable-csv-reader branch from 9724028 to 40fc29b Compare December 12, 2018 14:08

wesm force-pushed the master branch from 3088183 to 0c6b2d2 Compare February 18, 2019 19:34

c-bata closed this May 17, 2019

c-bata deleted the nullable-csv-reader branch May 17, 2019 07:50

briangold mentioned this pull request May 20, 2019

ARROW-3962: [Go] Handle null values in CSV #4346

Closed

asfimport mentioned this pull request Nov 27, 2019

[Go] Support null values while reading a CSV file. #20566

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-3962: [Go] Accept null values while reading CSV #3129

ARROW-3962: [Go] Accept null values while reading CSV #3129

c-bata commented Dec 8, 2018 •

edited

Loading

c-bata Dec 8, 2018

codecov-io commented Dec 8, 2018 •

edited

Loading

c-bata Dec 8, 2018 •

edited

Loading

c-bata Dec 8, 2018

c-bata commented Dec 10, 2018

sbinet commented Dec 10, 2018

sbinet left a comment

sbinet Dec 11, 2018

sbinet Dec 11, 2018

sbinet Dec 11, 2018

sbinet Dec 11, 2018

stuartcarnie commented Dec 11, 2018

sbinet commented Dec 12, 2018

sbinet Dec 12, 2018

c-bata Dec 12, 2018

wesm commented Dec 12, 2018

sbinet commented Dec 12, 2018

pitrou commented Dec 12, 2018

wesm commented Dec 12, 2018

pitrou commented Dec 12, 2018

sbinet commented Dec 19, 2018

c-bata commented Dec 19, 2018 •

edited

Loading

wesm commented Dec 19, 2018

sbinet commented May 17, 2019

c-bata commented May 17, 2019 •

edited

Loading

ARROW-3962: [Go] Accept null values while reading CSV #3129

ARROW-3962: [Go] Accept null values while reading CSV #3129

Conversation

c-bata commented Dec 8, 2018 • edited Loading

Summary

How to reproduce

Link

Choose a reason for hiding this comment

codecov-io commented Dec 8, 2018 • edited Loading

Codecov Report

c-bata Dec 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c-bata commented Dec 10, 2018

sbinet commented Dec 10, 2018

sbinet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuartcarnie commented Dec 11, 2018

sbinet commented Dec 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Dec 12, 2018

sbinet commented Dec 12, 2018

pitrou commented Dec 12, 2018

wesm commented Dec 12, 2018

pitrou commented Dec 12, 2018

sbinet commented Dec 19, 2018

c-bata commented Dec 19, 2018 • edited Loading

wesm commented Dec 19, 2018

sbinet commented May 17, 2019

c-bata commented May 17, 2019 • edited Loading

c-bata commented Dec 8, 2018 •

edited

Loading

codecov-io commented Dec 8, 2018 •

edited

Loading

c-bata Dec 8, 2018 •

edited

Loading

c-bata commented Dec 19, 2018 •

edited

Loading

c-bata commented May 17, 2019 •

edited

Loading