New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3962: [Go] Accept null values while reading CSV #3129

Open
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
6 participants
@c-bata
Copy link
Contributor

c-bata commented Dec 8, 2018

Summary

Fix the bug of CSV reader couldn't accept null values.

How to reproduce

Create a following CSV file:

0;0;str-0
1;1;str-1
;2;str-2
3;3;str-3
4;;str-4
5;5;str-5
6;6;
7;7;str-7
8;8;str-8
9;9;str-9

After that run example function in csv_test.go, got following results.

rec[0]["i64"]: [0]
rec[1]["f64"]: [0]
rec[2]["str"]: ["str-0"]
rec[0]["i64"]: [1]
rec[1]["f64"]: [0]
rec[2]["str"]: ["str-1"]

The reason why stopping is csv.Reader got error while parsing empty string as a float64 (the error message is strconv.ParseFloat: parsing "": invalid syntax).

Link

@c-bata c-bata changed the title ARROW-3962(Go): Consider null value while reading CSV ARROW-3962: [Go] Consider null value while reading CSV Dec 8, 2018

@c-bata c-bata force-pushed the c-bata:nullable-csv-reader branch from 959b194 to fe031bb Dec 8, 2018

}

want := `rec[0]["bool"]: [true]
for _, tc := range []struct {

This comment has been minimized.

@c-bata

c-bata Dec 8, 2018

Contributor

Change table-driven-tests style for adding tests.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Dec 8, 2018

Codecov Report

Merging #3129 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3129      +/-   ##
==========================================
+ Coverage   87.02%   87.02%   +<.01%     
==========================================
  Files         495      495              
  Lines       69679    69686       +7     
==========================================
+ Hits        60640    60647       +7     
  Misses       8942     8942              
  Partials       97       97
Impacted Files Coverage Δ
go/arrow/csv/csv.go 82.11% <100%> (+0.59%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0ac97f...40fc29b. Read the comment docs.

}{
{
name: "including various values which doesn't contain null values",
csv: bytes.NewBufferString(`## a simple set of data which contains all supported types

This comment has been minimized.

@c-bata

c-bata Dec 8, 2018

Contributor

I think this is better than using csv files in csv/testdata directory for maintainability reasons.

If you don't agree, I'll add csv/testdata/types_with_null.csv and use it. How do you feel about this?

This comment has been minimized.

@c-bata

c-bata Dec 8, 2018

Contributor

Godoc example also uses csv files in csv/testdata directory. I refactor it at #3131 .

@c-bata c-bata changed the title ARROW-3962: [Go] Consider null value while reading CSV ARROW-3962: [Go] Support null value while reading CSV Dec 8, 2018

@c-bata c-bata force-pushed the c-bata:nullable-csv-reader branch from fe031bb to e36fbe4 Dec 10, 2018

@c-bata

This comment has been minimized.

Copy link
Contributor

c-bata commented Dec 10, 2018

@sbinet Could you review this?

@c-bata c-bata force-pushed the c-bata:nullable-csv-reader branch 2 times, most recently from 52fc763 to e0892fa Dec 10, 2018

@c-bata c-bata changed the title ARROW-3962: [Go] Support null value while reading CSV ARROW-3962: [Go] Accept null values while reading CSV Dec 10, 2018

@sbinet

This comment has been minimized.

Copy link
Contributor

sbinet commented Dec 10, 2018

sure. I'll give it a go tomorrow (Paris time.)

@sbinet
Copy link
Contributor

sbinet left a comment

I've always disliked the automatic, by default, handling of missing values of pandas.
and always prefered the "in your face" handling errors of Go.

so I am personally a bit reluctant to have the same behaviour in Go-Arrow, although, here, it's easier to figure out this was a missing value and not a valid zero value.

this behaviour could be activated with a WithNulls (or some other name) option, of course.

what do others think? @stuartcarnie @alexandreyc ? @wesm ?

want string
}{
{
name: "including various values which doesn't contain null values",

This comment has been minimized.

@sbinet

sbinet Dec 11, 2018

Contributor

that's a bit of a mouthful.
also, it makes it harder to selectively select or disable a given set of sub-tests.

could we come up with a shorter yet descriptive name?
perhaps something along the lines of without-null-values?

{
name: "including various values which doesn't contain null values",
csv: bytes.NewBufferString(`## a simple set of data which contains all supported types
## supported types: bool;int8;int16;int32;int64;uint8;uint16;uint32;uint64;float32;float64;string

This comment has been minimized.

@sbinet

sbinet Dec 11, 2018

Contributor

supported types is also mentioned on the line above.
perhaps rephrase to remove the stuttering?

`,
},
{
name: "including null values",

This comment has been minimized.

@sbinet

sbinet Dec 11, 2018

Contributor

s/including null values/with-null-values/ ?

{
name: "including null values",
csv: bytes.NewBufferString(`## a simple set of data which contains all supported types
## supported types: bool;int8;int16;int32;int64;uint8;uint16;uint32;uint64;float32;float64;string

This comment has been minimized.

@sbinet

sbinet Dec 11, 2018

Contributor

ditto.

@stuartcarnie

This comment has been minimized.

Copy link
Contributor

stuartcarnie commented Dec 11, 2018

@sbinet are you asking about the default behavior for handling null in the CSV reader?

@sbinet

This comment has been minimized.

Copy link
Contributor

sbinet commented Dec 12, 2018

yes. I think I am asking for the default behaviour of the CSV reader to be to fail early and loudly when encountering missing values.
but still have an optional config-func (say, WithNulls() or something) to enable adding null values when encountering missing values.

I'd argue that dataset cleaning shouldn't be coupled to nor baked in the CSV reader.

@@ -237,6 +238,11 @@ func (r *Reader) validate(recs []string) {

func (r *Reader) read(recs []string) {
for i, str := range recs {
if str == "" {
r.bld.Field(i).AppendNull()

This comment has been minimized.

@sbinet

sbinet Dec 12, 2018

Contributor

actually, there's an issue here: how do we make the distinction b/w a null value for a string column and a valid "" value for a string column?

This comment has been minimized.

@c-bata

c-bata Dec 12, 2018

Contributor

thanks. I fixed this.

@c-bata c-bata force-pushed the c-bata:nullable-csv-reader branch from 11c7ca9 to 9724028 Dec 12, 2018

@wesm

This comment has been minimized.

Copy link
Member

wesm commented Dec 12, 2018

It seems fine to not recognize null values by default, as long as you have the option to provide a global list (in case the same types of markers are used in many columns) or a per column list

@c-bata c-bata force-pushed the c-bata:nullable-csv-reader branch from 9724028 to 40fc29b Dec 12, 2018

@sbinet

This comment has been minimized.

Copy link
Contributor

sbinet commented Dec 12, 2018

@wesm I'd prefer this modus operandi, indeed.

@pitrou

This comment has been minimized.

Copy link
Contributor

pitrou commented Dec 12, 2018

For the record, the C++ CSV reader automatically recognizes null values in most data types. That is empty values, but also a bunch of conventional "null" notations listed here:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/converter-test.cc#L39

It may be better to have similar characteristics from one Arrow implementation to another.

@wesm

This comment has been minimized.

Copy link
Member

wesm commented Dec 12, 2018

@pitrou I'd be in favor of reducing the defaults in the C++ implementation to a much more conservative set (eg "null", "NULL", and empty strings)

@pitrou

This comment has been minimized.

Copy link
Contributor

pitrou commented Dec 12, 2018

Making the set of null values configurable will require some work to keep the implementation performant, though. The current solution covers a wide range of values with a simple and performant implementation.

@sbinet

This comment has been minimized.

Copy link
Contributor

sbinet commented Dec 19, 2018

so... it seems the consensus is:

  • as a default behaviour, fail when encountering null or missing values
  • have an opt-in switch to enable the handling of missing values (which could itself be enabled for a (sub)set of columns.)

right?

@c-bata

This comment has been minimized.

Copy link
Contributor

c-bata commented Dec 19, 2018

In my opinion, the behavior that CSV reader accepting null values as default is not weird. At least, I feel it's better to take the same behavior between each implementations.

@wesm

This comment has been minimized.

Copy link
Member

wesm commented Dec 19, 2018

It would not be unreasonable to recognize a conservative list of null markers by default, like "null" and "NULL". There is also the question of empty cells in columns that contain numeric data (with string columns, you probably want to distinguish empty string vs. null)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment