Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3962: [Go] Accept null values while reading CSV #3129

Closed
wants to merge 1 commit into from

Conversation

c-bata
Copy link
Contributor

@c-bata c-bata commented Dec 8, 2018

Summary

Fix the bug of CSV reader couldn't accept null values.

How to reproduce

Create a following CSV file:

0;0;str-0
1;1;str-1
;2;str-2
3;3;str-3
4;;str-4
5;5;str-5
6;6;
7;7;str-7
8;8;str-8
9;9;str-9

After that run example function in csv_test.go, got following results.

rec[0]["i64"]: [0]
rec[1]["f64"]: [0]
rec[2]["str"]: ["str-0"]
rec[0]["i64"]: [1]
rec[1]["f64"]: [0]
rec[2]["str"]: ["str-1"]

The reason why stopping is csv.Reader got error while parsing empty string as a float64 (the error message is strconv.ParseFloat: parsing "": invalid syntax).

Link

@c-bata c-bata changed the title ARROW-3962(Go): Consider null value while reading CSV ARROW-3962: [Go] Consider null value while reading CSV Dec 8, 2018
}

want := `rec[0]["bool"]: [true]
for _, tc := range []struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change table-driven-tests style for adding tests.

@codecov-io
Copy link

codecov-io commented Dec 8, 2018

Codecov Report

Merging #3129 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3129      +/-   ##
==========================================
+ Coverage   87.02%   87.02%   +<.01%     
==========================================
  Files         495      495              
  Lines       69679    69686       +7     
==========================================
+ Hits        60640    60647       +7     
  Misses       8942     8942              
  Partials       97       97
Impacted Files Coverage Δ
go/arrow/csv/csv.go 82.11% <100%> (+0.59%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0ac97f...40fc29b. Read the comment docs.

}{
{
name: "including various values which doesn't contain null values",
csv: bytes.NewBufferString(`## a simple set of data which contains all supported types
Copy link
Contributor Author

@c-bata c-bata Dec 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is better than using csv files in csv/testdata directory for maintainability reasons.

If you don't agree, I'll add csv/testdata/types_with_null.csv and use it. How do you feel about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Godoc example also uses csv files in csv/testdata directory. I refactor it at #3131 .

@c-bata c-bata changed the title ARROW-3962: [Go] Consider null value while reading CSV ARROW-3962: [Go] Support null value while reading CSV Dec 8, 2018
@c-bata
Copy link
Contributor Author

c-bata commented Dec 10, 2018

@sbinet Could you review this?

@c-bata c-bata force-pushed the nullable-csv-reader branch 2 times, most recently from 52fc763 to e0892fa Compare December 10, 2018 09:31
@c-bata c-bata changed the title ARROW-3962: [Go] Support null value while reading CSV ARROW-3962: [Go] Accept null values while reading CSV Dec 10, 2018
@sbinet
Copy link
Contributor

sbinet commented Dec 10, 2018

sure. I'll give it a go tomorrow (Paris time.)

Copy link
Contributor

@sbinet sbinet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've always disliked the automatic, by default, handling of missing values of pandas.
and always prefered the "in your face" handling errors of Go.

so I am personally a bit reluctant to have the same behaviour in Go-Arrow, although, here, it's easier to figure out this was a missing value and not a valid zero value.

this behaviour could be activated with a WithNulls (or some other name) option, of course.

what do others think? @stuartcarnie @alexandreyc ? @wesm ?

want string
}{
{
name: "including various values which doesn't contain null values",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a bit of a mouthful.
also, it makes it harder to selectively select or disable a given set of sub-tests.

could we come up with a shorter yet descriptive name?
perhaps something along the lines of without-null-values?

{
name: "including various values which doesn't contain null values",
csv: bytes.NewBufferString(`## a simple set of data which contains all supported types
## supported types: bool;int8;int16;int32;int64;uint8;uint16;uint32;uint64;float32;float64;string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supported types is also mentioned on the line above.
perhaps rephrase to remove the stuttering?

`,
},
{
name: "including null values",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/including null values/with-null-values/ ?

{
name: "including null values",
csv: bytes.NewBufferString(`## a simple set of data which contains all supported types
## supported types: bool;int8;int16;int32;int64;uint8;uint16;uint32;uint64;float32;float64;string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@stuartcarnie
Copy link
Contributor

@sbinet are you asking about the default behavior for handling null in the CSV reader?

@sbinet
Copy link
Contributor

sbinet commented Dec 12, 2018

yes. I think I am asking for the default behaviour of the CSV reader to be to fail early and loudly when encountering missing values.
but still have an optional config-func (say, WithNulls() or something) to enable adding null values when encountering missing values.

I'd argue that dataset cleaning shouldn't be coupled to nor baked in the CSV reader.

@@ -237,6 +238,11 @@ func (r *Reader) validate(recs []string) {

func (r *Reader) read(recs []string) {
for i, str := range recs {
if str == "" {
r.bld.Field(i).AppendNull()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, there's an issue here: how do we make the distinction b/w a null value for a string column and a valid "" value for a string column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. I fixed this.

@wesm
Copy link
Member

wesm commented Dec 12, 2018

It seems fine to not recognize null values by default, as long as you have the option to provide a global list (in case the same types of markers are used in many columns) or a per column list

@sbinet
Copy link
Contributor

sbinet commented Dec 12, 2018

@wesm I'd prefer this modus operandi, indeed.

@pitrou
Copy link
Member

pitrou commented Dec 12, 2018

For the record, the C++ CSV reader automatically recognizes null values in most data types. That is empty values, but also a bunch of conventional "null" notations listed here:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/converter-test.cc#L39

It may be better to have similar characteristics from one Arrow implementation to another.

@wesm
Copy link
Member

wesm commented Dec 12, 2018

@pitrou I'd be in favor of reducing the defaults in the C++ implementation to a much more conservative set (eg "null", "NULL", and empty strings)

@pitrou
Copy link
Member

pitrou commented Dec 12, 2018

Making the set of null values configurable will require some work to keep the implementation performant, though. The current solution covers a wide range of values with a simple and performant implementation.

@sbinet
Copy link
Contributor

sbinet commented Dec 19, 2018

so... it seems the consensus is:

  • as a default behaviour, fail when encountering null or missing values
  • have an opt-in switch to enable the handling of missing values (which could itself be enabled for a (sub)set of columns.)

right?

@c-bata
Copy link
Contributor Author

c-bata commented Dec 19, 2018

In my opinion, the behavior that CSV reader accepting null values as default is not weird. At least, I feel it's better to take the same behavior between each implementations.

@wesm
Copy link
Member

wesm commented Dec 19, 2018

It would not be unreasonable to recognize a conservative list of null markers by default, like "null" and "NULL". There is also the question of empty cells in columns that contain numeric data (with string columns, you probably want to distinguish empty string vs. null)

@sbinet
Copy link
Contributor

sbinet commented May 17, 2019

ping @c-bata

what's the status on this?

@c-bata
Copy link
Contributor Author

c-bata commented May 17, 2019

Hi @sbinet . Thank you for your remind. I give up on this one as it does not seem to reach a consensus. Thank you for reviewers 🙇

(P.S. Good job to implement an IPC protocol in Go! That's really needed.)

@c-bata c-bata closed this May 17, 2019
@c-bata c-bata deleted the nullable-csv-reader branch May 17, 2019 07:50
sbinet pushed a commit that referenced this pull request Nov 27, 2019
This patch-set extends the support for nulls in Arrow's CSV reader and writer. The string used for nulls is added as an option for the Reader and Writer structs.

Also included in the PR is a commit that refactors the test code to avoid repetition between with-header and non-header tests. If preferred, I can create a separate PR and corresponding JIRA.

See #3129 for a prior (abandoned) attempt at this, but with useful discussion. This current PR explicitly passes the string for null values to the reader or writer. We could generalize the reader further by using a slice of valid NULLs; this is roughly what the C++ interface provides from what I can tell.

Closes #4346 from briangold/csvnull and squashes the following commits:

e9f4b2c <Micah Kornfield> Address PR feedback on naming
1722d26 <Brian Gold> Improved null handling for CSV reading
3323672 <Brian Gold> Added CSV support for reading/writing NULLs
3c593c9 <Brian Gold> Refactored CSV tests

Lead-authored-by: Brian Gold <bgold@purestorage.com>
Co-authored-by: Micah Kornfield <emkornfield@gmail.com>
Signed-off-by: Sebastien Binet <binet@cern.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants