ARROW-3726: [Rust] Add CSV reader with example by andygrove · Pull Request #2992 · apache/arrow

andygrove · 2018-11-19T15:09:29Z

This adds a CSV reader and an example that accessed the loaded data through the use of downcasting arrays to specific types. The CSV reader supports all primitive types + string (List<u8>).

andygrove · 2018-11-19T15:58:57Z

@paddyhoran @sunchao I would appreciate a review of this. The code in the example for reading strings is ugly and required calling unsafe. I don't know if I'm missing something or whether we need to improve the API here?

paddyhoran · 2018-11-19T17:00:19Z

rust/examples/read_csv.rs

+    let lng = batch
+        .column(2)
+        .as_any()
+        .downcast_ref::<PrimitiveArray<f64>>()


Are we sure that the alignment requirements are met if we simple downcast from any? Alignments are checked through our existing constructors but would not be checked if we downcast, right?

I have no idea to be honest. I am definitely looking for guidance here on how to access these arrays after they are built.

Actually, I think it's fine as you are using our builders internally. I'll take a closer look when I get a chance.

Thanks @andygrove

paddyhoran

We should update the documentation of csvreader also and provide more prose around the features it provides. A working csv reader really makes this library useful for a lot of users and the docs online will be the first place they will look.

If not as part of this PR, I'd be happy to work on the docs as a follow up PR.

Thanks @andygrove!

paddyhoran · 2018-11-20T02:29:55Z

rust/src/array.rs

                unsafe { *(self.raw_values().offset(i as isize)) }
            }

+            pub fn value_slice(&self, offset: i64, len: i64) -> &[$native_ty] {


We should probably add a doc-comment for value_slice

paddyhoran · 2018-11-20T02:31:17Z

rust/src/csvreader.rs

+        // read a batch of rows into memory
+        let mut rows: Vec<StringRecord> = Vec::with_capacity(self.batch_size);
+        for _ in 0..self.batch_size {
+            match self.record_iter.next() {


Should we expose record_iter to the user? This would allow them to skip records at the beginning of a file for example before reading the rest of the file.

I will think about this. There is already the constructor param to indicate whether there is a header row or not that needs to be skipped. Maybe we could do this as a separate enhancement PR.

I have a use case where there is meta-data inserted after the header but before the data and I have no way to control this. Perhaps this is a rather narrow use case. In any case, happy to revisit this.

paddyhoran · 2018-11-20T02:32:01Z

rust/src/csvreader.rs

+                                }
+                                _ => {
+                                    list_builder.append(false).unwrap();
+                                }


I think DataType::Utf8 should create a BinaryArray not a PrimitiveArray<u8> (there currently is a difference, PrimitiveArray<u8> has child data but BinaryArray has 2 top level buffers). Once ARROW-3787 is merged you can use from to make this conversion and once ARROW-3713 is merged the builder can be simplified further.

I pulled in 3787 and changed this to create BinaryArray

andygrove · 2018-11-20T16:33:22Z

@paddyhoran I added documentation as requested. I also renamed CsvFile to CsvReader. I will be adding CsvWriter in separate PR once this one is merged (I already have working code).

sunchao · 2018-11-20T17:20:28Z

Thanks @andygrove . I'll take a look at this today too.

paddyhoran · 2018-11-20T17:30:49Z

@andygrove the windows CI needs to be updated to run the new example also.

sunchao · 2018-11-20T17:18:44Z

rust/src/csvreader.rs

@@ -0,0 +1,263 @@
+// Licensed to the Apache Software Foundation (ASF) under one


Can we have a separate mod for csv, i.e., rust/src/csv and rust/src/csv/reader.rs?

sunchao · 2018-11-20T17:40:51Z

rust/src/csvreader.rs

+
+impl CsvReader {
+    /// Read the next batch of rows
+    pub fn next(&mut self) -> Option<Result<Arc<RecordBatch>, ArrowError>> {


Two points:

Can we return RecordBatch instead of Arc<RecordBatch>? we are creating unique reference for the record batches here and can let the caller to decide whether to make it a Arc.

We should use error.Result instead of Result<T, ArrowError>.

sunchao · 2018-11-20T17:43:11Z

rust/src/csvreader.rs

+                    rows.push(r);
+                }
+                Some(Err(_)) => {
+                    return Some(Err(ArrowError::ParseError(


Perhaps we should surface this error into the ParseError.

sunchao · 2018-11-20T17:44:10Z

rust/src/csvreader.rs

+        }
+
+        // return early if no data was loaded
+        if rows.len() == 0 {


nit: use is_empty() instead of len() == 0? it is more explicit.

sunchao · 2018-11-20T17:45:42Z

rust/src/csvreader.rs

+                .collect(),
+        };
+
+        let arrays: Result<Vec<ArrayRef>, ArrowError> = projection


here too: we can use error.Result.

sunchao · 2018-11-20T17:48:23Z

rust/src/csvreader.rs

+                    &DataType::UInt16 => build_primitive_array!(rows, i, u16),
+                    &DataType::UInt32 => build_primitive_array!(rows, i, u32),
+                    &DataType::UInt64 => build_primitive_array!(rows, i, u64),
+                    &DataType::Float16 => build_primitive_array!(rows, i, f32),


This doesn't look right - half float is stored with 2 bytes but here we are making them f32 (4 bytes). Maybe we should leave it as a TODO for now.

I removed support for Float16 so this would now fail with an unsupported data type error, which seems sensible

paddyhoran

+1 LGTM, thanks @andygrove

sunchao · 2018-11-21T03:39:49Z

rust/src/csv/mod.rs

@@ -0,0 +1 @@
+pub mod reader;


License header in this file too. Also, I wonder if it's possible to rename CsvReader to Reader and export it on the module level, so people can import it with use arrow::csv::Reader instead of the current use arrow::csv::reader::CsvReader, which seems a little redundant.

andygrove · 2018-11-21T16:27:25Z

@sunchao All PR feedback has been addressed and CI is happy. Could you give this your blessing and I will go ahead and merge. Thanks.

sunchao

LGTM. Thanks @andygrove !

andygrove added 5 commits November 19, 2018 07:58

Implement csv reader

517da28

Update CI script

9e88791

fix test

638159d

Example displays data

8974c60

update example to print city names to demonstrate usage of List<u8>

e539814

andygrove added 2 commits November 19, 2018 09:35

add value_slice method, clean up example code

aae53aa

cargo fmt

247092d

paddyhoran reviewed Nov 19, 2018

View reviewed changes

paddyhoran reviewed Nov 20, 2018

View reviewed changes

Add documentation, rename CsvFile to CsvReader

3928d6d

andygrove added 2 commits November 20, 2018 09:48

Merge branch 'master' into ARROW-3726

f726559

Use BinaryArray instead of List<u8>

80c44ca

sunchao reviewed Nov 20, 2018

View reviewed changes

andygrove added 7 commits November 20, 2018 13:33

Remove support for Float16

6167223

Remove support for Float16

2b9d9e1

use isEmpty() instead of len() == 0

ab1b20f

Remove Arc<>

5d43b1f

Update Windows CI to run new example

26857a3

create module for csv::reader

3ab0a47

add missing file after rename

4674651

paddyhoran approved these changes Nov 21, 2018

View reviewed changes

sunchao reviewed Nov 21, 2018

View reviewed changes

andygrove added 2 commits November 21, 2018 06:57

re-export csv::Reader

70140c6

Exclude Rust test data csv files from rat

4d1bf98

sunchao approved these changes Nov 21, 2018

View reviewed changes

andygrove closed this in c04a62b Nov 21, 2018

andygrove deleted the ARROW-3726 branch March 30, 2019 22:33

andygrove restored the ARROW-3726 branch March 30, 2019 22:33

andygrove deleted the ARROW-3726 branch March 30, 2019 22:34

asfimport mentioned this pull request Nov 21, 2018

[Rust] CSV Reader & Writer #15862

Closed

		@@ -0,0 +1,263 @@
		// Licensed to the Apache Software Foundation (ASF) under one

Conversation

andygrove commented Nov 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove commented Nov 19, 2018

Uh oh!

paddyhoran Nov 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paddyhoran left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paddyhoran Nov 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove commented Nov 20, 2018

Uh oh!

sunchao commented Nov 20, 2018

Uh oh!

paddyhoran commented Nov 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paddyhoran left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove commented Nov 21, 2018

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andygrove commented Nov 19, 2018 •

edited

Loading

paddyhoran Nov 19, 2018 •

edited

Loading

paddyhoran Nov 20, 2018 •

edited

Loading