ARROW-3726: [Rust] Add CSV reader with example#2992
ARROW-3726: [Rust] Add CSV reader with example#2992andygrove wants to merge 19 commits intoapache:masterfrom
Conversation
|
@paddyhoran @sunchao I would appreciate a review of this. The code in the example for reading strings is ugly and required calling |
| let lng = batch | ||
| .column(2) | ||
| .as_any() | ||
| .downcast_ref::<PrimitiveArray<f64>>() |
There was a problem hiding this comment.
Are we sure that the alignment requirements are met if we simple downcast from any? Alignments are checked through our existing constructors but would not be checked if we downcast, right?
There was a problem hiding this comment.
I have no idea to be honest. I am definitely looking for guidance here on how to access these arrays after they are built.
There was a problem hiding this comment.
Actually, I think it's fine as you are using our builders internally. I'll take a closer look when I get a chance.
Thanks @andygrove
paddyhoran
left a comment
There was a problem hiding this comment.
We should update the documentation of csvreader also and provide more prose around the features it provides. A working csv reader really makes this library useful for a lot of users and the docs online will be the first place they will look.
If not as part of this PR, I'd be happy to work on the docs as a follow up PR.
Thanks @andygrove!
| unsafe { *(self.raw_values().offset(i as isize)) } | ||
| } | ||
|
|
||
| pub fn value_slice(&self, offset: i64, len: i64) -> &[$native_ty] { |
There was a problem hiding this comment.
We should probably add a doc-comment for value_slice
| // read a batch of rows into memory | ||
| let mut rows: Vec<StringRecord> = Vec::with_capacity(self.batch_size); | ||
| for _ in 0..self.batch_size { | ||
| match self.record_iter.next() { |
There was a problem hiding this comment.
Should we expose record_iter to the user? This would allow them to skip records at the beginning of a file for example before reading the rest of the file.
There was a problem hiding this comment.
I will think about this. There is already the constructor param to indicate whether there is a header row or not that needs to be skipped. Maybe we could do this as a separate enhancement PR.
There was a problem hiding this comment.
I have a use case where there is meta-data inserted after the header but before the data and I have no way to control this. Perhaps this is a rather narrow use case. In any case, happy to revisit this.
| } | ||
| _ => { | ||
| list_builder.append(false).unwrap(); | ||
| } |
There was a problem hiding this comment.
I think DataType::Utf8 should create a BinaryArray not a PrimitiveArray<u8> (there currently is a difference, PrimitiveArray<u8> has child data but BinaryArray has 2 top level buffers). Once ARROW-3787 is merged you can use from to make this conversion and once ARROW-3713 is merged the builder can be simplified further.
There was a problem hiding this comment.
I pulled in 3787 and changed this to create BinaryArray
|
@paddyhoran I added documentation as requested. I also renamed CsvFile to CsvReader. I will be adding CsvWriter in separate PR once this one is merged (I already have working code). |
|
Thanks @andygrove . I'll take a look at this today too. |
|
@andygrove the windows CI needs to be updated to run the new example also. |
| @@ -0,0 +1,263 @@ | |||
| // Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
Can we have a separate mod for csv, i.e., rust/src/csv and rust/src/csv/reader.rs?
rust/src/csvreader.rs
Outdated
|
|
||
| impl CsvReader { | ||
| /// Read the next batch of rows | ||
| pub fn next(&mut self) -> Option<Result<Arc<RecordBatch>, ArrowError>> { |
There was a problem hiding this comment.
Two points:
- Can we return
RecordBatchinstead ofArc<RecordBatch>? we are creating unique reference for the record batches here and can let the caller to decide whether to make it aArc. - We should use
error.Resultinstead ofResult<T, ArrowError>.
| rows.push(r); | ||
| } | ||
| Some(Err(_)) => { | ||
| return Some(Err(ArrowError::ParseError( |
There was a problem hiding this comment.
Perhaps we should surface this error into the ParseError.
rust/src/csvreader.rs
Outdated
| } | ||
|
|
||
| // return early if no data was loaded | ||
| if rows.len() == 0 { |
There was a problem hiding this comment.
nit: use is_empty() instead of len() == 0? it is more explicit.
rust/src/csvreader.rs
Outdated
| .collect(), | ||
| }; | ||
|
|
||
| let arrays: Result<Vec<ArrayRef>, ArrowError> = projection |
There was a problem hiding this comment.
here too: we can use error.Result.
rust/src/csvreader.rs
Outdated
| &DataType::UInt16 => build_primitive_array!(rows, i, u16), | ||
| &DataType::UInt32 => build_primitive_array!(rows, i, u32), | ||
| &DataType::UInt64 => build_primitive_array!(rows, i, u64), | ||
| &DataType::Float16 => build_primitive_array!(rows, i, f32), |
There was a problem hiding this comment.
This doesn't look right - half float is stored with 2 bytes but here we are making them f32 (4 bytes). Maybe we should leave it as a TODO for now.
There was a problem hiding this comment.
I removed support for Float16 so this would now fail with an unsupported data type error, which seems sensible
paddyhoran
left a comment
There was a problem hiding this comment.
+1 LGTM, thanks @andygrove
| @@ -0,0 +1 @@ | |||
| pub mod reader; | |||
There was a problem hiding this comment.
License header in this file too. Also, I wonder if it's possible to rename CsvReader to Reader and export it on the module level, so people can import it with use arrow::csv::Reader instead of the current use arrow::csv::reader::CsvReader, which seems a little redundant.
|
@sunchao All PR feedback has been addressed and CI is happy. Could you give this your blessing and I will go ahead and merge. Thanks. |
This adds a CSV reader and an example that accessed the loaded data through the use of downcasting arrays to specific types. The CSV reader supports all primitive types + string (
List<u8>).