Skip to content

ARROW-3726: [Rust] Add CSV reader with example#2992

Closed
andygrove wants to merge 19 commits intoapache:masterfrom
andygrove:ARROW-3726
Closed

ARROW-3726: [Rust] Add CSV reader with example#2992
andygrove wants to merge 19 commits intoapache:masterfrom
andygrove:ARROW-3726

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Nov 19, 2018

This adds a CSV reader and an example that accessed the loaded data through the use of downcasting arrays to specific types. The CSV reader supports all primitive types + string (List<u8>).

@andygrove
Copy link
Copy Markdown
Member Author

@paddyhoran @sunchao I would appreciate a review of this. The code in the example for reading strings is ugly and required calling unsafe. I don't know if I'm missing something or whether we need to improve the API here?

let lng = batch
.column(2)
.as_any()
.downcast_ref::<PrimitiveArray<f64>>()
Copy link
Copy Markdown
Contributor

@paddyhoran paddyhoran Nov 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that the alignment requirements are met if we simple downcast from any? Alignments are checked through our existing constructors but would not be checked if we downcast, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea to be honest. I am definitely looking for guidance here on how to access these arrays after they are built.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think it's fine as you are using our builders internally. I'll take a closer look when I get a chance.

Thanks @andygrove

Copy link
Copy Markdown
Contributor

@paddyhoran paddyhoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update the documentation of csvreader also and provide more prose around the features it provides. A working csv reader really makes this library useful for a lot of users and the docs online will be the first place they will look.

If not as part of this PR, I'd be happy to work on the docs as a follow up PR.

Thanks @andygrove!

unsafe { *(self.raw_values().offset(i as isize)) }
}

pub fn value_slice(&self, offset: i64, len: i64) -> &[$native_ty] {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add a doc-comment for value_slice

// read a batch of rows into memory
let mut rows: Vec<StringRecord> = Vec::with_capacity(self.batch_size);
for _ in 0..self.batch_size {
match self.record_iter.next() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we expose record_iter to the user? This would allow them to skip records at the beginning of a file for example before reading the rest of the file.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will think about this. There is already the constructor param to indicate whether there is a header row or not that needs to be skipped. Maybe we could do this as a separate enhancement PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a use case where there is meta-data inserted after the header but before the data and I have no way to control this. Perhaps this is a rather narrow use case. In any case, happy to revisit this.

}
_ => {
list_builder.append(false).unwrap();
}
Copy link
Copy Markdown
Contributor

@paddyhoran paddyhoran Nov 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think DataType::Utf8 should create a BinaryArray not a PrimitiveArray<u8> (there currently is a difference, PrimitiveArray<u8> has child data but BinaryArray has 2 top level buffers). Once ARROW-3787 is merged you can use from to make this conversion and once ARROW-3713 is merged the builder can be simplified further.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled in 3787 and changed this to create BinaryArray

@andygrove
Copy link
Copy Markdown
Member Author

@paddyhoran I added documentation as requested. I also renamed CsvFile to CsvReader. I will be adding CsvWriter in separate PR once this one is merged (I already have working code).

@sunchao
Copy link
Copy Markdown
Member

sunchao commented Nov 20, 2018

Thanks @andygrove . I'll take a look at this today too.

@paddyhoran
Copy link
Copy Markdown
Contributor

@andygrove the windows CI needs to be updated to run the new example also.

@@ -0,0 +1,263 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a separate mod for csv, i.e., rust/src/csv and rust/src/csv/reader.rs?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


impl CsvReader {
/// Read the next batch of rows
pub fn next(&mut self) -> Option<Result<Arc<RecordBatch>, ArrowError>> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two points:

  1. Can we return RecordBatch instead of Arc<RecordBatch>? we are creating unique reference for the record batches here and can let the caller to decide whether to make it a Arc.
  2. We should use error.Result instead of Result<T, ArrowError>.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

rows.push(r);
}
Some(Err(_)) => {
return Some(Err(ArrowError::ParseError(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should surface this error into the ParseError.

}

// return early if no data was loaded
if rows.len() == 0 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use is_empty() instead of len() == 0? it is more explicit.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

.collect(),
};

let arrays: Result<Vec<ArrayRef>, ArrowError> = projection
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too: we can use error.Result.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

&DataType::UInt16 => build_primitive_array!(rows, i, u16),
&DataType::UInt32 => build_primitive_array!(rows, i, u32),
&DataType::UInt64 => build_primitive_array!(rows, i, u64),
&DataType::Float16 => build_primitive_array!(rows, i, f32),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right - half float is stored with 2 bytes but here we are making them f32 (4 bytes). Maybe we should leave it as a TODO for now.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed support for Float16 so this would now fail with an unsupported data type error, which seems sensible

Copy link
Copy Markdown
Contributor

@paddyhoran paddyhoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM, thanks @andygrove

@@ -0,0 +1 @@
pub mod reader;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

License header in this file too. Also, I wonder if it's possible to rename CsvReader to Reader and export it on the module level, so people can import it with use arrow::csv::Reader instead of the current use arrow::csv::reader::CsvReader, which seems a little redundant.

@andygrove
Copy link
Copy Markdown
Member Author

@sunchao All PR feedback has been addressed and CI is happy. Could you give this your blessing and I will go ahead and merge. Thanks.

Copy link
Copy Markdown
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @andygrove !

@andygrove andygrove closed this in c04a62b Nov 21, 2018
@andygrove andygrove deleted the ARROW-3726 branch March 30, 2019 22:33
@andygrove andygrove restored the ARROW-3726 branch March 30, 2019 22:33
@andygrove andygrove deleted the ARROW-3726 branch March 30, 2019 22:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants