New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-25: [C++] Implement CSV reader #2576
Conversation
917e0b9
to
5f4f55e
Compare
a1757df
to
30a4d68
Compare
By the way, initial performance testing on a CSV of string/binary columns gives the following ballpark numbers here (on a 8 core 3.2 GHz AMD Ryzen CPU):
(I didn't bother measuring with numeric columns, as the performance of our number parsing routines is likely to be very bad) Clearly the main thread's chunking routine is the bottleneck in the multi-threaded scenario. Improving this will require adding an option to signal that values can't have newlines in them (as paratext does), and perhaps SIMD-accelerating that special case (as... paratext does ;-)). Overall I have three directions in mind to improve performance:
|
python/pyarrow/_csv.pyx
Outdated
return <unsigned char> val | ||
|
||
|
||
cdef class CSVReadOptions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: since pyarrow.csv
is a distinct module, I have in mind to rename this ReadOptions
simply... (same for ParseOptions
). What do you think?
By the way, it's perhaps a bit late, but I thought this might be fit as an experimental feature in Arrow 0.11 :-) |
That sounds fine to me. I think we'll be able to merge it before then. It's my goal to review this tomorrow or Saturday so we can merge sometime next week |
I have started to look through this. I think we're going to need to do some work on the design of the tokenizer hot path (I wrote the tokenizer that pandas uses, for example -- I probably wouldn't use the same design again -- so we have other data points to compare with). Luckily we have benchmarks and tests so we can refactor at will to try out different things and analyze that part in more depth. |
Working on this review. There are quite a lot of -Wconversion issues with gcc, I'm going to push a fix for some of these so I have a clean build with gcc 4.8.x |
Yes, I find |
Oops, I was compiling the wrong branch :S |
Ahah :-) Sorry, I may have left a "csv_reader" branch lying around... |
dbe2b5c
to
f489b2e
Compare
There's a failure in tha Java Flight test here: (by the way, I thought we had reduced the Java jobs' verbosity?) |
I opened a JIRA about the Java failure |
I'm going to work on getting a review up for this patch, but I will likely merge this as is (with a passing build) and leave additional work to follow up patches because it's so large |
Perhaps you'd like to give an opinion on the small naming question I asked above ;-) |
Ah, you can remove the |
f489b2e
to
9fa4f3f
Compare
Ok, done. |
Codecov Report
@@ Coverage Diff @@
## master #2576 +/- ##
=========================================
+ Coverage 87.23% 88.5% +1.26%
=========================================
Files 380 342 -38
Lines 59463 57461 -2002
=========================================
- Hits 51872 50854 -1018
+ Misses 7521 6607 -914
+ Partials 70 0 -70
Continue to review full report at Codecov.
|
Can we merge this for 0.11.0? Or should we keep this for 0.12.0? |
I'm ok with merging myself. |
This includes: - a CSV table reader written in C++ - a Python wrapper around the CSV table reader - simple type inference for CSV values (null -> int64 -> float64 -> binary) - generic null parsing using Pandas defaults as a baseline ("NA", "N/A", "NaN"...) - some simple syntax parameters for CSV parsing Not included: - conversion and typing options - performance tuning
9fa4f3f
to
4ae93b2
Compare
Rebased to make sure CI passes. |
Thanks! |
I had planned to merge it today. I'm working on the code review the next few hours |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Lots of work to do on this problem, but this is a fantastic start! I left a number of comments either about design/functionality or code cleaning, but in the interest of unblocking the 0.11 release I'm going to merge this and we'll leave our work to follow up patches
#include "arrow/util/logging.h" | ||
|
||
#include <sstream> | ||
#include <string> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Standard library headers should be included after the "chunker.h" but before the other local project headers
cdef _get_reader(input_file, shared_ptr[InputStream]* out): | ||
cdef shared_ptr[RandomAccessFile] result | ||
use_memory_map = False | ||
get_reader(input_file, use_memory_map, &result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fastest CSV readers I'm familiar with all use memory mapping, FWIW. We can do our own experiments on large files. This is a good one https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that ReadaheadSpooler will make a copy anyway when reading from a memory-mapped file (because of padding).
More generally, I don't think it is really relevant here. Parsing CSV data will always be much slower than the speed of raw memory copies (which should be on the order of 30 GB/s from main memory).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, another motivation for not focussing on memory-mapped files is that it wouldn't help with compressed CSV files.
} | ||
|
||
template <bool quoting, bool escaping> | ||
Status Chunker::ProcessSpecialized(const char* start, uint32_t size, uint32_t* out_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably want to standardize on using int64_t
for sizes, etc.
/// | ||
/// Process a block of CSV data, reading up to max_num_rows rows. | ||
/// The number of bytes in the chunk is returned in out_size. | ||
Status Process(const char* data, uint32_t size, uint32_t* out_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: make virtual. We will need to define chunkers that use whitespace, or multi-character / regex delimiters eventually
// Detect a single line from the data pointer. Return the line end, | ||
// or nullptr if the remaining line is truncated. | ||
template <bool quoting, bool escaping> | ||
inline const char* ReadLine(const char* data, const char* data_end); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you make Process
virtual then these methods can be tucked into the private implementation
} | ||
|
||
// Make a BlockParser from a vector of lines representing a CSV file | ||
void MakeCSVParser(std::vector<std::string> lines, std::shared_ptr<BlockParser>* out) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const vector<string>&
} | ||
|
||
// Make a BlockParser from a vector of strings representing a single CSV column | ||
void MakeColumnParser(std::vector<std::string> items, std::shared_ptr<BlockParser>* out) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const&
static void sleep_for(double seconds) { | ||
std::this_thread::sleep_for( | ||
std::chrono::nanoseconds(static_cast<int64_t>(seconds * 1e9))); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function has appeared in several places now, should put in a utility place like arrow/test-util.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, perhaps.
/// be careful to call Finish() on subgroups before calling it | ||
/// on the main group). | ||
// XXX if a subgroup errors out, should it propagate immediately to the parent | ||
// and to children? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bailing out early seems like a useful workflow to harden
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that subgroups aren't used actually, so perhaps we should simply rip them out. They are a source of complication in the implementation (initially, I thought I'd need them).
cdef extern from "arrow/csv/api.h" namespace "arrow::csv" nogil: | ||
|
||
cdef cppclass CCSVParseOptions" arrow::csv::ParseOptions": | ||
unsigned char delimiter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should think ahead about how we will handle multi-char delimiters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well... do those appear in the wild?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, they do, unfortunately. And regular expressions. this was implemented in continuum's IOPro, for example: https://github.com/ContinuumIO/TextAdapter/blob/53138c2277cdfcf32e127251313d4f77f81050aa/textadapter/core/text_adapter.c#L1575. In pandas, multiline/regex delimiters are implemented using an extremely slow Python parser
If I understand it correctly the pyarrow csvreader already reads large files in chunks? Is there a way to control this behaviour? |
This includes:
("NA", "N/A", "NaN"...)
Not included: