ARROW-25: [C++] Implement CSV reader #2576

pitrou · 2018-09-17T16:36:13Z

This includes:

a CSV table reader written in C++
a Python wrapper around the CSV table reader
simple type inference for CSV values (null -> int64 -> float64 -> binary)
generic null parsing using Pandas defaults as a baseline
("NA", "N/A", "NaN"...)
some simple syntax parameters for CSV parsing

Not included:

conversion and typing options
performance tuning

wesm · 2018-09-17T21:05:17Z

Awesome! I will plan to spend some quality time reviewing this to give feedback. Having spent a lot of time writing code to parse CSV files I have a lot of opinions.

cc also @cpcloud @jreback

pitrou · 2018-09-20T17:36:53Z

By the way, initial performance testing on a CSV of string/binary columns gives the following ballpark numbers here (on a 8 core 3.2 GHz AMD Ryzen CPU):

single-threaded: 150 MB/s
multi-threaded: 600MB/s

(I didn't bother measuring with numeric columns, as the performance of our number parsing routines is likely to be very bad)

Clearly the main thread's chunking routine is the bottleneck in the multi-threaded scenario. Improving this will require adding an option to signal that values can't have newlines in them (as paratext does), and perhaps SIMD-accelerating that special case (as... paratext does ;-)).

Overall I have three directions in mind to improve performance:

improve the main thread's chunking routine as described above
pre-allocate parsing scratch spaces (and perhaps recycle them with a special MemoryPool)
optimize number parsing routines

pitrou · 2018-09-20T17:37:56Z

python/pyarrow/_csv.pyx

+    return <unsigned char> val
+
+
+cdef class CSVReadOptions:


Question: since pyarrow.csv is a distinct module, I have in mind to rename this ReadOptions simply... (same for ParseOptions). What do you think?

pitrou · 2018-09-20T18:03:35Z

By the way, it's perhaps a bit late, but I thought this might be fit as an experimental feature in Arrow 0.11 :-)

wesm · 2018-09-20T22:56:17Z

That sounds fine to me. I think we'll be able to merge it before then. It's my goal to review this tomorrow or Saturday so we can merge sometime next week

wesm · 2018-09-21T00:29:49Z

I have started to look through this. I think we're going to need to do some work on the design of the tokenizer hot path (I wrote the tokenizer that pandas uses, for example -- I probably wouldn't use the same design again -- so we have other data points to compare with). Luckily we have benchmarks and tests so we can refactor at will to try out different things and analyze that part in more depth.

wesm · 2018-09-24T12:57:19Z

Working on this review. There are quite a lot of -Wconversion issues with gcc, I'm going to push a fix for some of these so I have a clean build with gcc 4.8.x

pitrou · 2018-09-24T13:21:26Z

Yes, I find -Wconversion to be quite annoying. For the record, gbenchmark fails to build with this flag.

wesm · 2018-09-24T13:31:11Z

Oops, I was compiling the wrong branch :S

pitrou · 2018-09-24T13:42:36Z

Ahah :-) Sorry, I may have left a "csv_reader" branch lying around...

pitrou · 2018-09-27T13:50:29Z

There's a failure in tha Java Flight test here:
https://travis-ci.org/apache/arrow/jobs/434090025

(by the way, I thought we had reduced the Java jobs' verbosity?)

wesm · 2018-09-27T13:51:20Z

I opened a JIRA about the Java failure

wesm · 2018-09-27T15:23:04Z

I'm going to work on getting a review up for this patch, but I will likely merge this as is (with a passing build) and leave additional work to follow up patches because it's so large

pitrou · 2018-09-27T15:24:01Z

Perhaps you'd like to give an opinion on the small naming question I asked above ;-)

wesm · 2018-09-27T15:24:33Z

Ah, you can remove the CSV part of CSVParseOptions

pitrou · 2018-09-27T15:28:29Z

Ok, done.

codecov-io · 2018-09-27T16:16:20Z

Codecov Report

Merging #2576 into master will increase coverage by 1.26%.
The diff coverage is 95.73%.

@@            Coverage Diff            @@
##           master   #2576      +/-   ##
=========================================
+ Coverage   87.23%   88.5%   +1.26%     
=========================================
  Files         380     342      -38     
  Lines       59463   57461    -2002     
=========================================
- Hits        51872   50854    -1018     
+ Misses       7521    6607     -914     
+ Partials       70       0      -70

Impacted Files	Coverage Δ
cpp/src/arrow/csv/chunker.h	`100% <100%> (ø)`
cpp/src/arrow/csv/column-builder.h	`100% <100%> (ø)`
cpp/src/arrow/csv/csv-converter-test.cc	`100% <100%> (ø)`
cpp/src/arrow/util/task-group.h	`100% <100%> (ø)`
cpp/src/arrow/csv/csv-column-builder-test.cc	`100% <100%> (ø)`
cpp/src/arrow/util/task-group-test.cc	`100% <100%> (ø)`
cpp/src/arrow/csv/parser.h	`100% <100%> (ø)`
cpp/src/arrow/util/task-group.cc	`100% <100%> (ø)`
cpp/src/arrow/csv/reader.h	`100% <100%> (ø)`
python/pyarrow/csv.py	`100% <100%> (ø)`
... and 100 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d38cf86...4ae93b2. Read the comment docs.

kou · 2018-10-01T06:21:00Z

Can we merge this for 0.11.0? Or should we keep this for 0.12.0?

pitrou · 2018-10-01T07:40:35Z

I'm ok with merging myself.

This includes: - a CSV table reader written in C++ - a Python wrapper around the CSV table reader - simple type inference for CSV values (null -> int64 -> float64 -> binary) - generic null parsing using Pandas defaults as a baseline ("NA", "N/A", "NaN"...) - some simple syntax parameters for CSV parsing Not included: - conversion and typing options - performance tuning

pitrou · 2018-10-01T07:45:32Z

Rebased to make sure CI passes.

kou · 2018-10-01T07:52:41Z

Thanks!

wesm · 2018-10-01T08:05:36Z

I had planned to merge it today. I'm working on the code review the next few hours

wesm

+1. Lots of work to do on this problem, but this is a fantastic start! I left a number of comments either about design/functionality or code cleaning, but in the interest of unblocking the 0.11 release I'm going to merge this and we'll leave our work to follow up patches

wesm · 2018-09-24T13:32:28Z

cpp/src/arrow/csv/chunker.cc

+#include "arrow/util/logging.h"
+
+#include <sstream>
+#include <string>


Standard library headers should be included after the "chunker.h" but before the other local project headers

wesm · 2018-09-24T13:36:02Z

python/pyarrow/_csv.pyx

+cdef _get_reader(input_file, shared_ptr[InputStream]* out):
+    cdef shared_ptr[RandomAccessFile] result
+    use_memory_map = False
+    get_reader(input_file, use_memory_map, &result)


The fastest CSV readers I'm familiar with all use memory mapping, FWIW. We can do our own experiments on large files. This is a good one https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9

The problem is that ReadaheadSpooler will make a copy anyway when reading from a memory-mapped file (because of padding).

More generally, I don't think it is really relevant here. Parsing CSV data will always be much slower than the speed of raw memory copies (which should be on the order of 30 GB/s from main memory).

By the way, another motivation for not focussing on memory-mapped files is that it wouldn't help with compressed CSV files.

wesm · 2018-10-01T09:30:23Z

cpp/src/arrow/csv/chunker.cc

+}
+
+template <bool quoting, bool escaping>
+Status Chunker::ProcessSpecialized(const char* start, uint32_t size, uint32_t* out_size) {


Probably want to standardize on using int64_t for sizes, etc.

wesm · 2018-10-01T09:32:39Z

cpp/src/arrow/csv/chunker.h

+  ///
+  /// Process a block of CSV data, reading up to max_num_rows rows.
+  /// The number of bytes in the chunk is returned in out_size.
+  Status Process(const char* data, uint32_t size, uint32_t* out_size);


TODO: make virtual. We will need to define chunkers that use whitespace, or multi-character / regex delimiters eventually

wesm · 2018-10-01T09:33:20Z

cpp/src/arrow/csv/chunker.h

+  // Detect a single line from the data pointer.  Return the line end,
+  // or nullptr if the remaining line is truncated.
+  template <bool quoting, bool escaping>
+  inline const char* ReadLine(const char* data, const char* data_end);


If you make Process virtual then these methods can be tucked into the private implementation

wesm · 2018-10-01T10:22:43Z

cpp/src/arrow/csv/test-common.h

+}
+
+// Make a BlockParser from a vector of lines representing a CSV file
+void MakeCSVParser(std::vector<std::string> lines, std::shared_ptr<BlockParser>* out) {


const vector<string>&

wesm · 2018-10-01T10:22:52Z

cpp/src/arrow/csv/test-common.h

+}
+
+// Make a BlockParser from a vector of strings representing a single CSV column
+void MakeColumnParser(std::vector<std::string> items, std::shared_ptr<BlockParser>* out) {


wesm · 2018-10-01T10:24:11Z

cpp/src/arrow/util/task-group-test.cc

+static void sleep_for(double seconds) {
+  std::this_thread::sleep_for(
+      std::chrono::nanoseconds(static_cast<int64_t>(seconds * 1e9)));
+}


I think this function has appeared in several places now, should put in a utility place like arrow/test-util.h

Yeah, perhaps.

wesm · 2018-10-01T10:26:27Z

cpp/src/arrow/util/task-group.h

+  /// be careful to call Finish() on subgroups before calling it
+  /// on the main group).
+  // XXX if a subgroup errors out, should it propagate immediately to the parent
+  // and to children?


Bailing out early seems like a useful workflow to harden

Note that subgroups aren't used actually, so perhaps we should simply rip them out. They are a source of complication in the implementation (initially, I thought I'd need them).

wesm · 2018-10-01T10:28:43Z

python/pyarrow/includes/libarrow.pxd

+cdef extern from "arrow/csv/api.h" namespace "arrow::csv" nogil:
+
+    cdef cppclass CCSVParseOptions" arrow::csv::ParseOptions":
+        unsigned char delimiter


We should think ahead about how we will handle multi-char delimiters

Well... do those appear in the wild?

Yep, they do, unfortunately. And regular expressions. this was implemented in continuum's IOPro, for example: https://github.com/ContinuumIO/TextAdapter/blob/53138c2277cdfcf32e127251313d4f77f81050aa/textadapter/core/text_adapter.c#L1575. In pandas, multiline/regex delimiters are implemented using an extremely slow Python parser

saschahofmann · 2019-07-23T10:49:55Z

If I understand it correctly the pyarrow csvreader already reads large files in chunks? Is there a way to control this behaviour?
I would like to create a loading bar for large CSVs and thought I could simply track the already read chunks.

pitrou force-pushed the ARROW-25-csv-reader branch from 917e0b9 to 5f4f55e Compare September 17, 2018 17:47

pitrou force-pushed the ARROW-25-csv-reader branch 3 times, most recently from a1757df to 30a4d68 Compare September 18, 2018 16:32

pitrou commented Sep 20, 2018

View reviewed changes

pitrou force-pushed the ARROW-25-csv-reader branch 2 times, most recently from dbe2b5c to f489b2e Compare September 27, 2018 13:13

pitrou force-pushed the ARROW-25-csv-reader branch from f489b2e to 9fa4f3f Compare September 27, 2018 15:28

pitrou force-pushed the ARROW-25-csv-reader branch from 9fa4f3f to 4ae93b2 Compare October 1, 2018 07:45

wesm approved these changes Oct 1, 2018

View reviewed changes

wesm closed this in 5ebab5a Oct 1, 2018

pitrou deleted the ARROW-25-csv-reader branch October 1, 2018 11:40

bkietz mentioned this pull request Dec 20, 2018

ARROW-694: [C++] json reader, WIP #3206

Closed

ARROW-25: [C++] Implement CSV reader #2576

ARROW-25: [C++] Implement CSV reader #2576

Conversation

pitrou commented Sep 17, 2018

wesm commented Sep 17, 2018

pitrou commented Sep 20, 2018

Choose a reason for hiding this comment

pitrou commented Sep 20, 2018

wesm commented Sep 20, 2018

wesm commented Sep 21, 2018

wesm commented Sep 24, 2018

pitrou commented Sep 24, 2018

wesm commented Sep 24, 2018

pitrou commented Sep 24, 2018

pitrou commented Sep 27, 2018

wesm commented Sep 27, 2018

wesm commented Sep 27, 2018

pitrou commented Sep 27, 2018

wesm commented Sep 27, 2018

pitrou commented Sep 27, 2018

codecov-io commented Sep 27, 2018 • edited

Codecov Report

kou commented Oct 1, 2018

pitrou commented Oct 1, 2018

pitrou commented Oct 1, 2018

kou commented Oct 1, 2018

wesm commented Oct 1, 2018

wesm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saschahofmann commented Jul 23, 2019

codecov-io commented Sep 27, 2018 •

edited