Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-25: [C++] Implement CSV reader #2576

Closed
wants to merge 1 commit into from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Sep 17, 2018

This includes:

  • a CSV table reader written in C++
  • a Python wrapper around the CSV table reader
  • simple type inference for CSV values (null -> int64 -> float64 -> binary)
  • generic null parsing using Pandas defaults as a baseline
    ("NA", "N/A", "NaN"...)
  • some simple syntax parameters for CSV parsing

Not included:

  • conversion and typing options
  • performance tuning

@wesm
Copy link
Member

wesm commented Sep 17, 2018

Awesome! I will plan to spend some quality time reviewing this to give feedback. Having spent a lot of time writing code to parse CSV files I have a lot of opinions.

cc also @cpcloud @jreback

@pitrou pitrou force-pushed the ARROW-25-csv-reader branch 3 times, most recently from a1757df to 30a4d68 Compare September 18, 2018 16:32
@pitrou
Copy link
Member Author

pitrou commented Sep 20, 2018

By the way, initial performance testing on a CSV of string/binary columns gives the following ballpark numbers here (on a 8 core 3.2 GHz AMD Ryzen CPU):

  • single-threaded: 150 MB/s
  • multi-threaded: 600MB/s

(I didn't bother measuring with numeric columns, as the performance of our number parsing routines is likely to be very bad)

Clearly the main thread's chunking routine is the bottleneck in the multi-threaded scenario. Improving this will require adding an option to signal that values can't have newlines in them (as paratext does), and perhaps SIMD-accelerating that special case (as... paratext does ;-)).

Overall I have three directions in mind to improve performance:

  • improve the main thread's chunking routine as described above
  • pre-allocate parsing scratch spaces (and perhaps recycle them with a special MemoryPool)
  • optimize number parsing routines

return <unsigned char> val


cdef class CSVReadOptions:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: since pyarrow.csv is a distinct module, I have in mind to rename this ReadOptions simply... (same for ParseOptions). What do you think?

@pitrou
Copy link
Member Author

pitrou commented Sep 20, 2018

By the way, it's perhaps a bit late, but I thought this might be fit as an experimental feature in Arrow 0.11 :-)

@wesm
Copy link
Member

wesm commented Sep 20, 2018

That sounds fine to me. I think we'll be able to merge it before then. It's my goal to review this tomorrow or Saturday so we can merge sometime next week

@wesm
Copy link
Member

wesm commented Sep 21, 2018

I have started to look through this. I think we're going to need to do some work on the design of the tokenizer hot path (I wrote the tokenizer that pandas uses, for example -- I probably wouldn't use the same design again -- so we have other data points to compare with). Luckily we have benchmarks and tests so we can refactor at will to try out different things and analyze that part in more depth.

@wesm
Copy link
Member

wesm commented Sep 24, 2018

Working on this review. There are quite a lot of -Wconversion issues with gcc, I'm going to push a fix for some of these so I have a clean build with gcc 4.8.x

@pitrou
Copy link
Member Author

pitrou commented Sep 24, 2018

Yes, I find -Wconversion to be quite annoying. For the record, gbenchmark fails to build with this flag.

@wesm
Copy link
Member

wesm commented Sep 24, 2018

Oops, I was compiling the wrong branch :S

@pitrou
Copy link
Member Author

pitrou commented Sep 24, 2018

Ahah :-) Sorry, I may have left a "csv_reader" branch lying around...

@pitrou pitrou force-pushed the ARROW-25-csv-reader branch 2 times, most recently from dbe2b5c to f489b2e Compare September 27, 2018 13:13
@pitrou
Copy link
Member Author

pitrou commented Sep 27, 2018

There's a failure in tha Java Flight test here:
https://travis-ci.org/apache/arrow/jobs/434090025

(by the way, I thought we had reduced the Java jobs' verbosity?)

@wesm
Copy link
Member

wesm commented Sep 27, 2018

I opened a JIRA about the Java failure

@wesm
Copy link
Member

wesm commented Sep 27, 2018

I'm going to work on getting a review up for this patch, but I will likely merge this as is (with a passing build) and leave additional work to follow up patches because it's so large

@pitrou
Copy link
Member Author

pitrou commented Sep 27, 2018

Perhaps you'd like to give an opinion on the small naming question I asked above ;-)

@wesm
Copy link
Member

wesm commented Sep 27, 2018

Ah, you can remove the CSV part of CSVParseOptions

@pitrou
Copy link
Member Author

pitrou commented Sep 27, 2018

Ok, done.

@codecov-io
Copy link

codecov-io commented Sep 27, 2018

Codecov Report

Merging #2576 into master will increase coverage by 1.26%.
The diff coverage is 95.73%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #2576      +/-   ##
=========================================
+ Coverage   87.23%   88.5%   +1.26%     
=========================================
  Files         380     342      -38     
  Lines       59463   57461    -2002     
=========================================
- Hits        51872   50854    -1018     
+ Misses       7521    6607     -914     
+ Partials       70       0      -70
Impacted Files Coverage Δ
cpp/src/arrow/csv/chunker.h 100% <100%> (ø)
cpp/src/arrow/csv/column-builder.h 100% <100%> (ø)
cpp/src/arrow/csv/csv-converter-test.cc 100% <100%> (ø)
cpp/src/arrow/util/task-group.h 100% <100%> (ø)
cpp/src/arrow/csv/csv-column-builder-test.cc 100% <100%> (ø)
cpp/src/arrow/util/task-group-test.cc 100% <100%> (ø)
cpp/src/arrow/csv/parser.h 100% <100%> (ø)
cpp/src/arrow/util/task-group.cc 100% <100%> (ø)
cpp/src/arrow/csv/reader.h 100% <100%> (ø)
python/pyarrow/csv.py 100% <100%> (ø)
... and 100 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d38cf86...4ae93b2. Read the comment docs.

@kou
Copy link
Member

kou commented Oct 1, 2018

Can we merge this for 0.11.0? Or should we keep this for 0.12.0?

@pitrou
Copy link
Member Author

pitrou commented Oct 1, 2018

I'm ok with merging myself.

This includes:
- a CSV table reader written in C++
- a Python wrapper around the CSV table reader
- simple type inference for CSV values (null -> int64 -> float64 -> binary)
- generic null parsing using Pandas defaults as a baseline
  ("NA", "N/A", "NaN"...)
- some simple syntax parameters for CSV parsing

Not included:
- conversion and typing options
- performance tuning
@pitrou
Copy link
Member Author

pitrou commented Oct 1, 2018

Rebased to make sure CI passes.

@kou
Copy link
Member

kou commented Oct 1, 2018

Thanks!

@wesm
Copy link
Member

wesm commented Oct 1, 2018

I had planned to merge it today. I'm working on the code review the next few hours

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Lots of work to do on this problem, but this is a fantastic start! I left a number of comments either about design/functionality or code cleaning, but in the interest of unblocking the 0.11 release I'm going to merge this and we'll leave our work to follow up patches

#include "arrow/util/logging.h"

#include <sstream>
#include <string>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standard library headers should be included after the "chunker.h" but before the other local project headers

cdef _get_reader(input_file, shared_ptr[InputStream]* out):
cdef shared_ptr[RandomAccessFile] result
use_memory_map = False
get_reader(input_file, use_memory_map, &result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fastest CSV readers I'm familiar with all use memory mapping, FWIW. We can do our own experiments on large files. This is a good one https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that ReadaheadSpooler will make a copy anyway when reading from a memory-mapped file (because of padding).

More generally, I don't think it is really relevant here. Parsing CSV data will always be much slower than the speed of raw memory copies (which should be on the order of 30 GB/s from main memory).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, another motivation for not focussing on memory-mapped files is that it wouldn't help with compressed CSV files.

}

template <bool quoting, bool escaping>
Status Chunker::ProcessSpecialized(const char* start, uint32_t size, uint32_t* out_size) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to standardize on using int64_t for sizes, etc.

///
/// Process a block of CSV data, reading up to max_num_rows rows.
/// The number of bytes in the chunk is returned in out_size.
Status Process(const char* data, uint32_t size, uint32_t* out_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: make virtual. We will need to define chunkers that use whitespace, or multi-character / regex delimiters eventually

// Detect a single line from the data pointer. Return the line end,
// or nullptr if the remaining line is truncated.
template <bool quoting, bool escaping>
inline const char* ReadLine(const char* data, const char* data_end);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make Process virtual then these methods can be tucked into the private implementation

}

// Make a BlockParser from a vector of lines representing a CSV file
void MakeCSVParser(std::vector<std::string> lines, std::shared_ptr<BlockParser>* out) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const vector<string>&

}

// Make a BlockParser from a vector of strings representing a single CSV column
void MakeColumnParser(std::vector<std::string> items, std::shared_ptr<BlockParser>* out) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const&

static void sleep_for(double seconds) {
std::this_thread::sleep_for(
std::chrono::nanoseconds(static_cast<int64_t>(seconds * 1e9)));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function has appeared in several places now, should put in a utility place like arrow/test-util.h

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, perhaps.

/// be careful to call Finish() on subgroups before calling it
/// on the main group).
// XXX if a subgroup errors out, should it propagate immediately to the parent
// and to children?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bailing out early seems like a useful workflow to harden

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that subgroups aren't used actually, so perhaps we should simply rip them out. They are a source of complication in the implementation (initially, I thought I'd need them).

cdef extern from "arrow/csv/api.h" namespace "arrow::csv" nogil:

cdef cppclass CCSVParseOptions" arrow::csv::ParseOptions":
unsigned char delimiter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should think ahead about how we will handle multi-char delimiters

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... do those appear in the wild?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, they do, unfortunately. And regular expressions. this was implemented in continuum's IOPro, for example: https://github.com/ContinuumIO/TextAdapter/blob/53138c2277cdfcf32e127251313d4f77f81050aa/textadapter/core/text_adapter.c#L1575. In pandas, multiline/regex delimiters are implemented using an extremely slow Python parser

@wesm wesm closed this in 5ebab5a Oct 1, 2018
@pitrou pitrou deleted the ARROW-25-csv-reader branch October 1, 2018 11:40
@saschahofmann
Copy link

If I understand it correctly the pyarrow csvreader already reads large files in chunks? Is there a way to control this behaviour?
I would like to create a loading bar for large CSVs and thought I could simply track the already read chunks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants