Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3404: [C++] Make CSV chunker faster #2684

Closed
wants to merge 1 commit into from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Oct 2, 2018

This speeds up multi-threaded CSV reads by removing the serial bottleneck of detecting newlines when no newlines can be present in CSV values.

On this machine (8-core AMD processor), the speed of reading a CSV file of text column data goes from 600 MB/s (before this patch) to 1.2 GB/s.

This speeds up multi-threaded CSV reads by removing the serial bottleneck
of detecting newlines when no newlines can be present in CSV values.

On this machine (8-core AMD processor), the speed of reading a CSV file
of text column data goes from 600 MB/s (before this patch) to 1.2 GB/s.
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -40,6 +40,8 @@ struct ARROW_EXPORT ParseOptions {
bool escaping = false;
// Escaping character (if `escaping` is true)
char escape_char = '\\';
// Whether values are allowed to contain CR (0x0d) and LF (0x0a) characters
bool newlines_in_values = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially thinking the default for this should be "true", but I would wager > 90% of CSV files do not have embedded newlines and so when someone has such a "special" file they can opt in to this additional logic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change the default anyway, I just thought it would be nice to have nice numbers out of the box :-)

@wesm
Copy link
Member

wesm commented Oct 2, 2018

Very nice speedup!

@wesm wesm closed this in a978786 Oct 2, 2018
@pitrou pitrou deleted the ARROW-3404-faster-csv-chunker branch October 2, 2018 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants