Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fastest way to read a csv file? #44

Closed
patricebellan opened this issue Jun 14, 2016 · 7 comments
Closed

Fastest way to read a csv file? #44

patricebellan opened this issue Jun 14, 2016 · 7 comments

Comments

@patricebellan
Copy link

patricebellan commented Jun 14, 2016

Hi,

I've been "playing" with xsv for a couple days and was please to see it's the fasted csv "kit" I've tried.
So I decided to give Rust and the rust-csv crate a try.

Picking code from the examples, I came up with a simple function that iterates over a file and counts the rows.

It takes about 25 seconds to run on a 7.2M rows file, which is fine.
But I was surprised it's exactly the time it takes xsv to run the following "select" on the same file (no prior "index" was done).

time xsv select -d';' field myfile.csv > field.txt
real    0m24.803s
user    0m9.176s
sys 0m9.336s

I was expecting my loop to run faster for it does pretty much nothing, so I'm thinking I probably missed something (once again, I'm a total Rust beginner).

Here's the code:

extern crate csv;

fn main() {
    let mut cnt = 0;
    let mut rdr = csv::Reader::from_file("myfile.csv").unwrap().delimiter(b';');

    for row in rdr.records() {
        cnt+=1;
    }

    println!("{}",cnt);
}

Is there any faster way to read the file?

Note: I'm not complaining about speed, I'm just surprised xsv can do much more than my simple piece of code in the same time ;)

@Eh2406
Copy link

Eh2406 commented Jun 21, 2016

Did you compile in release? Don't know the real answer, just a knee jerk response to "Rust beginner" and "expecting to run faster" :-)

@BurntSushi
Copy link
Owner

@patricebellan Have you read the section in the docs about iterating over records?

I would somewhat expect xsv to run quite a bit faster than 25 seconds on a mere 7 million rows. I share @Eh2406's concerns. Instead of cargo build you might try cargo build --release.

xsv select barely does anything either. It should run within spitting distance of a simple count loop.

@patricebellan
Copy link
Author

Nice try @Eh2406, I did compile in release ;)

I'm running it on a VM, so overall performance may not be the best.
But I was mostly concerned about comparing both, not pure performance per se.

BurntSushi added a commit that referenced this issue May 23, 2017
This commit contains a ground up rewrite of the CSV crate. Nothing
survived. This rewrite was long overdue. Namely, the API of the previous
version was initially designed 3 years ago, which was 1 year before Rust
1.0 was released.

The big changes:

1. Use a DFA to get nearly a factor of 2 speed improvement across the board.
2. Serde support, including matching headers with names in structs.
3. A new crate, csv-core, for parsing CSV without the standard library.

The performance improvements:

    count_game_deserialize_owned_bytes  30,404,805 (85 MB/s)   23,878,089 (108 MB/s)    -6,526,716  -21.47%   x 1.27
    count_game_deserialize_owned_str    30,431,169 (85 MB/s)   22,861,276 (113 MB/s)    -7,569,893  -24.88%   x 1.33
    count_game_iter_bytes               21,751,711 (119 MB/s)  11,873,257 (218 MB/s)    -9,878,454  -45.41%   x 1.83
    count_game_iter_str                 25,609,184 (101 MB/s)  13,769,390 (188 MB/s)   -11,839,794  -46.23%   x 1.86
    count_game_read_bytes               12,110,082 (214 MB/s)  6,686,121 (388 MB/s)     -5,423,961  -44.79%   x 1.81
    count_game_read_str                 15,497,249 (167 MB/s)  8,269,207 (314 MB/s)     -7,228,042  -46.64%   x 1.87
    count_mbta_deserialize_owned_bytes  5,779,138 (125 MB/s)   3,775,874 (191 MB/s)     -2,003,264  -34.66%   x 1.53
    count_mbta_deserialize_owned_str    5,777,055 (125 MB/s)   4,353,921 (166 MB/s)     -1,423,134  -24.63%   x 1.33
    count_mbta_iter_bytes               3,991,047 (181 MB/s)   1,805,387 (400 MB/s)     -2,185,660  -54.76%   x 2.21
    count_mbta_iter_str                 4,726,647 (153 MB/s)   2,354,842 (307 MB/s)     -2,371,805  -50.18%   x 2.01
    count_mbta_read_bytes               2,690,641 (268 MB/s)   1,253,111 (577 MB/s)     -1,437,530  -53.43%   x 2.15
    count_mbta_read_str                 3,399,631 (212 MB/s)   1,743,035 (415 MB/s)     -1,656,596  -48.73%   x 1.95
    count_nfl_deserialize_owned_bytes   10,608,513 (128 MB/s)  5,828,747 (234 MB/s)     -4,779,766  -45.06%   x 1.82
    count_nfl_deserialize_owned_str     10,612,366 (128 MB/s)  6,814,770 (200 MB/s)     -3,797,596  -35.78%   x 1.56
    count_nfl_iter_bytes                6,798,767 (200 MB/s)   2,564,448 (532 MB/s)     -4,234,319  -62.28%   x 2.65
    count_nfl_iter_str                  7,888,662 (172 MB/s)   3,579,865 (381 MB/s)     -4,308,797  -54.62%   x 2.20
    count_nfl_read_bytes                4,588,369 (297 MB/s)   1,911,120 (714 MB/s)     -2,677,249  -58.35%   x 2.40
    count_nfl_read_str                  5,755,926 (237 MB/s)   2,847,833 (479 MB/s)     -2,908,093  -50.52%   x 2.02
    count_pop_deserialize_owned_bytes   11,052,436 (86 MB/s)   8,848,364 (108 MB/s)     -2,204,072  -19.94%   x 1.25
    count_pop_deserialize_owned_str     11,054,638 (86 MB/s)   9,184,678 (104 MB/s)     -1,869,960  -16.92%   x 1.20
    count_pop_iter_bytes                6,190,345 (154 MB/s)   3,110,704 (307 MB/s)     -3,079,641  -49.75%   x 1.99
    count_pop_iter_str                  7,679,804 (124 MB/s)   4,274,842 (223 MB/s)     -3,404,962  -44.34%   x 1.80
    count_pop_read_bytes                3,898,119 (245 MB/s)   2,218,535 (430 MB/s)     -1,679,584  -43.09%   x 1.76
    count_pop_read_str                  5,195,237 (183 MB/s)   3,209,998 (297 MB/s)     -1,985,239  -38.21%   x 1.62

The rewrite/redesign was largely fueled by two things:

1. Reorganizing the API to permit performance improvements. For example,
   the lower level APIs now operate on entire records instead of
   one-field-at-a-time.
2. Fix a large number of outstanding issues.

Fixes #16, Fixes #28, Fixes #29, Fixes #32, Fixes #33, Fixes #36,
Fixes #39, Fixes #42, Fixes #44, Fixes #46, Fixes #49, Fixes #52,
Fixes #56, Fixes #59, Fixes #67
BurntSushi added a commit that referenced this issue May 23, 2017
This commit contains a ground up rewrite of the CSV crate. Nothing
survived. This rewrite was long overdue. Namely, the API of the previous
version was initially designed 3 years ago, which was 1 year before Rust
1.0 was released.

The big changes:

1. Use a DFA to get nearly a factor of 2 speed improvement across the board.
2. Serde support, including matching headers with names in structs.
3. A new crate, csv-core, for parsing CSV without the standard library.

The performance improvements:

    count_game_deserialize_owned_bytes  30,404,805 (85 MB/s)   23,878,089 (108 MB/s)    -6,526,716  -21.47%   x 1.27
    count_game_deserialize_owned_str    30,431,169 (85 MB/s)   22,861,276 (113 MB/s)    -7,569,893  -24.88%   x 1.33
    count_game_iter_bytes               21,751,711 (119 MB/s)  11,873,257 (218 MB/s)    -9,878,454  -45.41%   x 1.83
    count_game_iter_str                 25,609,184 (101 MB/s)  13,769,390 (188 MB/s)   -11,839,794  -46.23%   x 1.86
    count_game_read_bytes               12,110,082 (214 MB/s)  6,686,121 (388 MB/s)     -5,423,961  -44.79%   x 1.81
    count_game_read_str                 15,497,249 (167 MB/s)  8,269,207 (314 MB/s)     -7,228,042  -46.64%   x 1.87
    count_mbta_deserialize_owned_bytes  5,779,138 (125 MB/s)   3,775,874 (191 MB/s)     -2,003,264  -34.66%   x 1.53
    count_mbta_deserialize_owned_str    5,777,055 (125 MB/s)   4,353,921 (166 MB/s)     -1,423,134  -24.63%   x 1.33
    count_mbta_iter_bytes               3,991,047 (181 MB/s)   1,805,387 (400 MB/s)     -2,185,660  -54.76%   x 2.21
    count_mbta_iter_str                 4,726,647 (153 MB/s)   2,354,842 (307 MB/s)     -2,371,805  -50.18%   x 2.01
    count_mbta_read_bytes               2,690,641 (268 MB/s)   1,253,111 (577 MB/s)     -1,437,530  -53.43%   x 2.15
    count_mbta_read_str                 3,399,631 (212 MB/s)   1,743,035 (415 MB/s)     -1,656,596  -48.73%   x 1.95
    count_nfl_deserialize_owned_bytes   10,608,513 (128 MB/s)  5,828,747 (234 MB/s)     -4,779,766  -45.06%   x 1.82
    count_nfl_deserialize_owned_str     10,612,366 (128 MB/s)  6,814,770 (200 MB/s)     -3,797,596  -35.78%   x 1.56
    count_nfl_iter_bytes                6,798,767 (200 MB/s)   2,564,448 (532 MB/s)     -4,234,319  -62.28%   x 2.65
    count_nfl_iter_str                  7,888,662 (172 MB/s)   3,579,865 (381 MB/s)     -4,308,797  -54.62%   x 2.20
    count_nfl_read_bytes                4,588,369 (297 MB/s)   1,911,120 (714 MB/s)     -2,677,249  -58.35%   x 2.40
    count_nfl_read_str                  5,755,926 (237 MB/s)   2,847,833 (479 MB/s)     -2,908,093  -50.52%   x 2.02
    count_pop_deserialize_owned_bytes   11,052,436 (86 MB/s)   8,848,364 (108 MB/s)     -2,204,072  -19.94%   x 1.25
    count_pop_deserialize_owned_str     11,054,638 (86 MB/s)   9,184,678 (104 MB/s)     -1,869,960  -16.92%   x 1.20
    count_pop_iter_bytes                6,190,345 (154 MB/s)   3,110,704 (307 MB/s)     -3,079,641  -49.75%   x 1.99
    count_pop_iter_str                  7,679,804 (124 MB/s)   4,274,842 (223 MB/s)     -3,404,962  -44.34%   x 1.80
    count_pop_read_bytes                3,898,119 (245 MB/s)   2,218,535 (430 MB/s)     -1,679,584  -43.09%   x 1.76
    count_pop_read_str                  5,195,237 (183 MB/s)   3,209,998 (297 MB/s)     -1,985,239  -38.21%   x 1.62

The rewrite/redesign was largely fueled by two things:

1. Reorganizing the API to permit performance improvements. For example,
   the lower level APIs now operate on entire records instead of
   one-field-at-a-time.
2. Fix a large number of outstanding issues.

Fixes #16, Fixes #28, Fixes #29, Fixes #32, Fixes #33, Fixes #36,
Fixes #39, Fixes #42, Fixes #44, Fixes #46, Fixes #49, Fixes #52,
Fixes #56, Fixes #59, Fixes #67
@hakunin
Copy link

hakunin commented Oct 18, 2019

Just ran into this myself - debug mode reads mere 2M rows in 18seconds, release build does it in 630ms.

@phiresky
Copy link

Just in case someone else finds this, here's an overview of things to make it faster (tested on a file with four columns and one billion lines):

  1. compile with --release (huge difference, >10x perf)

  2. wrap your input in a BufReader::with_capacity(1_000_000). No difference for me, probably depends on where the data comes from

  3. use .byte_records instead of .records if you don't need string parsing (only minor difference for me)

  4. enable this: (opt-level 3 is not much faster than level 2, lto=fat improves perf by 15%!)

     [profile.release]
     opt-level = 3
     debug = true
     lto = "fat"
    
  5. compile with RUSTFLAGS="-C target-cpu=native" (only minor difference)

  6. Instead of for result in reader.into_byte_records(), use:

    let mut record = csv::ByteRecord::new();
    while reader.read_byte_record(&mut record)? {

    this doubles the performance!! This is also what xsv does: https://github.com/BurntSushi/xsv/blob/3de6c04269a7d315f7e9864b9013451cd9580a08/src/cmd/select.rs#L77

With the release mode and read_byte_record, perf of using the library is the same as xsv select for me.

@BurntSushi
Copy link
Owner

Thanks for adding your tip here! There is more explanation here on the technique: https://docs.rs/csv/1.1.3/csv/tutorial/index.html#amortizing-allocations

@phiresky
Copy link

phiresky commented Jun 19, 2020

Thanks for the link. Funny, I didn't actually see that, I just searched for "performance" within the docs index page but it didn't have results, and I assumed "tutorial" was more about handling different types of files etc, not about improving performance. And the docs search (obviously i guess) didn't yield it either :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants