Confusing memory usage with CSV reader #623

dbr · 2021-07-27T14:27:53Z

Describe the bug
Using the arrow::csv::ReaderBuilder with something like worldcitiespop_mil.csv mentioned on this page

I was experimenting with the batch size setting in a standalone script, and it impacted the RAM usage in a surprising way:

use arrow::record_batch::RecordBatch;
use arrow::error::ArrowError;


fn main() {
    let args: Vec<String> = std::env::args().collect();
    let fname = &args[1];
    dbg!(&fname);
    let batch_size: usize = args[2].parse().unwrap();

    let f = std::fs::File::open(&fname).unwrap();
    let reader = arrow::csv::ReaderBuilder::new()
        .infer_schema(Some(1_000))
        .has_header(true)
        .with_batch_size(batch_size)
        .build(f).unwrap();

        let mut total = 0;
    for r in reader {
        total += r.unwrap().num_rows();
    }
    dbg!(total);
    // let mut input = String::new(); std::io::stdin().read_line(&mut input);
}

If I run it like so:

cargo +1.53 run --release -- ./worldcitiespop_mil.csv 10

..according to top | grep arrcsv the RAM usage is something like 5MB.

If I increase 10 to 100,000 the RAM usage goes to maybe 30MB. Add another zero and the RAM usage is 255MB.

Not being too familiar with arrow, I would have expected:

Larger batch size may take more RAM while parsing, but more efficient storage
Small batch size reduces RAM usage while parsing, but has more overhead (if it was 10% more I wouldn't be surprised)

However the opposite seems to be true, and the usage seems kind of oddly high and, mainly, unpredictable.

While making this minimal example, I had a thought that maybe the arrow::csv::Reader was still being kept around and it was using the memory, not the Vec<RecordBatch> - so I refactored it into a method, had it return the RecordBatch so the reader should have been dropped..

..but even more surprisingly, the memory usage drastically increased:

use arrow::record_batch::RecordBatch;
use arrow::error::ArrowError;

fn hmm() -> Vec<Result<RecordBatch, ArrowError>> {
    let args: Vec<String> = std::env::args().collect();
    let fname = &args[1];
    let batch_size: usize = args[2].parse().unwrap();

    let f = std::fs::File::open(&fname).unwrap();
    let reader = arrow::csv::ReaderBuilder::new()
        .infer_schema(Some(1_000))
        .has_header(true)
        .with_batch_size(batch_size)
        .build(f).unwrap();
    reader.collect()
}

fn main() {
    let batches = hmm();
    let mut total = 0;
    for r in batches {
        total += r.unwrap().num_rows();
    }
    dbg!(total);
    //let mut input = String::new(); std::io::stdin().read_line(&mut input);
}

With this change:

With batch size of 1000, the RAM usage is now about 80MB (much higher than the ~5MB before)
With batch size of 1,000,000 the RAM usage is slightly higher (255MB -> 300MB)
With very small batch size of 10, the RAM usage is about 630MB?!

To Reproduce

Create empty project with main.rs as one of my terrible lumps of code above. Only dependency is arrow = "5.0.0"
Run the example with cargo +1.53 --release -- ./worldcitiespop_mil.csv 1000 etc
Monitor RAM usage somehow (I was using the output from top | grep ... - thus the stdin-reading line in the code)

Expected behavior
Mostly covered above - but basically I'd expect the memory usage with all of these combinations to be "quite similar"

Additional context
I've not used arrow much, so it's very much possible I'm doing something strange or incorrect!

Versions of stuff:

Linux (Debian Buster)
arrow 4.2 with Rust 1.51
Also: arrow 5.0 with Rust 1.53

The text was updated successfully, but these errors were encountered:

Dandandan · 2021-07-27T15:03:23Z

What the CSV reader does in the CSV parser is reusing some allocations over time in a batch to reduce allocations / time.
So generally, this might increase the memory usage a bit as more allocations are kept around from previous batches.

However, with a very small batch size of 10, this won't cause the high memory usage, but the data and metadata around a single RecordBatch does: each batch has a schema with field names, some different pointers to the data etc. which will make up the most of the data when choosing a low size. When you store them in a Vec instead of iterating over them (where they will be dropped) you'll keep them in memory, which I expect will consume the most memory.

So generally

Use a batch size of some 1000s (so you have less overhead of metadata and makes use the columnar Arrow format)
If you don't have to store them in a Vec - don't keep them in a Vec but iterate over them like in your first example.

dbr · 2021-07-28T01:44:50Z

When you store them in a Vec instead of iterating over them (where they will be dropped) you'll keep them in memory

Ahh, I think this was where a majority of my confusion was coming from - I should have had something after the read_line which re-iterated over the batches to be sure they weren't yet dropped

The only bit that remains a mystery to me is: why does a giant batch size cause the process to use so much RAM?

With the tweaked example:

use arrow::record_batch::RecordBatch;
use arrow::error::ArrowError;

fn hmm() -> Vec<Result<RecordBatch, ArrowError>> {
    let args: Vec<String> = std::env::args().collect();
    let fname = &args[1];
    let batch_size: usize = args[2].parse().unwrap();

    let f = std::fs::File::open(&fname).unwrap();
    let reader = arrow::csv::ReaderBuilder::new()
        .infer_schema(Some(5_000))
        .has_header(true)
        .with_batch_size(batch_size)
        .build(f).unwrap();
    reader.collect()
}

fn main() {
    let batches = hmm();
    let mut total = 0;
    let mut total_bytes = 0;
    for r in &batches {
        let batch = r.as_ref().unwrap();
        for c in batch.columns() {
            total_bytes += c.get_array_memory_size();
        }
        total += batch.num_rows();
    }
    dbg!(total);
    dbg!(total_bytes);

    // Delay to measure process RAM usage
    let mut input = String::new();
    std::io::stdin().read_line(&mut input);

    // Repeat
    for r in &batches {
        let batch = r.as_ref().unwrap();
        for c in batch.columns() {
            total_bytes += c.get_array_memory_size();
        }
        total += batch.num_rows();
    }
    dbg!(2, total);
}

..I get the following results:

batch size	process RAM	sum of get_array_memory_size
1,000,000	357.7m	96,186,112
500,000	232.1m	96,187,648
50,000	93.4m	83,143,168
5,000	93.8m	97,396,736
500	124.3m	102,904,832

The size reported by get_array_memory_size matches exactly what you say - overly small batch size starts to introduce some overhead with the duplicated references and so on - but a pretty small difference (varies by <10% which seems perfectly reasonable)

However the process RAM seems to do the inverse to what I'd expect - it's like something is leaking from the parser, or an array is being over allocated, or something like that?

dbr · 2021-07-29T04:18:47Z

I ran the example code with the memory-profiler tool, with huge batch size:

If I understand right, this explains the remaining mystery (why giant batch size causes process to use lots of memory). In my "ELI5 level" knowledge of memory allocators:

With a huge chunk size size, the readers allocate large continuous chunks of memory as part of the parsing
The RecordBatch(es) are allocated while the parsing is happening, so they have to "fit around" these allocations
When the buffers for the reader are deallocated, they leave big "holes" in the allocated memory
Since Rust wont magically rearrange items in memory, the process ends up using all that memory, holes and all
The example script is especially bad since there's no subsequent allocations which might use up some of those holes
With a smaller buffer size, the "holes" created by the parser are much smaller, thus the overhead is insignificant.

I might try and make a PR to add some basic docs to the with_batch_size methods when I have time, to incorporate some of the advice above - but otherwise I think this issue can be closed, as it seems to be working "as intended".

Thanks @Dandandan !

dbr added the bug label Jul 27, 2021

dbr closed this as completed Jul 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusing memory usage with CSV reader #623

Confusing memory usage with CSV reader #623

dbr commented Jul 27, 2021

Dandandan commented Jul 27, 2021

dbr commented Jul 28, 2021

dbr commented Jul 29, 2021

Confusing memory usage with CSV reader #623

Confusing memory usage with CSV reader #623

Comments

dbr commented Jul 27, 2021

Dandandan commented Jul 27, 2021

dbr commented Jul 28, 2021

dbr commented Jul 29, 2021