Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing memory usage with CSV reader #623

Closed
dbr opened this issue Jul 27, 2021 · 3 comments
Closed

Confusing memory usage with CSV reader #623

dbr opened this issue Jul 27, 2021 · 3 comments
Labels

Comments

@dbr
Copy link
Contributor

dbr commented Jul 27, 2021

Describe the bug
Using the arrow::csv::ReaderBuilder with something like worldcitiespop_mil.csv mentioned on this page

I was experimenting with the batch size setting in a standalone script, and it impacted the RAM usage in a surprising way:

use arrow::record_batch::RecordBatch;
use arrow::error::ArrowError;


fn main() {
    let args: Vec<String> = std::env::args().collect();
    let fname = &args[1];
    dbg!(&fname);
    let batch_size: usize = args[2].parse().unwrap();

    let f = std::fs::File::open(&fname).unwrap();
    let reader = arrow::csv::ReaderBuilder::new()
        .infer_schema(Some(1_000))
        .has_header(true)
        .with_batch_size(batch_size)
        .build(f).unwrap();

        let mut total = 0;
    for r in reader {
        total += r.unwrap().num_rows();
    }
    dbg!(total);
    // let mut input = String::new(); std::io::stdin().read_line(&mut input);
}

If I run it like so:

cargo +1.53 run --release -- ./worldcitiespop_mil.csv 10

..according to top | grep arrcsv the RAM usage is something like 5MB.

If I increase 10 to 100,000 the RAM usage goes to maybe 30MB. Add another zero and the RAM usage is 255MB.

Not being too familiar with arrow, I would have expected:

  1. Larger batch size may take more RAM while parsing, but more efficient storage
  2. Small batch size reduces RAM usage while parsing, but has more overhead (if it was 10% more I wouldn't be surprised)

However the opposite seems to be true, and the usage seems kind of oddly high and, mainly, unpredictable.

While making this minimal example, I had a thought that maybe the arrow::csv::Reader was still being kept around and it was using the memory, not the Vec<RecordBatch> - so I refactored it into a method, had it return the RecordBatch so the reader should have been dropped..

..but even more surprisingly, the memory usage drastically increased:

use arrow::record_batch::RecordBatch;
use arrow::error::ArrowError;

fn hmm() -> Vec<Result<RecordBatch, ArrowError>> {
    let args: Vec<String> = std::env::args().collect();
    let fname = &args[1];
    let batch_size: usize = args[2].parse().unwrap();

    let f = std::fs::File::open(&fname).unwrap();
    let reader = arrow::csv::ReaderBuilder::new()
        .infer_schema(Some(1_000))
        .has_header(true)
        .with_batch_size(batch_size)
        .build(f).unwrap();
    reader.collect()
}

fn main() {
    let batches = hmm();
    let mut total = 0;
    for r in batches {
        total += r.unwrap().num_rows();
    }
    dbg!(total);
    //let mut input = String::new(); std::io::stdin().read_line(&mut input);
}

With this change:

  • With batch size of 1000, the RAM usage is now about 80MB (much higher than the ~5MB before)
  • With batch size of 1,000,000 the RAM usage is slightly higher (255MB -> 300MB)
  • With very small batch size of 10, the RAM usage is about 630MB?!

To Reproduce

  1. Create empty project with main.rs as one of my terrible lumps of code above. Only dependency is arrow = "5.0.0"
  2. Run the example with cargo +1.53 --release -- ./worldcitiespop_mil.csv 1000 etc
  3. Monitor RAM usage somehow (I was using the output from top | grep ... - thus the stdin-reading line in the code)

Expected behavior
Mostly covered above - but basically I'd expect the memory usage with all of these combinations to be "quite similar"

Additional context
I've not used arrow much, so it's very much possible I'm doing something strange or incorrect!

Versions of stuff:

  • Linux (Debian Buster)
  • arrow 4.2 with Rust 1.51
  • Also: arrow 5.0 with Rust 1.53
@dbr dbr added the bug label Jul 27, 2021
@Dandandan
Copy link
Contributor

What the CSV reader does in the CSV parser is reusing some allocations over time in a batch to reduce allocations / time.
So generally, this might increase the memory usage a bit as more allocations are kept around from previous batches.

However, with a very small batch size of 10, this won't cause the high memory usage, but the data and metadata around a single RecordBatch does: each batch has a schema with field names, some different pointers to the data etc. which will make up the most of the data when choosing a low size. When you store them in a Vec instead of iterating over them (where they will be dropped) you'll keep them in memory, which I expect will consume the most memory.

So generally

  • Use a batch size of some 1000s (so you have less overhead of metadata and makes use the columnar Arrow format)
  • If you don't have to store them in a Vec - don't keep them in a Vec but iterate over them like in your first example.

@dbr
Copy link
Contributor Author

dbr commented Jul 28, 2021

When you store them in a Vec instead of iterating over them (where they will be dropped) you'll keep them in memory

Ahh, I think this was where a majority of my confusion was coming from - I should have had something after the read_line which re-iterated over the batches to be sure they weren't yet dropped

The only bit that remains a mystery to me is: why does a giant batch size cause the process to use so much RAM?

With the tweaked example:

use arrow::record_batch::RecordBatch;
use arrow::error::ArrowError;

fn hmm() -> Vec<Result<RecordBatch, ArrowError>> {
    let args: Vec<String> = std::env::args().collect();
    let fname = &args[1];
    let batch_size: usize = args[2].parse().unwrap();

    let f = std::fs::File::open(&fname).unwrap();
    let reader = arrow::csv::ReaderBuilder::new()
        .infer_schema(Some(5_000))
        .has_header(true)
        .with_batch_size(batch_size)
        .build(f).unwrap();
    reader.collect()
}

fn main() {
    let batches = hmm();
    let mut total = 0;
    let mut total_bytes = 0;
    for r in &batches {
        let batch = r.as_ref().unwrap();
        for c in batch.columns() {
            total_bytes += c.get_array_memory_size();
        }
        total += batch.num_rows();
    }
    dbg!(total);
    dbg!(total_bytes);

    // Delay to measure process RAM usage
    let mut input = String::new();
    std::io::stdin().read_line(&mut input);

    // Repeat
    for r in &batches {
        let batch = r.as_ref().unwrap();
        for c in batch.columns() {
            total_bytes += c.get_array_memory_size();
        }
        total += batch.num_rows();
    }
    dbg!(2, total);
}

..I get the following results:

batch size process RAM sum of get_array_memory_size
1,000,000 357.7m 96,186,112
500,000 232.1m 96,187,648
50,000 93.4m 83,143,168
5,000 93.8m 97,396,736
500 124.3m 102,904,832

The size reported by get_array_memory_size matches exactly what you say - overly small batch size starts to introduce some overhead with the duplicated references and so on - but a pretty small difference (varies by <10% which seems perfectly reasonable)

However the process RAM seems to do the inverse to what I'd expect - it's like something is leaking from the parser, or an array is being over allocated, or something like that?

@dbr
Copy link
Contributor Author

dbr commented Jul 29, 2021

I ran the example code with the memory-profiler tool, with huge batch size:
graphs of memory allocation/fragmentation

If I understand right, this explains the remaining mystery (why giant batch size causes process to use lots of memory). In my "ELI5 level" knowledge of memory allocators:

  • With a huge chunk size size, the readers allocate large continuous chunks of memory as part of the parsing
  • The RecordBatch(es) are allocated while the parsing is happening, so they have to "fit around" these allocations
  • When the buffers for the reader are deallocated, they leave big "holes" in the allocated memory
  • Since Rust wont magically rearrange items in memory, the process ends up using all that memory, holes and all
  • The example script is especially bad since there's no subsequent allocations which might use up some of those holes
  • With a smaller buffer size, the "holes" created by the parser are much smaller, thus the overhead is insignificant.

I might try and make a PR to add some basic docs to the with_batch_size methods when I have time, to incorporate some of the advice above - but otherwise I think this issue can be closed, as it seems to be working "as intended".

Thanks @Dandandan !

@dbr dbr closed this as completed Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants