Conversion from one dataset to another that will not fit in memory? #12653

eitsupi · 2022-03-17T05:25:24Z

Having found the following description in the documentation, I tried the operation of scanning a dataset larger than memory and writing it to another dataset.

https://arrow.apache.org/docs/python/dataset.html#writing-large-amounts-of-data

The above examples wrote data from a table. If you are writing a large amount of data you may not be able to load everything into a single in-memory table. Fortunately, the write_dataset() method also accepts an iterable of record batches. This makes it really simple, for example, to repartition a large dataset without loading the entire dataset into memory:

import pyarrow.dataset as ds

input_dataset = ds.dataset("input")
ds.write_dataset(inpute_dataset.scanner(), "output", format="parquet")

arrow::open_dataset("input") |>
  arrow::write_dataset("output")

But both Python and R on Windows crashed due to lack of memory. Am I missing something?
Is there a recommended way to convert one dataset to another without running out of computer memory?

wjones127 · 2022-03-17T14:49:18Z

Hi @eitsupi ,

Depending on your memory restrictions, you may need to control the batch size (how many rows are loaded at once) on the scanner:

import pyarrow.dataset as ds

input_dataset = ds.dataset("input")
scanner = inpute_dataset.scanner(batch_size=100_000) # default is 1_000_000
ds.write_dataset(scanner.to_reader(), "output", format="parquet")

Does that help in your use case?

westonpace · 2022-03-18T01:13:10Z

At the moment we generally use too much memory when scanning parquet. This is because the scanner's readahead is unfortunately based on the row group size and not the batch size. Using smaller row groups in your source files will help. #12228 changes the readahead to be based on the batch size but it's been on my back burner for a bit. I'm still optimistic I will get to it for the 8.0.0 release.

eitsupi · 2022-03-18T10:21:26Z

Thank you both.
I tried lowering the batch size to 1000 in Python, but it still consumed over 3GB of memory and crashed.

I will wait for the 8.0.0 release to try this again.

eitsupi · 2022-05-10T10:02:52Z

I tried pyarrow 8.0.0 and unfortunately it still crashes.

willbowditch · 2022-06-17T16:14:07Z

Finding the same thing in pyarrow 8.0.0 converting from a CSV to Parquet - I've tried various batch sizes on the scanner and various min/max rows/groups on the writer.

Running in a container the memory usage increases to maximum and eventually crashes.

from pathlib import Path

import pyarrow.csv as csv
import pyarrow.dataset as ds

tsv_directory_path = Path("/dir/with/tsv")

read_schema = pa.schema([...])


input_tsv_dataset = ds.dataset(
    tsv_directory_path,
    read_schema,
    format=ds.CsvFileFormat(
        parse_options=csv.ParseOptions(delimiter="\t", quote_char=False)
    ),
)


scanner = input_tsv_dataset.scanner(batch_size=100)

ds.write_dataset(
    scanner,
    "output_directory.parquet",
    format="parquet",
    max_rows_per_file=10000,
    max_rows_per_group=10000,
    min_rows_per_group=10,
)

Using the csv.open_csv and pq.ParquetWriter to write batches to a single file works fine, but results in a single large file.

VHellendoorn · 2022-06-20T01:04:46Z

I am noticing the same issue with pyarrow 8.0.0. Memory usage steadily increases to over 10GB while reading batches from a 15GB Parquet file, even with batch size 1. The rows vary a fair bit in size in this dataset, but not enough to require that much RAM.

For what it's worth, I've found that passing use_threads=False as an argument to scanner prevents the memory footprint from growing as large (not growing past ~3GB in this case, but still fluctuating by a fair bit), after noticing that this implicitly disables both batch and fragment readahead here. The performance penalty isn't particularly large, especially with bigger batch sizes, so this may be a temporary solution for those wishing to keep memory usage low.

eitsupi mentioned this issue Feb 24, 2023

[R] Can not parse file #34291

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conversion from one dataset to another that will not fit in memory? #12653

Conversion from one dataset to another that will not fit in memory? #12653

eitsupi commented Mar 17, 2022

wjones127 commented Mar 17, 2022

westonpace commented Mar 18, 2022

eitsupi commented Mar 18, 2022

eitsupi commented May 10, 2022

willbowditch commented Jun 17, 2022

VHellendoorn commented Jun 20, 2022

Conversion from one dataset to another that will not fit in memory? #12653

Conversion from one dataset to another that will not fit in memory? #12653

Comments

eitsupi commented Mar 17, 2022

wjones127 commented Mar 17, 2022

westonpace commented Mar 18, 2022

eitsupi commented Mar 18, 2022

eitsupi commented May 10, 2022

willbowditch commented Jun 17, 2022

VHellendoorn commented Jun 20, 2022