New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversion from one dataset to another that will not fit in memory? #12653
Comments
Hi @eitsupi , Depending on your memory restrictions, you may need to control the batch size (how many rows are loaded at once) on the scanner: import pyarrow.dataset as ds
input_dataset = ds.dataset("input")
scanner = inpute_dataset.scanner(batch_size=100_000) # default is 1_000_000
ds.write_dataset(scanner.to_reader(), "output", format="parquet") Does that help in your use case? |
At the moment we generally use too much memory when scanning parquet. This is because the scanner's readahead is unfortunately based on the row group size and not the batch size. Using smaller row groups in your source files will help. #12228 changes the readahead to be based on the batch size but it's been on my back burner for a bit. I'm still optimistic I will get to it for the 8.0.0 release. |
Thank you both. I will wait for the 8.0.0 release to try this again. |
I tried pyarrow 8.0.0 and unfortunately it still crashes. |
Finding the same thing in Running in a container the memory usage increases to maximum and eventually crashes. from pathlib import Path
import pyarrow.csv as csv
import pyarrow.dataset as ds
tsv_directory_path = Path("/dir/with/tsv")
read_schema = pa.schema([...])
input_tsv_dataset = ds.dataset(
tsv_directory_path,
read_schema,
format=ds.CsvFileFormat(
parse_options=csv.ParseOptions(delimiter="\t", quote_char=False)
),
)
scanner = input_tsv_dataset.scanner(batch_size=100)
ds.write_dataset(
scanner,
"output_directory.parquet",
format="parquet",
max_rows_per_file=10000,
max_rows_per_group=10000,
min_rows_per_group=10,
) Using the |
I am noticing the same issue with pyarrow 8.0.0. Memory usage steadily increases to over 10GB while reading batches from a 15GB Parquet file, even with batch size 1. The rows vary a fair bit in size in this dataset, but not enough to require that much RAM. For what it's worth, I've found that passing |
Having found the following description in the documentation, I tried the operation of scanning a dataset larger than memory and writing it to another dataset.
https://arrow.apache.org/docs/python/dataset.html#writing-large-amounts-of-data
But both Python and R on Windows crashed due to lack of memory. Am I missing something?
Is there a recommended way to convert one dataset to another without running out of computer memory?
The text was updated successfully, but these errors were encountered: