Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits #23234

Closed
asfimport opened this issue Oct 16, 2019 · 21 comments

Comments

@asfimport
Copy link

I realize that when I read up a lot of Parquet files using pyarrow.parquet.read_table(...), my program's memory usage becomes very bloated, although I don't keep the table objects after converting them to Pandas DFs.

You can try this in an interactive Python shell to reproduce this problem:

from tqdm import tqdm
from pyarrow.parquet import read_table

PATH = '/tmp/big.snappy.parquet'

for _ in tqdm(range(10)):
    read_table(PATH, use_threads=False, memory_map=False)
    (note that I'm not assigning the read_table(...) result to anything, so I'm not creating any new objects at all)

During the For loop above, if you view the memory usage (e.g. using htop program), you'll see that it keeps creeping up. Either the program crashes during the 10 iterations, or if the 10 iterations complete, the program will still occupy a huge amount of memory, although no objects are kept. That memory is only released when you exit() from Python.

This problem means that my compute jobs using PyArrow currently need to use bigger server instances than I think is necessary, which translates to significant extra cost.

Reporter: V Luong
Assignee: Wes McKinney / @wesm

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-6910. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
How do you measure memory usage? "RSS"? It may be very well be a false positive.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Likely duplicate of ARROW-6874

@asfimport
Copy link
Author

V Luong:
@wesm @pitrou ARROW-6874's title states that Table.to_pandas() causes the problem. But it seems from my code above that the problem starts from read_table(...) itself, even before converting to Pandas DFs. So I'm not sure if this can be called a duplicate of ARROW-6874.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I see. Is there something you can do to make the issue more reproducible, like one or more example files?

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
[~MBALearnsToCode] If it might not be a duplicate, could you try to provide a reproducible example?

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I don't think this is a bug. I wrote a script to make and read a ~1+GB file in a loop and look at the process's RSS. Here is a plot of what the process's RSS looks like over the course of 100 iterations

arrow6910.png

This suggests this is related to internal behavior of our allocator (jemalloc here) which retains unused heap memory to speed up future in-process allocations rather than releasing the memory to the operating system. I'm not an expert on these kinds of system matters, @pitrou or others would know more

@asfimport
Copy link
Author

V Luong:
@wesm @jorisvandenbossche @pitrou I've made a Parquet data set available at s3://public-parquet-test-data/big.snappy.parquet for testing (you can do "wget http://public-parquet-test-data.s3.amazonaws.com/big.snappy.parquet" if not using AWS CLI). It's only moderately big. I repeatedly load various files thousands of times during iterative model training jobs that last for days. In 0.14.1 my long-running jobs succeeded, but in 0.15.0 the same jobs crashed after 30 mins or 1 hour. My inspection as shared above indicates that memory usage increases with the number of times read_table(...) is called and memory is not released, so long-running jobs would inevitably die.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Can you give me an HTTPS link to download that file? I tried wget https://public-parquet-test-data.s3.amazonaws.com/big.parquet and it didn't work

@asfimport
Copy link
Author

Wes McKinney / @wesm:

$ aws s3 cp s3://public-parquet-test-data/big.parquet . --recursive
fatal error: Unable to locate credentials

Not a regular S3 user, sorry

@asfimport
Copy link
Author

V Luong:
ok let me check again on another machine @wesm and let you know

@asfimport
Copy link
Author

V Luong:
@wesm @jorisvandenbossche @pitrou can you try "wget http://public-parquet-test-data.s3.amazonaws.com/big.snappy.parquet" now? I'll also edit the code in the description to reproduce the problem.

@asfimport
Copy link
Author

V Luong:
Using the code above, after just 10 iterations of reading up the file with 1 thread, the program has grown to occupy 15-18GB of memory and does not release it.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I can access it. I'll try to have a closer look in the next couple of days to see if I can determine what is going on.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I can confirm that setting the "dirty_decay_ms" jemalloc option to 0 causes memory to be released to the OS right away. This is likely to reduce application performance, but it may make sense to make the default option but allow it to be configured at runtime. I'm working on a patch

see http://jemalloc.net/jemalloc.3.html

@asfimport
Copy link
Author

V Luong:
Great, thank you a great deal @wesm!

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 5701
#5701

@asfimport
Copy link
Author

V Luong:
@pitrou @wesm @jorisvandenbossche I'm re-testing this issue using the newly-released 0.15.1, with the following code, in an interactive Python 3.7 shell:


from pyarrow.parquet import read_table
import os
from tqdm import tqdm

PARQUET_S3_PATH = 's3://public-parquet-test-data/big.snappy.parquet'
PARQUET_HTTP_PATH = 'http://public-parquet-test-data.s3.amazonaws.com/big.snappy.parquet'
PARQUET_TMP_PATH = '/tmp/big.snappy.parquet'

os.system('wget --output-document={} {}'.format(PARQUET_TMP_PATH, PARQUET_HTTP_PATH))

for _ in tqdm(range(10)):
read_table(
source=PARQUET_TMP_PATH,
columns=None,
use_threads=False,
metadata=None,
use_pandas_metadata=False,
memory_map=False,
filesystem=None,
filters=None)

I observe the following mysterious behavior:

  • If I don't do anything after the above For loop, the program still occupies 8-10GB of memory and does not release it. I keep it at this idle state for a good 10-15 minutes and confirm that memory is still occupied.

  • Then, I try to do something random, like "import pyarrow; print(pyarrow.version)" in the interactive shell, and then the memory is immediately released.

    This behavior remains unintuitive to me, and it seems users still don't have a firm control on the memory used by PyArrow. Each read_table(...) call does not seem memory-neutral by default as of 0.15.1 yet. This means long-running iterative programs, especially ML training involving repeated loading up these files, will inevitably OOM.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
What platform are you on? It's possible the background thread reclamation is not enabled in your build

Adding import gc; gc.collect() to your scripts may not be a bad idea

@asfimport
Copy link
Author

Wes McKinney / @wesm:
If you can open a new JIRA for further investigation that would be helpful, the original issue you reported is no longer present, as chronicled in the linked pull request

@asfimport
Copy link
Author

V Luong:
ok @wesm let me create a new JIRA ticket for 0.15.1

@asfimport
Copy link
Author

Wes McKinney / @wesm:
The place to start will be twiddling with the jemalloc conf settings here:

https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L48

@asfimport asfimport added this to the 0.15.1 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants