[Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits #23234

asfimport · 2019-10-16T23:49:30Z

I realize that when I read up a lot of Parquet files using pyarrow.parquet.read_table(...), my program's memory usage becomes very bloated, although I don't keep the table objects after converting them to Pandas DFs.

You can try this in an interactive Python shell to reproduce this problem:

from tqdm import tqdm
from pyarrow.parquet import read_table

PATH = '/tmp/big.snappy.parquet'

for _ in tqdm(range(10)):
    read_table(PATH, use_threads=False, memory_map=False)
    (note that I'm not assigning the read_table(...) result to anything, so I'm not creating any new objects at all)

During the For loop above, if you view the memory usage (e.g. using htop program), you'll see that it keeps creeping up. Either the program crashes during the 10 iterations, or if the 10 iterations complete, the program will still occupy a huge amount of memory, although no objects are kept. That memory is only released when you exit() from Python.

This problem means that my compute jobs using PyArrow currently need to use bigger server instances than I think is necessary, which translates to significant extra cost.

Reporter: V Luong
Assignee: Wes McKinney / @wesm

Original Issue Attachments:

arrow6910.png

PRs and other links:

GitHub Pull Request #5701

_{Note: This issue was originally created as ARROW-6910. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2019-10-17T02:14:27Z

Antoine Pitrou / @pitrou:
How do you measure memory usage? "RSS"? It may be very well be a false positive.

asfimport · 2019-10-17T02:18:06Z

Wes McKinney / @wesm:
Likely duplicate of ARROW-6874

asfimport · 2019-10-17T03:10:52Z

V Luong:
@wesm @pitrou ARROW-6874's title states that Table.to_pandas() causes the problem. But it seems from my code above that the problem starts from read_table(...) itself, even before converting to Pandas DFs. So I'm not sure if this can be called a duplicate of ARROW-6874.

asfimport · 2019-10-17T13:43:01Z

Wes McKinney / @wesm:
I see. Is there something you can do to make the issue more reproducible, like one or more example files?

asfimport · 2019-10-17T13:44:28Z

Joris Van den Bossche / @jorisvandenbossche:
[~MBALearnsToCode] If it might not be a duplicate, could you try to provide a reproducible example?

asfimport · 2019-10-17T18:31:44Z

Wes McKinney / @wesm:
I don't think this is a bug. I wrote a script to make and read a ~1+GB file in a loop and look at the process's RSS. Here is a plot of what the process's RSS looks like over the course of 100 iterations

This suggests this is related to internal behavior of our allocator (jemalloc here) which retains unused heap memory to speed up future in-process allocations rather than releasing the memory to the operating system. I'm not an expert on these kinds of system matters, @pitrou or others would know more

asfimport · 2019-10-17T18:48:08Z

V Luong:
@wesm @jorisvandenbossche @pitrou I've made a Parquet data set available at s3://public-parquet-test-data/big.snappy.parquet for testing (you can do "wget http://public-parquet-test-data.s3.amazonaws.com/big.snappy.parquet" if not using AWS CLI). It's only moderately big. I repeatedly load various files thousands of times during iterative model training jobs that last for days. In 0.14.1 my long-running jobs succeeded, but in 0.15.0 the same jobs crashed after 30 mins or 1 hour. My inspection as shared above indicates that memory usage increases with the number of times read_table(...) is called and memory is not released, so long-running jobs would inevitably die.

asfimport · 2019-10-17T18:53:44Z

Wes McKinney / @wesm:
Can you give me an HTTPS link to download that file? I tried wget https://public-parquet-test-data.s3.amazonaws.com/big.parquet and it didn't work

asfimport · 2019-10-17T18:59:36Z

Wes McKinney / @wesm:

$ aws s3 cp s3://public-parquet-test-data/big.parquet . --recursive
fatal error: Unable to locate credentials

Not a regular S3 user, sorry

asfimport · 2019-10-17T19:01:28Z

V Luong:
ok let me check again on another machine @wesm and let you know

asfimport · 2019-10-17T21:51:10Z

V Luong:
@wesm @jorisvandenbossche @pitrou can you try "wget http://public-parquet-test-data.s3.amazonaws.com/big.snappy.parquet" now? I'll also edit the code in the description to reproduce the problem.

asfimport · 2019-10-17T22:29:06Z

V Luong:
Using the code above, after just 10 iterations of reading up the file with 1 thread, the program has grown to occupy 15-18GB of memory and does not release it.

asfimport · 2019-10-19T01:29:15Z

Wes McKinney / @wesm:
I can access it. I'll try to have a closer look in the next couple of days to see if I can determine what is going on.

asfimport · 2019-10-19T16:19:23Z

Wes McKinney / @wesm:
I can confirm that setting the "dirty_decay_ms" jemalloc option to 0 causes memory to be released to the OS right away. This is likely to reduce application performance, but it may make sense to make the default option but allow it to be configured at runtime. I'm working on a patch

see http://jemalloc.net/jemalloc.3.html

asfimport · 2019-10-19T18:16:12Z

V Luong:
Great, thank you a great deal @wesm!

asfimport · 2019-10-22T14:47:39Z

Antoine Pitrou / @pitrou:
Issue resolved by pull request 5701
#5701

asfimport · 2019-11-06T20:03:06Z

V Luong:
@pitrou @wesm @jorisvandenbossche I'm re-testing this issue using the newly-released 0.15.1, with the following code, in an interactive Python 3.7 shell:

from pyarrow.parquet import read_table
import os
from tqdm import tqdm

PARQUET_S3_PATH = 's3://public-parquet-test-data/big.snappy.parquet'
PARQUET_HTTP_PATH = 'http://public-parquet-test-data.s3.amazonaws.com/big.snappy.parquet'
PARQUET_TMP_PATH = '/tmp/big.snappy.parquet'

os.system('wget --output-document={} {}'.format(PARQUET_TMP_PATH, PARQUET_HTTP_PATH))

for _ in tqdm(range(10)):
read_table(
source=PARQUET_TMP_PATH,
columns=None,
use_threads=False,
metadata=None,
use_pandas_metadata=False,
memory_map=False,
filesystem=None,
filters=None)

I observe the following mysterious behavior:

If I don't do anything after the above For loop, the program still occupies 8-10GB of memory and does not release it. I keep it at this idle state for a good 10-15 minutes and confirm that memory is still occupied.
Then, I try to do something random, like "import pyarrow; print(pyarrow.version)" in the interactive shell, and then the memory is immediately released.

This behavior remains unintuitive to me, and it seems users still don't have a firm control on the memory used by PyArrow. Each read_table(...) call does not seem memory-neutral by default as of 0.15.1 yet. This means long-running iterative programs, especially ML training involving repeated loading up these files, will inevitably OOM.

asfimport · 2019-11-06T20:18:44Z

Wes McKinney / @wesm:
What platform are you on? It's possible the background thread reclamation is not enabled in your build

Adding import gc; gc.collect() to your scripts may not be a bad idea

asfimport · 2019-11-06T20:20:47Z

Wes McKinney / @wesm:
If you can open a new JIRA for further investigation that would be helpful, the original issue you reported is no longer present, as chronicled in the linked pull request

asfimport · 2019-11-06T20:24:40Z

V Luong:
ok @wesm let me create a new JIRA ticket for 0.15.1

asfimport · 2019-11-06T20:32:29Z

Wes McKinney / @wesm:
The place to start will be twiddling with the jemalloc conf settings here:

https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L48

asfimport closed this as completed Oct 22, 2019

asfimport assigned wesm Jan 10, 2023

asfimport added this to the 0.15.1 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits #23234

[Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits #23234

asfimport commented Oct 16, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 19, 2019

asfimport commented Oct 19, 2019

asfimport commented Oct 19, 2019

asfimport commented Oct 22, 2019

asfimport commented Nov 6, 2019

asfimport commented Nov 6, 2019

asfimport commented Nov 6, 2019

asfimport commented Nov 6, 2019

asfimport commented Nov 6, 2019

[Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits #23234

[Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits #23234

Comments

asfimport commented Oct 16, 2019

Original Issue Attachments:

PRs and other links:

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 17, 2019

asfimport commented Oct 19, 2019

asfimport commented Oct 19, 2019

asfimport commented Oct 19, 2019

asfimport commented Oct 22, 2019

asfimport commented Nov 6, 2019

for _ in tqdm(range(10)): read_table( source=PARQUET_TMP_PATH, columns=None, use_threads=False, metadata=None, use_pandas_metadata=False, memory_map=False, filesystem=None, filters=None)

asfimport commented Nov 6, 2019

asfimport commented Nov 6, 2019

asfimport commented Nov 6, 2019

asfimport commented Nov 6, 2019

for _ in tqdm(range(10)):
read_table(
source=PARQUET_TMP_PATH,
columns=None,
use_threads=False,
metadata=None,
use_pandas_metadata=False,
memory_map=False,
filesystem=None,
filters=None)