New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] pyarrow.parquet.read_table(...) takes up lots of memory which is not released until program exits #23234
Comments
Antoine Pitrou / @pitrou: |
Wes McKinney / @wesm: |
V Luong: |
Wes McKinney / @wesm: |
Joris Van den Bossche / @jorisvandenbossche: |
Wes McKinney / @wesm: This suggests this is related to internal behavior of our allocator (jemalloc here) which retains unused heap memory to speed up future in-process allocations rather than releasing the memory to the operating system. I'm not an expert on these kinds of system matters, @pitrou or others would know more |
V Luong: |
Wes McKinney / @wesm: |
$ aws s3 cp s3://public-parquet-test-data/big.parquet . --recursive
fatal error: Unable to locate credentials Not a regular S3 user, sorry |
V Luong: |
V Luong: |
Wes McKinney / @wesm: |
Wes McKinney / @wesm: |
Antoine Pitrou / @pitrou: |
V Luong: from pyarrow.parquet import read_table PARQUET_S3_PATH = 's3://public-parquet-test-data/big.snappy.parquet' os.system('wget --output-document={} {}'.format(PARQUET_TMP_PATH, PARQUET_HTTP_PATH)) for _ in tqdm(range(10)):
|
Wes McKinney / @wesm: Adding |
Wes McKinney / @wesm: |
Wes McKinney / @wesm: https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L48 |
I realize that when I read up a lot of Parquet files using pyarrow.parquet.read_table(...), my program's memory usage becomes very bloated, although I don't keep the table objects after converting them to Pandas DFs.
You can try this in an interactive Python shell to reproduce this problem:
During the For loop above, if you view the memory usage (e.g. using htop program), you'll see that it keeps creeping up. Either the program crashes during the 10 iterations, or if the 10 iterations complete, the program will still occupy a huge amount of memory, although no objects are kept. That memory is only released when you exit() from Python.
This problem means that my compute jobs using PyArrow currently need to use bigger server instances than I think is necessary, which translates to significant extra cost.
Reporter: V Luong
Assignee: Wes McKinney / @wesm
Original Issue Attachments:
PRs and other links:
Note: This issue was originally created as ARROW-6910. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: