Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True #22462

Closed
asfimport opened this issue Jul 29, 2019 · 9 comments

Comments

@asfimport
Copy link

asfimport commented Jul 29, 2019

 I tried to load a parquet file of about 1.8Gb using the following code. It crashed due to out of memory issue.

import pyarrow.parquet as pq
pq.read_table('/tmp/test.parquet')

 However, it worked well with use_threads=True as follows

pq.read_table('/tmp/test.parquet', use_threads=False)

If pyarrow is downgraded to 0.12.1, there is no such problem.

Reporter: Kun Liu
Assignee: Ben Kietzman / @bkietz

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-6060. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Can you provide an example file that we can use to try to find what's wrong?

@asfimport
Copy link
Author

Kun Liu:
Thanks for the response, @wesm.

I am trying to generate a sample file and reproduce the error as the original file is not possible to disclose. The pandas types of columns in the parquet file are just unicode, bytes, and int64.

@asfimport
Copy link
Author

Kun Liu:
@wesm I used the following code to generate a sample parquet. 

import pandas as pd
from pandas.util.testing import rands

def generate_strings(length, nunique, string_length=10):
    unique_values = [rands(string_length) for i in range(nunique)]
    values = unique_values * (length // nunique)
    return values

df = pd.DataFrame()
df['a'] = generate_strings(100000000, 10000)
df['b'] = generate_strings(100000000, 10000)
df.to_parquet('/tmp/test.parquet')

And run following

import pyarrow.parquet as pq
 pq.read_table('/tmp/test.parquet') # crash
 pq.read_table('/tmp/test.parquet', use_threads=False) # works

Btw, my machine has 16GB RAM. 

@asfimport
Copy link
Author

Robin Kåveland:
I've had to downgrade our VMs to 0.13.0 today, I was observing parquet files that we could load just fine with 16GB of RAM earlier fail to load using VMs with 28GB of RAM. Unfortunately, I can't disclose any of the data either. We are using parquet.ParquetDataset.read(), but observe the problem even if we read single pieces of the parquet data sets (the pieces are between 100MB and 200MB). Most of our columns are unicode and probably would be friendly to dictionary encoding. The files have been written by spark. Normally, these datasets would take a while to load, so memory consumption would grow steadily for ~10 seconds, but now it seems like we invoke the OOM-killer in only a few seconds, so allocation seems very spiky.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I confirmed the peak memory use problem with the following code (thanks for the help reproducing!):

import pandas as pd
from pandas.util.testing import rands

import pyarrow as pa
import pyarrow.parquet as pq

import gc
class memory_use:
    
    def __init__(self):
        self.start_use = pa.total_allocated_bytes()        
        self.pool = pa.default_memory_pool()
        self.start_peak_use = self.pool.max_memory()
        
    def __enter__(self):
        return
    
    def __exit__(self, type, value, traceback):
        gc.collect()
        print("Change in memory use: {}"
              .format(pa.total_allocated_bytes() - self.start_use))
        print("Change in peak use: {}"
              .format(self.pool.max_memory() - self.start_peak_use))

def generate_strings(length, nunique, string_length=10):
    unique_values = [rands(string_length) for i in range(nunique)]
    values = unique_values * (length // nunique)
    return values

df = pd.DataFrame()
df['a'] = generate_strings(100000000, 10000)
df['b'] = generate_strings(100000000, 10000)
df.to_parquet('/tmp/test.parquet')

with memory_use():
    table = pq.read_table('/tmp/test.parquet')

I have with 0.13.0:

Change in memory use: 2825000192
Change in peak use: 3827684608

and with 0.14.1 and master

Change in memory use: 2825000192
Change in peak use: 20585786752

So peak memory use is about 20GB now where it was less than 4GB before. I'm not sure which patch caused this but there have been a lot of patches related to builders in the last several months so my guess is that one of the builders has a bug in its memory allocation logic

cc @bkietz @pitrou @nealrichardson

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Note that with my patch for ARROW-3325 the memory use is very low comparatively:

with memory_use():
    table = pq.read_table('/tmp/test.parquet', read_dictionary=['a', 'b'])

Change in memory use: 825560448
Change in peak use: 1484772224

@asfimport
Copy link
Author

Wes McKinney / @wesm:
git bisect reveals this issue was introduced by ARROW-3762

a634f92

@bkietz since you last worked on this builder would you mind taking a look?

@asfimport
Copy link
Author

Ben Kietzman / @bkietz:
Will do

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 5016
#5016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants