New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True #22462
Comments
Wes McKinney / @wesm: |
Kun Liu: import pandas as pd
from pandas.util.testing import rands
def generate_strings(length, nunique, string_length=10):
unique_values = [rands(string_length) for i in range(nunique)]
values = unique_values * (length // nunique)
return values
df = pd.DataFrame()
df['a'] = generate_strings(100000000, 10000)
df['b'] = generate_strings(100000000, 10000)
df.to_parquet('/tmp/test.parquet') And run following import pyarrow.parquet as pq
pq.read_table('/tmp/test.parquet') # crash
pq.read_table('/tmp/test.parquet', use_threads=False) # works Btw, my machine has 16GB RAM. |
Robin Kåveland: |
Wes McKinney / @wesm: import pandas as pd
from pandas.util.testing import rands
import pyarrow as pa
import pyarrow.parquet as pq
import gc
class memory_use:
def __init__(self):
self.start_use = pa.total_allocated_bytes()
self.pool = pa.default_memory_pool()
self.start_peak_use = self.pool.max_memory()
def __enter__(self):
return
def __exit__(self, type, value, traceback):
gc.collect()
print("Change in memory use: {}"
.format(pa.total_allocated_bytes() - self.start_use))
print("Change in peak use: {}"
.format(self.pool.max_memory() - self.start_peak_use))
def generate_strings(length, nunique, string_length=10):
unique_values = [rands(string_length) for i in range(nunique)]
values = unique_values * (length // nunique)
return values
df = pd.DataFrame()
df['a'] = generate_strings(100000000, 10000)
df['b'] = generate_strings(100000000, 10000)
df.to_parquet('/tmp/test.parquet')
with memory_use():
table = pq.read_table('/tmp/test.parquet') I have with 0.13.0: Change in memory use: 2825000192
Change in peak use: 3827684608 and with 0.14.1 and master Change in memory use: 2825000192
Change in peak use: 20585786752 So peak memory use is about 20GB now where it was less than 4GB before. I'm not sure which patch caused this but there have been a lot of patches related to builders in the last several months so my guess is that one of the builders has a bug in its memory allocation logic |
Wes McKinney / @wesm: with memory_use():
table = pq.read_table('/tmp/test.parquet', read_dictionary=['a', 'b'])
Change in memory use: 825560448
Change in peak use: 1484772224 |
Wes McKinney / @wesm: @bkietz since you last worked on this builder would you mind taking a look? |
Ben Kietzman / @bkietz: |
Antoine Pitrou / @pitrou: |
I tried to load a parquet file of about 1.8Gb using the following code. It crashed due to out of memory issue.
However, it worked well with use_threads=True as follows
If pyarrow is downgraded to 0.12.1, there is no such problem.
Reporter: Kun Liu
Assignee: Ben Kietzman / @bkietz
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-6060. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: