[Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True #22462

asfimport · 2019-07-29T11:49:18Z

I tried to load a parquet file of about 1.8Gb using the following code. It crashed due to out of memory issue.

import pyarrow.parquet as pq
pq.read_table('/tmp/test.parquet')

However, it worked well with use_threads=True as follows

pq.read_table('/tmp/test.parquet', use_threads=False)

If pyarrow is downgraded to 0.12.1, there is no such problem.

Reporter: Kun Liu
Assignee: Ben Kietzman / @bkietz

Related issues:

[Python] Reading a dictionary column from Parquet results in disproportionate memory usage (is duplicated by)
Method pyarrow.parquet.read_table has memory spikes from version 0.14 (is duplicated by)
[Python] Reading a dictionary column from Parquet results in disproportionate memory usage (causes)
[Python] Regression memory issue when calling pandas.read_parquet (relates to)
[R] Reading in Parquet files are 20x slower than reading fst files in R (is related to)

PRs and other links:

GitHub Pull Request #5016

_{Note: This issue was originally created as ARROW-6060. Please see the migration documentation for further details.}

asfimport · 2019-07-29T14:38:03Z

Wes McKinney / @wesm:
Can you provide an example file that we can use to try to find what's wrong?

asfimport · 2019-07-29T15:09:03Z

Kun Liu:
Thanks for the response, @wesm.

I am trying to generate a sample file and reproduce the error as the original file is not possible to disclose. The pandas types of columns in the parquet file are just unicode, bytes, and int64.

asfimport · 2019-07-29T15:52:39Z

Kun Liu:
@wesm I used the following code to generate a sample parquet.

import pandas as pd
from pandas.util.testing import rands

def generate_strings(length, nunique, string_length=10):
    unique_values = [rands(string_length) for i in range(nunique)]
    values = unique_values * (length // nunique)
    return values

df = pd.DataFrame()
df['a'] = generate_strings(100000000, 10000)
df['b'] = generate_strings(100000000, 10000)
df.to_parquet('/tmp/test.parquet')

And run following

import pyarrow.parquet as pq
 pq.read_table('/tmp/test.parquet') # crash
 pq.read_table('/tmp/test.parquet', use_threads=False) # works

Btw, my machine has 16GB RAM.

asfimport · 2019-08-02T22:26:11Z

Robin Kåveland:
I've had to downgrade our VMs to 0.13.0 today, I was observing parquet files that we could load just fine with 16GB of RAM earlier fail to load using VMs with 28GB of RAM. Unfortunately, I can't disclose any of the data either. We are using parquet.ParquetDataset.read(), but observe the problem even if we read single pieces of the parquet data sets (the pieces are between 100MB and 200MB). Most of our columns are unicode and probably would be friendly to dictionary encoding. The files have been written by spark. Normally, these datasets would take a while to load, so memory consumption would grow steadily for ~10 seconds, but now it seems like we invoke the OOM-killer in only a few seconds, so allocation seems very spiky.

asfimport · 2019-08-03T19:29:52Z

Wes McKinney / @wesm:
I confirmed the peak memory use problem with the following code (thanks for the help reproducing!):

import pandas as pd
from pandas.util.testing import rands

import pyarrow as pa
import pyarrow.parquet as pq

import gc
class memory_use:
    
    def __init__(self):
        self.start_use = pa.total_allocated_bytes()        
        self.pool = pa.default_memory_pool()
        self.start_peak_use = self.pool.max_memory()
        
    def __enter__(self):
        return
    
    def __exit__(self, type, value, traceback):
        gc.collect()
        print("Change in memory use: {}"
              .format(pa.total_allocated_bytes() - self.start_use))
        print("Change in peak use: {}"
              .format(self.pool.max_memory() - self.start_peak_use))

def generate_strings(length, nunique, string_length=10):
    unique_values = [rands(string_length) for i in range(nunique)]
    values = unique_values * (length // nunique)
    return values

df = pd.DataFrame()
df['a'] = generate_strings(100000000, 10000)
df['b'] = generate_strings(100000000, 10000)
df.to_parquet('/tmp/test.parquet')

with memory_use():
    table = pq.read_table('/tmp/test.parquet')

I have with 0.13.0:

Change in memory use: 2825000192
Change in peak use: 3827684608

and with 0.14.1 and master

Change in memory use: 2825000192
Change in peak use: 20585786752

So peak memory use is about 20GB now where it was less than 4GB before. I'm not sure which patch caused this but there have been a lot of patches related to builders in the last several months so my guess is that one of the builders has a bug in its memory allocation logic

cc @bkietz @pitrou @nealrichardson

asfimport · 2019-08-03T19:31:23Z

Wes McKinney / @wesm:
Note that with my patch for ARROW-3325 the memory use is very low comparatively:

with memory_use():
    table = pq.read_table('/tmp/test.parquet', read_dictionary=['a', 'b'])

Change in memory use: 825560448
Change in peak use: 1484772224

asfimport · 2019-08-03T19:53:43Z

Wes McKinney / @wesm:
git bisect reveals this issue was introduced by ARROW-3762

a634f92

@bkietz since you last worked on this builder would you mind taking a look?

asfimport · 2019-08-05T15:03:01Z

Ben Kietzman / @bkietz:
Will do

asfimport · 2019-08-07T10:31:22Z

Antoine Pitrou / @pitrou:
Issue resolved by pull request 5016
#5016

asfimport closed this as completed Aug 7, 2019

asfimport assigned bkietz Jan 10, 2023

asfimport added this to the 0.15.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True #22462

[Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True #22462

asfimport commented Jul 29, 2019 •

edited

asfimport commented Jul 29, 2019

asfimport commented Jul 29, 2019

asfimport commented Jul 29, 2019

asfimport commented Aug 2, 2019

asfimport commented Aug 3, 2019

asfimport commented Aug 3, 2019

asfimport commented Aug 3, 2019

asfimport commented Aug 5, 2019

asfimport commented Aug 7, 2019

[Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True #22462

[Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True #22462

Comments

asfimport commented Jul 29, 2019 • edited

Related issues:

PRs and other links:

asfimport commented Jul 29, 2019

asfimport commented Jul 29, 2019

asfimport commented Jul 29, 2019

asfimport commented Aug 2, 2019

asfimport commented Aug 3, 2019

asfimport commented Aug 3, 2019

asfimport commented Aug 3, 2019

asfimport commented Aug 5, 2019

asfimport commented Aug 7, 2019

asfimport commented Jul 29, 2019 •

edited