-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Memory leak in pq.read_table and table.to_pandas #18431
Comments
Weston Pace / @westonpace:
|
Michael Peleshenko: Traceback (most recent call last):
File "C:/Users/mipelesh/Workspace/Git/Lynx/src-pyLynx/pyLynx/run_pyarrow_memoy_leak_sample.py", line 35, in <module>
main()
File "C:/Users/mipelesh/Workspace/Git/Lynx/src-pyLynx/pyLynx/run_pyarrow_memoy_leak_sample.py", line 18, in main
pa.jemalloc_set_decay_ms(0)
File "pyarrow\memory.pxi", line 171, in pyarrow.lib.jemalloc_set_decay_ms
File "pyarrow\error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: jemalloc support is not built |
Wes McKinney / @wesm: |
Antoine Pitrou / @pitrou: If you're merely worried about a potential memory leak, the way to check for it is to run your function in a loop and check whether memory occupation is constantly increasing, or if it quickly reaches a stable plateau. |
Joris Van den Bossche / @jorisvandenbossche: |
Dmitry Kashtanov: This behavior remains also if I use Also after that I drop a referenced column (not copied) from that pandas DataFrame, this results in DataFrame data copy and the memory from the original DataFrame is also not released to OS. The subsequent transformations of the DataFrame release memory as expected. The exactly same code with exactly same Python (3.8.7) and packages versions on MacOS releases memory to OS as expected (also will all kinds of the memory pool).
The very first lines of the script are: import pyarrow
pyarrow.jemalloc_set_decay_ms(0)
Mac OS:
Line # Mem usage Increment Occurences Line Contents
============================================================
460 141.5 MiB 141.5 MiB 1 @profile
461 def bqs_stream_to_pandas(session, stream_name):
463 142.2 MiB 0.7 MiB 1 client = bqs.BigQueryReadClient()
464 158.7 MiB 16.5 MiB 1 reader = client.read_rows(name=stream_name, offset=0)
465 1092.2 MiB 933.5 MiB 1 table = reader.to_arrow(session)
470 2725.1 MiB 1632.5 MiB 2 dataset = table.to_pandas(deduplicate_objects=False, split_blocks=False, self_destruct=False,
471 1092.6 MiB 0.0 MiB 1 strings_to_categorical=True,)
472 1405.0 MiB -1320.1 MiB 1 del table
473 1405.0 MiB 0.0 MiB 1 del reader
474 1396.1 MiB -8.9 MiB 1 del client
475 1396.1 MiB 0.0 MiB 1 time.sleep(1)
476 1396.1 MiB 0.0 MiB 1 if MEM_PROFILING:
477 1396.1 MiB 0.0 MiB 1 mem_pool = pyarrow.default_memory_pool()
478 1396.1 MiB 0.0 MiB 1 print(f"PyArrow mem pool info: {mem_pool.backend_name} backend, {mem_pool.bytes_allocated()} allocated, "
479 f"{mem_pool.max_memory()} max allocated, ")
480 1396.1 MiB 0.0 MiB 1 print(f"PyArrow total allocated bytes: {pyarrow.total_allocated_bytes()}")
481 1402.4 MiB 6.3 MiB 1 mem_usage = dataset.memory_usage(index=True, deep=True)
485 1404.2 MiB 0.0 MiB 1 return dataset
# Output
PyArrow mem pool info: jemalloc backend, 1313930816 allocated, 1340417472 max allocated,
PyArrow total allocated bytes: 1313930816
Line # Mem usage Increment Occurences Line Contents
============================================================
...
139 1477.7 MiB 0.4 MiB 1 dataset_label = dataset[label_column].astype(np.int8)
140
141 1474.2 MiB -3.5 MiB 1 dataset.drop(columns=label_column, inplace=True)
142 1474.2 MiB 0.0 MiB 1 gc.collect()
143
144 1474.2 MiB 0.0 MiB 1 if MEM_PROFILING:
145 1474.2 MiB 0.0 MiB 1 mem_pool = pyarrow.default_memory_pool()
146 1474.2 MiB 0.0 MiB 1 print(f"PyArrow mem pool info: {mem_pool.backend_name} backend, {mem_pool.bytes_allocated()} allocated, "
147 f"{mem_pool.max_memory()} max allocated, ")
148 1474.2 MiB 0.0 MiB 1 print(f"PyArrow total allocated bytes: {pyarrow.total_allocated_bytes()}")
# Output
PyArrow mem pool info: jemalloc backend, 0 allocated, 1340417472 max allocated,
PyArrow total allocated bytes: 0
Line # Mem usage Increment Occurences Line Contents
============================================================
460 153.0 MiB 153.0 MiB 1 @profile
461 def bqs_stream_to_pandas(session, stream_name):
463 153.5 MiB 0.6 MiB 1 client = bqs.BigQueryReadClient()
464 166.9 MiB 13.4 MiB 1 reader = client.read_rows(name=stream_name, offset=0)
465 1567.5 MiB 1400.6 MiB 1 table = reader.to_arrow(session)
469 1567.5 MiB 0.0 MiB 1 report_metric('piano.ml.preproc.pyarrow.table.bytes', table.nbytes)
470 2843.7 MiB 1276.2 MiB 2 dataset = table.to_pandas(deduplicate_objects=False, split_blocks=False, self_destruct=False,
471 1567.5 MiB 0.0 MiB 1 strings_to_categorical=True,)
472 2843.7 MiB 0.0 MiB 1 del table
473 2843.7 MiB 0.0 MiB 1 del reader
474 2843.9 MiB 0.2 MiB 1 del client
475 2842.2 MiB -1.8 MiB 1 time.sleep(1)
476 2842.2 MiB 0.0 MiB 1 if MEM_PROFILING:
477 2842.2 MiB 0.0 MiB 1 mem_pool = pyarrow.default_memory_pool()
478 2842.2 MiB 0.0 MiB 1 print(f"PyArrow mem pool info: {mem_pool.backend_name} backend, {mem_pool.bytes_allocated()} allocated, "
479 f"{mem_pool.max_memory()} max allocated, ")
480 2842.2 MiB 0.0 MiB 1 print(f"PyArrow total allocated bytes: {pyarrow.total_allocated_bytes()}")
481 2838.9 MiB -3.3 MiB 1 mem_usage = dataset.memory_usage(index=True, deep=True)
485 2839.1 MiB 0.0 MiB 1 return dataset
# Output
PyArrow mem pool info: jemalloc backend, 1313930816 allocated, 1338112064 max allocated,
PyArrow total allocated bytes: 1313930816
Line # Mem usage Increment Occurences Line Contents
============================================================
...
139 2839.1 MiB 0.0 MiB 1 dataset_label = dataset[label_column].astype(np.int8)
140
141 2836.6 MiB -2.6 MiB 1 dataset.drop(columns=label_column, inplace=True)
142 2836.6 MiB 0.0 MiB 1 gc.collect()
143
144 2836.6 MiB 0.0 MiB 1 if MEM_PROFILING:
145 2836.6 MiB 0.0 MiB 1 mem_pool = pyarrow.default_memory_pool()
146 2836.6 MiB 0.0 MiB 1 print(f"PyArrow mem pool info: {mem_pool.backend_name} backend, {mem_pool.bytes_allocated()} allocated, "
147 f"{mem_pool.max_memory()} max allocated, ")
148 2836.6 MiB 0.0 MiB 1 print(f"PyArrow total allocated bytes: {pyarrow.total_allocated_bytes()}")
# Output
PyArrow mem pool info: jemalloc backend, 0 allocated, 1338112064 max allocated,
PyArrow total allocated bytes: 0
A case with dropping a referenced (not copied) column:
Line # Mem usage Increment Occurences Line Contents
============================================================
...
134 2872.0 MiB 0.0 MiB 1 dataset_label = dataset[label_column]
135
136 4039.4 MiB 1167.4 MiB 1 dataset.drop(columns=label_column, inplace=True)
137 4035.9 MiB -3.6 MiB 1 gc.collect()
138
139 4035.9 MiB 0.0 MiB 1 if MEM_PROFILING:
140 4035.9 MiB 0.0 MiB 1 mem_pool = pyarrow.default_memory_pool()
141 4035.9 MiB 0.0 MiB 1 print(f"PyArrow mem pool info: {mem_pool.backend_name} backend, {mem_pool.bytes_allocated()} allocated, "
142 f"{mem_pool.max_memory()} max allocated, ")
# Output
PyArrow mem pool info: jemalloc backend, 90227904 allocated, 1340299200 max allocated,
Package versions:
boto3==1.17.1
botocore==1.20.1
cachetools==4.2.1
certifi==2020.12.5
cffi==1.14.4
chardet==4.0.0
google-api-core[grpc]==1.25.1
google-auth==1.25.0
google-cloud-bigquery-storage==2.2.1
google-cloud-bigquery==2.7.0
google-cloud-core==1.5.0
google-crc32c==1.1.2
google-resumable-media==1.2.0
googleapis-common-protos==1.52.0
grpcio==1.35.0
idna==2.10
jmespath==0.10.0
joblib==1.0.0
libcst==0.3.16
memory-profiler==0.58.0
mypy-extensions==0.4.3
numpy==1.20.0
pandas==1.2.1
proto-plus==1.13.0
protobuf==3.14.0
psutil==5.8.0
pyarrow==3.0.0
pyasn1-modules==0.2.8
pyasn1==0.4.8
pycparser==2.20
python-dateutil==2.8.1
pytz==2021.1
pyyaml==5.4.1
requests==2.25.1
rsa==4.7
s3transfer==0.3.4
scikit-learn==0.24.1
scipy==1.6.0
setuptools-scm==5.0.1
six==1.15.0
smart-open==4.1.2
threadpoolctl==2.1.0
typing-extensions==3.7.4.3
typing-inspect==0.6.0
unidecode==1.1.2
urllib3==1.26.3
|
Antoine Pitrou / @pitrou: Also, how is "Mem usage" measured in your script? |
Dmitry Kashtanov: And as we may see, the following line doesn't help. |
That doesn't really answer the question: what does it measure? RSS? Virtual memory size?
Perhaps, but I still don't see what Arrow could do, or even if there is an actual problem. Can you run "bqs_stream_to_pandas" in a loop and see whether memory usage increases? Or does it stay stable as its initial peak value? |
MALLOC_CONF="background_thread:true,narenas:1,tcache:false,dirty_decay_ms:0,muzzy_decay_ms:0" Specifying the above environment variable also doesn't help for jemalloc. The suspicious things are that everything works in MacOS and that also that all allocators behave similarly. |
It looks like
PSB. It doesn't increase (almost). Line # Mem usage Increment Occurences Line Contents
============================================================
...
117 2866.0 MiB 2713.1 MiB 1 dataset = bqs_stream_to_pandas(session, stream_name)
118 2865.6 MiB -0.4 MiB 1 del dataset
119 2874.6 MiB 9.0 MiB 1 dataset = bqs_stream_to_pandas(session, stream_name)
120 2874.6 MiB 0.0 MiB 1 del dataset
121 2887.0 MiB 12.4 MiB 1 dataset = bqs_stream_to_pandas(session, stream_name)
122 2878.2 MiB -8.8 MiB 1 del dataset
123 2903.2 MiB 25.1 MiB 1 dataset = bqs_stream_to_pandas(session, stream_name)
124 2903.2 MiB 0.0 MiB 1 del dataset
125 2899.2 MiB -4.1 MiB 1 dataset = bqs_stream_to_pandas(session, stream_name)
126 2899.2 MiB 0.0 MiB 1 del dataset
127 2887.9 MiB -11.3 MiB 1 dataset = bqs_stream_to_pandas(session, stream_name)
128 2887.9 MiB 0.0 MiB 1 del dataset
Interestingly, the first chunk of memory is freed when gRPC connection/session (may call it incorrecty) is reset: Line # Mem usage Increment Occurences Line Contents
============================================================
471 2898.9 MiB 2898.9 MiB 1 @profile
472 def bqs_stream_to_pandas(session, stream_name, row_limit=3660000):
474 2898.9 MiB 0.0 MiB 1 client = bqs.BigQueryReadClient()
475 1628.4 MiB -1270.5 MiB 1 reader = client.read_rows(name=stream_name, offset=0)
476 1628.4 MiB 0.0 MiB 1 rows = reader.rows(session)
... If a pyarrow.ipc.read_record_batch(
pyarrow.py_buffer(message.arrow_record_batch.serialized_record_batch),
self._schema,
)
|
Antoine Pitrou / @pitrou:
Hmm... I guess it probably should? But I think you may find more expertise about this by asking the BigQuery developers / community. |
Dmitry Kashtanov:
Before pandas dataframe creation
PyArrow mem pool info: jemalloc backend, 0 allocated, 0 max allocated,
PyArrow total allocated bytes: 0 So with this, it looks like we have the following container sequence:
|
shadowdsp: import io
import pandas as pd
import pyarrow as pa
pa.jemalloc_set_decay_ms(0)
import pyarrow.parquet as pq
from memory_profiler import profile
@profile
def read_file(f):
table = pq.read_table(f)
df = table.to_pandas(strings_to_categorical=True)
del table
del df
def main():
rows = 2000000
df = pd.DataFrame({
"string": [{"test": [1, 2], "test1": [3, 4]}] * rows,
"int": [5] * rows,
"float": [2.0] * rows,
})
table = pa.Table.from_pandas(df, preserve_index=False)
parquet_stream = io.BytesIO()
pq.write_table(table, parquet_stream)
for i in range(3):
parquet_stream.seek(0)
read_file(parquet_stream)
if __name__ == '__main__':
main() Output: Filename: memory_leak.py
Line # Mem usage Increment Occurences Line Contents
============================================================
14 329.5 MiB 329.5 MiB 1 @profile
15 def read_file(f):
16 424.4 MiB 94.9 MiB 1 table = pq.read_table(f)
17 1356.6 MiB 932.2 MiB 1 df = table.to_pandas(strings_to_categorical=True)
18 1310.5 MiB -46.1 MiB 1 del table
19 606.7 MiB -703.8 MiB 1 del df
Filename: memory_leak.py
Line # Mem usage Increment Occurences Line Contents
============================================================
14 606.7 MiB 606.7 MiB 1 @profile
15 def read_file(f):
16 714.9 MiB 108.3 MiB 1 table = pq.read_table(f)
17 1720.8 MiB 1005.9 MiB 1 df = table.to_pandas(strings_to_categorical=True)
18 1674.5 MiB -46.3 MiB 1 del table
19 970.6 MiB -703.8 MiB 1 del df
Filename: memory_leak.py
Line # Mem usage Increment Occurences Line Contents
============================================================
14 970.6 MiB 970.6 MiB 1 @profile
15 def read_file(f):
16 1079.6 MiB 109.0 MiB 1 table = pq.read_table(f)
17 2085.5 MiB 1005.9 MiB 1 df = table.to_pandas(strings_to_categorical=True)
18 2039.2 MiB -46.3 MiB 1 del table
19 1335.3 MiB -703.8 MiB 1 del df
▶ pip show pyarrow
Name: pyarrow
Version: 3.0.0
Summary: Python library for Apache Arrow
Home-page: https://arrow.apache.org/
Author: None
Author-email: None
License: Apache License, Version 2.0
Location:
Requires: numpy
Required-by: utify
▶ pip show pandas
Name: pandas
Version: 1.2.1
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: None
Author-email: None
License: BSD
Location:
Requires: python-dateutil, pytz, numpy
Required-by: utify, seaborn, fastparquet |
Weston Pace / @westonpace: |
shadowdsp: |
Peter Gaultney: I think this bug still exists in 6.0.0 of pyarrow. I'm attaching a script that requires fastparquet, pyarrow, and psutil to be installed. benchmark-pandas-parquet.py It allows switching between fastparquet and pyarrow to see the difference between memory usage between each iteration, where the number of calls to read_table is also parameterizable, but defaults to 5. There seems to be a large memory leak, followed by smaller ones on every iteration. Even with I've been able to reproduce with many different kinds of parquet files, but I don't know about nested vs non-nested data.
|
Cory Nezin: |
I have been struggling with memory leak in
import pandas as pd
from pyarrow import dataset as ds
import pyarrow as pa
def create_parquet(path: str):
pd.DataFrame({'range': [x for x in range(1000000)]}).to_parquet(path)
def load_parquet_to_table(path: str):
dataset = ds.dataset(path, format='parquet')
dataset.to_table()
if __name__ == '__main__':
PATH = 'test.parquet'
pa.jemalloc_set_decay_ms(0)
create_parquet(PATH)
for x in range(100):
load_parquet_to_table(PATH)
I tested on version 9.0.0 with python 3.8 on macOS. And Memory Usage:
Even though the memory usage doesn't grow linearly here, when I used this in more complex example in long running process it ended up increasing linearly until exceeding the memory limit. |
Antoine Pitrou / @pitrou: |
Jan Skorepa: |
Ninh Chu: I'm running on Ubuntu20.04 / WSL2 import pyarrow.dataset as ds
import pyarrow as pa
pa.jemalloc_set_decay_ms(0)
delta_ds = ds.dataset("delta")
row_count = delta_ds.count_rows()
print("row_count = ", row_count)
reader = delta_ds.scanner(batch_size=10000).to_reader()
batch = reader.read_next_batch()
print("first batch row count = ", batch.num_rows)
print("Total allocated mem for pyarrow = ", pa.total_allocated_bytes() // 1024**2) Small dataset dataset row_count = 66651
first batch row count = 10000
Total allocated mem for pyarrow = 103 Big dataset created by duplicating the same file 4 times dataset row_count = 333255
first batch row count = 10000
Total allocated mem for pyarrow = 412 If load all the data in dataset into Table: import pyarrow.dataset as ds
import pyarrow as pa
pa.jemalloc_set_decay_ms(0)
delta_ds = ds.dataset("delta")
row_count = delta_ds.count_rows()
print("dataset row_count = ", row_count)
pa_table = delta_ds.to_table()
print("Total allocated mem for pyarrow = ", pa.total_allocated_bytes() // 1024**2) dataset row_count = 333255
Total allocated mem for pyarrow = 512 |
Julius Uotila: I am having the exact same issue as Jan Skorepa, but with Windows/Windows Server. I have a process building a dataset overnight from SQL database to .parquet with a predefined save interval (does 30+ saves a night) and limited memory. Each save is slowly creeping up memory until process crashes. python 3.9.12 Windows Server 2019 Windows 10 Many thanks, Julius |
wondertx: |
hi, i was profiling this and i spotted that in arrow/python/pyarrow/array.pxi Line 724 in fc1f9eb
arrow/python/pyarrow/array.pxi Line 690 in fc1f9eb
the memory goes down when its false. |
@dxe4 this is a rather old issue (perhaps we should close it) and not necessarily getting much attention. It's also not clear this issue is related to deduplicate_objects. Can you open a new issue, specific to your deduplicate_objects question? |
Encountered highly likely same issue
I can't provide our data or code, but I created a repository with smallest possible scripts to reproduce issue, it can be found here: https://github.com/ales-vilchytski/pyarrow-parquet-memory-leak-demo. Repository includes scripts to generate parquet file and to reproduce OOM, Dockerfile and instructions how to run it. Issue reproduces on pyarrow (13, 14) and pandas 2+, different docker images, native MacOS ARM 13, different python version (3.10, 3.11, 3.12). Core thing (https://github.com/ales-vilchytski/pyarrow-parquet-memory-leak-demo/blob/main/src/mem_leak.py#L10):
As example: Also I experimented with jemalloc settings and found that Parquet file in example is written with Any attempt to fix things by triggering GC, clearing memory pools or switching to system memory allocator failed. It still gets OOM but just earlier or later. |
Hi @ales-vilchytski , Have you fixed this yet? I encountered similar issue. I was wondering if you have a solution? Thanks. |
Hello @kyle-ip. Unfortunately we've found workarounds only: use explicit It makes things harder to implement, but works in our case (at least for now) |
Are people still working on this? It is preventing us from even considering parquet as a file format, even though in theory it'd be perfect for our needs. |
While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we observed a memory leak in the read_table and to_pandas methods. See below for sample code to reproduce it. Memory does not seem to be returned after deleting the table and df as it was in pyarrow 0.12.1.
Sample Code
Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs
Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs
Reporter: Michael Peleshenko
Assignee: Weston Pace / @westonpace
Related issues:
Original Issue Attachments:
Note: This issue was originally created as ARROW-11007. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: