Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] High memory usage writing pyarrow.Table with large strings to parquet #23592

Open
asfimport opened this issue Dec 4, 2019 · 8 comments

Comments

@asfimport
Copy link

asfimport commented Dec 4, 2019

My case of datasets stored is specific. I have large strings (1-100MB each).

Let's take for example a single row.

43mb.csv is a 1-row CSV with 10 columns. One column a 43mb string.

When I read this csv with pandas and then dump to parquet, my script consumes 10x of the 43mb.

With increasing amount of such rows memory footprint overhead diminishes, but I want to focus on this specific case.

Here's the footprint after running using memory profiler:

Line #    Mem usage    Increment   Line Contents
================================================
     4     48.9 MiB     48.9 MiB   @profile
     5                             def test():
     6    143.7 MiB     94.7 MiB       data = pd.read_csv('43mb.csv')
     7    498.6 MiB    354.9 MiB       data.to_parquet('out.parquet')
 

Is this typical for parquet in case of big strings?

Environment: Mac OSX
Reporter: Bogdan Klichuk

Related issues:

Original Issue Attachments:

Note: This issue was originally created as ARROW-7305. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
There may be some things we could do about this, do you have an example file we could use to help with profiling the internal memory allocations during the write process?

@asfimport
Copy link
Author

Bogdan Klichuk:
Sorry for delay, attaching a gzipped 50mb csv file with sample text that performs the same! Thanks.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Note because you are on macOS the background thread jemalloc memory reclamation is disabled. cc @pitrou

@asfimport
Copy link
Author

Bogdan Klichuk:
I have tried this in ubuntu docker and results for 0.14.1 vs 0.15.1 are pretty interesting.

 

0.14.1:

 Line #    Mem usage    Increment   Line Contents
================================================
     4     50.5 MiB     50.5 MiB   @profile
     5                             def do():
     6     99.9 MiB     49.4 MiB       df = pd.read_csv('50mb.csv')
     7    112.1 MiB     12.1 MiB       df.to_parquet('test.parquet')

0.15.1:

Line #    Mem usage    Increment   Line Contents
================================================
     4     50.5 MiB     50.5 MiB   @profile
     5                             def do():
     6    100.0 MiB     49.4 MiB       df = pd.read_csv('50mb.csv')
     7    401.4 MiB    301.4 MiB       df.to_parquet('test.parquet') 

which besides the fact that 0.14.1 does indeed behave better on non-mac, also shows that 0.15.1 requires much more memory to write.

 

@asfimport
Copy link
Author

Bogdan Klichuk:
Looking at a bigger example

df = pd.concat([df] * 20) 

Resolves into 1.2.gb real usage on 0.14.1 and 2gb on 0.15.1

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Thanks for the additional information. Someone (which can be you) will need to investigate. These issues are very time consuming to diagnose

@asfimport
Copy link
Author

Wes McKinney / @wesm:
See the following script

https://gist.github.com/wesm/193f644d10b5aee8c258b8f4f81c5161

Here is the output for me (off of master branch, I assume 0.15.1 is the same)

$ python arrow7305.py 
Starting RSS: 102367232
Read CSV RSS: 154279936
Wrote Parquet RSS: 522485760
Waited 1 second RSS: 161763328
Read CSV RSS: 164732928
Wrote Parquet RSS: 528371712
Waited 1 second RSS: 226361344
Read CSV RSS: 167698432
Wrote Parquet RSS: 528502784
Waited 1 second RSS: 226492416
Read CSV RSS: 172175360
Wrote Parquet RSS: 532971520
Waited 1 second RSS: 230961152
Read CSV RSS: 172093440
Wrote Parquet RSS: 532889600
Waited 1 second RSS: 230879232
Read CSV RSS: 230940672
Wrote Parquet RSS: 532992000
Waited 1 second RSS: 230981632
Read CSV RSS: 232812544
Wrote Parquet RSS: 534822912
Waited 1 second RSS: 232812544
Read CSV RSS: 235274240
Wrote Parquet RSS: 537608192
Waited 1 second RSS: 235577344
Read CSV RSS: 236883968
Wrote Parquet RSS: 531349504
Waited 1 second RSS: 229318656
Read CSV RSS: 231157760
Wrote Parquet RSS: 533168128
Waited 1 second RSS: 231157760
Waited 1 second RSS: 172433408
Waited 1 second RSS: 172433408
Waited 1 second RSS: 172433408
Waited 1 second RSS: 172433408
Waited 1 second RSS: 172433408
Waited 1 second RSS: 172433408
Waited 1 second RSS: 172433408
Waited 1 second RSS: 172433408
Waited 1 second RSS: 172433408
Waited 1 second RSS: 172433408

Here is the output from 0.14.1

$ python arrow7305.py 
Starting RSS: 74477568
Read CSV RSS: 126550016
Wrote Parquet RSS: 129470464
Waited 1 second RSS: 129470464
Read CSV RSS: 132321280
Wrote Parquet RSS: 135151616
Waited 1 second RSS: 135151616
Read CSV RSS: 135155712
Wrote Parquet RSS: 133169152
Waited 1 second RSS: 133169152
Read CSV RSS: 135159808
Wrote Parquet RSS: 133230592
Waited 1 second RSS: 133230592
Read CSV RSS: 135217152
Wrote Parquet RSS: 135217152
Waited 1 second RSS: 135217152
Read CSV RSS: 139567104
Wrote Parquet RSS: 139567104
Waited 1 second RSS: 139567104
Read CSV RSS: 141398016
Wrote Parquet RSS: 133378048
Waited 1 second RSS: 133378048
Read CSV RSS: 137068544
Wrote Parquet RSS: 133234688
Waited 1 second RSS: 133234688
Read CSV RSS: 135221248
Wrote Parquet RSS: 135221248
Waited 1 second RSS: 135221248
Read CSV RSS: 139567104
Wrote Parquet RSS: 133234688
Waited 1 second RSS: 133234688
Waited 1 second RSS: 133234688
Waited 1 second RSS: 133234688
Waited 1 second RSS: 133234688
Waited 1 second RSS: 133234688
Waited 1 second RSS: 133234688
Waited 1 second RSS: 133234688
Waited 1 second RSS: 133234688
Waited 1 second RSS: 133234688
Waited 1 second RSS: 133234688
Waited 1 second RSS: 133234688

I've only begun to investigate but these changes have to do with the jemalloc version upgrade and the changes that we made to configuration options. I don't know what is causing the ~30-40MB difference in the baseline memory usage, though (could be differences in aggregate shared library sizes). We changed memory page management to be performed in the background which means that memory is not released to the OS immediately as it was before but rather on a short time delay as you can see.

The basic idea is that requesting memory from the operating system is expensive, and so jemalloc is being a bit greedy about holding on to memory for a short period of time because applications that use a lot of memory often continue using a lot of memory, and this will result in improved performance.

An alternative to our current configuration would be to disable the background_thread option and set the decay_ms to 0. This would likely yield worse performance in some applications.

We're having to strike a delicate balance between having a piece of software that performs well in real world scenarios while also offering predictable resource utilization. It is hard to satisfy everyone.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I found a different but also concerning problem on macOS relating to ARROW-6994 and have just put up this patch

#6100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant