-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] Reading in Parquet files are 20x slower than reading fst files in R #22617
Comments
Wes McKinney / @wesm: There's some work going on for the current release (see ARROW-3772, ARROW-3325, ARROW-3246) that will enable direct writing of R factors to and from Parquet, so that could be a (no pun intended) factor in the results |
Wes McKinney / @wesm: In [28]: %time table = pq.read_table('2016Q4.parquet')
CPU times: user 6.57 s, sys: 4.2 s, total: 10.8 s
Wall time: 2.05 s
In [29]: %time df = table.to_pandas()
CPU times: user 2.37 s, sys: 2.11 s, total: 4.48 s
Wall time: 2.04 s So the performance issue is probably R specific. I'll build the R package tomorrow and see if I can diagnose the problem |
Zhuo Jia Dai: import pandas as pd
pd.read_parquet("~/a.parquet")
# Process finished with exit code 137 (interrupted by signal 9: SIGKILL) |
Wes McKinney / @wesm: |
Wes McKinney / @wesm: > a <- data.table::fread("/home/wesm/data/fanniemae_loanperf/Performance_2016Q4.txt", header=FALSE)
|--------------------------------------------------|
|==================================================|
> fst::write_fst(a, "/home/wesm/data/fanniemae_loanperf/2016Q4.fst")
> system.time(a <- fst::read_fst("/home/wesm/data/fanniemae_loanperf/2016Q4.fst"))
user system elapsed
8.174 2.866 2.969
> system.time(a <- arrow::read_parquet("/home/wesm/data/fanniemae_loanperf/2016Q4.parquet"))
user system elapsed
9.330 3.681 3.353 This is on a true 16-core system. This suggests that you performance problem is being caused by memory thrashing related to ARROW-6060 – sorry about that, I would guess we'll have the 0.15.0 release out with that fixed within 6 weeks. perf report suggests there is certainly some optimization opportunity. https://gist.github.com/wesm/7b577f0ce7dfdf96fddfd91943c162e5 |
Wes McKinney / @wesm: -rw-r--r-- 1 wesm wesm 527777454 Aug 14 10:25 2016Q4.fst
-rw-r--r-- 1 wesm wesm 119175882 Aug 13 22:03 2016Q4.parquet |
Wes McKinney / @wesm: |
Problem
Loading any of the data I mentioned below is 20x slower than the fst format in R.
How to get the data
https://loanperformancedata.fanniemae.com/lppub/index.html
Register and download any of these. I can't provide the data to you, and I think it's best you register.
Code
Environment: Windows 10 Pro and Ubuntu
Reporter: Zhuo Jia Dai
Assignee: Wes McKinney / @wesm
Related issues:
Original Issue Attachments:
Note: This issue was originally created as ARROW-6230. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: