Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Reading in Parquet files are 20x slower than reading fst files in R #22617

Closed
asfimport opened this issue Aug 14, 2019 · 8 comments
Closed

Comments

@asfimport
Copy link

asfimport commented Aug 14, 2019

Problem

Loading any of the data I mentioned below is 20x slower than the fst format in R.

 

How to get the data

https://loanperformancedata.fanniemae.com/lppub/index.html

Register and download any of these. I can't provide the data to you, and I think it's best you register.

 

image-2019-08-14-10-04-56-834.png

 

Code

path = "data/Performance_2016Q4.txt"

library(data.table)
 library(arrow)

a = data.table::fread(path, header = FALSE)

fst::write_fst(a, "data/a.fst")

arrow::write_parquet(a, "data/a.parquet")

rm(a); gc()

#read in test
system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds

rm(a); gc()

read in test
system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds

Environment: Windows 10 Pro and Ubuntu
Reporter: Zhuo Jia Dai
Assignee: Wes McKinney / @wesm

Related issues:

Original Issue Attachments:

Note: This issue was originally created as ARROW-6230. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Thanks for the example. I'm interested to see where the time is being spent. Reading Parquet files is quite fast in Python so I'll see what the performance is there also.

There's some work going on for the current release (see ARROW-3772, ARROW-3325, ARROW-3246) that will enable direct writing of R factors to and from Parquet, so that could be a (no pun intended) factor in the results

@asfimport
Copy link
Author

Wes McKinney / @wesm:
For the record this file takes on the same order of magnitude as fst to load and convert to pandas without any tuning of data types (e.g. converting things to factor/categorical)

In [28]: %time table = pq.read_table('2016Q4.parquet')                                                                                                          
CPU times: user 6.57 s, sys: 4.2 s, total: 10.8 s
Wall time: 2.05 s

In [29]: %time df = table.to_pandas()                                                                                                                           
CPU times: user 2.37 s, sys: 2.11 s, total: 4.48 s
Wall time: 2.04 s

So the performance issue is probably R specific. I'll build the R package tomorrow and see if I can diagnose the problem

@asfimport
Copy link
Author

@asfimport
Copy link
Author

Zhuo Jia Dai:
Actually, I can't even read it in Python on the same machine

import pandas as pd
pd.read_parquet("~/a.parquet")
# Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Oh, you're running into https://issues.apache.org/jira/browse/ARROW-6060. If you downgrade to pyarrow==0.13.0 it should work

@asfimport
Copy link
Author

Wes McKinney / @wesm:
On the master branch I have

> a <- data.table::fread("/home/wesm/data/fanniemae_loanperf/Performance_2016Q4.txt", header=FALSE)
|--------------------------------------------------|
|==================================================|
> fst::write_fst(a, "/home/wesm/data/fanniemae_loanperf/2016Q4.fst")
> system.time(a <- fst::read_fst("/home/wesm/data/fanniemae_loanperf/2016Q4.fst"))
   user  system elapsed 
  8.174   2.866   2.969 
> system.time(a <- arrow::read_parquet("/home/wesm/data/fanniemae_loanperf/2016Q4.parquet"))
   user  system elapsed 
  9.330   3.681   3.353 

This is on a true 16-core system.

This suggests that you performance problem is being caused by memory thrashing related to ARROW-6060 – sorry about that, I would guess we'll have the 0.15.0 release out with that fixed within 6 weeks.

perf report suggests there is certainly some optimization opportunity.

https://gist.github.com/wesm/7b577f0ce7dfdf96fddfd91943c162e5

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Note that the Parquet file is about 5x smaller than the fst file

-rw-r--r-- 1 wesm wesm  527777454 Aug 14 10:25 2016Q4.fst
-rw-r--r-- 1 wesm wesm  119175882 Aug 13 22:03 2016Q4.parquet

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Resolving for 0.15.0. If after 0.15.0 comes out there are performance or memory use problems please reopen this issue or open a new issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants