[R] Reading in Parquet files are 20x slower than reading fst files in R #22617

asfimport · 2019-08-14T00:06:00Z

Problem

Loading any of the data I mentioned below is 20x slower than the fst format in R.

How to get the data

https://loanperformancedata.fanniemae.com/lppub/index.html

Register and download any of these. I can't provide the data to you, and I think it's best you register.

Code

path = "data/Performance_2016Q4.txt"

library(data.table)
 library(arrow)

a = data.table::fread(path, header = FALSE)

fst::write_fst(a, "data/a.fst")

arrow::write_parquet(a, "data/a.parquet")

rm(a); gc()

#read in test
system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds

rm(a); gc()

read in test
system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds

Environment: Windows 10 Pro and Ubuntu
Reporter: Zhuo Jia Dai
Assignee: Wes McKinney / @wesm

Related issues:

[Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True (relates to)

Original Issue Attachments:

image-2019-08-14-10-04-56-834.png

_{Note: This issue was originally created as ARROW-6230. Please see the migration documentation for further details.}

asfimport · 2019-08-14T02:39:43Z

Wes McKinney / @wesm:
Thanks for the example. I'm interested to see where the time is being spent. Reading Parquet files is quite fast in Python so I'll see what the performance is there also.

There's some work going on for the current release (see ARROW-3772, ARROW-3325, ARROW-3246) that will enable direct writing of R factors to and from Parquet, so that could be a (no pun intended) factor in the results

asfimport · 2019-08-14T03:07:51Z

Wes McKinney / @wesm:
For the record this file takes on the same order of magnitude as fst to load and convert to pandas without any tuning of data types (e.g. converting things to factor/categorical)

In [28]: %time table = pq.read_table('2016Q4.parquet')                                                                                                          
CPU times: user 6.57 s, sys: 4.2 s, total: 10.8 s
Wall time: 2.05 s

In [29]: %time df = table.to_pandas()                                                                                                                           
CPU times: user 2.37 s, sys: 2.11 s, total: 4.48 s
Wall time: 2.04 s

So the performance issue is probably R specific. I'll build the R package tomorrow and see if I can diagnose the problem

asfimport · 2019-08-14T03:08:31Z

Wes McKinney / @wesm:
cc @romainfrancois @nealrichardson

asfimport · 2019-08-14T03:10:36Z

Zhuo Jia Dai:
Actually, I can't even read it in Python on the same machine

import pandas as pd
pd.read_parquet("~/a.parquet")
# Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

asfimport · 2019-08-14T03:18:49Z

Wes McKinney / @wesm:
Oh, you're running into https://issues.apache.org/jira/browse/ARROW-6060. If you downgrade to pyarrow==0.13.0 it should work

asfimport · 2019-08-14T15:34:38Z

Wes McKinney / @wesm:
On the master branch I have

> a <- data.table::fread("/home/wesm/data/fanniemae_loanperf/Performance_2016Q4.txt", header=FALSE)
|--------------------------------------------------|
|==================================================|
> fst::write_fst(a, "/home/wesm/data/fanniemae_loanperf/2016Q4.fst")
> system.time(a <- fst::read_fst("/home/wesm/data/fanniemae_loanperf/2016Q4.fst"))
   user  system elapsed 
  8.174   2.866   2.969 
> system.time(a <- arrow::read_parquet("/home/wesm/data/fanniemae_loanperf/2016Q4.parquet"))
   user  system elapsed 
  9.330   3.681   3.353

This is on a true 16-core system.

This suggests that you performance problem is being caused by memory thrashing related to ARROW-6060 – sorry about that, I would guess we'll have the 0.15.0 release out with that fixed within 6 weeks.

perf report suggests there is certainly some optimization opportunity.

https://gist.github.com/wesm/7b577f0ce7dfdf96fddfd91943c162e5

asfimport · 2019-08-14T15:35:21Z

Wes McKinney / @wesm:
Note that the Parquet file is about 5x smaller than the fst file

-rw-r--r-- 1 wesm wesm  527777454 Aug 14 10:25 2016Q4.fst
-rw-r--r-- 1 wesm wesm  119175882 Aug 13 22:03 2016Q4.parquet

asfimport · 2019-08-15T16:17:41Z

Wes McKinney / @wesm:
Resolving for 0.15.0. If after 0.15.0 comes out there are performance or memory use problems please reopen this issue or open a new issue. Thanks!

asfimport closed this as completed Aug 15, 2019

asfimport assigned wesm Jan 10, 2023

asfimport added this to the 0.15.0 milestone Jan 11, 2023

asfimport mentioned this issue Jan 11, 2023

[Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True #22462

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Reading in Parquet files are 20x slower than reading fst files in R #22617

[R] Reading in Parquet files are 20x slower than reading fst files in R #22617

asfimport commented Aug 14, 2019 •

edited

asfimport commented Aug 14, 2019

asfimport commented Aug 14, 2019

asfimport commented Aug 14, 2019

asfimport commented Aug 14, 2019

asfimport commented Aug 14, 2019

asfimport commented Aug 14, 2019

asfimport commented Aug 14, 2019

asfimport commented Aug 15, 2019

[R] Reading in Parquet files are 20x slower than reading fst files in R #22617

[R] Reading in Parquet files are 20x slower than reading fst files in R #22617

Comments

asfimport commented Aug 14, 2019 • edited

Related issues:

Original Issue Attachments:

asfimport commented Aug 14, 2019

asfimport commented Aug 14, 2019

asfimport commented Aug 14, 2019

asfimport commented Aug 14, 2019

asfimport commented Aug 14, 2019

asfimport commented Aug 14, 2019

asfimport commented Aug 14, 2019

asfimport commented Aug 15, 2019

asfimport commented Aug 14, 2019 •

edited