Skip to content
Chunkwise Text-file Processing for 'dplyr'
R TeX
Branch: master
Clone or download
Latest commit fd54db7 May 30, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R send ... to detect_dm_csv in order to allow colClasses Apr 14, 2018
examples fixed a testing issue Mar 14, 2016
man fixed for dplyr 0.5 -> 0.6 May 10, 2017
presentations wip May 30, 2018
tests fixed for dplyr 0.5 -> 0.6 May 10, 2017
.Rbuildignore wip May 30, 2018
.gitignore added tests Jul 24, 2015
.travis.yml fixing CRAN error Jul 1, 2017
DESCRIPTION fixing CRAN error Jul 1, 2017
NAMESPACE added some extra functionaly to support issue #4 Mar 6, 2017
NEWS.md not checking suggests May 10, 2017
README.md updated README Mar 7, 2016
appveyor.yml wip Jul 24, 2015
chunked.Rproj first commit Jul 23, 2015

README.md

chunked

version Downloads Travis-CI Build Status AppVeyor Build Status Coverage Status R is a great tool, but processing data in large text files is cumbersome. chunked helps you to process large text files with dplyr while loading only a part of the data in memory. It builds on the excellent R package LaF.

Processing commands are written in dplyr syntax, and chunked (using LaF) will take care that chunk by chunk is processed, taking far less memory than otherwise. chunked is useful for select-ing columns, mutate-ing columns and filter-ing rows. It is less helpful in group-ing and summarize-ation of large text files. It can be used in data pre-processing.

Install

'chunked' can be installed with

install.packages('chunked')

beta version with:

install.packages('chunked', repos=c('https://cran.rstudio.com', 'http://edwindj.github.io/drat'))

and the development version with:

devtools::install_github('edwindj/chunked')

Enjoy! Feedback is welcome...

Usage

Text file -> process -> text file

Most common case is processing a large text file, select or add columns, filter it and write the result back to a text file

  read_chunkwise("./large_file_in.csv", chunk_size=5000) %>% 
  select(col1, col2, col5) %>%
  filter(col1 > 10) %>% 
  mutate(col6 = col1 + col2) %>% 
  write_chunkwise("./large_file_out.csv")

chunked will write process the above statement in chunks of 5000 records. This is different from for example read.csv which reads all data into memory before processing it.

Text file -> process -> database

Another option is to use chunked as a preprocessing step before adding it to a database

db <- src_sqlite('test.db', create=TRUE)

tbl <- 
  read_chunkwise("./large_file_in.csv", chunk_size=5000) %>% 
  select(col1, col2, col5) %>%
  filter(col1 > 10) %>% 
  mutate(col6 = col1 + col2) %>% 
  write_chunkwise(db, 'my_large_table')
  
# tbl now points to the table in sqlite.

Db -> process -> Text file

Chunked can be used to export chunkwise to a text file. Note however that in that case processing takes place in the database and the chunkwise restrictions only apply to the writing.

Lazy processing

chunked will not start processing until collect or write_chunkwise is called.

data_chunks <- 
  read_chunkwise("./large_file_in.csv", chunk_size=5000) %>% 
  select(col1, col3)
  
# won't start processing until
collect(data_chunks)
# or
write_chunkwise(data_chunks, "test.csv")
# or
write_chunkwise(data_chunks, db, "test")

Syntax completion of variables of a chunkwise file in RStudio works like a charm...

Dplyr verbs

chunked implements the following dplyr verbs:

  • filter
  • select
  • rename
  • mutate
  • mutate_each
  • transmute
  • do
  • tbl_vars
  • inner_join
  • left_join
  • semi_join
  • anti_join

Since data is processed in chunks, some dplyr verbs are not implemented:

  • arrange
  • right_join
  • full_join

summarize and group_by are implemented but generate a warning: they operate on each chunk and not on the whole data set. However this makes is more easy to process a large file, by repeatedly aggregating the resulting data.

  • summarize
  • group_by
tmp <- tempfile()
write.csv(iris, tmp, row.names=FALSE, quote=FALSE)
iris_cw <- read_chunkwise(tmp, chunk_size = 30) # read in chunks of 30 rows for this example

iris_cw %>% 
  group_by(Species) %>%            # group in each chunk
  summarise( m = mean(Sepal.Width) # and summarize in each chunk
           , w = n()
           ) %>% 
  as.data.frame %>%                  # since each Species has 50 records, results will be in multiple chunks
  group_by(Species) %>%              # group the results from the chunk
  summarise(m = weighted.mean(m, w)) # and summarize it again
You can’t perform that action at this time.