Skip to content

Commit

Permalink
cleaning up
Browse files Browse the repository at this point in the history
  • Loading branch information
edwindj committed Jul 7, 2016
2 parents 40416dd + 2e9bd99 commit ea660ec
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 51 deletions.
71 changes: 20 additions & 51 deletions useR/lightning.Rmd
@@ -1,7 +1,7 @@
---
title: "Chunked"
author: "Edwin de Jonge"
date: "Statistisc Netherlands / UseR! 2016"
date: "Statistics Netherlands / UseR! 2016"
output:
beamer_presentation:
keep_tex: false
Expand Down Expand Up @@ -40,7 +40,23 @@ Short answer:
- Another text file
- A database

## Option 1: Use unix tools
## Option 1: Read data with R

### Use:

- ~~~read.csv~~~ uh, `readr::read_csv`
- `datatable::fread`
- Fast reading of data into memory!
### However...
- You will need a lot of RAM!
- Text files tend to be 1 to 100 Gb.
- **Even though these procedures use memory mapping the resulting `data.frame`
does not!**
- development cycle of processing script is long...
## Option 2: Use unix tools
### Good choice!
Expand All @@ -58,7 +74,7 @@ It is nice to stay in `R`-universe (one data-processing tool)
- Does it work on my OS/shell?
- I want to use dplyr verbs! (dplyr-deprivation...)
## Option 2: Import data in DB
## Option 3: Import data in DB
### Import data into DB
Expand All @@ -67,25 +83,9 @@ It is nice to stay in `R`-universe (one data-processing tool)
### However
- That is LET (Load, Extract, Transform) in stead of (Extract, Load, Transform)
- It is not really a R, but a DB solution
- May be not efficient.
## Option 3: Read data with R

### Use:

- ~~~read.csv~~~ uh, `readr::read_csv`
- `datatable::fread`
- Fast reading of data into memory!
### However...
- You will need a lot of RAM!
- Text files tend to be 1 to 100 Gb.
- **Even though these procedures use memory mapping the resulting `data.frame`
does not!**
- development cycle of processing script is long...
## Process in chunks?
\begin{center}
Expand All @@ -107,37 +107,6 @@ does not!**
- All `dplyr` verbs on `chunk_wise` objects are recorded and replayed when
writing.
## Option 4: Use chunked!
### Idea:
- Process data chunk by chunk using `dplyr` verbs
- Memory efficient, only one chunk at a time in memory
- Lazy processing
- Development cycle is short: test on first chunk.
###
- Read (and write) on chunk at a time using R package `LaF`.
- All `dplyr` verbs on `chunk_wise` objects are recorded and replayed when
writing.
## Scenario 1: TXT -> TXT
### Preprocess a text file with data
```{r}
read_chunkwise("my_data.csv", chunk_size = 5000) %>%
select(col1, col2) %>%
filter(col1 > 1) %>%
mutate(col3 = col1 + 1) %>%
write_chunkwise("output.csv")
```
This code:
- evals chunk by chunk
- allows for column name completion in Rstudio!
## Scenario 1: TXT -> TXT
### Preprocess a text file with data
Expand Down Expand Up @@ -168,7 +137,7 @@ tbl <-
mutate(col6 = col1 + col2) %>%
write_chunkwise(db, 'my_large_table')
```
## Scenario 2: DB -> TXT
## Scenario 3: DB -> TXT
### Extract a large table from a DB to a text file
Expand Down
Binary file modified useR/lightning.pdf
Binary file not shown.

0 comments on commit ea660ec

Please sign in to comment.