# sparklyr — R interface for Apache Spark

- docker run  -d -p 8787:8787 rocker/rstudio
- http://211.249.50.37:8787/


- 설치가 많이 어렵습니다. 
- spark-all 도커에서는 안 됨.  rocker/rstudio 도커에서도 안 됨.
- virtualbox에 rstudio을 설치하여 실습함.


- 기능 요약
    - R로부터 spark와 연동 => sparklyr 패키지는 뒷단에서 dplyr을 완벽히 지원
    - R로부터 가지고 온 spark dataset을 filter와 aggregation을 함.
    - R에서 spark의 분산 machine learning library 사용함.
    - Spark API와 spark package와의 연동으로 기능을 확장할 수 있음.

## Installation

- devtools 패키지를 사용해서 sparklyr을 설치함.

In [27]:
install.packages("devtools", repos='http://cran.ism.ac.jp/', lib="/opt/conda/lib/R/library" )
devtools::install_github("rstudio/sparklyr"  )

also installing the dependency ‘git2r’

In install.packages("devtools", repos = "http://cran.ism.ac.jp/", : installation of package ‘devtools’ had non-zero exit statusUpdating HTML index of packages in '.Library'
Making 'packages.html' ... done


- devtools과 sparklyr 설치시 문제가 많이 발생함. 
- 우분투에서는 외부 Lib 참조 예러가 발생함. 아래와 같이 추가 필요함.
    - sudo apt-get -y build-dep libcurl4-gnutls-dev
    - apt-get -y install libcurl4-gnutls-dev

In [None]:
library(sparklyr)
spark_install(version = "1.6.1")

## Connecting to Spark

```
spark_connect(master, spark_home = Sys.getenv("SPARK_HOME"), app_name = "sparklyr", version = NULL, hadoop_version = NULL, config = spark_config(), extensions = sparklyr::registered_extensions())
```

In [None]:
library(sparklyr)
sc <- spark_connect(master = "local")

## Reading Data

- dplyr의 copy_to()함수를 사용해서 R데이터를 Spark로 카피할 수 있음.

In [None]:
install.packages("nycflights13")
install.packages("Lahman")

library(dplyr)
iris_tbl <- copy_to(sc, iris)
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
batting_tbl <- copy_to(sc, Lahman::Batting, "batting")

In [None]:
src_tbls(sc)

## Using dplyr

In [None]:
# filter by departure delay
flights_tbl %>% filter(dep_delay == 2)

In [None]:
delay <- flights_tbl %>% 
  group_by(tailnum) %>%
  summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
  filter(count > 20, dist < 2000, !is.na(delay)) %>%
  collect

# plot delays
library(ggplot2)
ggplot(delay, aes(dist, delay)) +
  geom_point(aes(size = count), alpha = 1/2) +
  geom_smooth() +
  scale_size_area(max_size = 2)

## Window Functions

In [None]:
batting_tbl %>%
  select(playerID, yearID, teamID, G, AB:H) %>%
  arrange(playerID, yearID, teamID) %>%
  group_by(playerID) %>%
  filter(min_rank(desc(H)) <= 2 & H > 0)

## Using SQL

In [None]:
library(DBI)
iris_preview <- dbGetQuery(sc, "SELECT * FROM iris LIMIT 10")
iris_preview

## Machine Learning

- sparklyr패키지안에서는 spark의 MLlib함수에 포함된 machine learning algorithms을 사용할 수 있음.
- 이 함수는 DataFrames 에 올려진 하이레벨API과 연동됨.

In [None]:
# copy mtcars into spark
mtcars_tbl <- copy_to(sc, mtcars)

# transform our data set, and then partition into 'training', 'test'
partitions <- mtcars_tbl %>%
  filter(hp >= 100) %>%
  mutate(cyl8 = cyl == 8) %>%
  sdf_partition(training = 0.5, test = 0.5, seed = 1099)

# fit a linear model to the training dataset
fit <- partitions$training %>%
  ml_linear_regression(response = "mpg", features = c("wt", "cyl"))

In [None]:
summary(fit)