# CHAPTER 1.  Secondary Sort: Introduction

## Spark Solution to Secondary Sort

### Step 3: Connect to the Spark master

- sparkPackages="com.databricks:spark-csv_2.10:1.0.3"  옵션은 csv 파일을 읽어오기 위한   패키지명

In [1]:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

sc <- sparkR.init(master="local[*]", sparkPackages="com.databricks:spark-csv_2.10:1.0.3" )
sqlContext <- sparkRSQL.init(sc)


Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

    cov, filter, lag, na.omit, predict, sd, var

The following objects are masked from ‘package:base’:

    colnames, colnames<-, endsWith, intersect, rank, rbind, sample,
    startsWith, subset, summary, table, transform



Launching java with spark-submit command /usr/local/spark/bin/spark-submit  --packages com.databricks:spark-csv_2.10:1.0.3 sparkr-shell /tmp/RtmpRBcXAZ/backend_port229367227291 


### Step 4: Use the JavaSparkContext to create a JavaRDD

In [18]:
lines <- SparkR:::textFile(sc, 'timeseries.csv')

In [19]:
SparkR:::take(lines, 10)

In [20]:
text_df <- read.df(sqlContext=sqlContext, source="text", path="timeseries.csv", delimiter=",", header="true")

In [21]:
head(text_df)

Unnamed: 0,value
1,"x,2,9"
2,"y,2,5"
3,"x,1,3"
4,"y,1,7"
5,"y,3,1"
6,"x,3,6"


In [22]:
customSchema <- structType(
    structField("name", "string"),    
    structField("time1", "integer"),
    structField("time2", "integer")
)
csv_df <- read.df(sqlContext=sqlContext, source="com.databricks.spark.csv", path="timeseries.csv", header="true", schema = customSchema)

In [23]:
head(csv_df)

Unnamed: 0,name,time1,time2


### Step 5: Create key-value pairs from the JavaRDD

- R에서는 tuple을 list로 표현함.

In [24]:
make_pairs <- function ( line ) {
    tokens <- strsplit(line, ",")[[1]]
    #time <- tokens[2]
    #value <- tokens[3]
    time <- as.numeric( tokens[2] )
    value <- as.numeric( tokens[3] )
    timevalue <- list( time, value )
    result <- list( tokens[1], timevalue  )
    return ( result )
}

In [25]:
make_pairs( 'z,4,0' )

In [26]:
pairs <- SparkR:::flatMap( lines, make_pairs  )

- 결과가 맞는지 확인해보자.

In [27]:
SparkR:::take(pairs, 35)

### Step 7: Group JavaPairRDD elements by the key (name)

- groupByKey(rdd, numPartitions)
    - rdd : The RDD to group. Should be an RDD where each element is list(K, V) or c(K, V).
    - numPartitions	: Number of partitions to create.

In [28]:
groups <- SparkR:::groupByKey( pairs, 1L )

In [29]:
SparkR:::take(groups, 30)

###  검증하기 

In [30]:
pairs2 <- list( 
    list('p', list(2, 6) ), 
    list('p', list(4, 7) ), 
    list('p', list(1, 9) ), 
    list('p', list(6, 0) ),
    list('p', list(7, 3) )
)
pairs2

In [31]:
rdd2 <- SparkR:::parallelize(sc, pairs2)
parts2 <- SparkR:::groupByKey(rdd2, 2L)
grouped2 <- collect(parts2)
grouped2