# CHAPTER 1  - Secondary Sort: Introduction

- Secondary Sort는 key에 의한 정렬도 하고, key에 연결된 value에 대해서도 정렬하는 문제임.
- MapReduce프레임워크 라는 병렬처리 알고리즘으로 해결하보자.

- MapReduce 기본 구조

> map(key1, value1) → list(key2, value2)

> reduce(key2, list(value2)) → list(key3, value3)

## Solutions to the Secondary Sort Problem


첫번째 방법 

- reduce단계에서 value값을 모두 메모리에 올려놓고 정렬함.
- out-ofmemory error

두번째 방법

- MapReduce 프레임워크의 reducer의 정렬기능을 이용
- key와 value을 하나의 key로 구성하는 composite key 생성이 필요
- Hadoop의 MapReduce프레임워크만 가능하고, spark와 R에서 구현 불가


![](sparkr_01_01.jpg)

![](sparkr_01_02.jpg)

## Spark Solution to Secondary Sort

### Time Series as Input

```
name time value
x 2 9
y 2 5
x 1 3
y 1 7
y 3 1
x 3 6
z 1 4
z 2 8
z 3 7
z 4 0
p 2 6
p 4 7
p 1 9
p 6 0
p 7 3
```

### Expected Output

```
name t1 t2 t3 t4 t5 ...
x => [3, 9, 6]
y => [7, 5, 1]
z => [4, 8, 7, 0]
p => [9, 6, 7, 0, 3]
```

### Step 3: Connect to the Spark master

In [1]:
from pyspark import SparkContext
sc = SparkContext() 
sc

<pyspark.context.SparkContext at 0x7fd68418ea50>

### Step 4: Use the JavaSparkContext to create a JavaRDD

-  timeseries.csv  파일 내용
```
x,2,9
y,2,5
x,1,3
y,1,7
y,3,1
x,3,6
z,1,4
z,2,8
z,3,7
z,4,0
p,2,6
p,4,7
p,1,9
p,6,0
p,7,3
```

In [2]:
lines = sc.textFile('timeseries.csv', 1);

In [3]:
lines.take(10)

[u'x,2,9',
 u'y,2,5',
 u'x,1,3',
 u'y,1,7',
 u'y,3,1',
 u'x,3,6',
 u'z,1,4',
 u'z,2,8',
 u'z,3,7',
 u'z,4,0']

### Step 5: Create key-value pairs from the JavaRDD

In [59]:
def make_pairs( s ) : 
    tokens = s.split(",");
    time = int(  tokens[1] )
    value = int(  tokens[2] )
    timevalue = ( time, value )
    return ( tokens[0], timevalue )

In [60]:
pairs = lines.map( make_pairs )

In [61]:
pairs.collect()

[(u'x', (2, 9)),
 (u'y', (2, 5)),
 (u'x', (1, 3)),
 (u'y', (1, 7)),
 (u'y', (3, 1)),
 (u'x', (3, 6)),
 (u'z', (1, 4)),
 (u'z', (2, 8)),
 (u'z', (3, 7)),
 (u'z', (4, 0)),
 (u'p', (2, 6)),
 (u'p', (4, 7)),
 (u'p', (1, 9)),
 (u'p', (6, 0)),
 (u'p', (7, 3))]

### Step 7: Group JavaPairRDD elements by the key (name)

In [62]:
groups = pairs.groupByKey();

In [63]:
groups.collect()

[(u'y', <pyspark.resultiterable.ResultIterable at 0x7fd65e5fed10>),
 (u'x', <pyspark.resultiterable.ResultIterable at 0x7fd65e5fed50>),
 (u'z', <pyspark.resultiterable.ResultIterable at 0x7fd65e5fe850>),
 (u'p', <pyspark.resultiterable.ResultIterable at 0x7fd65e5fe690>)]

In [64]:
debug2 = groups.collect()
for t2 in  debug2 :
    print "debug2 key={}\t value={}".format( t2[0],  "".join([str(x) for x in t2[1]] )   )

debug2 key=y	 value=(2, 5)(1, 7)(3, 1)
debug2 key=x	 value=(2, 9)(1, 3)(3, 6)
debug2 key=z	 value=(1, 4)(2, 8)(3, 7)(4, 0)
debug2 key=p	 value=(2, 6)(4, 7)(1, 9)(6, 0)(7, 3)


### Step 9: Sort the reducer’s values in memory

In [78]:
def secondary_sort( s ) :
    
    decorated = [(tup[0], tup) for tup in s ]
    decorated.sort()
    undecorated  = [tup for second, tup in decorated]
    
    return undecorated 

In [79]:
sorted = groups.mapValues( secondary_sort )

In [80]:
sorted.collect()

[(u'y', [(1, 7), (2, 5), (3, 1)]),
 (u'x', [(1, 3), (2, 9), (3, 6)]),
 (u'z', [(1, 4), (2, 8), (3, 7), (4, 0)]),
 (u'p', [(1, 9), (2, 6), (4, 7), (6, 0), (7, 3)])]

### Step 10: Key로 정렬함.

In [85]:
sorted2 = sorted.sortByKey(True)

In [86]:
sorted2.collect()

[(u'p', [(1, 9), (2, 6), (4, 7), (6, 0), (7, 3)]),
 (u'x', [(1, 3), (2, 9), (3, 6)]),
 (u'y', [(1, 7), (2, 5), (3, 1)]),
 (u'z', [(1, 4), (2, 8), (3, 7), (4, 0)])]