# Key-Value RDD Operations

In [1]:
spark

Intitializing Scala interpreter ...

Spark Web UI available at http://192.168.1.19:4040
SparkContext available as 'sc' (version = 2.4.5, master = local[*], app id = local-1588687919267)
SparkSession available as 'spark'


res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@af33f8d


# Types of Spark operations

1. **Transformations**: RDD $\rightarrow$ RDD
  * Examples: `map`, `filter`, `sample`, and [More](http://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations)
  * No communication needed
 

2. **Actions**: RDD $\rightarrow$ Python-object in head node.
  * Examples: `reduce`, `collect`, `count`, `take`, and [More](http://spark.apache.org/docs/latest/rdd-programming-guide.html#actions).
  * *Some* communication needed.
  
  
3. **Shuffles:** RDD $\to$ RDD, **shuffle** needed
  * Examples: sort, distinct, repartition, sortByKey, reduceByKey, join [More](http://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations)
  * *A LOT* of communication might be needed.

# Key-value pairs

* Constructed using `Map()` constructor.
* The **key** is used to find a set of pairs with the particular key.
* The **value** can be anything.
* Spark has a set of special opeartions for *(key, value)* RDDs.


Spark provides specific functions to deal with RDDs in which each element is a key/value pair. Key/value RDDs expose new operations (e.g. aggregating and grouping together data with the same key and grouping together two different RDDs.) Such RDDs are also called pair RDDs. 

# Creating `(key,value)` RDDs

**Method 1**: `parallelize` a list of pairs.

In [8]:
val pair_rdd = sc.parallelize(List((1,2),(3,4)))
pair_rdd.collect().foreach(println)

(1,2)
(3,4)


pair_rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[4] at parallelize at <console>:27


**Method 2**: `map` a function that maps elements to key-value pairs

In [9]:
val reg_rdd = sc.parallelize(List(1,2,3,4,2,5,6))
val pair_rdd = reg_rdd.map(x => (x, x*x))
pair_rdd.collect().foreach(println)

(1,1)
(2,4)
(3,9)
(4,16)
(2,4)
(5,25)
(6,36)


reg_rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:27
pair_rdd: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[6] at map at <console>:28


# Transformations on (key, value) RDDs

## **`reduceByKey(func)`**

Apply the reduce function on the values with the same key.

In [13]:
val rdd = sc.parallelize(List((1,2), (2,4), (2,6)))

println(f"Original RDD: [${rdd.collect.mkString(", ")}]")
println(f"After transformation: ${rdd.reduceByKey((a,b)=> a+b).collect().mkString(", ")}]")

Original RDD: [(1,2), (2,4), (2,6)]
After transformation: (1,2), (2,10)]


rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[9] at parallelize at <console>:27


Note that although it is similar to the reduce function, it is implemented as a transformation and not as an action because the dataset can have very large number of keys. So, it does not return values to the driver program. Instead, it returns a new RDD. 

## **`sortByKey()`**

Sort RDD by keys in ascending order.

In [15]:
val rdd = sc.parallelize(List((2,2),(1,4),(3,6)))

println(f"Original RDD: [${rdd.collect().mkString(", ")}]")
println(f"After transformation: [${rdd.sortByKey().collect().mkString(", ")}]")

Original RDD: [(2,2), (1,4), (3,6)]
After transformation: [(1,4), (2,2), (3,6)]


rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[12] at parallelize at <console>:27


**Note:** The output of sortByKey() is an RDD. This means that  RDDs do have a meaningful order, which extends between partitions.

## groupByKey(): 

Returns a new RDD of `(key,<iterator>)` pairs where the iterator iterates over the values associated with the key.

[Iterators](http://anandology.com/python-practice-book/iterators.html) are python objects that generate a sequence of values. Writing a loop over `n` elements as 
```scala
for(w <- range){
    //do something
}
```
is inefficient because it first allocates a list of `n` elements and then iterates over it.
Using the iterator `xrange(n)` achieves the same result without materializing the list. Instead, elements are generated on the fly.

To materialize the list of values returned by an iterator we will use the list comprehension command:
```scala
for {a in <iterator>} yield a
```

In [19]:
val rdd = sc.parallelize(List((1,2), (2,4), (2,6)))

println(f"Original RDD: [${rdd.collect().mkString(", ")}]")

Original RDD: [(1,2), (2,4), (2,6)]


rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[21] at parallelize at <console>:27


In [18]:
//After transformation
rdd.groupByKey().mapValues(x => for (a <-x) yield a).collect()

res14: Array[(Int, Iterable[Int])] = Array((1,List(2)), (2,List(4, 6)))


##  `flatMapValues(func)` 

Similar to `flatMap()`: creates a separate key/value pair for each element of the list generated by the map operation.


`func` is a function that takes as input a single value and returns an iterator that generates a sequence of values.
The application of flatMapValues operates on a key/value RDD. It applies `func` to each value, and gets an list (generated by the iterator) of values. It then combines each of the values with the original key to produce a list of key-value pairs. These lists are concatenated as in `flatMap`

In [22]:
val rdd = sc.parallelize(List((1,2),(2,4),(2,6)))

rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[22] at parallelize at <console>:25


In [23]:
println(f"Original RDD :[${rdd.collect().mkString(", ")}]")


Original RDD :[(1,2), (2,4), (2,6)]


The anonmymous function here generates for each number `i`, an iterator that produces `i,i+1`

In [25]:

println(f"After transformation : [${rdd.flatMapValues(x => List.range(x,x+2)).collect().mkString(", ")}]")

After transformation : [(1,2), (1,3), (2,4), (2,5), (2,6), (2,7)]


## (Advanced)  `combineByKey(createCombiner, mergeValue, mergeCombiner)` 
Combine values with the same key using a different result type.

This is the most general of the per-key aggregation functions. Most of the other per-key combiners are implemented using it. 


Spark's `combineByKey` method requires 3 functions:
* `createCombiner`
* `mergeValue`
* `mergeCombiner`

The elements of the original RDD are considered here *values*

Values are converted into *combiners* which we will refer to here as "accumulators". An example of such a mapping is the mapping of the value *word* to the accumulator (*word*,1) that is done in WordCount.

Accumulators are then combined with values and the other combiner to generate a result for each key. For example, we can use it to calculate per-activity average durations as follows. Consider an RDD of key/value pairs where keys correspond to different activities and values correspond to duration.

In [39]:
val rdd = sc.parallelize(List(("Sleep", 7), ("Work",5), ("Play", 3), 
                      ("Sleep", 6), ("Work",4), ("Play", 4),
                      ("Sleep", 8), ("Work",5), ("Play", 5)))

rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[27] at parallelize at <console>:25


### Create a Combiner

```scala
value => (value, 1)
```

The first required argument in the `combineByKey` method is a function to be used as the very first aggregation step for each key. The argument of this function corresponds to the value in a key-value pair. If we want to compute the sum and count using `combineByKey`, then we create this "combiner" to be a tuple in the form of `(sum, count)`. The very first step in this aggregation is then `(value, 1)`, where `value` is the first RDD value that `combineByKey` comes across and `1` initializes the count.

In [65]:
def createCombiner = {(value:Int) =>
  (value.toDouble, 1)
}

createCombiner: Int => (Double, Int)
mergeValue: ((Double, Int), Int) => (Double, Int)
mergeCombiner: ((Double, Int), (Double, Int)) => (Double, Int)


### Merge a Value
```scala
(accumulator, value) => (accumulator._1 + value, accumulator._2 + 1)
```

The next required function tells `combineByKey` what to do when a combiner is given a new value. The arguments to this function are a combiner and a new value. The structure of the combiner is defined above as a tuple in the form of `(sum, count)` so we merge the new value by adding it to the first element of the tuple while incrementing `1` to the second element of the tuple.


In [None]:

def mergeValue = {(accumulator: (Double, Int), element:Int) =>
  (accumulator._1 + element, accumulator._2 + 1)
}



### Merge Combiners
```scala
(x, y) => (x._1 + y._1, x._2 + y._2)
```

The final required function tells `combineByKey` how to merge two combiners. In this example with tuples as combiners in the form of `(sum, count)`, all we need to do is add the first and last elements together.

In [None]:
def mergeCombiner = {(accumulator1: (Double, Int), accumulator2:(Double, Int)) =>
  (accumulator1._1 + accumulator2._1, accumulator1._2 + accumulator2._2)
}

In [66]:
val sum_counts = rdd.combineByKey(createCombiner, mergeValue, mergeCombiner)

sum_counts: org.apache.spark.rdd.RDD[(String, (Double, Int))] = ShuffledRDD[30] at combineByKey at <console>:30


In [67]:
sum_counts.collect()

res25: Array[(String, (Double, Int))] = Array((Work,(14.0,3)), (Play,(12.0,3)), (Sleep,(21.0,3)))


In [69]:
val duration_means_by_activity = sum_counts.mapValues(value=> value._1*1.0/value._2).collect()

duration_means_by_activity: Array[(String, Double)] = Array((Work,4.666666666666667), (Play,4.0), (Sleep,7.0))


# Transformations on two (key,value) RDDs

In [70]:
val rdd1 = sc.parallelize(List((1,2),(2,1),(2,2)))
val rdd2 = sc.parallelize(List((2,5),(3,1)))

rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[32] at parallelize at <console>:25
rdd2: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[33] at parallelize at <console>:26


## `subtractByKey()`

Remove from RDD1 all elements whose key is present in RDD2.


In [74]:
println(f"rdd1 = [${rdd1.collect().mkString(", ")}]")
println(f"rdd2 = [${rdd2.collect().mkString(", ")}]")
println(f"Result = [${rdd1.subtractByKey(rdd2).collect().mkString(", ")}]")

rdd1 = [(1,2), (2,1), (2,2)]
rdd2 = [(2,5), (3,1)]
Result = [(1,2)]


##  `join()` 
* A fundamental operation in relational databases.
* assumes two tables have a *key* column in common. 
* merges rows with the same key.

Suppose we have two `(key,value)` datasets:

|**dataset 1**|                                     |..........| **dataset 2** | 	       	     |
|-------------|-------------------------------------|   |-------------|-----------------|
| **key=name**   |   **(gender,occupation,age)**    |   |  **key=name**   |   **hair color**    |
| John   |  (male,cook,21)                          |   | Jill   |  blond |
| Jill   |  (female,programmer,19)                  |   | Grace  |  brown |         
| John   |  (male, kid, 2)                          |   | John   |  black |
| Kate   |  (female, wrestler, 54)                  |


When `Join` is called on datasets of type `(Key, V)` and `(Key, W)`, it  returns a dataset of `(Key, (V, W))` pairs with all pairs of elements for each key. Joining the 2 datasets above yields:


|   key = name | (gender,occupation,age),haircolor |
|--------------|-----------------------------------|
| John         | ((male,cook,21),black)             |
| John         | ((male, kid, 2),black)             |
| Jill         | ((female,programmer,19),blond)     |

In [75]:
println(f"rdd1 = [${rdd1.collect().mkString(", ")}]")
println(f"rdd2 = [${rdd2.collect().mkString(", ")}]")
println(f"Result = [${rdd1.join(rdd2).collect().mkString(", ")}]")

rdd1 = [(1,2), (2,1), (2,2)]
rdd2 = [(2,5), (3,1)]
Result = [(2,(1,5)), (2,(2,5))]


### Variants of join.
There are four variants of `join` which differ in how they treat keys that appear in one dataset but not the other.
* `join` is an *inner* join which means that keys that appear only in one dataset are eliminated.
* `leftOuterJoin` keeps all keys from the left dataset even if they don't appear in the right dataset. The result of leftOuterJoin in our example will contain the keys `John, Jill, Kate`
* `rightOuterJoin` keeps all keys from the right dataset even if they don't appear in the left dataset. The result of leftOuterJoin in our example will contain the keys `Jill, Grace, John`
* `FullOuterJoin` keeps all keys from both datasets. The result of leftOuterJoin in our example will contain the keys `Jill, Grace, John, Kate`

In outer joins, if the element appears only in one dataset, the element in `(K,(V,W))` that does not appear in the dataset is represented bye `None`

# Actions on (key, val) RDDs

In [76]:
val rdd = sc.parallelize(List((1,2), (2,4), (2,6)))

rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[38] at parallelize at <console>:25


## `countByKey()`
Count the number of elements for each key. Returns a dictionary for easy access to keys.

In [77]:
println(f"rdd = [${rdd.collect().mkString(", ")}]")
val result = rdd.countByKey()
println(f"Result = [${result.mkString(", ")}]")

rdd = [(1,2), (2,4), (2,6)]
Result = [1 -> 1, 2 -> 2]


result: scala.collection.Map[Int,Long] = Map(1 -> 1, 2 -> 2)


## `collectAsMap()` 
Collect the result as a dictionary to provide easy lookup.

In [78]:
println(f"rdd = [${rdd.collect().mkString(", ")}]")
val result = rdd.collectAsMap()
println(f"Result = [${result.mkString(", ")}]")

rdd = [(1,2), (2,4), (2,6)]
Result = [2 -> 6, 1 -> 2]


result: scala.collection.Map[Int,Int] = Map(2 -> 6, 1 -> 2)


## `lookup(key)` 
Return all values associated with the provided key.