# PySpark Tutorial - Joins
<div>
 <h2> CSCI 4253 / 5253 
  <IMG SRC="https://www.colorado.edu/cs/profiles/express/themes/cuspirit/logo.png" WIDTH=50 ALIGN="right"/> </h2>
</div>

In [None]:
from pyspark import SparkContext, SparkConf
import numpy as np
import operator

In [None]:
conf=SparkConf().setAppName("pyspark tutorial").setMaster("local[*]")
sc = SparkContext(conf=conf)

## Key Value Data, Grouping and Joins

The Key-Value or (k,v) datatype is fundemental to many operations including grouping and joins. We need to be able to:
* Create keys from non-KV data
* Group or organize data according to keys
* Operate on data according to keys
* Form standard joins

### Creating Key-Pair Values

We can directly create K-V pairs using lists of pairs

In [None]:
visits = sc.parallelize([ ("index.html", "1.2.3.4"),
                         ("about.html", "3.4.5.6"),
                         ("index.html", "1.3.3.1") ])
pageNames = sc.parallelize([ ("index.html", "Home"),
                            ("about.html", "About") ])

When we `collect()` the K-V pairs, we'll get the full list of keys -- this pulls the data to the front-end machine:

In [None]:
visits.collect()

The first position of the K-V pair is the key. We'll cover grouping in more detail later, but let's take a look at grouping visits by the keys:

In [None]:
visits.groupByKey().collect()

The KV pairs have now been *grouped* meaning that we have an RDD of all the keys and the values for each key are a "list" of of the values corresponding to that key in the original KV RDD. Because the values are scattered across your cluster, a `ResultIterable` is used to represented their distributed type.

We reify the results by converting it to a list. You wouldn't do this in practice because this brings all the values back to the front-end machine across the whole networking, but lets see what this produces to understand what `groupByKey` is doing.

We'll `map` a lambda function that simply "flattens" the items by converting the `ResultIterable` into a list.

In [None]:
visits.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()

When we `map` across a grouped KV the mapped function takes an argument which is the pair "(key, value") -- that's why we're referring to `x[0]` for the value and `x[1]` for the value in the sample `map`.

It's more typicaly that you want to just `map` across the values and there is a corresponding `mapValues` function that does precisely this. In the example below, we are mapping `list` across the values in the groups. The results should be the same as the 

In [None]:
visits.groupByKey().mapValues(list).collect()

`keyBy` is a function to efficiently create a key-value pair from an RDD. The elements of the original RDD are the "values" and a function provides the associated key.

For example, assume we want $K = V^2$ for values $V$ in an RDD:

In [None]:
r = sc.parallelize( [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3] )
r.collect()

In [None]:
rsq = r.keyBy(lambda x : x*x)
rsq.collect()

For example, we might want to create keys for the **passwd** data using the shell name

In [None]:
passwd = sc.textFile("/etc/passwd")

The password is the 7th field (6th index)  in the `/etc/passwd` file:

In [None]:
passwd.take(1)[0].split(':')

Thus, if we want the `key` to be the 6th element, we provide a function that extracts that from each entry:

In [None]:
byShell = passwd.keyBy( lambda x : x.split(':')[6] )
byShell.take(3)

This is more or less equivilent to a `map` that returns pairs of values. For example:

In [None]:
passwd.map( lambda x : (x.split(':')[6], x) ).take(3)

### Using Key-Value Pairs

We can return a dictionary containing the number of logins using each shell using the `countByKey` function:

In [None]:
byShell.countByKey()

And we can extract keys and values in parallel across the cluster:

In [None]:
shellKeys = byShell.keys()
shellKeys.take(3)

In [None]:
byShell.values().take(3)

## Grouping and Joins

Once you have (k,v) pairs, you can group the key -- the values are iterable ( distributed lists) of the values for that key. Earlier, we shows that you can `map` a function over the `ResultIterable` -- it's unlikely you want want to return them to a native `list` because that will pull all the values back to the front end machine.

In [None]:
byShell.groupByKey().take(3)

We can also group KV pairs by other attributes using the groupBy() method. For example, here we're going to group the password data by the user name (the 0'th field of the values)

In [None]:
byShell.take(3)

In [None]:
def getLogin(x):
    return x.split(':')[0]

In [None]:
byShell.groupBy (lambda x : getLogin(x[1]) ).take(3)

Once items are grouped, you can iterate over the values associated with a key.

Let's group the `/etc/passwd` entries by the shell.

In [None]:
grpdShell = byShell.groupByKey()
grpdShell.take(3)

Now the key are differ Unix shells (*e.g.* `/bin/bash`) and the values are `ResultIterable`s of the values having that key in the original KV list.

Again, we typically wouldn't want to pull in all the value using a `list` because this will bring everything to the front-end machine in the cluster. But, let's convert the `ResultIterable` to a list just to see the structure:

In [None]:
grpdShell.map(lambda x: (x[0], list(x[1])) ).take(3)

The better way to process this would be to would `map` a function over each KV pair in the grouped list. The key is `x[0]` and the `ResultIterable` is `x[1]`. We can then map a function over each of the  `ResultIterable` to extract some field, such as the login information.

In [None]:
grpdShell.mapValues(list).take(3)

In [None]:
shellAndLogins = grpdShell.map( 
    lambda x: (x[0], ",".join( [ getLogin(y) for y in x[1] ]) ))

In [None]:
shellAndLogins.take(3)

Rather than first group the keys and then combine the values into the string above, we can use **foldByKey** to do more or less the same thing -- this combines the mapping phase implicit in the list comprehension above.

In [None]:
byShell.foldByKey( "", lambda x,y: x + getLogin(y) + ',' ).collect()

Internally, this is done using a "combiner(createCombiner, mergeValue, mergeCombiners)", which turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C.  Note that V and C can be different -- for example, one might group an RDD of type (Int, Int) into an RDD of type (Int, List[Int]).

This example takes the byShell (K,V) list and constructs a new V that is the name of the users of that shell. Entries within the same RDD partition are joined by a comma and between partitions by "AND".

In [None]:
byShell.combineByKey( getLogin,
                        lambda xs, x: xs + ',' + getLogin(x),
                        lambda xs, ys: xs + ' AND ' + ys ).collect()

**combineByKey** is used to develop "reduceByKey" functions that can e.g. sum up the items associated with a key. This is effectively doing a **groupByKey** followed by a **reduce** on each list of values

In [None]:
c = sc.parallelize([ (1,2), (2,3), (1, 99), (3, 44), (2, 1), (4,5), (3, 19) ] )

The `groupByKey` operator gives us `iterable` items.

In [None]:
c.groupByKey().collect()

We can reify (make concrete) the iterable item by convert it to a list:

In [None]:
c.groupByKey().map(lambda x : (x[0], list(x[1]))).collect()

The `reduceByKey` groups identical keys and then applys a function over the iterable results.

In [None]:
c.reduceByKey( operator.add ).collect()

## CoGroup - the basis for joins

As in Pig, joins are done performing "co-groups" where multiple data sets are grouped by the same key

In [None]:
s1 = c
s1.collect()

In [None]:
s2 = sc.parallelize( [ (2, -99), ( 4, 199), (19, 23) ] )
s2.collect()

In [None]:
co = s1.cogroup(s2)
co.collect()

Let's use our "convert it to a list trick" to see what's in each cogroup

In [None]:
co.map(lambda x: (x[0], list(x[1][0]), list(x[1][1])) ).collect()

This is used to build different kinds of joins

In [None]:
s1.join(s2).collect()

In [None]:
s1.leftOuterJoin(s2).collect()

In [None]:
s1.rightOuterJoin(s2).collect()

In [None]:
x = sc.parallelize( [ ("NY", 10), ("OH", 20), ("OH", 99), ("CO", 88) ] )
y = sc.parallelize( [ ("NY", 30), ("CO", 40), ("NY", 22 )] )

In [None]:
x.join(y).collect()

In [None]:
x.join(y).flatMap(lambda kv: ((kv[0],x) for x in kv[1])).collect()

In [None]:
x.cogroup(y).collect()

In [None]:
x.cogroup(y).flatMap(lambda kv: ((kv[0],x,y) for x in kv[1][0] for y in kv[1][1])).collect()