# PySpark Tutorial - Joins
<div>
 <h2> CSCI 4283 / 5253 
  <IMG SRC="https://www.colorado.edu/cs/profiles/express/themes/cuspirit/logo.png" WIDTH=50 ALIGN="right"/> </h2>
</div>

In [None]:
from pyspark import SparkContext, SparkConf
import numpy as np
import operator

In [None]:
conf=SparkConf().setAppName("pyspark tutorial").setMaster("local[*]")
sc = SparkContext(conf=conf)

## Key Value Data, Grouping and Joins

The Key-Value or (k,v) datatype is fundemental to many operations including grouping and joins. We need to be able to:
* Create keys from non-KV data
* Group or organize data according to keys
* Operate on data according to keys
* Form standard joins

### Creating keys

We can directly create K-V pairs using lists of pairs

In [None]:
visits = sc.parallelize([ ("index.html", "1.2.3.4"),
                         ("about.html", "3.4.5.6"),
                         ("index.html", "1.3.3.1") ])
pageNames = sc.parallelize([ ("index.html", "Home"),
                            ("about.html", "About") ])

In [None]:
visits.join(pageNames).collect()

In [None]:
visits.cogroup(pageNames).collect()

`keyBy` is a function to efficiently create a key from an RDD:

In [None]:
r = sc.parallelize( [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3] )
r.collect()

In [None]:
rsq = r.keyBy(lambda x : x*x)
rsq.collect()

For example, we might want to create keys for the **passwd** data using the shell name

In [None]:
passwd = sc.textFile("/etc/passwd")

In [None]:
passwd.take(1)[0].split(':')

In [None]:
byShell = passwd.keyBy(lambda x : x.split(':')[6] )
byShell.take(3)

This is more or less equivilent...

In [None]:
passwd.map( lambda x : (x.split(':')[6], x) ).take(3)

In [None]:
byShell.take(3)

Return a dictionary containing the number of logins using each shell

In [None]:
byShell.countByKey()

And we can extract keys and values (in parallel)

In [None]:
shellKeys = byShell.keys()
shellKeys.take(3)

In [None]:
byShell.values().take(3)

## Grouping and Joins

Once you have (k,v) pairs, you can group the key -- the values are iterable (lists) of the values of that key.

In [None]:
byShell.groupByKey().take(3)

Or, group them by other attributes using the groupBy() method. For example, here we're going to group by the user name (the 0'th field of the values)

In [None]:
byShell.take(3)

In [None]:
def getLogin(x):
    return x.split(':')[0]

In [None]:
byShell.groupBy (lambda x : getLogin(x[1]) ).take(3)

Once items are grouped, you can iterarte over the values associated with a key.

In [None]:
grpdShell = byShell.groupByKey()

In [None]:
shellAndLogins = grpdShell.map( 
    lambda x: (x[0], ",".join( [ getLogin(y) for y in x[1] ]) ))

In [None]:
shellAndLogins.take(3)

Rather than first group the keys and then combine the values into string above, we can use **foldByKey** to do more or less the same thing -- this combines the mapping phase implicit in the list comprehension above.

In [None]:
byShell.foldByKey( "", lambda x,y: x + getLogin(y) + ',' ).collect()

Internally, this is done using a "combiner(createCombiner, mergeValue, mergeCombiners)", which turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C.  Note that V and C can be different -- for example, one might group an RDD of type (Int, Int) into an RDD of type (Int, List[Int]).

This example takes the byShell (K,V) list and constructs a new V there is the name of the users of that shell. Entries within the same RDD partition are joined by a comma and between partitions by "AND".

In [None]:
byShell.combineByKey( getLogin,
                        lambda xs, x: xs + ',' + getLogin(x),
                        lambda xs, ys: xs + ' AND ' + ys ).collect()

**combineByKey** is used to develop "reduceByKey" functions that can e.g. sum up the items associated with a key. This is effectively doing a **groupByKey** followed by a **reduce** on each list of values

In [None]:
c = sc.parallelize([ (1,2), (2,3), (1, 99), (3, 44), (2, 1), (4,5), (3, 19) ] )

The `groupByKey` operator gives us `iterable` items.

In [None]:
c.groupByKey().collect()

We can reify (make concrete) the iterable item by convert it to a list:

In [None]:
c.groupByKey().map(lambda x : (x[0], list(x[1]))).collect()

The `reduceByKey` groups identical keys and then applys a function over the iterable results.

In [None]:
c.reduceByKey( operator.add ).collect()

## CoGroup - the basis for joins

As in Pig, joins are done performing "co-groups" where multiple data sets are grouped by the same key

In [None]:
s1 = c
s1.collect()

In [None]:
s2 = sc.parallelize( [ (2, -99), ( 4, 199), (19, 23) ] )
s2.collect()

In [None]:
co = s1.cogroup(s2)
co.collect()

Let's use our "convert it to a list trick" to see what's in each cogroup

In [None]:
co.map(lambda x: (x[0], list(x[1][0]), list(x[1][1]) ) ).collect()

This is used to build different kinds of joins

In [None]:
s1.join(s2).collect()

In [None]:
s1.leftOuterJoin(s2).collect()

In [None]:
s1.rightOuterJoin(s2).collect()

In [None]:
x = sc.parallelize( [ ("NY", 10), ("OH", 20), ("OH", 99), ("CO", 88) ] )
y = sc.parallelize( [ ("NY", 30), ("CO", 40), ("NY", 22 )] )

In [None]:
x.join(y).collect()

In [None]:
x.join(y).flatMap(lambda kv: ((kv[0],x) for x in kv[1])).collect()

In [None]:
x.cogroup(y).collect()

In [None]:
x.cogroup(y).flatMap(lambda kv: ((kv[0],x,y) for x in kv[1][0] for y in kv[1][1])).collect()