<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>

*The below cell generates the table of contents - run it, but don't change it *

In [34]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

# Chapter 4: Working with Key/Value Pairs

## Pair RDDs
* Spark provides special operations on RDDs containing key/value pairs - these RDDs are called pair RDDs
* We can use the <code>map()</code> function to create a pair RDD as shown below
* Pair RDDs are still RDDs (of Tuple2 objects in Java/Scala or of Python tuples) 

### Example: Creating a pair RDD using the first word as the key

In [1]:
from __future__ import print_function
lines = sc.textFile('name_example.txt')
[print('line',i,a) for i,a in enumerate(lines.collect())]
print('\nresult from map:')
[print(a) for a in lines.map(lambda x: (x.split(' ')[0],x)).collect()]

rddpair = lines.map(lambda x: (x.split(' ')[0],x))
rddpair.reduceByKey(lambda x,y: x+';'+y).collect()
print('\nlookup result: ', rddpair.lookup('her'))


line 0 his name is pat
line 1 his name is peter
line 2 his name is olaf
line 3 her name is Joanne
line 4 her name is Therese
line 5 they work at algebraix in Encintas 
line 6 they like to eat yogurt 

result from map:
(u'his', u'his name is pat')
(u'his', u'his name is peter')
(u'his', u'his name is olaf')
(u'her', u'her name is Joanne')
(u'her', u'her name is Therese')
(u'they', u'they work at algebraix in Encintas ')
(u'they', u'they like to eat yogurt ')

lookup result:  [u'her name is Joanne', u'her name is Therese']


## Transformations on Pair RDDs

### Transformation on one pair RDD
* Pair RDDs are allowed to use all the transformations available to standard RDDs
* The same rules apply from "Passing Functions to Spark" from Chapter 3
    - Since pair RDDs contain tuples, we need to pass functions that operate on tuples rather than on individual elements
* <code>reduceByKey(func)</code>: Combine values with the same key 
    - Note that calling <code>reduceByKey()</code> and <code>foldByKey()</code> will automatically perform combining locally on each machine before computing global totals for each key and hte user does not need to specify a combiner
    - The more general <code>combineByKey</code> interface allows you to customize combining behavior
    
* <code>groupByKey()</code>: Group values with the same key
* <code>combineByKey(createCombiner,mergeValue,mergeCombiners,partitioner)</code>: Combine values with the same key using a different result type
    - <code>combineByKey()</code> is the most general of the per-key aggregation functions
    - Most of the other per-key combiners are implemented using it
    - Like <code>aggregate</code>, <code>combineByKey()</code> allows the user to return values that are not hte same type as our input data 
    - Understsanding <code>combineByKey()</code> in depth:<br>
    For each element in the partition, the element either has a key it hasn't seen before or has the same key as a previous element:
        - If it's a new element, <code>combineByKey()</code> uses a function we provide, called <code>createCombiner()</code> to create the initial value for the accumulator on the key (**Note: This happens the first time a key is found in each partition rather than only the first time the key is found in the RDD**)
        - If it is a value we have seen before while processing that partition, it will instead use the provided function, <code>mergeValue()</code>, with the current value for the accumulator for that key and the new value
        - <code>mergeCombiners()</code>: merges the accumulators of the same key across all partitions
* <code>mapValues(func)</code>: Apply a function to each value of a pair RDD without changing they key
* <code>flatMapValues(func)</code>: Apply a function taht returns an iterator to each value of a pair RDD, and for each element returned produce a key/value entry with the old key. Often used for tokenization 
* <code>keys()</code>: Return an RDD of just the keys
* <code>values()</code>: Return an RDD of just the values
* <code>sortByKey()</code>: Return an RDD sorted by the key

### Transformation on two pair RDDs
* <code> rdd.subtractByKey(other)</code>: Remove elements with a key present in the other RDD
* <code> rdd.join(other)</code>: Perform an inner join between two RDDs
* <code> rdd.rightOuterJoin(other)</code>: Perform a join between two RDDs where the key must be present in the first RDD 
* <code> rdd.leftOuterJoin(other)</code>: Perform a join between two RDDs where the key must be present in the other RDD
* <code> rdd.cogroup(other)</code>: Group data from both RDDs sharing the same key. 
    - <code>cogroup()</code> over two RDDs sharing the same key type, K, with the respective value types V and W returns <code>RDD[(K,(Iterable[V]),Iterable[W]))]</code>
    - If one of the RDDs doesn't have elements for a given key that is present in the other RDD, the corresponding <code>Iterable</code> is simply empty
    - can be used to implement intersection by key
    - can work on three or more RDDs at once


### Example on one pair RDD

In [2]:
# examples - one pair RDD 
rdd = sc.parallelize({(1,2),(3,4),(3,6),(3,1)})
print('rdd: ',rdd.collect())

# add values with the same key
print('reduceByKey example: ', rdd.reduceByKey(lambda x,y: x+y).collect())

# groupByKey() returns an iterable in the values, the mapValues function below converts the iterable to a list
print('groupByKey example: ', rdd.groupByKey().mapValues(lambda x: list(x)).collect())

# mapValues example
print('mapValues example: ',rdd.mapValues(lambda x: x+1).collect())

# Apply a function that returns an iterator to each value of a pair RDD
print('flatMapValues example: ', rdd.flatMapValues(lambda x: xrange(x,6)).collect())

# keys example
print('keys example: ',rdd.keys().collect())

# values example
print('values example: ', rdd.values().collect())

# sortByKey example:
print('sortByKey example: ',rdd.sortByKey().collect())


rdd:  [(1, 2), (3, 1), (3, 4), (3, 6)]
reduceByKey example:  [(1, 2), (3, 11)]
groupByKey example:  [(1, [2]), (3, [1, 4, 6])]
mapValues example:  [(1, 3), (3, 2), (3, 5), (3, 7)]
flatMapValues example:  [(1, 2), (1, 3), (1, 4), (1, 5), (3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 4), (3, 5)]
keys example:  [1, 3, 3, 3]
values example:  [2, 1, 4, 6]
sortByKey example:  [(1, 2), (3, 1), (3, 4), (3, 6)]


### Example on two pair RDDs


In [19]:
from __future__ import print_function
rdd=sc.parallelize([(1,2),(3,4),(3,6)])
other=sc.parallelize([(3,9)])

# [print(a,list(b)) for a,b in rdd.cogroup(other).collect()]
# for a in rdd.cogroup(other).collect():
#     print(a,)


def print_results(obj):
    [print(a,(list(b[0]),list(b[1]))) for a,b in obj]
    print
    return 
    

rdd.cogroup(other).collect()[0][1][1]
# print (rdd.cogroup(other).collect())
print('set 1: ', rdd.collect())
print('set 2:', other.collect())

#subtractByKey example
print('subtractByKey results: ', rdd.subtractByKey(other).collect())

# join example
print('join results: ',rdd.join(other).collect())

# rightOuterJoin example
print('rightOuterJoin results: ',rdd.rightOuterJoin(other).collect())

# leftOuterJoin example
print('leftOuterJoin results: ',rdd.leftOuterJoin(other).collect())

# cogroup example
print('cogroup results: ')
print_results(rdd.cogroup(other).collect())

set 1:  [(1, 2), (3, 4), (3, 6)]
set 2: [(3, 9)]
subtractByKey results:  [(1, 2)]
join results:  [(3, (4, 9)), (3, (6, 9))]
rightOuterJoin results:  [(3, (4, 9)), (3, (6, 9))]
leftOuterJoin results:  [(1, (2, None)), (3, (4, 9)), (3, (6, 9))]
cogroup results: 
1 ([2], [])
3 ([4, 6], [9])


## Actions Available on Pair RDDs
* <code>countByKey()</code>: Count the number of elements for each key
* <code>collectAsMap()</code>: Collect the reuslts as a map to provide easy lookup
* <code>lookup(key)</code>: Return all values associated with the provided key

### Examples of Actions

In [31]:
rdd = sc.parallelize([(1,2),(3,4),(3,6)])
print('example rdd: ',rdd.collect())

# countByKey example
print('countByKey result: ',dict(rdd.countByKey()))

# collectAsMap example:
print('collectAsMap result: ',rdd.collectAsMap())

# lookup(key) example:
print('lookup(key) result: ', rdd.lookup(3))

example rdd:  [(1, 2), (3, 4), (3, 6)]
countByKey result:  {1: 1, 3: 2}
collectAsMap result:  {1: 2, 3: 6}
lookup(key) result:  [4, 6]


### combineByKey Example
<code>combineByKey(createCombiner,mergeValue,mergeCombiners,partitioner)</code>: Combine values with the same key using a different result type


In [84]:
lines = sc.textFile('rddpair_vehicles.txt')
print('rddpair_vehicles.txt file content: ')
[print ('line',i,':',a) for i,a in enumerate(lines.collect())]

lines = lines.map(lambda x: (x.split('\t')[0],int(x.split('\t')[1])))
print('\nrdd pair: ')
[print(a) for a in lines.collect()]
sumCount = lines.combineByKey((lambda x: (x,1)), # (value,1)
                              (lambda x,y: (x[0] + y, x[1] + 1)), 
                              #(value in tuple + another value the same key, summing the count )
                              (lambda x,y: (x[0] + y[0], x[1] +y[1]))) 
                            # summing the count and values across all partitions

sumCount.collectAsMap()

rddpair_vehicles.txt file content: 
line 0 : car	1
line 1 : car	2
line 2 : car	100
line 3 : taxi	1
line 4 : taxi	47
line 5 : bus	250
line 6 : bus	26
line 7 : uber	10
line 8 : uber	17

rdd pair: 
(u'car', 1)
(u'car', 2)
(u'car', 100)
(u'taxi', 1)
(u'taxi', 47)
(u'bus', 250)
(u'bus', 26)
(u'uber', 10)
(u'uber', 17)


{u'bus': (276, 2), u'car': (103, 3), u'taxi': (48, 2), u'uber': (27, 2)}

### reduceByKey Example

In [110]:
lines2 = sc.textFile('rddpair_vehicles2.txt')
[print('line',i,a) for i,a in enumerate(lines2.collect())]
lines2 = lines2.map(lambda x: (x.split('\t')[0],int(x.split('\t')[1])))
lines2.reduceByKey(lambda x,y:(x,y)).collectAsMap()

line 0 car	1
line 1 car	2
line 2 car	3
line 3 car	4
line 4 taxi	1
line 5 taxi	47
line 6 bus	250
line 7 uber	10
line 8 uber	17


{u'bus': 250, u'car': (((1, 2), 3), 4), u'taxi': (1, 47), u'uber': (10, 17)}

### Per-key average with <code>reduceByKey()</code> and <code>mapValues()</code> Example
Note - in chapter 3 it was pointed out that <code>map()</code>'s return type does not have to be the same as the input type. 

In [127]:
rdd=sc.parallelize({('panda',0),('pink',3),('pirate',3),('panda',1),('pink',4)})
print('mapValues -> ',rdd.mapValues(lambda x:(x,1)).collect())
rdd = rdd.mapValues(lambda x: (x,1))
print('reduceByKey -> ',rdd.reduceByKey(lambda x,y:(x[0]+y[0],x[1]+y[1])).collect() )
rdd = rdd.reduceByKey(lambda x,y:(x[0]+y[0],x[1]+y[1]))
print('per key average: ',rdd.mapValues(lambda x: 1.*x[0]/x[1]).collect())


mapValues ->  [('panda', (1, 1)), ('pink', (3, 1)), ('pirate', (3, 1)), ('panda', (0, 1)), ('pink', (4, 1))]
reduceByKey ->  [('pink', (7, 2)), ('panda', (1, 2)), ('pirate', (3, 1))]
per key average:  [('pink', 3.5), ('panda', 0.5), ('pirate', 3.0)]


### Word Count Example

The word count script below is quite simple. It takes the following steps:

1. Split each line from the file into words (<code>flatMapstep</code>)
2. Map each word to a tuple containing the word and an initial count of 1 (<code>mapStep</code>)
3. Sum up the count for each word (<code>reduceStep</code>)



In [169]:
text_file = sc.textFile('README_spark.md')
word_counts = text_file \
    .flatMap(lambda line: line.split()) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda a, b: a + b)
    
word_counts.collect()


flatMapstep = text_file.flatMap(lambda line: line.split())
print('snippet of flatMap step: ',flatMapstep.collect()[:10])
mapStep = flatMapstep.map(lambda word: (word,1))
print('snippet of map step: ', mapStep.collect()[:10])
reduceStep = mapStep.reduceByKey(lambda x,y:x+y) 
print('reduce step: ')
# [print(a) for a in reduceStep.collect()[:10]]
[print(a) for a in reduceStep.collect()]
print('')

snippet of flatMap step:  [u'#', u'Apache', u'Spark', u'Spark', u'is', u'a', u'fast', u'and', u'general', u'cluster']
snippet of map step:  [(u'#', 1), (u'Apache', 1), (u'Spark', 1), (u'Spark', 1), (u'is', 1), (u'a', 1), (u'fast', 1), (u'and', 1), (u'general', 1), (u'cluster', 1)]
reduce step: 
(u'when', 1)
(u'R,', 1)
(u'including', 3)
(u'computation', 1)
(u'using:', 1)
(u'guidance', 2)
(u'Scala,', 1)
(u'environment', 1)
(u'only', 1)
(u'rich', 1)
(u'Apache', 1)
(u'sc.parallelize(range(1000)).count()', 1)
(u'Building', 1)
(u'guide,', 1)
(u'return', 2)
(u'Please', 3)
(u'Try', 1)
(u'not', 1)
(u'Spark', 13)
(u'scala>', 1)
(u'Note', 1)
(u'cluster.', 1)
(u'./bin/pyspark', 1)
(u'params', 1)
(u'through', 1)
(u'GraphX', 1)
(u'[run', 1)
(u'abbreviated', 1)
(u'[project', 2)
(u'##', 8)
(u'library', 1)
(u'see', 1)
(u'"local"', 1)
(u'[Apache', 1)
(u'will', 1)
(u'#', 1)
(u'processing,', 1)
(u'for', 11)
(u'[building', 1)
(u'provides', 1)
(u'print', 1)
(u'supports', 2)
(u'built,', 1)
(u'[params]`.', 1)

Note: We can implement word count faster by using <code>countByValue()</code> function on the first RDD

In [170]:
# [print(a) for a in flatMapstep.countByValue()]
result = flatMapstep.countByValue()
[print((a,result[a])) for a in result]
print('')

(u'help', 1)
(u'storage', 1)
(u'Hadoop', 3)
(u'not', 1)
(u'including', 3)
(u'computation', 1)
(u'high-level', 1)
(u'find', 1)
(u'web', 1)
(u'Shell', 2)
(u'how', 2)
(u'using:', 1)
(u'graph', 1)
(u'guidance', 2)
(u'run:', 1)
(u'Scala,', 1)
(u'should', 2)
(u'environment', 1)
(u'to', 14)
(u'only', 1)
(u'other', 1)
(u'scala>', 1)
(u'rich', 1)
(u'directory.', 1)
(u'Apache', 1)
(u'Once', 1)
(u'sc.parallelize(range(1000)).count()', 1)
(u'Building', 1)
(u'do', 2)
(u'guide,', 1)
(u'return', 2)
(u'Programs', 1)
(u'Many', 1)
(u'Try', 1)
(u'built,', 1)
(u'YARN,', 1)
(u'R,', 1)
(u'using', 2)
(u'Example', 1)
(u'For', 2)
(u'Spark', 13)
(u'Spark"](http://spark.apache.org/docs/latest/building-spark.html).', 1)
(u'Because', 1)
(u'cluster.', 1)
(u'name', 1)
(u'Testing', 1)
(u'refer', 2)
(u'Streaming', 1)
(u'./bin/pyspark', 1)
(u'have', 1)
(u'SQL', 2)
(u'through', 1)
(u'GraphX', 1)
(u'them,', 1)
(u'[run', 1)
(u'analysis.', 1)
(u'abbreviated', 1)
(u'set', 2)
(u'[project', 2)
(u'Scala', 2)
(u'##', 8)
(u'thre

## Data Partitioning (Advanced)
* In a distributed program, communication is very expensive, so laying out data to minimize network traffic can greatly improve performance
* Partitioning is only useful when a dataset is reused **multiple** times in key-oriented operations such as joins
* Spark's partitioning is available on all RDDs of key/value paris and causes the system to group elements based on a function of each key
* Ensures that a set of keys will appear together on some node
* Example: 
    - You might choose to hash-partition an RDD to 100 partitions so that keys that have the same hash value modulo 100 appear on the same node
    - You might range-partition the RDD into sorted ranges of keys so that the elements with keys in the same range appear on the same node
* Many other Spark operations automatically result in an RDD with known partitioning information and many operations other than <code>join()</code> will take advantage of this inforamtion
    - For example, <code>sortByKey()</code> and <code>groupByKey()</code> will result in range-partitioned and hash-partitioned RDDs, respectively
    - On the other hand, operations like <code>map()</code> cause the new RDD to forget the parent's partitioning information, because such operations could theoretically modify the key of each record 
    
* **<font color="red">Partitioning in Java and Python</font>**
    - Spark's Java and Python APIs benefit from partitioning in the same way as the Scala API. However, in Python you pass a number of partitions desired (e.g. <code>rdd.partitionBy(100))</code>

### Scala Simple Application Example
* Consider an application that keeps a large table of user information in memory - say, an RDD of <code>(UserID, UserInfo)</code> pairs, where <code>UserInfo</code> contains a list of topics the user is subscribed to
* The application periodiically combines this table with a smaller file representing events that happened in the past five minutes - say, a table of <code>(UserID, LinkInfo)</code> pairs for users who have clicked a link on a website in those five minutes
* We may want to count how many users visisted a link that was **not** to one of their subscribed topics

* Initialization code; we load the user info from a Hadoop SequenceFile on HDFS
* This distributes the elements of <code>userData</code> by the HDFS block where they are found and **doesn't provide Spark with any way of knowing which partition a particular <code>UserID</code> is located**

        val sc = new SparkContext(...)
        val userData = sc.sequenceFile[UserID, UserInfo]("hdfs://...").persist()
        
* Function called periodically to process a logfile of events in the past 5 minutes
* We assume that this is a SequenceFile containing <code>(userID, LinkInfo)</code> pairs

        def processNewLogs(logFileName: String) {
            val events = sc.sequenceFile[UserID, LinkInfo](logFileName)
            val joined = userData.join(events) //RDD of (UserID, (UserInfo,LinkInfo)) pairs
            val offTopicVisits = joined.filter {
                case (userID, (userInfo, linkInfo)) => // Expand the tuple into its components 
                    !userInfo.topics.contains(linkInfo.topic)
            }.count()
            println("Number of Visits to non-subscribed topics: " + offTopicVisits)
        }
* This is inefficient because the <code>join()</code> operation, called each time <code>processNewLogs()</code> is invoked but does not know anything about how the keys are partitioned in the datasets 
* By default, this operation will hash all the keys of both datasets sending elements with the same key hash across the network to the same machine and then join tegether the elements with the same key as shown below

<img src='figures/fig4_4.png' alt="Drawing" style="width: 600px;"/>

* Fixing this is simple: just use the <code>partitionBy()</code> transformation on <code>userData</code> to hash-partition it at the start of the program 
* Do this by passing <code>spark.HashPartitioner</code> to <code>partitionBy</code>:
        val sc = new SparkContext(...)
        val userData = sc.sequenceFile[UserID, UserInfo]("hdfs://...")
                            .partitionBy(new HashPartitioner(100)) //create 100 Partitions
                            .persist()
                            
* The <code>processNewLogs()</code> method can remain unchanged
* Because we called <code>partitionBy()</code> when building <code>userData</code>, Spark will now know that it is hash-partitioned and calls to <code>join()</code> on it will take advantage of this information 
    - When we call <code>userData.join(events)</code> Spark will shuffle only the <code>events</code> RDD, sending events with each particular <code>UserID</code> to the machine that contains the corresponding hash partition of <code>userData</code> as shown below:
    <img src="figures/fig4_5.png" alt="Drawing" style="width: 600px;"/>
* The result is that a lot less data is communicated over the network, and the program runs significantly faster 
* Note: <code>partitionBy()</code> is a transformation, so it always returns a new RDD - it does not change the original RDD in place. Therefore it is important to persist and save as <code>userData</code> the result of <code>partitionBy()</code>, not the original <code>sequenceFile()</code>
* In general make the number of partitions at least as large as the number of cores in your cluster
* <font color="red">**Warning:**</font> 
    - Failure to persist an RDD after it has been transformed with <code>partitionBy()</code> will cause subsequent uses of the RDD to repeat the partitioning of the data
    - This would negate the advantage of <code>partitionBy()</code> resulting in repeated partitioning and shuffling of data across the network, similar to what occurs without any specified partitioner 
    

## Determining an RDD's Partitioner 
* In Scala and Java you can determine how an RDD is partiioned using its <code>partitioner</code> property (or <code>partitioner()</code> method in Java)
    - Returns a <code>scala.Option</code> object which is a Scala class for a container that may or may not contain one item 
    - Call <code>isDefined()</code> on the <code>Option</code> to check whether it has a value
    - Call <code>get()</code> to get this value
    - If present, the value will be a <code>spark.Partitioner</code> object 
    - This is essentially a function telling the RDD which partition each key goes into
* Example: Determining the partitioner of an RDD:
        scala> val pairs = sc.parallelize(List((1,1), (2,2), (3,3))) //create an RDD with no partitioning 
        pairs: spark.RDD[(Int, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:12
        
        scala>pairs.partitioner
        res0: Option[spark.Partitioner] = None //partitioner validates no partitioning 
        
        //create new RDD w/ partitioning
        scala> val partitioned = pairs.partitionBy(new spark.HashPartitioner(2)) 
                                        .persist() // to actually use partitioned in further operations 
        partitioned: spark.RDD[(Int, Int)] = ShuffledRDD[1] at partitionBy at <console>:14
        
        scala> partitioned.partitioner 
        res1: Option[spark.Partitioner] = Some(spark.HashPartitioner@5147788d) //validates the partitioning 


        

## Tuning the Level of Parallelism
* Spark will always try to infer a sensible default number of partitions based on te size of your cluster, but in some cases you will want to tune hte level of parallelism for better performance 
* <code>repartition()</code>: shuffles the data across the network to create a new set of partitions
    - Note: repartitioning your data is a fairly expensive operation 
* <code>coalesce()</code>: an optimized version of <code>repartition()</code> that allows avoiding data movement, but only if you are decreasing the number of RDD partitions
    - To know whether you can safely call <code>coalesce()</code>, you can check the size of the RDD using <code>rdd.partitions.size()</code> (Java/Scala) and <code>rdd.getNumPartitions()</code> (Python) to make sure you are coalescing it to fewer partitions than it currently has 

## Operations That Benefit from Partitioning
* Operations that benefit from parittioning: <code>cogroup()</code>, <code>grouWith()</code>, <code>join()</code>,<code>leftOuterJoin()</code>,<code>rightOuterJoin()</code>,<code>groupByKey()</code>, <code>reduceByKey()</code>, <code>comebineByKey()</code>,<code>lookup()</code>
    - All these functions do not change the partitions because they only operate on the values
* For operations that act on a single RDD, such as <code>reduceByKey()</code>, running on a prepartioned RDD will cause all the values for each key to be computed locally on a single machine requring only the final local reduce value to be sent from each worker node by to the master

## Operations That Affect Partitioning
* Spark knows internally how each of its operators affects partitioning and automatically sets the <code>partitioner</code> on RDDs created by operations that partition the data 
* Suppose you call <code>join()</code> to join two RDDs; because the elements with the same key have been hased to the same machine, Spark knows that the result is hash-partitioned and operations like <code>reduceByKey()</code> on te join result are going to be signficantly faster 
* Flipside: for transformations that cannot be guaranteed to produce a known partitioning, the output RDD will not have a partitioner set
    - Example: if you call <code>map()</code> on a hash-partitioned RDD of key/value pairs, the function passed to <code>map()</code> can in theory change the key of each elemen, so the result will not have a <code>partitioner</code> set 
* Operations such as <code>mapValues()</code> and <code>flatMapValues()</codE> guarantees that each tuple's key remains the same so use that shit instead if you don't intend to change the key
* Operations that result in a partitioner being set on the output RDD:
    - <code>cogroup(), groupWith(), join(), leftOuterJoin(), rightOuterJoin(), groupByKey(), reduceByKey(), combineByKey(), partitionBy(), sort()</code>
    - If the parent has a partiioner: <code>mapValues(), flatMapValues(), filter()</code>
* For binary operations <i>which</i> partitioner is set on the output depends on the parent RDDs' partitioners
* By default, it is a hash partitioner, with the number of partitions set to the level of parallelism of the operation
    - Set to whatever the operation decides to set it as 
* If one of the parents has a <code>partitioner</code> set, it will be that partitioner
* If both parents have a <code>partitioner</code> set, it will be the partioner of the first parent

In [32]:
# page rank example skipped - come back to this later