In [1]:
from pyspark import SparkContext

In [2]:
sc = SparkContext("local", "pyspark")

## Creating Pair RDDs

In [3]:
lines = sc.textFile("README.md")

Creating a pair RDD using the first word as the key.

In [4]:
pairs = lines.map(lambda x: (x.split(" ")[0], x))

In Python, for the functions on keyed data to work we need to return RDD composed of tuples.

## Transformations on Pair RDDs

Pair RDDs are allowed to use all the transformations available to standard RDDs.

Filter out lines longer than 20 characters

In [5]:
result = pairs.filter(lambda keyValue: len(keyValue[1]) <= 20)

Another example on filtering:

In [6]:
pairs = sc.parallelize([("holden", "likes coffee"), ("panda", "likes long strings and coffee")])

In [7]:
pairs.filter(lambda keyValue: len(keyValue[1]) <= 20).collect()

[('holden', 'likes coffee')]

If you only want to work with the value part of a key-value tuple, use **mapValues()**

In [8]:
rdd = sc.parallelize([(1, 2), (3, 4), (3, 6)])

In [9]:
rdd.mapValues(lambda x: x + 1).collect()

[(1, 3), (3, 5), (3, 7)]

### Aggregations

**reduceByKey()** runs several parallel reduce operations, one for each key in the dataset, where each operation combines values that have the same key. It returns a new RDD (i.e. transformation) consisting of each key and the reduced value for that key.

In [10]:
rdd = sc.parallelize([("panda", 0), ("pink", 3), ("pirate", 3), ("panda", 1), ("pink", 4)])

In [11]:
rdd.collect()

[('panda', 0), ('pink', 3), ('pirate', 3), ('panda', 1), ('pink', 4)]

In [12]:
rdd.mapValues(lambda x: (x, 1)).collect()

[('panda', (0, 1)),
 ('pink', (3, 1)),
 ('pirate', (3, 1)),
 ('panda', (1, 1)),
 ('pink', (4, 1))]

In [13]:
rdd.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1])).collect()

[('pink', (7, 2)), ('panda', (1, 2)), ('pirate', (3, 1))]

**word count**

In [14]:
rdd = sc.textFile("README.md")

In [15]:
words = rdd.flatMap(lambda x: x.split(" "))

Two ways: 1) using **map()** and **reduceByKey()**; 2) using **countByValue()**

In [16]:
result = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)

In [17]:
result.count()

275

In [18]:
mydict = words.countByValue()

len(mydict)

275

**combineByKey()** is the most general of the per-key aggregation functions.

**combineByKey()** allows the user to return values that are not the same type as our input data.


1. As **combineByKey()** goes through the elements in a partition, each element either has a key it hasn't seen before or has the same key as a previous element.

2. If it's a new element, **combineByKey()** uses a function we provide, called create **Combiner()**, to create the initial value for the accumulator on that key. This happens the first time a key is found **in each partition**, rather than only the first time the key is found in the RDD.

3. If it is a value we have seen before while processing that partition, it will instead use the provided function, **mergeValue()**, with the current value for the accumulator for that key and the new value.

4. Since each partition is processed independently, we can have multiple accumulators for the same key. When we are merging the results from each partition, if two or more partitions have an accumulator for the same key we merge the accumulators using the user-supplied **mergeCombiners()** function.

**combineByKey(createCombiner, mergeValue, mergeCombiners)**

Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a “combined type” C

**Example 1**:

In [19]:
nums = sc.parallelize([("panda", 0), ("pink", 3), ("pirate", 3), ("panda", 1), ("pink", 4)])

In [20]:
nums.collect()

[('panda', 0), ('pink', 3), ('pirate', 3), ('panda', 1), ('pink', 4)]

In [21]:
sumCount = nums.combineByKey((lambda x: (x, 1)),
                             # createCombiner()
                             (lambda x, y: (x[0] + y, x[1] + 1)),
                             # mergeValue() Map-side aggregation
                             (lambda x, y: (x[0] + y[0], x[1] + y[1])))
                             # mergeCombiners() Aggregation across partitions

In [22]:
sumCount.collect()

[('pink', (7, 2)), ('panda', (1, 2)), ('pirate', (3, 1))]

**Example 2**:

In [23]:
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 91)])
def add(a, b): return a + str(b)
sorted(rdd.combineByKey(str,
                        # createCombiner, which turns a V into a C (e.g., creates a one-element list)
                        add,
                        # mergeValue, to merge a V into a C (e.g., adds it to the end of a list)
                        add
                        # mergeCombiners, to combine two C’s into a single one.
                       ).collect())

[('a', '191'), ('b', '1')]

**Example 3**:

Use **combineByKey()** to implement **reduceByKey()**

In [24]:
rdd = sc.parallelize([("a", 2), ("b", 9), ("a", 1)])
rdd.combineByKey(lambda x: x,
                 lambda x, y: x + y,
                 lambda x, y: x + y).collect()

[('a', 3), ('b', 9)]

In [25]:
rdd = sc.parallelize([("a", 2), ("b", 9), ("a", 1)])
rdd.reduceByKey(lambda x, y: x + y).collect()

[('a', 3), ('b', 9)]

#### Tuning the level of parallelism

Every RDD has a fixed number of *partitions* that determine the degree of parallelism to use when executing operations on the RDD

Spark always try to infer a sensible default number of partitions based on the size of your cluster, but we can aske Spark to use a specific value.

In [26]:
data = [("a", 3), ("b", 4), ("a", 1)]

In [27]:
sc.parallelize(data).reduceByKey(lambda x, y: x + y)
# Default parallelism

PythonRDD[50] at RDD at PythonRDD.scala:48

In [28]:
sc.parallelize(data).reduceByKey(lambda x, y: x + y, 10)
# Custom parallelism

PythonRDD[56] at RDD at PythonRDD.scala:48

**repartition()** shuffles the data across the network to create a new set of partitions (VERY EXPENSIVE).

**coalesce()** is a special case of **repartition()**: it allows you to decrease the number of RDD partitions (i.e. merging existing partitions)

Use **rdd.getNumPartitions()** to get the number of partitions

### Grouping Data

**groupByKey()** on pair RDD

In [29]:
rdd = sc.parallelize([("a", 3), ("b", 4), ("a", 1)])
rdd.groupByKey().collect()

[('a', <pyspark.resultiterable.ResultIterable at 0x7f203c21a690>),
 ('b', <pyspark.resultiterable.ResultIterable at 0x7f203c21ad10>)]

**groupBy()** on unpaired RDD

It takes a function that it applies to every element in the source RDD and uses the result to determine the key

In [30]:
rdd = sc.parallelize([1, 1, 2, 3, 5, 8])
result = rdd.groupBy(lambda x: x % 2).collect()

[(x, list(y)) for (x, y) in result]

[(0, [2, 8]), (1, [1, 1, 3, 5])]

Use **groupByKey()** and **mapValues()** to implement **reduceByKey()**

In [31]:
rdd = sc.parallelize([("a", 3), ("b", 4), ("a", 1)])

In [32]:
rdd.groupByKey().mapValues(lambda x: reduce(lambda a, b: a + b, x)).collect()

[('a', 4), ('b', 4)]

In [33]:
rdd.reduceByKey(lambda x, y: x + y).collect()

[('a', 4), ('b', 4)]

**cogroup()** over two RDDs sharing the same key type, K, with the respective value types V and W gives us back RDD[(K, (Iterable[V], Iterable[W]))]. If one of the RDDs doesn't have elements for a given key that is present in the other RDD, the corresponding Iterable is simply empty.

In [34]:
x = sc.parallelize([("a", 1), ("b", 4), ("a", 2)])
y = sc.parallelize([("a", 2), ("b", 6), ("c", 3)])

In [35]:
result = x.cogroup(y).collect()

In [36]:
result

[('a',
  (<pyspark.resultiterable.ResultIterable at 0x7f203c23b710>,
   <pyspark.resultiterable.ResultIterable at 0x7f203c22c550>)),
 ('c',
  (<pyspark.resultiterable.ResultIterable at 0x7f203c22c150>,
   <pyspark.resultiterable.ResultIterable at 0x7f203c22c610>)),
 ('b',
  (<pyspark.resultiterable.ResultIterable at 0x7f203c22ca90>,
   <pyspark.resultiterable.ResultIterable at 0x7f203c22ca50>))]

In [37]:
[(item[0], tuple([list(i) for i in item[1]])) for item in result]

[('a', ([1, 2], [2])), ('c', ([], [3])), ('b', ([4], [6]))]

### Joins

In [38]:
storeAddress = sc.parallelize([("Ritual", "1026 Valencia St"),
                               ("Philz", "748 Van Ness Ave"),
                               ("Philz", "3101 24th St"),
                               ("Starbucks", "Seattle")])

In [39]:
storeRating = sc.parallelize([("Ritual", 4.9), ("Philz", 4.8)])

In [40]:
storeAddress.join(storeRating).collect()

[('Philz', ('748 Van Ness Ave', 4.8)),
 ('Philz', ('3101 24th St', 4.8)),
 ('Ritual', ('1026 Valencia St', 4.9))]

In [41]:
storeAddress.leftOuterJoin(storeRating).collect()

[('Philz', ('748 Van Ness Ave', 4.8)),
 ('Philz', ('3101 24th St', 4.8)),
 ('Ritual', ('1026 Valencia St', 4.9)),
 ('Starbucks', ('Seattle', None))]

In [42]:
storeAddress.rightOuterJoin(storeRating).collect()

[('Philz', ('748 Van Ness Ave', 4.8)),
 ('Philz', ('3101 24th St', 4.8)),
 ('Ritual', ('1026 Valencia St', 4.9))]

### Sorting Data

In [43]:
rdd = sc.parallelize([("3", 312), ("1", 394), ("4", 903),
                      ("1", 394), ("5", 105), ("9", 967),
                      ("2", 234), ("6", 831)])

In [44]:
rdd.sortByKey(ascending=True,
              numPartitions=None,
              keyfunc=lambda x: str(x)).collect()

[('1', 394),
 ('1', 394),
 ('2', 234),
 ('3', 312),
 ('4', 903),
 ('5', 105),
 ('6', 831),
 ('9', 967)]

## Actions Available on Pair RDDs

## Data Partitioning (Advanced)

1. Use **partiionBy()** (transformation) to partition (i.e. hash-partition) the RDD prior to join operations.
2. Use **persist()** to take advantage of the partition in the privous step


1. **sortByKey()** results in range-partitioned RDDs|
2. **groupByKey()** results in hash-partitioned RDDs|
3. **map()** causes new RDD to forget the parent's partitioning info|


### Operations That Benefit from Partitioning

Many of Spark's operations involve shuffling data by key across the network. All of these will benefit from partitioning. As of Spark 1.0, the operations that benefit from partitioning are **cogroup()**, **groupWith()**, **join()**, **leftOuterJoin()**, **rightOuterJoin()**, **groupByKey()**, **reduceByKey()**, **combineByKey()**, and **lookup()**.

1. For operations that act on a single RDD, such as **reduceByKey()**, running on a pre-partitioned RDD will cause all the values for each key to be computed locally on a single machine, requiring only the final, locally reduced value to be sent from each worker node back to the master. 

2. For binary operations, such as **cogroup()** and **join()** , pre-partitioning will cause at least one of the RDDs (the one with the known partitioner) to not be shuffled. 

3. If both RDDs have the same partitioner, and if they are cached on the same machines (e.g., one was created using **mapValues()** on the other, which preserves keys and partitioning) or if one of them has not yet been computed, then no shuffling across the network will occur.

### Operations That Affect Partitioning

1. Spark knows internally how each of its operations affects partitioning, and automatically sets the partitioner on RDDs created by operations that partition the data. For example, suppose you called **join()** to join two RDDs; because the elements with the same key have been hashed to the same machine, Spark knows that the result is hash-partitioned, and operations like **reduceByKey()** on the join result are going to be significantly faster.

2. The flipside, however, is that for transformations that cannot be guaranteed to produce a known partitioning, the output RDD will not have a partitioner set. For example, if you call **map()** on a hash-partitioned RDD of key/value pairs, the function passed to **map()** can in theory change the key of each element, so the result will not have a partitioner. Spark does not analyze your functions to check whether they retain the key. Instead, it provides two other operations, **mapValues()** and **flatMapValues()**, which guarantee that each tuple's key remains the same.

3. All that said, here are all the operations that result in a partitioner being set on the output RDD: **cogroup()**, **groupWith()**, **join()**, **leftOuterJoin()**, **rightOuterJoin()**, **groupByKey()**, **reduceByKey()**, **combineByKey()**, **partitionBy()**, **sort()**, **mapValues()** (if the parent RDD has a partitioner), **flatMapValues()** (if parent has a partitioner), and **filter()** (if parent has a partitioner). All other operations will produce a result with no partitioner.

### Example: PageRank

In [45]:
links = sc.textFile("links").map(lambda x: x.split("\t")).map(lambda x: (x[0], x[1:])).partitionBy(2).persist()
links.collect()

[(u'a', [u'b', u'd']),
 (u'c', [u'd']),
 (u'e', [u'a', u'b', u'c']),
 (u'b', [u'c', u'e']),
 (u'd', [u'e'])]

In [46]:
ranks = sc.textFile("ranks").map(lambda x: x.split("\t")).map(lambda x: (x[0], float(x[1])))
ranks.collect()

[(u'a', 0.2), (u'b', 0.2), (u'c', 0.2), (u'e', 0.2), (u'd', 0.2)]

In [47]:
MAX = 5
for c in range(MAX):
    contributions = links.join(ranks).flatMap(lambda x: [(i, x[1][1]/len(x[1][0])) for i in x[1][0]])
    ranks = contributions.reduceByKey(lambda x, y: x + y)
    
ranks.collect()    

[(u'e', 0.33472222222222225),
 (u'b', 0.1518518518518519),
 (u'a', 0.10740740740740744),
 (u'd', 0.22222222222222227),
 (u'c', 0.18379629629629635)]

### Custom Partitioners

In [48]:
import urlparse

def hash_domain(url):
    return hash(urlparse.urlparse(url).netloc)

rdd = sc.parallelize(["http://www.cnn.com/world",
                      "http://www.cnn.com/us",
                      "https://www.google.com/preferences"])
rdd.partitionBy(2, hash_domain)

MapPartitionsRDD[187] at mapPartitions at PythonRDD.scala:422

# Synopsis

In [49]:
rdd = sc.parallelize([(1, 2), (3, 4), (3, 6)])

## Transformations on one pair RDD

1 **reduceByKey(func)**: Combine values with the same key

In [50]:
rdd.reduceByKey(lambda x, y: x + y).collect()

[(1, 2), (3, 10)]

2 **groupByKey()**: Group values with the same key

In [51]:
rdd.groupByKey().collect()

[(1, <pyspark.resultiterable.ResultIterable at 0x7f203c26f7d0>),
 (3, <pyspark.resultiterable.ResultIterable at 0x7f203c1d75d0>)]

3 **combineByKey(createCombiner, mergeValue, mergeCombiner, partitioner)**: Combine values with the same key using a different result type 

In [52]:
rdd.combineByKey((lambda x: (x, 1)),
                 (lambda x, y: (x[0] + y, x[1] + 1)),
                 (lambda x, y: (x[0] + y[0], x[1] + y[1]))).collect()

[(1, (2, 1)), (3, (10, 2))]

4 **mapValues(func)**: Apply a function to each value of a pair RDD without changing the key

In [53]:
rdd.mapValues(lambda x: x + 1).collect()

[(1, 3), (3, 5), (3, 7)]

5 **flatMapValues(func)**: Apply a function that returns an iterator to each value of a pair RDD, and for each element returned produce a key/value entry with the old key. Often used to tokenization.

In [54]:
rdd.flatMapValues(lambda x: range(x, 6)).collect()

[(1, 2), (1, 3), (1, 4), (1, 5), (3, 4), (3, 5)]

6 **key()**: Return an RDD of just the keys

In [55]:
rdd.keys().collect()

[1, 3, 3]

7 **value()**: Return ad RDD of just the values

In [56]:
rdd.values().collect()

[2, 4, 6]

8 **sortByKey()**: Return an RDD sorted by the key

In [57]:
rdd.sortByKey().collect()

[(1, 2), (3, 4), (3, 6)]

9 **combineByKey(createCombiner, mergeValue, mergeCombiners)**

Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a “combined type” C

In [58]:
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 91)])
def add(a, b): return a + str(b)
sorted(rdd.combineByKey(str,
                        # createCombiner, which turns a V into a C (e.g., creates a one-element list)
                        add,
                        # mergeValue, to merge a V into a C (e.g., adds it to the end of a list)
                        add
                        # mergeCombiners, to combine two C’s into a single one.
                       ).collect())

[('a', '191'), ('b', '1')]

## Transformation on two pair RDDs

In [59]:
rdd = sc.parallelize([(1, 2), (3, 4), (3, 6)])

In [60]:
other = sc.parallelize([(3, 9)])

1 **subtractByKey()**: Remove elements with a key present in the other RDD

In [61]:
rdd.subtractByKey(other).collect()

[(1, 2)]

2 **join()**: Perform an inner join between two RDDs

In [62]:
rdd.join(other).collect()

[(3, (4, 9)), (3, (6, 9))]

3 **rightOuterJoin()**: Perform a right-outer join: rdd(left), other(right)

In [63]:
rdd.rightOuterJoin(other).collect()

[(3, (4, 9)), (3, (6, 9))]

4 **leftOuterJoin()**: Perform a left-outer join: rdd(left), other(right)

In [64]:
rdd.leftOuterJoin(other).collect()

[(1, (2, None)), (3, (4, 9)), (3, (6, 9))]

5 **cogroup()**: Group data from both RDDs sharing the same key

In [65]:
rdd.cogroup(other).collect()

[(1,
  (<pyspark.resultiterable.ResultIterable at 0x7f203c1d7bd0>,
   <pyspark.resultiterable.ResultIterable at 0x7f203c1d7890>)),
 (3,
  (<pyspark.resultiterable.ResultIterable at 0x7f203c1e6f90>,
   <pyspark.resultiterable.ResultIterable at 0x7f203c1e6fd0>))]

## Actions on pair RDDs

In [66]:
rdd = sc.parallelize([(1, 2), (3, 4), (3, 6)])

1 **countByKey()**: Count the number of elements for each key

In [67]:
rdd.countByKey()

defaultdict(int, {1: 1, 3: 2})

2 **collectAsMap()**: Collect the result as a map to provide easy lookup

In [68]:
rdd.collectAsMap()

{1: 2, 3: 6}

3 **lookup()**: Return all values associated with the provided key

In [69]:
rdd.lookup(3)

[4, 6]