In [1]:
from pyspark import SparkContext

In [2]:
sc = SparkContext("local", "pyspark")

## RDD Basics

In [3]:
lines = sc.textFile("README.md")

In [4]:
pythonLines = lines.filter(lambda line: "Python" in line)

In [5]:
pythonLines.first()

u'high-level APIs in Scala, Java, Python, and R, and an optimized engine that'

RDDs are computed in a lazy fashion.

Spark's RDDs are by default recomputed each time you run an action on them.

Use **persist()** to keep them around in memory.

In [6]:
pythonLines.persist()

PythonRDD[3] at RDD at PythonRDD.scala:48

In [7]:
pythonLines.count()

3

In [8]:
pythonLines.first()

u'high-level APIs in Scala, Java, Python, and R, and an optimized engine that'

To summarize, every Spark program and shell session will work as follows:
1. Create some input RDDs from external data.
2. Transform them to define new RDDs using transformations like **filter()**.
3. Ask Spark to **persist()** any intermediate RDDs that will need to be reused.
4. Launch actions such as **count()** and **first()** to kick off a parallel computation, which is then optimized and executed by Spark.

**cache()** is the same as calling **persist()** with the default storage level.

## Creating RDDs

Spark provides two ways to create RDDs: loading an external dataset and parallelizing a collection in your driver program.

In [9]:
lines = sc.parallelize(["pandas", "i like pandas"])

In [10]:
lines = sc.textFile("README.md")

## RDD Operations

### Transformations

In [11]:
inputRDD= sc.textFile("log.txt")

In [12]:
inputRDD.collect()



In [13]:
errorsRDD = inputRDD.filter(lambda x: "error" in x)

In [14]:
warningsRDD = inputRDD.filter(lambda x: "warning" in x)

In [15]:
badLinesRDD = errorsRDD.union(warningsRDD)

**union()** is a bit different than **filter()** , in that it operates on two RDDs instead of one.

Finally, as you derive new RDDs from each other using transformations, Spark keeps track of the set of dependencies between different RDDs, called the *lineage graph*. It uses this information to compute each RDD on demand and to recover lost data if part of a persistent RDD is lost.

### Actions

In [16]:
print "Input had " + str(badLinesRDD.count()) + " concerning lines"

Input had 3 concerning lines


In [17]:
print "Here are 10 examples"

Here are 10 examples


In [18]:
for line in badLinesRDD.take(10):
    print line

error 1
error 2


In this example, we used **take()** to retrieve a small number of elements in the RDD at the driver program.

RDDs also have a **collect()** function to retrieve the entire RDD.

You can save the contents of an RDD using the **saveAsTextFile()** action, **saveAsSequenceFile()** , or any of a number of actions for
various built-in formats.

### Lazy Evaluation

In systems like Hadoop MapReduce, developers often have to spend a lot of time considering how to group together operations to minimize the number of MapReduce passes. In Spark, there is no substantial benefit
to writing a single complex map instead of chaining together many simple operations. Thus, users are free to organize their program into smaller, more manageable operations.

## Passing Functions to Spark
### Python

In [19]:
rdd = sc.textFile("log.txt")

In [20]:
word = rdd.filter(lambda s: "error" in s)

In [21]:
def containsError(s):
    return 'error' in s

In [22]:
word = rdd.filter(containsError)

One issue to watch out for when passing functions is inadvertently serializing the object containing the function. When you pass a function that is the member of an object, or contains references to fields in an object, Spark sends the *entire* object to worker nodes, which can be much larger than the bit of information
you need.

## Common Transformations and Actions
### Basic RDDs
#### Element-wise transformations

The two most common transformations you will likely be using are **map()** and **filter()**.

**map()**'s return type does not have to be the same as its input type.

In [23]:
nums = sc.parallelize([1, 2, 3, 4])

In [24]:
squared = nums.map(lambda x: x * x).collect()

In [25]:
for num in squared:
    print "%i " % (num)

1 
4 
9 
16 


Use **flatMap()** to produce multiple output elements for each input element. A simpole usage of **flatMap()** is splitting up an input string into words.

In [26]:
lines = sc.parallelize(["hello world", "hi"])

In [27]:
words = lines.flatMap(lambda line: line.split(" "))

In [28]:
words.first()

'hello'

In [29]:
words.collect()

['hello', 'world', 'hi']

**Difference between flatMap() and map() on an RDD**

In [30]:
rdd = sc.parallelize(["coffee panda",
                      "happy panda",
                      "happiest panda party"])

In [31]:
rdd.map(lambda x: x.split(" ")).collect()

[['coffee', 'panda'], ['happy', 'panda'], ['happiest', 'panda', 'party']]

In [32]:
rdd.flatMap(lambda x: x.split(" ")).collect()

['coffee', 'panda', 'happy', 'panda', 'happiest', 'panda', 'party']

#### Pseudo set operations

RDDs support many of the operations of mathematical sets, such as union and intersection, even when the RDDs themselves are not properly sets. It’s important to note that all of these operations require that the RDDs being operated on are of the **same type**.

Some simple set operations

In [33]:
rdd1 = sc.parallelize(["coffee", "coffee", "panda", "monkey", "tea"])

In [34]:
rdd2 = sc.parallelize(["coffee", "monkey", "kitty"])

In [35]:
rdd1.distinct().collect()

['tea', 'coffee', 'panda', 'monkey']

**distinct()** requires shuffling.

In [36]:
rdd1.union(rdd2).collect()

['coffee', 'coffee', 'panda', 'monkey', 'tea', 'coffee', 'monkey', 'kitty']

**union()** does not require shuffling and result may contain duplicates

In [37]:
rdd1.intersection(rdd2).collect()

['coffee', 'monkey']

**intersection()** requires shuffling.

In [38]:
rdd1.subtract(rdd2).collect()

['tea', 'panda']

**subtract()** requires shuffling.

Cartesian product between two RDDs

In [39]:
rdd1 = sc.parallelize(["User1", "User2", "User3"])

In [40]:
rdd2 = sc.parallelize(["Venue Betabrand",
                       "Venue Asha Tea House",
                       "Venue Ritual"])

In [41]:
rdd1.cartesian(rdd2).collect()

[('User1', 'Venue Betabrand'),
 ('User1', 'Venue Asha Tea House'),
 ('User1', 'Venue Ritual'),
 ('User2', 'Venue Betabrand'),
 ('User2', 'Venue Asha Tea House'),
 ('User2', 'Venue Ritual'),
 ('User3', 'Venue Betabrand'),
 ('User3', 'Venue Asha Tea House'),
 ('User3', 'Venue Ritual')]

#### Actions

**reduce()** takes a function that operates on two elements of the type in your RDD and returns a new element of the same type.

In [42]:
nums = sc.parallelize(range(10))

In [43]:
nums.reduce(lambda x, y: x + y)

45

Similar to **reduce()** is **fold()**, which also takes a function with the same signature as needed for **reduce()**, but in addition takes a "zero value" to be used for the initial call on each partition.

In [44]:
nums.fold(0, lambda x, y: x + y)

45

Both **fold()** and **reduce()** require that the return type of our result be the **same type** as that of the elements in the RDD we are operating over.

The **aggregate()** function frees us from the constraint of having the return be the same type as the RDD we are working on.

First argument: zero value
Second argument: local accumulator on the same node
Third argument: accumulator accross nodes in the clustering

The following example uses one function **aggregate()** instead of two functions **map()** and **reduce()**:

In [45]:
sumCount = nums.aggregate((0, 0),
                          # acc = (0, 0)
                          lambda acc, value: (acc[0] + value, acc[1] + 1),
                          # like local combiner in mapreduce
                          lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))
                          # like reducer in mapreduce

In [46]:
sumCount

(45, 10)

In [47]:
nums.map(lambda x: (x, 1.0)).reduce(lambda x, y: (x[0] + y[0], x[1] + y[1]))

(45, 10.0)

### Converting Between RDD Types

Some functions are available only on certain types of RDDs, such as **mean()** and **variance()** on numeric RDDs or **join()** on key/value pair RDDs. In Python all of the functions are implemented on the base RDD class but will fail at runtime if the type of data in the RDD is incorrect.

## Persistence (Caching)

**Double execution**:

In [48]:
rdd = sc.parallelize(range(10))

result = rdd.map(lambda x: x*x)

In [49]:
result.count() # action triggers execution implicitly

10

In [50]:
result.collect()

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

To avoid computing an RDD multiple times, we can ask Spark to persist the data. When we ask Spark to persist an RDD, the nodes that compute the RDD store their partitions. If a node that has data persisted on it fails, Spark will **recompute the lost partitions** of the data when needed.

Example:

**from pyspark import StorageLevel**

**rdd.persist(StorageLevel.DISK_ONLY)**

If you attempt to cache too much data to fit in memory, Spark will automatically evict old partitions using a Least Recently Used (LRU) cache policy. For the memory-only storage levels, it will recompute these partitions the next time they are accessed, while for the memory-and-disk ones, it will write them out to disk. In either case, this
means that you don’t have to worry about your job breaking if you ask Spark to cache too much data. However, caching unnecessary data can lead to eviction of useful data and more recomputation time.

Finally, RDDs come with a method called **unpersist()** that lets you manually remove them from the cache.

# Synopsis

In [51]:
rdd = sc.parallelize([1, 2, 3, 3])

## Basic RDD transformations on an RDD containing [1, 2, 3, 3]

1 **map()**: Apply a function to each element in the RDD and return an RDD of the result

In [52]:
rdd.map(lambda x: x + 1).collect()

[2, 3, 4, 4]

2 **flatMap()**: Apply a function to each element in the RDD and return  and RDD of the contents of the iterators returned. Often used to extract words.

In [53]:
rdd.flatMap(lambda x: range(x, 4)).collect()

[1, 2, 3, 2, 3, 3, 3]

3 **filter()**: Return an RDD consisting of only elements that pass the condition passed to **filter()**

In [54]:
rdd.filter(lambda x: x != 1).collect()

[2, 3, 3]

4 **distinct()**: Remove duplicates

In [55]:
rdd.distinct().collect()

[1, 2, 3]

5 **sample(withReplacement, fraction, [seed])**: Sample an RDD, with or without replacement 

In [56]:
rdd.sample(False, 0.5, 1203).collect()

[1, 3]

## Two-RDD transformations on RDDs containing [1, 2, 3] and [3, 4, 5]

In [57]:
rdd = sc.parallelize([1, 2, 3])

In [58]:
other = sc.parallelize([3, 4, 5])

1 **union()**: Produce an RDD containing elements from both RDDs

In [59]:
rdd.union(other).collect()

[1, 2, 3, 3, 4, 5]

2 **intersection()**: RDD containing only elements found in both RDDs

In [60]:
rdd.intersection(other).collect()

[3]

3 **subtract()**: Remove the contents of one RDD(e.g., remove training data)

In [61]:
rdd.subtract(other).collect()

[2, 1]

4 **cartesian()**: Cartesian produce with the other RDD

In [62]:
rdd.cartesian(other).collect()

[(1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 3), (3, 4), (3, 5)]

## Basic actions on an RDD containing [1, 2, 3, 3]

In [63]:
rdd = sc.parallelize([1, 2, 3, 3])

1 **collect()**: Return all elements from the RDD

In [64]:
rdd.collect()

[1, 2, 3, 3]

2 **count()**: Number of elements in the RDD.

In [65]:
rdd.count()

4

3 **countByValue()**: Number of times each elements occurs in the RDD

In [66]:
rdd.countByValue()

defaultdict(int, {1: 1, 2: 1, 3: 2})

4 **take(num)**: Return num elements from the RDD

In [67]:
rdd.take(2)

[1, 2]

5 **top(num)**: Return the top num elements from the RDD

In [68]:
rdd.top(2)

[3, 3]

6 **takeOrdered(num, ordering)**: Return num elements based on provided ordering

In [69]:
rdd.takeOrdered(3, key=lambda x: -x)

[3, 3, 2]

7 **takeSample(withReplacement, num, [seed])**: Return num elements at random

In [70]:
rdd.takeSample(False, 3, 1998)

[3, 2, 1]

8 **reduce(func)**: Combine the elements of the RDD together in parallel (e.g., sum)

In [71]:
rdd.reduce(lambda x, y: x + y)

9

9 **fold(zero, func)**: Same as **reduce()** but with the provided zero value

In [72]:
rdd.fold(0, lambda x, y:x + y)

9

10 **aggregate(zeroValue, seqOp, combOp)**: Similar to **reduce()** but used to return a different type

In [73]:
rdd.aggregate((0, 0),
              lambda acc, value: (acc[0] + value, acc[1] + 1),
              lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))

(9, 4)

11 **foreach(func)**: Apply the provided function to each element of the RDD

In [74]:
def f(x): print(x)
    
rdd.foreach(f)