## Chaining
We can **chain** transformations and action to create a computation **pipeline**
Suppose we want to compute the sum of the squares
$$ \sum_{i=1}^n x_i^2 $$
where the elements $x_i$ are stored in an RDD.

### Start the `SparkContext`

In [39]:
import numpy as np
from pyspark import SparkContext
sc = SparkContext(master="local[4]") 

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=local[4]) created by __init__ at /tmp/ipykernel_5159/4023248638.py:3 

In [None]:
B=sc.parallelize(np.random.randint(0,10,size=1000)) # Create RDD B with 1000 random integers from 0 to 9
lst = B.collect()  
for i in lst: 
    print(i,end=', ')  # print element seperated by comma

7, 8, 3, 6, 2, 3, 5, 5, 8, 4, 4, 9, 3, 1, 7, 5, 0, 0, 0, 8, 9, 6, 5, 4, 5, 2, 5, 4, 2, 0, 0, 2, 0, 4, 2, 0, 7, 7, 3, 7, 5, 2, 1, 5, 6, 3, 3, 9, 3, 5, 7, 4, 6, 3, 2, 5, 7, 6, 6, 5, 1, 6, 8, 3, 4, 8, 8, 6, 6, 5, 4, 1, 1, 5, 4, 8, 0, 4, 3, 0, 7, 1, 7, 4, 0, 7, 0, 2, 6, 2, 4, 4, 1, 6, 5, 6, 3, 1, 1, 8, 2, 1, 5, 3, 0, 7, 1, 6, 9, 3, 5, 3, 6, 8, 4, 3, 9, 6, 5, 3, 3, 9, 1, 3, 4, 4, 6, 4, 4, 0, 0, 8, 1, 2, 2, 6, 1, 0, 8, 3, 3, 7, 2, 2, 4, 5, 2, 7, 5, 2, 1, 8, 9, 0, 6, 7, 0, 8, 9, 0, 2, 7, 6, 4, 9, 4, 1, 4, 9, 8, 3, 2, 9, 1, 0, 5, 8, 5, 5, 5, 6, 6, 5, 9, 1, 3, 4, 3, 7, 9, 2, 3, 2, 1, 6, 7, 5, 2, 1, 9, 0, 8, 9, 8, 2, 0, 4, 9, 9, 6, 4, 5, 9, 4, 9, 1, 9, 6, 8, 1, 9, 8, 9, 2, 3, 0, 0, 0, 8, 1, 5, 9, 2, 8, 3, 9, 6, 8, 3, 4, 3, 0, 4, 9, 9, 7, 9, 8, 4, 7, 9, 0, 6, 7, 3, 7, 4, 2, 4, 9, 1, 6, 5, 1, 1, 3, 3, 1, 0, 6, 6, 0, 8, 2, 3, 4, 8, 2, 4, 8, 7, 0, 8, 4, 5, 0, 2, 1, 3, 9, 4, 9, 5, 5, 9, 2, 6, 4, 0, 6, 1, 0, 0, 2, 9, 9, 3, 1, 7, 4, 4, 0, 7, 4, 0, 3, 1, 3, 5, 2, 7, 1, 8, 3, 9, 8, 1, 3, 3, 0, 6, 9, 1, 6

### Sequential syntax for chaining
Perform assignment after each computation

In [None]:
%%time
Squares=B.map(lambda x:x*x) # Create RDD Squares by applying lambda function to each element of RDD B
summation = Squares.reduce(lambda x,y:x+y)  # reduce RDD Squares to a single value by summming first two element and the result with the third element and so on

CPU times: user 640 µs, sys: 7.04 ms, total: 7.68 ms
Wall time: 118 ms


In [None]:
print(summation)

28625


### Cascaded syntax for chaining
Combine computations into a single cascaded command

In [None]:
%%time
B.map(lambda x:x*x).reduce(lambda x,y:x+y)  # similar to above but in one line of code   
# first compute squares then enter element as input to reduce function

CPU times: user 4.34 ms, sys: 4.56 ms, total: 8.91 ms
Wall time: 89.3 ms


28625

### Both syntaxes mean exactly the same thing
The only difference:
* In the sequential syntax the intermediate RDD has a name `Squares`
* In the cascaded syntax the intermediate RDD is *anonymous*

The execution is identical!

### Sequential execution
The standard way that the map and reduce are executed is
* perform the map
* store the resulting RDD in memory
* perform the reduce

### Disadvantages of Sequential execution

1. Intermediate result (`Squares`) requires memory space.
2. Two scans of memory (of `B`, then of `Squares`) - double the cache-misses.

### Pipelined execution
Perform the whole computation in a single pass. For each element of **`B`**
1. Compute the square
2. Enter the square as input to the `reduce` operation.

### Advantages of Pipelined execution

1. Less memory required - intermediate result is not stored.
2. Faster - only one pass through the Input RDD.

In [None]:
sc.stop()