## Chaining
We can **chain** transformations and actions to create a computation **pipeline**
Suppose we want to compute the sum of the squares
$$ \sum_{i=1}^n x_i^2 $$
where the elements $x_i$ are stored in an RDD.

### Start the `SparkContext`

In [None]:
import numpy as np
from pyspark import SparkContext
sc = SparkContext(master="local[4]")

In [None]:
B=sc.parallelize(np.random.randint(0,10,size=1000))
lst = B.collect()
for i in lst: 
    print(i,end=', ')

### Sequential syntax for chaining
Perform assignment after each computation

In [3]:
%%time
Squares=B.map(lambda x:x*x)
summation = Squares.reduce(lambda x,y:x+y)

CPU times: total: 15.6 ms
Wall time: 6.88 s


In [4]:
print(summation)

27727


### Cascaded syntax for chaining
Combine computations into a single cascaded command

In [5]:
%%time
B.map(lambda x:x*x).reduce(lambda x,y:x+y)

CPU times: total: 0 ns
Wall time: 4.34 s


27727

### Both syntaxes mean exactly the same thing
The only difference:
* In the sequential syntax the intermediate RDD has a name `Squares`
* In the cascaded syntax the intermediate RDD is *anonymous*

The execution is identical!

### Sequential execution
The standard way that the map and reduce are executed is
* perform the map
* store the resulting RDD in memory
* perform the reduce

### Disadvantages of Sequential execution

1. Intermediate result (`Squares`) requires memory space.
2. Two scans of memory (of `B`, then of `Squares`)

### Pipelined execution
Perform the whole computation in a single pass. For each element of **`B`**
1. Compute the square
2. Enter the square as input to the `reduce` operation.

### Advantages of Pipelined execution

1. Less memory required - intermediate result is not stored.
2. Faster - only one pass through the Input RDD.

In [6]:
sc.stop()