# Spark Basics

For Spark Scala running in a Jupyter notebook, one simply initializes it with the `spark` command.

In [1]:
spark

Intitializing Scala interpreter ...

Spark Web UI available at http://192.168.1.19:4041
SparkContext available as 'sc' (version = 2.4.5, master = local[*], app id = local-1588616857471)
SparkSession available as 'spark'


res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@76bdbbaa


# Spark Context

In Spark compiled jobs (like in a script), we start by creating a **Spark Context** object named `sc`. The logic is illustrated as follows:

```scala
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext

val spark = SparkSession.builder.                // Use the builder pattern
            master("local[*]").                  // Run locally on all cores
            appName("MyApp").                    // Name of your app
            config("spark.app.id", "Console").   // Silence Metric Warnings
            getOrCreate()                        // Create it

val sc = spark.sparkContext                      // Extract SparkContext
```

**Only 1 SparkContext at a time!**
* Spark is designed for single user
* Only one sparkContext per program/notebook.
* Before starting a new sparkContext. Stop the one currently running with `sc.stop`.

# Some basic RDD commands

## Parallelize

* Simplest way to create an RDD
* The method `A = sc.parallelize(L)`, creates an RDD named `A` from list `L`.
* `A` is an RDD of type.

In [5]:
List(1 to 3)

res4: List[scala.collection.immutable.Range.Inclusive] = List(Range(1, 2, 3))


In [14]:
val A = sc.parallelize(List.range(0, 3))
println(A)

ParallelCollectionRDD[2] at parallelize at <console>:27


A: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:27


## Collect

* RDD content is distributed among all executors.
* `collect()` is the inverse of `parallelize()`.
* Collects the elements of the RDD.
* Returns a list

In [17]:
val L = A.collect()
println(L)

[I@26678e55


L: Array[Int] = Array(0, 1, 2)


**Using `.collect()` eliminates the benefits of parallelism**

It is often tempting to `.collect()` and RDD, make it into a list, and then process the list using standard python. However, note that this means that you are using only the head node to perform the computation which means that you are not getting any benefit from spark.

Using RDD operations, as described below, **will** make use of all of the computers at your disposal.

## Map
* Applies a given operation to each element of an RDD.
* Parameter is the function defining the operation.
* Returns a new RDD.
* Operation performed in parallel on all executors.
* Each executor operates on teh data **local** to it.

In [19]:
A.map(x => x*x).collect()

res16: Array[Int] = Array(0, 1, 4)


**Note:** Here we are using **anonymous** functions (i.e., Lambda functions), later we will see that regular functions can also be used.

For more on anonymous function see [here](https://www.scala-lang.org/old/node/133.html)

## Reduce

* Takes RDD as input, returns a single value.
* **Reduce operator** takes **two** elements as input returns **one** as output.
* Repeatedly applies a **reduce operator**
* Each executor reduces the data local to it.
* The results from all executors are combined.

E.g., A 2-to-1 operation is the sum

In [20]:
A.reduce((x,y)=> x+y)

res17: Int = 3


E.g., An example of a reduce operation that finds the shortest string in an RDD of strings.

In [23]:
val words = List("this", "is", "the", "best", "mac", "ever")
val wordRDD = sc.parallelize(words)
wordRDD.reduce((w,v)=> if (w.length()< v.length) w else v)

words: List[String] = List(this, is, the, best, mac, ever)
wordRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[5] at parallelize at <console>:26
res18: String = is


## Properties of reduce operations

* Reduce operations **must not depend on the order**
  * Order of operands should not matter
  * Order of application of reduce operator should not matter

* Multiplication and summation are good:

```
                1 + 3 + 5 + 2                      5 + 3 + 1 + 2 
```

 * Division and subtraction are bad:

```
                1 - 3 - 5 - 2                      1 - 3 - 5 - 2
```

### Why must reordering not change the result?

You can think about the reduce operation as a binary tree where the leaves are the elements of the list and the root is the final result. Each triplet of the form (parent, child1, child2) corresponds to a single application of the reduce function. 

The order in which the reduce operation is applied is **determined at run time** and depends on how the RDD is partitioned across the cluster.
There are many different orders to apply the reduce operation. 

If we want the input RDD to uniquely determine the reduced value **all evaluation orders must must yield the same final result**. In addition, the order of the elements in the list must not change the result. In particular, reversing the order of the operands in a reduce function must not change the outcome. 

For example the arithmetic operations multiply `*` and add `+` can be used in a reduce, but the operations subtract `-` and divide `/` should not.

Doing so will not raise an error, but the result is unpredictable.

In [25]:
val B = sc.parallelize(List(1,3,5,2))
B.reduce((x,y)=> x-y)

B: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:25
res19: Int = -7


Which of these the following orders was executed?
* $$((1-3)-5)-2$$ or
* $$(1-3)-(5-2)$$

# Using regular functions instead of anonymous functions

Anonymous function are short and sweet but sometimes it's hard to use just one line. We can use full-fledged functions instead.

Suppose we want to find the 
* last word in a lexicographical order 
* among 
* the longest words in the list.

We could achieve that as follows:

In [31]:
def largerThan(x:String, y:String): String = {
    if (x.length() > y.length())
       return x
    else if (x.length() < y.length())
       return y
    else              //Lengths are equal, compare lexicographically
       if (x > y)
          return x
       else
          return y
}

largerThan: (x: String, y: String)String


In [32]:
wordRDD.reduce(largerThan)

res22: String = this


# Summary

We saw how to:
* Start a SparkContext
* Create an RDD
* Perform Map and Reduce operations on an RDD
* Collect the final results back to head node.

# Chaining

We can **chain** transformations and action to create a computation **pipeline**.

Suppose we want to compute the sum of the squares
$$ \sum_{i=1}^n x_i^2 $$
where the elements $x_i$ are stored in an RDD.

In [34]:
val B = sc.parallelize(List.range(0, 4))
B.collect()

B: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:27
res24: Array[Int] = Array(0, 1, 2, 3)


## Sequential syntax for chaining

Perform assignment after each computation

In [37]:
val squares = B.map(x=>x*x)
squares.reduce((x,y)=>(x+y))

squares: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[11] at map at <console>:26
res27: Int = 14


## Cascade syntax for chaining

Combine computations into a single cascade command

In [36]:
B.map(x=>x*x).reduce((x,y)=>x+y)

res26: Int = 14


**Both syntaxes mean exactly the same thing**

The only difference:
* In the sequential syntax the intermediate RDD has a name `Squares`
* In the cascaded syntax the intermediate RDD is *anonymous*

The execution is identical!

### Sequential execution
The standard way that the map and reduce are executed is
* perform the map
* store the resulting RDD in memory
* perform the reduce

### Disadvantages of Sequential execution

1. Intermediate result (`Squares`) requires memory space.
2. Two scans of memory (of `B`, then of `Squares`) - double the cache-misses.

### Pipelined execution
Perform the whole computation in a single pass. For each element of **`B`**
1. Compute the square
2. Enter the square as input to the `reduce` operation.

### Advantages of Pipelined execution

1. Less memory required - intermediate result is not stored.
2. Faster - only one pass through the Input RDD.

### Lazy Evaluation
This type of pipelined evaluation is related to **Lazy Evaluation**. The word **Lazy** is used because the first command (computing the square) is not executed immediately. Instead, the execution is delayed as long as possible so that several commands are executed in a single pass.

The delayed commands are organized in an **Execution plan**.

For more on Pipelined execution, Lazy evaluation and Execution Plans see [spark programming guide/RDD operations](http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations)

### An instructive mistake
Here is another way to compute the sum of the squares using a single reduce command. Can you figure out how it comes up with this unexpected result?

In [39]:
val C = sc.parallelize(List(1,1,2))
C.reduce((x,y)=> x*x+y*y)

C: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[13] at parallelize at <console>:27
res29: Int = 8


# Getting information about an RDD
RDD's typically have hundreds of thousands of elements. It usually makes no sense to print out the content of a whole RDD. Here are some ways to get manageable amounts of information about an RDD

Create an RDD of length **`n`** which is a repetition of the pattern `1,2,3,4`

In [84]:
List.fill(3)(List(1,2,3,4)).flatten

res61: List[Int] = List(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4)


In [97]:
val n:Int = 1000000
val B = sc.parallelize(List.fill(n/4)(List(1,2,3,4)).flatten)

n: Int = 1000000
B: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[29] at parallelize at <console>:28


In [98]:
B.count()

res72: Long = 1000000


In [99]:
// Get first few elements of an RDD
println(f"First element = ${B.first()}")
println(f"First 5 elements = [${B.take(5).mkString(", ")}]")

First element = 1
First 5 elements = [1, 2, 3, 4, 1]


In [102]:
B.take(15)

res76: Array[Int] = Array(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3)


# Sampling an RDD
* RDDs are often very large.
* Aggregates, such as averages, can be approximated efficiently by using a sample.
* Sampling is done in parallel and requires limited computation.

The method `RDD.sample(withReplacement,p)` generates a sample of the elements of the RDD. where
- `withReplacement` is a boolean flag indicating whether or not a an element in the RDD can be sampled more than once.
- `p` is the probability of accepting each element into the sample. Note that as the sampling is performed independently in each partition, the number of elements in the sample changes from sample to sample.

In [119]:
val m:Int = 5
val p:Float = (m.toFloat/n.toFloat)
//B.sample(false, p).collect()

println(f"sample1 = [${B.sample(false,p).collect().mkString(", ")}]")
println(f"sample2 = [${B.sample(false,p).collect().mkString(", ")}]")

sample1 = [1, 1, 3]
sample2 = [3, 1, 2, 4, 4, 3, 3, 3]


m: Int = 5
p: Float = 5.0E-6



**Things to note and think about**

* Each time you run the previous cell, you get a different estimate
* The accuracy of the estimate is determined by the size of the sample $n*p$
* See how the error changes as you vary $p$
* Can you give a formula that relates the variance of the estimate to $(p*n)$ ? (The answer is in the Probability and statistics course).

# Filtering an RDD
The method `RDD.filter(func)` Return a new dataset formed by selecting those elements of the source on which func returns true.


In [122]:
val n_elements:Long=B.filter(n=> n>3).count()
println(f"The number of elements in B that are > 3 =${n_elements}")

The number of elements in B that are > 3 =250000


n_elements: Long = 250000


# Removing duplicate elements from an RDD
The method `RDD.distinct()` Returns a new dataset that contains the distinct elements of the source dataset.

This operation requires a **shuffle** in order to detect duplication across partitions.

In [133]:
// Remove duplicate element in DuplicateRDD we get distinct RDD
val duplicateRDD = sc.parallelize(List(1,1,2,2,3,3))
println(f"duplicateRDD = [${duplicateRDD.collect().mkString(", ")}]")
println(f"distinctRDD = [${duplicateRDD.distinct().collect().mkString(", ")}]")

duplicateRDD = [1, 1, 2, 2, 3, 3]
distinctRDD = [1, 2, 3]


duplicateRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[54] at parallelize at <console>:28


# flatmap an RDD
The method `RDD.flatMap(func)` is similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).

In [150]:
val text = List("you are my sunshine", "my only sunshine")
val text_file = sc.parallelize(text)

text: List[String] = List(you are my sunshine, my only sunshine)
text_file: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[81] at parallelize at <console>:29
res114: Array[String] = Array(you, are, my, sunshine, my, only, sunshine)


In [151]:
// Map
text_file.map(line=>line.split(" ")).collect()

res115: Array[Array[String]] = Array(Array(you, are, my, sunshine), Array(my, only, sunshine))


In [152]:
// Flatmap
text_file.flatMap(line=>line.split(" ")).collect()

res116: Array[String] = Array(you, are, my, sunshine, my, only, sunshine)


# Set operations
In this part, we explore set operations including **union**,**intersection**,**subtract**, **cartesian** in pyspark

In [163]:
val rdd1 = sc.parallelize(List[Any](1,1,2,3))
val rdd2 = sc.parallelize(List[Any](1,3,4,5))

rdd1: org.apache.spark.rdd.RDD[Any] = ParallelCollectionRDD[94] at parallelize at <console>:25
rdd2: org.apache.spark.rdd.RDD[Any] = ParallelCollectionRDD[95] at parallelize at <console>:26


## 1. `union(other)`
 * Return the union of this RDD and another one.
 * Note that that repetitions are allowed. The RDDs are **bags** not **sets**
 * To make the result a set, use `.distinct`

In [165]:
val rdd2 = sc.parallelize(List[Any]("a","b", 1))

println(f"rdd1 = [${rdd1.collect().mkString(", ")}]")
println(f"rdd2 = [${rdd2.collect().mkString(", ")}]")

println(f"Union as bags = ${rdd1.union(rdd2).collect().mkString(", ")}")
println(f"Union as sets = ${rdd1.union(rdd2).distinct().collect().mkString(", ")}")

rdd1 = [1, 1, 2, 3]
rdd2 = [a, b, 1]
Union as bags = 1, 1, 2, 3, a, b, 1
Union as sets = a, 1, b, 2, 3


rdd2: org.apache.spark.rdd.RDD[Any] = ParallelCollectionRDD[102] at parallelize at <console>:29


## 2. `intersection(other)`
 * Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did.Note that this method performs a shuffle internally.

In [166]:
val rdd2 = sc.parallelize(List[Any](1,1,2,5))

println(f"rdd1 = [${rdd1.collect().mkString(", ")}]")
println(f"rdd2 = [${rdd2.collect().mkString(", ")}]")

println(f"Intersection = [${rdd1.intersection(rdd2).collect().mkString(", ")}]")

rdd1 = [1, 1, 2, 3]
rdd2 = [1, 1, 2, 5]
Intersection = [1, 2]


rdd2: org.apache.spark.rdd.RDD[Any] = ParallelCollectionRDD[108] at parallelize at <console>:29


## 3. `subtract(other, numPartitions=None)`
 * Return each value in self that is not contained in other.

In [168]:
println(f"rdd1 = [${rdd1.collect().mkString(", ")}]")
println(f"rdd2 = [${rdd2.collect().mkString(", ")}]")

println(f"Subtract = [${rdd1.subtract(rdd2).collect().mkString(", ")}]")

rdd1 = [1, 1, 2, 3]
rdd2 = [1, 1, 2, 5]
Subtract = [3]


## 4. `cartesian(other)`
 * Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where **a** is in **self** and **b** is in **other**.

In [170]:
val rdd1 = sc.parallelize(List[Any](1,1,2))
val rdd2 = sc.parallelize(List[Any]("a","b"))

println(f"rdd1 = [${rdd1.collect().mkString(", ")}]")
println(f"rdd2 = [${rdd2.collect().mkString(", ")}]")
println(f"Subtract = [${rdd1.cartesian(rdd2).collect().mkString(", ")}]")

rdd1 = [1, 1, 2]
rdd2 = [a, b]
Subtract = [(1,a), (1,b), (1,a), (1,b), (2,a), (2,b)]


rdd1: org.apache.spark.rdd.RDD[Any] = ParallelCollectionRDD[123] at parallelize at <console>:29
rdd2: org.apache.spark.rdd.RDD[Any] = ParallelCollectionRDD[124] at parallelize at <console>:30


# Summary
* Chaining: creating a pipeline of RDD operations.
* counting, taking and sampling an RDD
* More Transformations: `filter, distinct, flatmap`
* Set transformations: `union, intersection, subtract, cartesian`