### Programming With RDDs

In this notebook we will be experimenting with RDDs. 

RDD stands for Resilient Distributed Dataset. They form the core abstraction of Spark which allows us to perform distributed operations on the data. All transformations gives us new RDDs and we invoke few actions to trigger some evaluation and execution of the transformation on the data.

Let us first create RDD from an online page and filter some text from the its lines.

In [1]:
val content = scala.io.Source.fromURL("https://raw.githubusercontent.com/apache/spark/master/README.md")
val strContent = content.mkString
val lines = strContent.split("\n")
val linesRDD = sc.parallelize(lines)
println("Class of linesRDD is " + linesRDD.getClass)
val filtered = linesRDD.filter(_.toLowerCase.contains("scala"))
println("Class of linesRDD is " + filtered.getClass)

Class of linesRDD is class org.apache.spark.rdd.ParallelCollectionRDD
Class of linesRDD is class org.apache.spark.rdd.MapPartitionsRDD


As we see above, ``filtered`` isn't really the filtered values but another RDD. the ``filter`` operation is a transformation which gives us another RDD. Transformations are lazy. When invoked on another RDD they give us another RDD keeping a track of all the transformations applied. We will now invoke an action to execute these transformations.

In [2]:
filtered.first()

high-level APIs in Scala, Java, Python, and R, and an optimized engine that

Invoking ``first`` on the filtered RDD gives us the first matching line in the RDD which contains the word "scala". Having lazy evaluation on transformations and defering them till we perform an action (method `first` in this case) gives the framework chance to optimize the execution. In this case, since the action if ``first``, Spark can stop execution as soon as the first match is found. Transformations like ``filter`` may reduces the amount of data passed to the next transformation/action and thus spark can efficiently load the data from the underlying source.

It is important to note that each time an action is performed on an RDD, all the transformations are applied. For our trivial case its probably ok. But when loading the data from a datasource and performing transformation on them is time consuming, caching the RDD might be a good idea. By default caching is not done as it can be wasteful of the RDD is not needed in future. However, systems that cache frequently used RDDs will give better performance. Caching can be done either in memory or to disk and is done using the ``cache`` of ``persist`` method (we will see them in use later). 

As we see below ``cache`` or ``persist`` is not an action but transformation. We simily express our interest to cache the contents the first time the an action is called triggering the evaluation of the transformations. If the value of the RDD (or a part of it) is cached after first evaluation, subsequent executions will no longer need to execute the RDD again.

In [3]:
val cachedRDD = filtered.cache;
println("Class if cachedRDD is " + cachedRDD.getClass)

Class if cachedRDD is class org.apache.spark.rdd.MapPartitionsRDD


Following line of code counts the number of words in the file README.md we read from the source. Note that we consider a space as a delimiter for our example

In [4]:
linesRDD.map(_.split(" ") match {
    case Array("") => 0
    case x  => x.length
}).reduce(_ + _)

527

---

We will now look at some of the most common transformation and actions. 

**map** : This transformation takes in a value and return a new value. The value passed in as parameter need not be same as the type returned by the map operation. On an RDD, applying map will apply this transformation to each and every data element in the RDD. Note that the ``map`` operations gives us the same number of elements as the input. For example, suppose we want to apply a squaring function to all elements of an RDD we would do the following.

In [5]:
val inputRDD = sc.parallelize(List(1, 2, 3, 4, 5))
val squared = inputRDD.map(x => x * x)

print("Squared RDD has " + squared.collect.toList)

Squared RDD has List(1, 4, 9, 16, 25)

In the above example we invoked **``collect``** on the ``squared`` RDD. This is an action which gets all the elements of the RDD in an array, in memory. Remember that before we invoke such action we need to ensure that either the number elements in the RDD are small enough to fit the memory or we have reduced them significantly to fit them in memory. The ``map`` operation doesn't reduce the number of elements as we may need an alternate transformation like ``filter`` which we will see next.

The following filter operation will filter all numbers which are multiples of 2.

In [6]:
val multOfTwo = inputRDD.filter(_ % 2 == 0)
print("Multiple of two are " + multOfTwo.take(3).toList)

Multiple of two are List(2, 4)


We demonstrated another action ``take`` along with the ``filter`` transformation above. The action ``take`` accepts a numeric parameter which is the maximum number of values to be taken from the RDD, in this case 3. This is better than collect in a way that we can control the number of elements to be brought into the driver program's memory. The driver program is the one which initialized the RDD, performs transformation on it and then performs some action. In this case this notebook is the driver program. Yet another transformation is ``union``, which as the name suggests, merges the results or two RDDs. Let us now merge the ``squared`` and ``multOfTwo`` RDD into a single RDD.

In [7]:
val unionRDD = multOfTwo.union(squared)
println("unionRDD contains " + unionRDD.collect.toList)

unionRDD contains List(2, 4, 1, 4, 9, 16, 25)



Note that the ``union`` transformation doesnt retain the unique values across RDDS. It simply concats the results on the two RDDs as we see above.

We will now look at another type of transformation ``flatMap``. For this example, we will use a alternate input RDD.

In [8]:
val pandaRDD = sc.parallelize(List("coffee panda", "happy panda", "happiest panda party"))
val mapped = pandaRDD.map(_.split(" ").toList)
val flatMapped = pandaRDD.flatMap(_.split(" "))
println(mapped.collect.toList)
println(flatMapped.collect.toList)

List(List(coffee, panda), List(happy, panda), List(happiest, panda, party))
List(coffee, panda, happy, panda, happiest, panda, party)



As we see above, the ``map`` transformation gives us a list of list.  The ``flatMap`` transformation however flattens these lists of lists to a list of strings.

Following few set operations that can be done on RDDs. We have already seen one the set operations ``union`` earlier, we will now see few more, namely ``distinct``, ``intersection`` and ``subtract``. We have already seen the ``union`` operation and seen how the contains duplicates. Lets find the unique elements in this RDD using the ``distinct`` transformation.

In [9]:
val distinctRDD = unionRDD.distinct
println("Distinct elements are " + distinctRDD.collect.toList)

Distinct elements are List(4, 16, 25, 1, 9, 2)



Note that there are no guarantees on the ordering of the elements and we simply get unique elements. We also see that there is one common element 4 between the ``squared`` and ``multOfTwo`` RDDs. Lets find the intersection (we know the expected answer) and also remove from squared all the elements those are multiples of two. As we see the order of subtract is important as this operation is not commutative and also the order of elements in the returned result are not necessarily in the same order as the original RDD from which elements are subtracted.

In [10]:
val common = squared.intersection(multOfTwo)
println("Common elements are " + common.collect.toList)
val subtracted = squared subtract multOfTwo
println("After subtraction,  elements are " + subtracted.collect.toList)

Common elements are List(4)
After subtraction,  elements are List(16, 1, 9, 25)



Another interesting transformation is the ``cartesian`` transform operation. Given two RDDs the cartesian generates cross product between them. Following example demonstrates it. Note this this is not something we should do on large RDDs.

In [11]:
val first = sc.parallelize(List(1, 2, 3, 4))
val second = sc.parallelize(List('a', 'b', 'c', 'd', 'e'))
val cross = first cartesian second
println("cross contains " + cross.collect.toList)
println("length of cross is " + cross.collect.length)

cross contains List((1,a), (1,b), (2,a), (2,b), (1,c), (1,d), (1,e), (2,c), (2,d), (2,e), (3,a), (3,b), (4,a), (4,b), (3,c), (3,d), (3,e), (4,c), (4,d), (4,e))
length of cross is 20



The cross RDD contains tuples generating all possible combinations between the two RDDs. The length of the result is the product of the length of the individual RDDs, in this case $4 \times 5 = 20$

There are some transformations which lets us randomly choose some elements from the RDD. The method is called ``sample``.

In [12]:
println((1 to 10).map( _ => "Sample with replacement gives " + first.sample(true, 0.5).collect.toList).mkString("\n"))
println("\n\n")
println((1 to 10).map( _ => "Sample without replacement gives " + first.sample(false, 0.5).collect.toList).mkString("\n"))

Sample with replacement gives List(1, 1)
Sample with replacement gives List(1, 1, 3, 4)
Sample with replacement gives List(1, 3, 4)
Sample with replacement gives List(2, 2, 2)
Sample with replacement gives List()
Sample with replacement gives List(1, 2)
Sample with replacement gives List(3, 4)
Sample with replacement gives List(1, 3, 4)
Sample with replacement gives List(1, 2, 2)
Sample with replacement gives List(1, 2, 3)
Sample without replacement gives List(3, 4)
Sample without replacement gives List(2, 4)
Sample without replacement gives List(3, 4)
Sample without replacement gives List()
Sample without replacement gives List(1, 2, 3)
Sample without replacement gives List(1, 2, 3)
Sample without replacement gives List(1, 2, 4)
Sample without replacement gives List(1, 3)
Sample without replacement gives List(1, 2, 3, 4)
Sample without replacement gives List(1, 3)



Sampeling contents from RDD is a non deterministic operation. The signature of the ``sample`` method  is ``sample(withReplacement, fraction, [seed])``

- ``withReplacement`` parameter deternmines whether elements in the RDD are chosen only once or can be chosen multiple times.
- ``fraction`` is the expected size of the sampled elements with replacement as a fraction of the total size of the RDD. Note that it is not a hard bound that guarantees exact number of elements selected.
- ``seed`` The seed for the random number generator.

As we see above, when the given value for ``withReplacement`` as false, we don't get the same values sampled again.

---

**Actions**

We will look at a few common actions on RDD.

We have already seen ``collect`` and ``take`` where ``collect`` brings the entire RDD into memory and ``take`` selects the given number of elements from the RDD.

TODO: Finish actions
