### Programming With RDDs

In this notebook we will be experimenting with RDDs. 

RDD stands for Resilient Distributed Dataset. They form the core abstraction of Spark which allows us to perform distributed operations on the data. All transformations gives us new RDDs and we invoke few actions to trigger some evaluation and execution of the transformation on the data.

Let us first create RDD from an online page and filter some text from the its lines.

In [6]:
val content = scala.io.Source.fromURL("https://raw.githubusercontent.com/apache/spark/master/README.md")
val strContent = content.mkString
val lines = strContent.split("\n")
val linesRDD = sc.parallelize(lines)
println("Class of linesRDD is " + linesRDD.getClass)
val filtered = linesRDD.filter(_.toLowerCase.contains("scala"))
println("Class of linesRDD is " + filtered.getClass)

Class of linesRDD is class org.apache.spark.rdd.ParallelCollectionRDD
Class of linesRDD is class org.apache.spark.rdd.MapPartitionsRDD


As we see above, ``filtered`` isn't really the filtered values but another RDD. the ``filter`` operation is a transformation which gives us another RDD. Transformations are lazy. When invoked on another RDD they give us another RDD keeping a track of all the transformations applied. We will now invoke an action to execute these transformations.

In [8]:
filtered.first()

high-level APIs in Scala, Java, Python, and R, and an optimized engine that

Invoking ``first`` on the filtered RDD gives us the first matching line in the RDD which contains the word "scala". Having lazy evaluation on transformations and defering them till we perform an action (method `first` in this case) gives the framework chance to optimize the execution. In this case, since the action if ``first``, Spark can stop execution as soon as the first match is found. Transformations like ``filter`` may reduces the amount of data passed to the next transformation/action and thus spark can efficiently load the data from the underlying source.

It is important to note that each time an action is performed on an RDD, all the transformations are applied. For our trivial case its probably ok. But when loading the data from a datasource and performing transformation on them is time consuming, caching the RDD might be a good idea. By default caching is not done as it can be wasteful of the RDD is not needed in future. However, systems that cache frequently used RDDs will give better performance. Caching can be done either in memory or to disk and is done using the ``cache`` of ``persist`` method (we will see them in use later). 

As we see below ``cache`` or ``persist`` is not an action but transformation. We simily express our interest to cache the contents the first time the an action is called triggering the evaluation of the transformations. If the value of the RDD (or a part of it) is cached after first evaluation, subsequent executions will no longer need to execute the RDD again.

In [14]:
val cachedRDD = filtered.cache;
println("Class if cachedRDD is " + cachedRDD.getClass)

Class if cachedRDD is class org.apache.spark.rdd.MapPartitionsRDD


Following line of code counts the number of words in the file README.md we read from the source. Note that we consider a space as a delimiter for our example

In [36]:
linesRDD.map(_.split(" ") match {
    case Array("") => 0
    case x  => x.length
}).reduce(_ + _)

527