<img src="images/intro-logo-scala-eng.png" align="left" width=600px/>
<!--- ![alt text](heading.png "Heading with Scala logo") --->

---

# Index

### [4. There's nothing left but... fly!](#seccion-4. Theres nothing left but... fly!)


---

<!---
# 4. There's nothing left but... fly! --->
<a name="seccion-4. Theres nothing left but... fly!"></a>
<a name="seccion-4. There's nothing left but... fly!"></a>
<table align="left" style="border-collapse: collapse; width: 100%; border: 5px double black">
    <tr style="border:none !important;">
        <td style="border:none !important; width: 100px;">
<img src="icons/volar-m.png" align="left" width=85px/>
        </td>
        <td style="border:none !important; text-align:left;">
<h1>4. There's nothing left but... fly!</h1>
<br>

Having studied the syntax and the main features of Scala, there is nothing left but putting this knowledge into practice to solve more advance and interesting tasks.
        </td>
    </tr>
</table>

---

<a name="subseccion-Introduccion a Spark"></a>
## Introduction to Spark

<!--<img src="images/intro_spark.png" width="60%" align="left">-->

In this notebook we present the basics of Spark with Scala

#### Up to now:
Data science and analytics has been done “in the small”, in R/Python/MATLAB, etc.


#### Nowadays:
Datasets do not fit into memory anymore, so...

* These languages/frameworks won’t allow you to scale. 

* You have to reimplement everything in some other language or system.



#### Moreover:

* Industry is shifting towards Bussiness Intelligence based on data-oriented decision making, relying on huge datasets.

* Spark's API is almost 1-to-1 with Scala's collections, but distributed!

### Spark + Scala


* More expressive. APIs modeled after Scala collections. Looks like functional lists! 


* Richer, more composable operations possible than in MapReduce (Hadoop).


* Performant: in terms of running time... AND also in terms of developer productivity! 


* Good for data science. Not just because of performance, but because it enables (efficient) iteration, required by most algorithms in a data scientist’s toolbox.


* High demand of Spark and Scala developers and Data Scientist!


### Spark vs Hadoop




* Hadoop is an open source implementations of Google's MapReduce.


* Simple API for map and reduce operations on distributed datasets.


* Fault tolerance: between each map and reduce operations, writes intermediate data to be able to recover from failures.


* Spark's fault tolerance is way more efficient because:
    - Keeps all data inmutable and in-memory
    - Operations are functional transformations
    - Fault tolerance: re-aply transformations to original data
    
    
* Spark is compatible with HDFS (Hadoop Distributed FileSystem)
<br><br>



<a name="subseccion-Key concepts in Spark"></a>
## Key concepts in Spark

* Spark Session: connection to Spark's API


* Hardware Structure: 
    - Cluster of driver + workers
    - Workflow: shuffling


* Logical Data Structures:
    - RDDs
    - PairRDDs
    
    
* Basic Operations:
    - Transformations
    - Actions
    
    
* Interesting Libraries:
    - Spark SQL: DataFrames and Datasets
    - Spark Streaming API
    - MlLib
    - GraphX
    - ...

### Spark Session

Connection to the Spark cluster. 

Usually we "talk to" the master node of the cluster, and it sends the jobs to the worker nodes.

SparkSession is the object that we will use to perform the configuration and input operations against the cluster.

#### Configuration


In [2]:
val spark = SparkSession
      .builder()
      .appName("Spark basic example")     // Name for the session
      .master("local[2]")                 // Path and number of cores to be used
      .getOrCreate()

// Optional: Set logging level if log4j not configured
Logger.getRootLogger.setLevel(Level.ERROR)

Name: Compile Error
Message: <console>:17: error: not found: value SparkSession
       val spark = SparkSession
                   ^
StackTrace: 

<table align="left" style="border-collapse: collapse; border: none !important; width: 100%;">
    <tr style="border:none !important;">
        <td style="border:none !important; width: 60px;">
<img src="icons/notepad.png" align="left" width="50px"> 
        </td>
        <td style="border:none !important; text-align:left;">
            <ul>
                <li>Previous versions of Spark use <strong>SparkContext</strong> instead of <strong>SparkSession</strong></li>
                <li><strong>SparkContext</strong> is still in use, but its transparent to the developper</li>
                <li>It can be accessed through the <strong>SparkSession</strong>: <strong>spark.sparkContext</strong></li>
                <li>In this notebook, we already have both, a <strong>SparkSession</strong> and its corresponding <strong>SparkContext</strong> in the inmutable variables: <strong>spark</strong> and <strong>sc</strong>, respectively, which will be used during this notebook</li>
            </ul>
        </td>
    </tr>
</table>

### Spark Hello World!


In [1]:
sc.parallelize(1 to 100).reduce(_ + _)

5050

<table align="left" style="border-collapse: collapse; border: none !important; width: 100%;">
    <tr style="border:none !important;">
        <td style="border:none !important; width: 60px;">
<img src="icons/notepad.png" align="left" width="50px"> 
        </td>
        <td style="border:none !important;text-align:left;">
            <ul>
                <li>We use <strong>sc</strong> to build a parallel collection in Spark cluster.</li>
                <li><strong>parallelize</strong> is a function to transform a collection to its correspondent parallel version.</li>
                <li><strong>(1 to 100)</strong> is the definition of a range collection (a collection formed by the values within the given range)</li>
                <li><strong>reduce</strong> has the same meaning as in Scala's Collection API</li>
            </ul>
        </td>
    </tr>
</table>


### Similarities between Spark and Scala



<br>
Spark API has almost a 1-to-1 relation to Scala's collections API. Let's see an example:
<br>

In [2]:
val lista = List("Juan", "María", "Pedro", "Elisa")            // We build a Scala List[String]

val paresLista = lista.map(nombre => (nombre, nombre.length))  // Associate the length of each string

paresLista.sortBy(-_._2).foreach(t => println(t._1 + " => " + t._2))  // Print it out

paresLista.map(_._2).reduce(_ + _)                             // Sum up the lengths of the strings

María => 5
Pedro => 5
Elisa => 5
Juan => 4


19

The equivalent in Spark, in a distributed fashion, could be something like follows:

In [3]:
val parlista = sc.parallelize[String](lista)                      // Create the equivalent: ParallelCollectionRDD[String]

val pares = parlista.map(nombre => (nombre, nombre.length))       // pares: MapPartitionRDD[(String, Int)]

pares.sortBy(-_._2).collect.foreach(t => println(t._1 + " => " + t._2))

pares.map(_._2).reduce(_ + _)                 

María => 5
Pedro => 5
Elisa => 5
Juan => 4


19

<table align="left" style="border-collapse: collapse; border: none !important; width: 100%;">
    <tr style="border:none !important;">
        <td style="border:none !important; width: 60px;">
<img src="icons/notepad.png" align="left" width="50px"> 
        </td>
        <td style="border:none !important;text-align:left;">
            <ul>
                <li>1-to-1 correspondence between Scala and Spark</li>
                <li><strong>collect</strong> brings the blocks of data distributed over the worker nodes to the master node, in order to process the whole RDD</li>
                <li>Spark lazy evaluation of some functions (map, sortBy) and eager for others (reduce, collect). We will study it in depth bellow.</li>
            </ul>
        </td>
    </tr>
</table>

<a name="subseccion-Hardware Structure"></a>
## Hardware Structure in Spark

<img src="images/spark_structure.png" width="80%"/>

### Workflow

* Master node distributes the data in blocks over the worker nodes, sends the tasks and integrates the results of the computation.


* Worker nodes receive the data chunks and the tasks and perform the transformations and actions on their blocks of data.


* Everytime our process requires the whole dataset to perform an action, the master node retrieves the data blocks from the workers, and reconstructs the data in memory.

<table align="left" style="border-collapse: collapse; border: none !important; width: 100%;">
    <tr style="border:none !important;">
        <td style="border:none !important; width: 60px;">
<img src="icons/warning.png" align="left" width="50px"> 
        </td>
        <td style="border:none !important;text-align:left;">
            <ul>
                <li>When the data travels trough the network is called <em>shuffle</em> and it is <strong>really expensive</strong></li>
                <li>We must minimize the number of times that a <em>shuffle</em> is required by our application.</li>
                <li>But take it easy by now: we need to know more Spark related concepts to thoroughly study this issue.</li>
            </ul>
        </td>
    </tr>
</table>

<a name="subseccion-Logical Data Structures RDDs"></a>
## Logical Data Structures: RDDs


* Resilient Distributed Dataset


* Parallel collections for distributed computation of functional programming


* A collection of (typed) data, easily distributable over the worker nodes, so each node take charge of a chunk of the whole dataset to be processed.


* An RDD is a logical reference of a dataset which is partitioned across many server machines in the Spark cluster.


* RDD are partitioned and distributed over the workers in the Spark cluster automatically (without programmer intervention). See previous section of physical structure of a Spark cluster.


* The partitioning scheme can be changed, but by default Spark tries to minimize the network traffic among nodes when processing the RDDs. Example: in a local environment, there is usually one partition per worker node (CPU cores available for Spark).

#### Example: Reading a Json file into an RDD


Input Json file:

Instructions in Spark to read the file and retrieve an RDD:

In [4]:
val testRDD = sc.textFile("test.txt")            // Reads the file into a RDD
val nRecords = testRDD.count                     // Returns the number of records read from the Json file
val nPartitions = testRDD.partitions.size         // Returns the number of partitions of the testRDD


println("Number of records in the file: " + nRecords)
println("Number of partitions in the RDD: " + nPartitions)

Number of records in the file: 10
Number of partitions in the RDD: 2


### Let's play a bit with our new RDD

#### Print the first 5 elements

In [5]:
testRDD.take(5).foreach(elem => println("\nRow: " + elem ))    // Take 5 elems from the RDD and print them out on console


Row: {"idTweet":"915831976929714177","text":"RT @Societatcc: Ayúdanos a difundir, necesitamos llegar a todos los rincones, no tenemos TV3 pero... ¡¡os tenemos a vosotros!!\n¿Com… ","date":"Thu Oct 05 08:52:13 CEST 2017","authorId":"2885455811","idOriginal":"915523419281739776"}

Row: {"idTweet":"915831940745441280","text":"Yo ya he escogido mediador. https://t.co/D7xS4MHbDG","date":"Thu Oct 05 08:52:04 CEST 2017","authorId":"2099361","idOriginal":""}

Row: {"idTweet":"915831968301973504","text":"RT @pedroveraOyP: #AmicsAmigos no pelearsen que es muy #ranciofacts https://t.co/mjMhHQfHuB","date":"Thu Oct 05 08:52:11 CEST 2017","authorId":"799792832","idOriginal":"915830958443687936"}

Row: {"idTweet":"915831985582612480","text":"RT @Societatcc: Ayúdanos a difundir, necesitamos llegar a todos los rincones, no tenemos TV3 pero... ¡¡os tenemos a vosotros!!\n¿Com… ","date":"Thu Oct 05 08:52:15 CEST 2017","authorId":"105157939","idOriginal":"915523419281739776"}

Row: {"idTweet":"91583200465

<table align="left" style="border-collapse: collapse; border: none !important; width: 100%;">
    <tr style="border:none !important;">
        <td style="border:none !important; width: 60px;">
<img src="icons/warning.png" align="left" width="50px"> 
        </td>
        <td style="border:none !important;text-align:left;">
            <ul>
                <li>This is just a part of the input file used in the upcoming examples.</li>
                <li>With this instruction we read a Json as a plain text file.</li>
            </ul>
        </td>
    </tr>
</table>

#### Filter those tweets with any hashtag

In [6]:
val hashtagTweets = testRDD.filter(t => t.contains("#"))
hashtagTweets.collect.foreach(elem => println("Row: " + elem ))

Row: {"idTweet":"915831968301973504","text":"RT @pedroveraOyP: #AmicsAmigos no pelearsen que es muy #ranciofacts https://t.co/mjMhHQfHuB","date":"Thu Oct 05 08:52:11 CEST 2017","authorId":"799792832","idOriginal":"915830958443687936"}
Row: {"idTweet":"915832004658286593","text":"RT @pedroveraOyP: #AmicsAmigos no pelearsen que es muy #ranciofacts https://t.co/mjMhHQfHuB","date":"Thu Oct 05 08:52:19 CEST 2017","authorId":"124248712","idOriginal":"915830958443687936"}
Row: {"idTweet":"915830958443687936","text":"#AmicsAmigos no pelearsen que es muy #ranciofacts https://t.co/mjMhHQfHuB","date":"Thu Oct 05 08:48:10 CEST 2017","authorId":"110117638","idOriginal":""}
Row: {"idTweet":"915832008936509440","text":"RT @gsemprunmdg: el desbroce    x      Davila\n\n#FelizJueves\n#AmicsAmigos\n#LaCafeteraPARLEM\n#DíaMundialDeLosDocentes https://t.co/trDDTvrjgr","date":"Thu Oct 05 08:52:20 CEST 2017","authorId":"150587014","idOriginal":"915830945785237504"}
Row: {"idTweet":"915832057288433664","text"

#### Extract the ids of tweets and their authors and the original tweet (if it is a retweet)

In [9]:
val ids = testRDD.map(t => t.split("\",\"")).map(fields => (fields(0), fields(3), fields(4)))

ids.collect.foreach(elem => println("[" + elem._1 + ", " + elem._2 + ", " + elem._3 + "]" ))

[{"idTweet":"915831976929714177, authorId":"2885455811, idOriginal":"915523419281739776"}]
[{"idTweet":"915831940745441280, authorId":"2099361, idOriginal":""}]
[{"idTweet":"915831968301973504, authorId":"799792832, idOriginal":"915830958443687936"}]
[{"idTweet":"915831985582612480, authorId":"105157939, idOriginal":"915523419281739776"}]
[{"idTweet":"915832004658286593, authorId":"124248712, idOriginal":"915830958443687936"}]
[{"idTweet":"915830958443687936, authorId":"110117638, idOriginal":""}]
[{"idTweet":"915832008936509440, authorId":"150587014, idOriginal":"915830945785237504"}]
[{"idTweet":"915832057288433664, authorId":"273360453, idOriginal":"915808416639143936"}]
[{"idTweet":"915808416639143936, authorId":"184865048, idOriginal":""}]
[{"idTweet":"915836526789046273, authorId":"142775869, idOriginal":""}]


#### Clean the elems of the RDD

In [10]:
val cleanIds = ids.map(tuple => {
    (tuple._1.replace("{\"idTweet\":\"", ""), 
    tuple._2.replace("authorId\":\"",""), 
    tuple._3.replace("idOriginal\":\"","").replace("\"}",""))
    })

cleanIds.collect.foreach(elem => println("[" + elem._1 + ", " + elem._2 + ", " + elem._3 + "]" ))

[915831976929714177, 2885455811, 915523419281739776]
[915831940745441280, 2099361, ]
[915831968301973504, 799792832, 915830958443687936]
[915831985582612480, 105157939, 915523419281739776]
[915832004658286593, 124248712, 915830958443687936]
[915830958443687936, 110117638, ]
[915832008936509440, 150587014, 915830945785237504]
[915832057288433664, 273360453, 915808416639143936]
[915808416639143936, 184865048, ]
[915836526789046273, 142775869, ]


#### Convert the RDD[(String, String, String)] into an RDD[(Long, Long, Long)]

In [11]:
val longIds = cleanIds.map(tuple => {
    (tuple._1.toLong, 
    tuple._2.toLong, 
    tuple._3.toLong)
    })

longIds.collect.foreach(elem => println("[" + elem._1 + ", " + elem._2 + ", " + elem._3 + "]" ))

Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: Task 1 in stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 (TID 45, localhost, executor driver): java.lang.NumberFormatException: For input string: ""
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Long.parseLong(Long.java:601)
	at java.lang.Long.parseLong(Long.java:631)
	at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
	at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
	at $line44.$read$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:28)
	at $line44.$read$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:25)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mut

#### Try again?

In [12]:
val longIds = cleanIds.map(tuple => {
    (tuple._1.toLong, 
    tuple._2.toLong, 
    if(tuple._3 == "") 0 else tuple._3.toLong)
    })

longIds.collect.foreach(elem => println("[" + elem._1 + ", " + elem._2 + ", " + elem._3 + "]" ))

[915831976929714177, 2885455811, 915523419281739776]
[915831940745441280, 2099361, 0]
[915831968301973504, 799792832, 915830958443687936]
[915831985582612480, 105157939, 915523419281739776]
[915832004658286593, 124248712, 915830958443687936]
[915830958443687936, 110117638, 0]
[915832008936509440, 150587014, 915830945785237504]
[915832057288433664, 273360453, 915808416639143936]
[915808416639143936, 184865048, 0]
[915836526789046273, 142775869, 0]


#### Retrieve tweets that have been retweeted

In [13]:
val retweetedIDS = longIds.filter(t => t._3!=0).map(_._3).distinct.collect

val original = longIds.filter(t => retweetedIDS.contains(t._1))


original.collect.foreach(println _)
retweetedIDS

(915830958443687936,110117638,0)
(915808416639143936,184865048,0)


Array(915523419281739776, 915830945785237504, 915830958443687936, 915808416639143936)

<a name="subseccion-Logical Data Structures PairRDDs"></a>
## Logical Data Structures: PairRDDs


* Intuition: parallel distributed version of a `map`.


* An RDD containing tuples of (key, value)


* Useful because `maps` are one of the most used data abstractions.

### Use case of PairRDDs: Counting words in an RDD


1.- First: let's split the content of the RDD into words: using <strong>flatMap</strong>

2.- Then, produce a PairRDD with: (Word, 1): using <strong>map</strong>

3.- Finally, group each pair according to their first component (the word) and sum up the second components (occurrences of the words): using <strong>reduceByKey</strong> function of PairRDDs

In [14]:
// Common mechanism of count elements by mapping a RDD to a PairRDD
val countWords = testRDD.flatMap(line => line.split(" ")).map(w => (w, 1)).reduceByKey(_ + _)

println("Filas leídas: " + countWords.count)

// Printing it out
countWords.take(15).foreach(t => println("Word: " + t._1 + "\tOccurrences: " + t._2))

Filas leídas: 101
Word: #ranciofacts	Occurrences: 3
Word: recodo	Occurrences: 1
Word: 2017","authorId":"124248712","idOriginal":"915830958443687936"}	Occurrences: 1
Word: arreglan	Occurrences: 2
Word: https://t.co/trDDTvrjgr","date":"Thu	Occurrences: 1
Word: Si	Occurrences: 1
Word: donde	Occurrences: 1
Word: ya	Occurrences: 1
Word: los	Occurrences: 4
Word: {"idTweet":"915830958443687936","text":"#AmicsAmigos	Occurrences: 1
Word: 2017","authorId":"142775869","idOriginal":""}	Occurrences: 1
Word: x	Occurrences: 1
Word: @pedroveraOyP:	Occurrences: 2
Word: #AmicsAmigos	Occurrences: 2
Word: pero...	Occurrences: 2


### Let's think it twice...


* The first step goes from an `RDD[String]` to an `RDD[String]`: **flatMap** splits each <em>Tweet</em> into a Collection[*words*], and then flats them, obtaining an RDD[<em>words</em>].


* The second step goes from an `RDD[String]`, where each String is a word, to an `RDD[(String, Int)]`, that is a `PairRDD[(String, Int)]`.

* Finally, **reduceByKey** groups all the tuples with the same word, summing up their values, producing a `PairRDD[(String, Int)]` that represents an RDD[*(word, occurrencesOfWord)*]

<table align="left" style="border-collapse: collapse; border: none !important; width: 100%;">
    <tr style="border:none !important;">
        <td style="border:none !important; width: 60px;">
<img src="icons/optimizar.png" align="left" width="50px"> 
        </td>
        <td style="border:none !important; text-align:left">
            <ul>
                <li>In the example above, we take each line of the Json file as a String.</li>
                <li>That means that we are counting all the words in the texts, including the names of the Json fields, and things that are not just texts.</li>
                <li>How can we fix it? Let's see it in next sections.</li>
            </ul>
        </td>
    </tr>
</table>

### What are we really doing?


When we read a file with <strong>textFile</strong>, Spark creates an RDD[String]. It doesn't infer the structure of the Json, nor the pairs (field, value).


The first idea to overcome this could be something like the following:

In [15]:
val texts = testRDD.map(row => row.split("\",")).map(row => row(1).replace("\"text\":\"", ""))

#### What does the instruction above do with the original RDD[String]?

Let's print it out and see...

In [16]:
texts.take(5).foreach(println(_))

RT @Societatcc: Ayúdanos a difundir, necesitamos llegar a todos los rincones, no tenemos TV3 pero... ¡¡os tenemos a vosotros!!\n¿Com… 
Yo ya he escogido mediador. https://t.co/D7xS4MHbDG
RT @pedroveraOyP: #AmicsAmigos no pelearsen que es muy #ranciofacts https://t.co/mjMhHQfHuB
RT @Societatcc: Ayúdanos a difundir, necesitamos llegar a todos los rincones, no tenemos TV3 pero... ¡¡os tenemos a vosotros!!\n¿Com… 
RT @pedroveraOyP: #AmicsAmigos no pelearsen que es muy #ranciofacts https://t.co/mjMhHQfHuB


#### Now, we can count the words of each text following the same procedure as before...

In [17]:
val countWordsTexts = texts.flatMap(line => line.split(" ")).map(w => (w, 1)).reduceByKey(_ + _)

In [18]:
// Printing it out, sorted by the counting
countWordsTexts.sortBy(-_._2).take(15).foreach(t => println("Word: " + t._1 + "\tOccurrences: " + t._2))

Word: 	Occurrences: 8
Word: a	Occurrences: 8
Word: no	Occurrences: 7
Word: RT	Occurrences: 6
Word: que	Occurrences: 5
Word: los	Occurrences: 4
Word: tenemos	Occurrences: 4
Word: lo	Occurrences: 4
Word: todos	Occurrences: 4
Word: #ranciofacts	Occurrences: 3
Word: https://t.co/mjMhHQfHuB	Occurrences: 3
Word: #AmicsAmigos	Occurrences: 3
Word: es	Occurrences: 3
Word: pelearsen	Occurrences: 3
Word: muy	Occurrences: 3


### Pros of RDDs and PairRDDs


    1.- Easy to use API, based on Scala's collections API (map, reduce, filter, flatMap...)
    
    2.- Optimized to use in a distributed Spark cluster
    
    3.- Typed collections: relying on the Scala type inference
    


### But, really... it is a bit tedious, isn't it?


    1.- Not good for processing structured or semi-structured data: 
        - In the example, we tried to read a <strong>structured</strong> file, in Json format
        - So we lost all that <strong>precious information</strong> (fields, values, etc.) transforming it into a (resilient and distributable) collection of strings.
        - And then use the same old <strong>split-get-replace</strong> boring stuff in the String class to extract the interesting parts from the string.


    2.- Shuffling can become the bottle-neck of our application and sometimes it is not easy to avoid it.
    

<table align="left" style="border-collapse: collapse; border: none !important; width: 100%;">
    <tr style="border:none !important;">
        <td style="border:none !important; width: 60px;">
<img src="icons/optimizar.png" align="left" width="50px"> 
        </td>
        <td style="border:none !important; text-align:left">
            <ul>
                <li>In relation to <strong>data shuffle</strong>, in next section we will study the basic operations in Spark (<em>Transformations</em> and <em>Actions</em>), their effects and the way they are managed within the Spark cluster.</li>
                <li>Regarding the processing of <strong>structured and semi-structured information</strong>: Spark offers a much better way of dealing with this kind of data through the <em>Spark SQL</em> library and its relational data structures: <em>DataFrames</em> and <em>Datasets</em>. We will study them later.</li>
            </ul>
        </td>
    </tr>
</table>

<a name="subseccion-Transformations and actions"></a>
## Spark Basic Operations: Transformations and Actions



Apache Spark RDDs support two types of operations: Transformations and Actions.



### Transformations


* They are functions that produce new RDDs from the existing ones. Examples: map(), filter().


* Since the input RDDs cannot be changed (they are immutable in nature), every time we apply a transformation new RDDs are created. 


* Transformations are lazily evaluated, which means that they are not executed immediately. A transformation is effectively executed when we call an action.


* So, applying a (number of) transformation(s) do not produce any inmediate effects. Instead, an RDD lineage is built up, going from the original RDD (which invokes the first transformation) to the final RDDs (result of all the transformations). RDD lineage, represented by a <strong>DAG</strong> (Directed Acyclic Graph), it's a logical execution plan of all the transformations.

Example of transformations and their DAG:

In [19]:
// Remember the code in [7] and [9]: 
    // val texts = testRDD.map(row => row.split("\",")).map(row => row(1).replace("\"text\":\"", ""))
    // val countWordsTexts = texts.flatMap(line => line.split(" ")).map(w => (w, 1)).reduceByKey(_ + _)

countWordsTexts.toDebugString            // print out the execution plan (DAG) of the transformations

(2) ShuffledRDD[30] at reduceByKey at <console>:23 []
 +-(2) MapPartitionsRDD[29] at map at <console>:23 []
    |  MapPartitionsRDD[28] at flatMap at <console>:23 []
    |  MapPartitionsRDD[27] at map at <console>:21 []
    |  MapPartitionsRDD[26] at map at <console>:21 []
    |  test.txt MapPartitionsRDD[10] at textFile at <console>:19 []
    |  test.txt HadoopRDD[9] at textFile at <console>:19 []

### Types of transformations: 



* Narrow transformations: they do not imply a shuffle of data. They can be computed by each worker node with their own data partitions.
    - Examples: map, filter, flatMap, union, sample...



* Wide transformations: the processing logic depends on data from multiple partitions, so data shuffling is needed to bring them together in one place.
    - Examples: distinct, join, reduceByKey, groupByKey...
   

<table align="left" style="border-collapse: collapse; border: none !important; width: 100%;">
    <tr style="border:none !important;">
        <td style="border:none !important; width: 60px;">
<img src="icons/notepad.png" align="left" width="50px"> 
        </td>
        <td style="border:none !important; text-align:left">
            <ul>
                <li> Spark implements a mechanism to optimize the execution plan of transformations in order to minimize the data shuffling. For example: it groups narrow transformations into one `stage`.</li>
                <li>Remember that transformations are <strong>lazy</strong>: they are not executed when they are declared.</li>
                <li>One way of actually perform a set of transformations is to apply an action to the output RDD.</li>
            </ul>
        </td>
    </tr>
</table>

<table align="left" style="border-collapse: collapse; border: none !important; width: 100%;">
    <tr style="border:none !important;">
        <td style="border:none !important; width: 60px;">
<img src="icons/warning.png" align="left" width="50px"> 
        </td>
        <td style="border:none !important; text-align:left">
            <ul>
                <li>DAG is the mechanism that allows Spark to be fault-tolerant, <strong>without</strong> having to write data to disk as a backup</li>
                <li>Spark recovers from failures by recomputing the lost partitions, following the <strong>DAG</strong></li>
                <li>It is really <strong>fast</strong> to recover data from <strong>narrow</strong> transformations, but <strong>slow</strong> from <strong>wide</strong> transformations.</li>
            </ul>
        </td>
    </tr>
</table>

### Actions: 



* They are Spark RDD operations that produce non-RDD values. 


* The results of actions are stored to driver nodes or to the external storage system. So an action is one of the ways of sending data from the worker nodes to the driver.


* It brings laziness of RDD into motion, which means that an action provokes the execution of the associated transformations on the RDD.

* Examples: count, collect, first, take...

### Let's review the example RDD from a text file:

In [1]:
val originalRDD = sc.textFile("test.txt")           // Read plain text file

val firstTransformation = originalRDD.map(row => row.split("\","))

val secondTransformation = firstTransformation.map(row => row(1).replace("\"text\":\"", ""))

val thirdTransformation = secondTransformation.filter(text => text.contains("@"))

val fourthTransformation = secondTransformation.flatMap(text => text.split(" "))

val fifthTransformation = fourthTransformation.filter(word => word.startsWith("#"))

val sixthTransformation = fifthTransformation.map(_.toLowerCase).distinct

<table align="left" style="border-collapse: collapse; border: none !important; width: 100%;">
    <tr style="border:none !important;">
        <td style="border:none !important; width: 60px;">
<img src="icons/question.jpg" align="left" width="50px"> 
        </td>
        <td style="border:none !important; text-align:left">
            <ul>
                <li>What have we done, up to now?</li>
                <li>What's the content of each RDD?</li>
            </ul>
        </td>
    </tr>
</table>


In [2]:
thirdTransformation.collect.foreach(println)         // Transformation to be computed: 3, 2 and 1
println
fifthTransformation.collect.foreach(println)         // Transformation to be computed: 5, 4, 2 and 1
println
sixthTransformation.collect.foreach(println)         // Transformation to be computed: 6, 5, 4, 2 and 1

RT @Societatcc: Ayúdanos a difundir, necesitamos llegar a todos los rincones, no tenemos TV3 pero... ¡¡os tenemos a vosotros!!\n¿Com… 
RT @pedroveraOyP: #AmicsAmigos no pelearsen que es muy #ranciofacts https://t.co/mjMhHQfHuB
RT @Societatcc: Ayúdanos a difundir, necesitamos llegar a todos los rincones, no tenemos TV3 pero... ¡¡os tenemos a vosotros!!\n¿Com… 
RT @pedroveraOyP: #AmicsAmigos no pelearsen que es muy #ranciofacts https://t.co/mjMhHQfHuB
RT @gsemprunmdg: el desbroce    x      Davila\n\n#FelizJueves\n#AmicsAmigos\n#LaCafeteraPARLEM\n#DíaMundialDeLosDocentes https://t.co/trDDTvrjgr
RT @carmouna: Si no lo arreglan los que mandan, lo haremos todos nosotros. Juntos. Envía tu canción a #amicsamigos @radio3_rne… 
Si no lo arreglan los que mandan, lo haremos todos nosotros. Juntos. Envía tu canción a #amicsamigos @radio3_rne… https://t.co/ayQQEgCvVz
#AmicsAmigos
#ranciofacts
#AmicsAmigos
#ranciofacts
#AmicsAmigos
#ranciofacts
#amicsamigos
#amicsamigos
#ranciofacts
#amicsamigos


<table align="left" style="border-collapse: collapse; border: none !important; width: 100%;">
    <tr style="border:none !important;">
        <td style="border:none !important; width: 60px;">
<img src="icons/optimizar.png" align="left" width="50px"> 
        </td>
        <td style="border:none !important; text-align:left">
            <ul>
                <li>Remember that transformations are lazily evaluated, so... </li>
                <li>Notice that transformations 2 and 1 are evaluated three times!!</li>
                <li>Spark provides a mechanism to help programmers to prevent this situation: <strong>caching</strong>. Let's rewrite our code:</li>
            </ul>
        </td>
    </tr>
</table>

In [3]:
val originalRDD2 = sc.textFile("test.txt")           // Read plain text file

val firstT = originalRDD2.map(row => row.split("\","))

val secondT = firstT.map(row => row(1).replace("\"text\":\"", "")).cache    // Cache the result!!

val thirdT = secondT.filter(text => text.contains("@"))

val fourthT = secondT.flatMap(text => text.split(" "))

val fifthT = fourthT.filter(word => word.startsWith("#"))

val sixthT = fifthT.map(_.toLowerCase).distinct

In [4]:
thirdT.collect.foreach(println)         // Transformation to be computed: 3, 2 and 1, and caches the second transformation
println
fifthT.collect.foreach(println)         // Transformation to be computed: 5 and 4 over the already evaluated and cached 2
println
sixthT.collect.foreach(println)         // Transformation to be computed: 6, 5, 4 over the already evaluated and cached 2

RT @Societatcc: Ayúdanos a difundir, necesitamos llegar a todos los rincones, no tenemos TV3 pero... ¡¡os tenemos a vosotros!!\n¿Com… 
RT @pedroveraOyP: #AmicsAmigos no pelearsen que es muy #ranciofacts https://t.co/mjMhHQfHuB
RT @Societatcc: Ayúdanos a difundir, necesitamos llegar a todos los rincones, no tenemos TV3 pero... ¡¡os tenemos a vosotros!!\n¿Com… 
RT @pedroveraOyP: #AmicsAmigos no pelearsen que es muy #ranciofacts https://t.co/mjMhHQfHuB
RT @gsemprunmdg: el desbroce    x      Davila\n\n#FelizJueves\n#AmicsAmigos\n#LaCafeteraPARLEM\n#DíaMundialDeLosDocentes https://t.co/trDDTvrjgr
RT @carmouna: Si no lo arreglan los que mandan, lo haremos todos nosotros. Juntos. Envía tu canción a #amicsamigos @radio3_rne… 
Si no lo arreglan los que mandan, lo haremos todos nosotros. Juntos. Envía tu canción a #amicsamigos @radio3_rne… https://t.co/ayQQEgCvVz
#AmicsAmigos
#ranciofacts
#AmicsAmigos
#ranciofacts
#AmicsAmigos
#ranciofacts
#amicsamigos
#amicsamigos
#ranciofacts
#amicsamigos


<a name="subseccion-Spark SQL"></a>
## Spark SQL: DataFrames and Datasets




### Spark SQL features


* Spark library that integrates SQL-based syntax to perform operations on distributed data.


* Defines data structures to ease the implementation of relational operations (select, group-by, order-by, max, min, average, count, etc.): DataFrames and Datasets.


* These data structures integrates performance optimizations from SQL relational algebra.


<table align="left" style="border-collapse: collapse; border: none !important; width: 100%;">
    <tr style="border:none !important;">
        <td style="border:none !important; width: 60px;">
<img src="icons/optimizar.png" align="left" width="50px"> 
        </td>
        <td style="border:none !important; text-align:left">
            <ul>
                <li>In order to use the optimized syntax of Spark SQL we must include the following line of code: <em>import spark.implicits._</em></li>
                <li>It is also useful to transform RDDs to DataFrames</li>
            </ul>
        </td>
    </tr>
</table>

In [24]:
// In the Jupyter notebooks this line should be different
val sqlC = new org.apache.spark.sql.SQLContext(sc)
import sqlC.implicits._

<a name="subsubseccion-DataFrames"></a>
### DataFrames


* Conceptually equivalent to a SQL table


* Dataframes are <strong>untyped</strong>: Scala cannot infer the type of its elements, because Dataframes are composed by <strong>Rows</strong> (without type)


* We lost the flexibility of RDDs and programmer-defined types and functions, against a set of pre-defined types (<em>Int, Long, String...</em>) and relational functions (<em>SELECT, COUNT, WHERE...</em>)


* On the other hand, we get enormous <strong>optimizations</strong> in terms of time complexity thanks to these strong constraints.


* Catalyst is the Spark component in charge of the optimizations of those methods.

#### Creating DataFrames

* Dataframes can be created by reading directly from a text file, using the SparkSession variable:

In [2]:
val df = spark.read.json("test.txt")
df

[authorId: string, date: string ... 3 more fields]

In [26]:
df.printSchema                                // Prints the schema of the DataFrame, inferred from the Json file

root
 |-- authorId: string (nullable = true)
 |-- date: string (nullable = true)
 |-- idOriginal: string (nullable = true)
 |-- idTweet: string (nullable = true)
 |-- text: string (nullable = true)



In [27]:
df.show                                       // Print the first 20 elements in the DataFrame

+----------+--------------------+------------------+------------------+--------------------+
|  authorId|                date|        idOriginal|           idTweet|                text|
+----------+--------------------+------------------+------------------+--------------------+
|2885455811|Thu Oct 05 08:52:...|915523419281739776|915831976929714177|RT @Societatcc: A...|
|   2099361|Thu Oct 05 08:52:...|                  |915831940745441280|Yo ya he escogido...|
| 799792832|Thu Oct 05 08:52:...|915830958443687936|915831968301973504|RT @pedroveraOyP:...|
| 105157939|Thu Oct 05 08:52:...|915523419281739776|915831985582612480|RT @Societatcc: A...|
| 124248712|Thu Oct 05 08:52:...|915830958443687936|915832004658286593|RT @pedroveraOyP:...|
| 110117638|Thu Oct 05 08:48:...|                  |915830958443687936|#AmicsAmigos no p...|
| 150587014|Thu Oct 05 08:52:...|915830945785237504|915832008936509440|RT @gsemprunmdg: ...|
| 273360453|Thu Oct 05 08:52:...|915808416639143936|915832057288433664

* Dataframes can be created from an existing RDD:

In [28]:
val dfFromRawRDD = originalRDD.toDF                // Function imported from spark.implicits._
dfFromRawRDD.printSchema
dfFromRawRDD.show

root
 |-- value: string (nullable = true)

+--------------------+
|               value|
+--------------------+
|{"idTweet":"91583...|
|{"idTweet":"91583...|
|{"idTweet":"91583...|
|{"idTweet":"91583...|
|{"idTweet":"91583...|
|{"idTweet":"91583...|
|{"idTweet":"91583...|
|{"idTweet":"91583...|
|{"idTweet":"91580...|
|{"idTweet":"91583...|
+--------------------+



<table align="left" style="border-collapse: collapse; border: none !important; width: 100%;">
    <tr style="border:none !important;">
        <td style="border:none !important; width: 60px;">
<img src="icons/question.jpg" align="left" width="50px"> 
        </td>
        <td style="border:none !important; text-align:left">
            <ul>
                <li>What are the difference between <em>df</em> and <em>dfFromRDD</em>?</li>
                <li>Why is that so?</li>
                <li>Let's solve it in the first Spark practice class</li>
            </ul>
        </td>
    </tr>
</table>


In [143]:
case class Tweet(idTweet:Long, text:String, date:String, AuthorId:Long, idOriginal:Long)

val formattedRDD = originalRDD.map(line => line.replace("{", "").replace("}", "")).
    map(line => line.split("\",\"")).
    map(columns => columns.map(e => e.split("\":\"")(1).replace("\"",""))).
    map(attributes => {
        // Warning: idOriginal can be empty!!
        val idorig = if(attributes(4)=="")  0 else attributes(4).toLong
        
        Tweet(attributes(0).toLong, attributes(1), attributes(2), attributes(3).toLong, idorig)
    })
formattedRDD

MapPartitionsRDD[487] at map at <console>:102

<table align="left" style="border-collapse: collapse; border: none !important; width: 100%;">
    <tr style="border:none !important;">
        <td style="border:none !important; width: 60px;">
<img src="icons/warning.png" align="left" width="50px"> 
        </td>
        <td style="border:none !important; text-align:left">
            <ul>
                <li>This method does not work fine on <strong>Jupyter Notebook</strong></li>
                <li>Check it in Lab.</li>
            </ul>
        </td>
    </tr>
</table>

* Dataframes can be created from an RDD, specifying a schema:

In [1]:
import org.apache.spark.sql._
import org.apache.spark.sql.types._

// Read plain text file
val originalRDD = spark.sparkContext.textFile("test.txt")

// Generate the schema specifying each field and its type
val fields = List(
  StructField("idTweet", LongType, nullable = true),
  StructField("text", StringType, nullable = true),
  StructField("date", StringType, nullable = true),
  StructField("authorId", LongType, nullable = true),
  StructField("idOriginal", LongType, nullable = true))

val schema = StructType(fields)

// Read the RDD from a text file
val rowRDD = originalRDD.map(line => line.replace("{", "").replace("}", "")).
  map(line => line.split("\",\"")).
  map(columns => columns.map(e => e.split("\":\"")(1).replace("\"",""))).
  map(attributes => {
    // Warning: idOriginal can be empty!!
    val idOrig = if(attributes(4)=="") 0 else attributes(4).toLong

    Row(attributes(0).toLong, attributes(1), attributes(2), attributes(3).toLong, idOrig)
  })

// Apply the schema to the RDD
val tweetDF = spark.createDataFrame(rowRDD, schema)

Intitializing Scala interpreter ...

Spark Web UI available at http://172.17.0.2:4041
SparkContext available as 'sc' (version = 2.2.0, master = local[*], app id = local-1516870159654)
SparkSession available as 'spark'


import org.apache.spark.sql._
import org.apache.spark.sql.types._
originalRDD: org.apache.spark.rdd.RDD[String] = test.txt MapPartitionsRDD[1] at textFile at <console>:28
fields: List[org.apache.spark.sql.types.StructField] = List(StructField(idTweet,LongType,true), StructField(text,StringType,true), StructField(date,StringType,true), StructField(authorId,LongType,true), StructField(idOriginal,LongType,true))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(idTweet,LongType,true), StructField(text,StringType,true), StructField(date,StringType,true), StructField(authorId,LongType,true), StructField(idOriginal,LongType,true))
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[5] at map at <console>:44
tweetDF: org.apache.spark.sql.DataFrame...

#### Playing with DataFrames: Transformations and Actions

DataFrames can be used almost like a SQL relational database:

In [10]:
// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("tweets")

// Select tweets that are NOT retweets
val originalTweetsQueryDF
= spark.sql("SELECT * FROM tweets WHERE idOriginal LIKE ''")

originalTweetsQueryDF.show

+---------+--------------------+----------+------------------+--------------------+
| authorId|                date|idOriginal|           idTweet|                text|
+---------+--------------------+----------+------------------+--------------------+
|  2099361|Thu Oct 05 08:52:...|          |915831940745441280|Yo ya he escogido...|
|110117638|Thu Oct 05 08:48:...|          |915830958443687936|#AmicsAmigos no p...|
|184865048|Thu Oct 05 07:18:...|          |915808416639143936|Si no lo arreglan...|
|142775869|Thu Oct 05 09:10:...|          |915836526789046273|La elegancia del ...|
+---------+--------------------+----------+------------------+--------------------+



* The equivalent, using Spark SQL functions and $_notation:

In [12]:
val sqlC = new org.apache.spark.sql.SQLContext(sc)           // Jupyter Notebooks require these sentences
import sqlC.implicits._                                      // in order to use $_notation

// Select tweets that are NOT retweets
// $"colName" => access to the colName of the DataFrame
df.select($"idOriginal", $"date", $"authorId", $"idTweet", $"text").where("idOriginal LIKE''").show

// Alternative way:
df.select($"idOriginal", $"date", $"authorId", $"idTweet", $"text").filter($"idOriginal".like("")).show

+----------+--------------------+---------+------------------+--------------------+
|idOriginal|                date| authorId|           idTweet|                text|
+----------+--------------------+---------+------------------+--------------------+
|          |Thu Oct 05 08:52:...|  2099361|915831940745441280|Yo ya he escogido...|
|          |Thu Oct 05 08:48:...|110117638|915830958443687936|#AmicsAmigos no p...|
|          |Thu Oct 05 07:18:...|184865048|915808416639143936|Si no lo arreglan...|
|          |Thu Oct 05 09:10:...|142775869|915836526789046273|La elegancia del ...|
+----------+--------------------+---------+------------------+--------------------+

+----------+--------------------+---------+------------------+--------------------+
|idOriginal|                date| authorId|           idTweet|                text|
+----------+--------------------+---------+------------------+--------------------+
|          |Thu Oct 05 08:52:...|  2099361|915831940745441280|Yo ya he esco

<table align="left" style="border-collapse: collapse; border: none !important; width: 100%;">
    <tr style="border:none !important;">
        <td style="border:none !important; width: 60px;">
<img src="icons/warning.png" align="left" width="50px"> 
        </td>
        <td style="border:none !important; text-align:left">
            <ul>
                <li>Outside <strong>Jupyter notebooks</strong> we must replace the first imports for: <em>import spark.implicits._</em></li>
                <li><em>spark</em> is the Spark Session connector.</li>
                <li>Check it in Lab.</li>
            </ul>
        </td>
    </tr>
</table>

<table align="left" style="border-collapse: collapse; border: none !important; width: 100%;">
    <tr style="border:none !important;">
        <td style="border:none !important; width: 60px;">
<img src="icons/notepad.png" align="left" width="50px"> 
        </td>
        <td style="border:none !important; text-align:left">
            <ul>
                <li>Spark SQL provides functions equivalent to SQL directives: <em>where, like, select, count</em>...</li>
                <li><strong>$_notation</strong> allows us to access to the columns of a DataFrame by their name.</li>
                <li>Functions from Spark API, like <em>filter</em>, are also override in Spark SQL API, in order to apply the optimizations when possible.</li>
            </ul>
        </td>
    </tr>
</table>

#### Aggregations:

* One of the most common tasks with relational databases is grouping and/or aggregating attributes with certain conditions to perform some actions to the result, such as counting, summing, averaging, etc.



* Spark SQL provides the function <strong>groupBy</strong>, wich returns a <em>RelationalGroupedDataset</em>



* This type has a number of relational aggregation functions: sum, count, avg, max, min.

In [78]:
import org.apache.spark.sql.functions._                   // NEEDED for grouping functions

// Example of grouping:
val grouped = df.groupBy($"idOriginal")
grouped

org.apache.spark.sql.RelationalGroupedDataset@222e7463

In [42]:
// Counting by idOriginal:
val groupedCount = grouped.count
groupedCount.printSchema

root
 |-- idOriginal: string (nullable = true)
 |-- count: long (nullable = false)



In [43]:
c.show

+------------------+-----+
|        idOriginal|count|
+------------------+-----+
|915830945785237504|    1|
|915808416639143936|    1|
|915830958443687936|    2|
|                  |    4|
|915523419281739776|    2|
+------------------+-----+



In [90]:
// Sorting the results
groupedCount.orderBy($"count".desc).show


+------------------+-----+
|        idOriginal|count|
+------------------+-----+
|                  |    4|
|915523419281739776|    2|
|915830958443687936|    2|
|915808416639143936|    1|
|915830945785237504|    1|
+------------------+-----+



In [91]:
// Average, max, min...
groupedCount.agg(avg($"count")).show
groupedCount.agg(max($"count")).show
groupedCount.agg(min($"count")).show

+----------+
|avg(count)|
+----------+
|       2.0|
+----------+

+----------+
|max(count)|
+----------+
|         4|
+----------+

+----------+
|min(count)|
+----------+
|         1|
+----------+



<a name="subsubseccion-DataSets"></a>
### DataSets


* In short: Typed DataFrames


* DataSets are a <strong>typed</strong> version of DataFrames: we have to specify the types of each column in a DataSet.


* Actually: DataFrame = DataSet\[Row\]


* We recover the <strong>flexibility</strong> of RDDs and programmer-defined types and functions, but also preserving the <strong>SparkSQL</strong> pre-defined types (<em>Int, Long, String...</em>) and relational functions (<em>SELECT, COUNT, WHERE...</em>)


* On the other hand, we get <strong>part</strong> of the optimizations of DataFrames.


* DataSets can be seen as a compromise between RDDs and DataFrames.

#### Creating DataSets

DataSets can be created from an existing RDD:

In [31]:
val ds = spark.createDataset(originalRDD)
ds

[value: string]

DataSets can be created from an existing DataFrame by <strong>hand-made type conversion</strong>:

In [127]:
val tweetDs = df.map(row => {
    // Warning: idOriginal can be empty!!
    val idorig = if(row.getAs[String]("idOriginal")=="") 0 else row.getAs[String]("idOriginal").toLong
    
    Tweet(row.getAs[String]("idTweet").toLong, 
          row.getAs[String]("text"), 
          row.getAs[String]("date"), 
          row.getAs[String]("authorId").toLong, 
          idorig)
})

tweetDs.printSchema

root
 |-- idTweet: long (nullable = false)
 |-- text: string (nullable = true)
 |-- date: string (nullable = true)
 |-- AuthorId: long (nullable = false)
 |-- idOriginal: long (nullable = false)



<table align="left" style="border-collapse: collapse; border: none !important; width: 100%;">
    <tr style="border:none !important;">
        <td style="border:none !important; width: 60px;">
<img src="icons/warning.png" align="left" width="50px"> 
        </td>
        <td style="border:none !important; text-align:left">
            <ul>
                <li>This method does not work fine on <strong>Jupyter Notebook</strong></li>
                <li>Check it in Lab.</li>
            </ul>
        </td>
    </tr>
</table>

DataSets can be created from an existing DataFrame by <strong>implicit type conversion</strong>:

In [139]:
val ds = tweetDF.as[Tweet]
ds.printSchema
ds.show

root
 |-- idTweet: long (nullable = true)
 |-- text: string (nullable = true)
 |-- date: string (nullable = true)
 |-- authorId: long (nullable = true)
 |-- idOriginal: long (nullable = true)

+------------------+--------------------+--------------------+----------+------------------+
|           idTweet|                text|                date|  authorId|        idOriginal|
+------------------+--------------------+--------------------+----------+------------------+
|915831976929714177|RT @Societatcc: A...|Thu Oct 05 08:52:...|2885455811|915523419281739776|
|915831940745441280|Yo ya he escogido...|Thu Oct 05 08:52:...|   2099361|                 0|
|915831968301973504|RT @pedroveraOyP:...|Thu Oct 05 08:52:...| 799792832|915830958443687936|
|915831985582612480|RT @Societatcc: A...|Thu Oct 05 08:52:...| 105157939|915523419281739776|
|915832004658286593|RT @pedroveraOyP:...|Thu Oct 05 08:52:...| 124248712|915830958443687936|
|915830958443687936|#AmicsAmigos no p...|Thu Oct 05 08:48:...| 

#### Playing with DataSets

In [153]:
ds.groupByKey(t => t.idOriginal).                // RDD API!!
    count.show                                   // DataFrames API!!

+------------------+--------+
|             value|count(1)|
+------------------+--------+
|                 0|       4|
|915808416639143936|       1|
|915523419281739776|       2|
|915830958443687936|       2|
|915830945785237504|       1|
+------------------+--------+



In [196]:
val mentions = ds.flatMap(t => t.text.split(" ").map(w => w.replaceAll(":$",""))).filter(text => text.startsWith("@"))

mentions.distinct.show
mentions.groupBy($"value").count.show

+-------------+
|        value|
+-------------+
| @gsemprunmdg|
|@pedroveraOyP|
|    @carmouna|
| @radio3_rne…|
|  @Societatcc|
+-------------+

+-------------+-----+
|        value|count|
+-------------+-----+
| @gsemprunmdg|    1|
|@pedroveraOyP|    2|
|    @carmouna|    1|
| @radio3_rne…|    2|
|  @Societatcc|    2|
+-------------+-----+



In [184]:
mentions.groupBy($"value").agg(count($"value").as[Double]).show

+-------------+------------+
|        value|count(value)|
+-------------+------------+
| @gsemprunmdg|           1|
|@pedroveraOyP|           2|
|    @carmouna|           1|
| @radio3_rne…|           2|
|  @Societatcc|           2|
+-------------+------------+



In [217]:
val groupedMentions = ds.groupByKey(tweet => tweet.idOriginal)

groupedMentions.agg(count($"idTweet").as[Double]).show

+------------------+--------------+
|             value|count(idTweet)|
+------------------+--------------+
|                 0|             4|
|915808416639143936|             1|
|915523419281739776|             2|
|915830958443687936|             2|
|915830945785237504|             1|
+------------------+--------------+



<a name="subsubseccion-Use of RDDs DataFrames and Datasets"></a>
### Use of RDDs, DataFrames and Datasets

So, where should I use RDDs, Datasets or DataFrames in my application? Let's summarize the characteristics of each data structure. You should use...


* RDDs when...

    - your data is unstructured, for example, binary (media) streams or text streams
    - you want to control your dataset and use low-level transformations and actions
    - you are ok to miss optimizations for DataFrames and Datasets for structured and semi-structured data that are available out of the box
    - you don’t care about the schema, columnar format and ready to use functional programming constructs


* DataFrames when...

    - your data is structured (RDBMS input) or semi-structured (json, csv)
    - you want to get the best performance gained from SQL’s optimized execution engine
    - you need to run hive queries
    - you appreciate domain specific language API (.groupBy, .agg, .orderBy)
    - you are using R or Python 


* Datasets when...

    - your data is structured or semi-structured
    - you appreciate type-safety at a compile time and a strong-typed API
    - you need good performance (mostly greater than RDD), but not the best one (usually lower than DataFrames)