# 102 Spark basics

The goal of this lab is to get familiar with Spark programming.

- [Spark programming guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html)
- [RDD APIs](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html)
- [PairRDD APIs](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/PairRDDFunctions.html)

In [1]:
import org.apache.spark

Intitializing Scala interpreter ...

Spark Web UI available at http://BigData:4040
SparkContext available as 'sc' (version = 3.5.1, master = local[*], app id = local-1729585476417)
SparkSession available as 'spark'


import org.apache.spark


## 102-1 Spark warm-up

Load the ```capra``` and ```divinacommedia``` datasets and try the following actions:
- Show their content (```collect```)
- Count their rows (```count```)
- Split phrases into words (```map``` or ```flatMap```; what’s the difference?)
- Check the results (remember: evaluation is lazy)
- Try the ```toDebugString``` function to check the execution plan

In [13]:
val rddCapra = sc.textFile("/home/bambinim/projects/lab-cse-24-25/datasets/capra.txt")
val capraCollected = rddCapra.collect
val capraCount = rddCapra.count

rddCapra: org.apache.spark.rdd.RDD[String] = /home/bambinim/projects/lab-cse-24-25/datasets/capra.txt MapPartitionsRDD[11] at textFile at <console>:26
capraCollected: Array[String] = Array(sopra la panca la capra campa, sotto la panca la capra crepa)
capraCount: Long = 2


In [15]:
val rddDivina = sc.textFile("/home/bambinim/projects/lab-cse-24-25/datasets/divinacommedia.txt")
val divinaCollected = rddDivina.collect
val divinaCount = rddDivina.count

rddDivina: org.apache.spark.rdd.RDD[String] = /home/bambinim/projects/lab-cse-24-25/datasets/divinacommedia.txt MapPartitionsRDD[15] at textFile at <console>:26
divinaCollected: Array[String] = Array(LA DIVINA COMMEDIA, di Dante Alighieri, INFERNO, "", "", "", Inferno: Canto I, "", "  Nel mezzo del cammin di nostra vita", mi ritrovai per una selva oscura, ch� la diritta via era smarrita., "  Ahi quanto a dir qual era � cosa dura", esta selva selvaggia e aspra e forte, che nel pensier rinova la paura!, "  Tant'� amara che poco � pi� morte;", ma per trattar del ben ch'i' vi trovai,, dir� de l'altre cose ch'i' v'ho scorte., "  Io non so ben ridir com'i' v'intrai,", tant'era pien di sonno a quel punto, che la verace via abbandonai., "  Ma poi ch'i' fui al pi� d'un colle giunto,", l� dove te...


In [18]:
rddCapra.flatMap(_.split(" ")).collect

res11: Array[String] = Array(sopra, la, panca, la, capra, campa, sotto, la, panca, la, capra, crepa)


## 102-2 Basic Spark jobs

Implement on Spark the following jobs and test them on both capra and divinacommedia datasets.

- **Word count**: count the number of occurrences of each word
  - Result: (sopra, 1), (la, 4), …
- **Word length count**: count the number of occurrences of words of given lengths
  - Result: (2, 4), (5, 8)
- Count the average length of words given their first letter (i.e., words that begin with "s" have an average length of 5)
  - Result: (s, 5), (l, 2), …
- Return the inverted index of words (i.e., for each word, list the numbers of lines in which they appear)
  - Result: (sopra, (0)), (la, (0, 1)), ...

Also, check how sorting works and try to sort key-value RDDs by descending values.

In [79]:
def wordCount(rdd: org.apache.spark.rdd.RDD[String]): org.apache.spark.rdd.RDD[(String, Int)] =
    rdd.flatMap(_.split(" ")).filter(_.length > 0).map((_, 1)).reduceByKey(_+_)

def wordLengthCount(rdd: org.apache.spark.rdd.RDD[String]): org.apache.spark.rdd.RDD[(Int, Int)] =
    rdd.flatMap(_.split(" ")).filter(_.length > 0).map(x => (x.length, 1)).reduceByKey(_+_)

def averageWordLength(rdd: org.apache.spark.rdd.RDD[String]): org.apache.spark.rdd.RDD[(String, Int)] =
    rdd.flatMap(_.split(" ")).filter(_.length > 0).map(x => (x.take(1), (x.length, 1)))
        .reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2)).map(x => (x._1, x._2._1 / x._2._2))

wordCount: (rdd: org.apache.spark.rdd.RDD[String])org.apache.spark.rdd.RDD[(String, Int)]
wordLengthCount: (rdd: org.apache.spark.rdd.RDD[String])org.apache.spark.rdd.RDD[(Int, Int)]
averageWordLength: (rdd: org.apache.spark.rdd.RDD[String])org.apache.spark.rdd.RDD[(String, Int)]
