# 102 Spark basics

The goal of this lab is to get familiar with Spark programming.

- [Spark programming guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html)
- [RDD APIs](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html)
- [PairRDD APIs](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/PairRDDFunctions.html)

In [None]:
import org.apache.spark.sql.SparkSession

In [None]:
// DO NOT EXECUTE - this is needed just to avoid showing errors in the following cells
val sc = spark.SparkContext.getOrCreate()

## 102-1 Spark warm-up

Load the ```capra``` and ```divinacommedia``` datasets and try the following actions:
- Show their content (```collect```)
- Count their rows (```count```)
- Split phrases into words (```map``` or ```flatMap```; what’s the difference?)
- Check the results (remember: evaluation is lazy)
- Try the ```toDebugString``` function to check the execution plan

In [None]:
val rdd_capra = sc.textFile("../../../../datasets/capra.txt")
val rdd_divinacommedia = sc.textFile("../../../../datasets/divinacommedia.txt")

In [3]:
rdd_capra.collect()

res0: Array[String] = Array(sopra la panca la capra campa, sotto la panca la capra crepa)


In [4]:
rdd_divinacommedia.collect()

res1: Array[String] = Array(LA DIVINA COMMEDIA, di Dante Alighieri, INFERNO, "", "", "", Inferno: Canto I, "", "  Nel mezzo del cammin di nostra vita", mi ritrovai per una selva oscura, ch? la diritta via era smarrita., "  Ahi quanto a dir qual era ? cosa dura", esta selva selvaggia e aspra e forte, che nel pensier rinova la paura!, "  Tant'? amara che poco ? pi? morte;", ma per trattar del ben ch'i' vi trovai,, dir? de l'altre cose ch'i' v'ho scorte., "  Io non so ben ridir com'i' v'intrai,", tant'era pien di sonno a quel punto, che la verace via abbandonai., "  Ma poi ch'i' fui al pi? d'un colle giunto,", l? dove terminava quella valle, che m'avea di paura il cor compunto,, "  guardai in alto, e vidi le sue spalle", vestite gi? de' raggi del pianeta, che mena dritto altrui per ogne ca...


In [5]:
rdd_capra.count()

res2: Long = 2


In [6]:
rdd_divinacommedia.count()

res3: Long = 14753


In [7]:
rdd_capra.flatMap(line => line.split(" ")).collect()

res4: Array[String] = Array(sopra, la, panca, la, capra, campa, sotto, la, panca, la, capra, crepa)


In [8]:
rdd_divinacommedia.flatMap(line => line.split(" ")).collect()

res5: Array[String] = Array(LA, DIVINA, COMMEDIA, di, Dante, Alighieri, INFERNO, "", "", "", Inferno:, Canto, I, "", "", "", Nel, mezzo, del, cammin, di, nostra, vita, mi, ritrovai, per, una, selva, oscura, ch?, la, diritta, via, era, smarrita., "", "", Ahi, quanto, a, dir, qual, era, ?, cosa, dura, esta, selva, selvaggia, e, aspra, e, forte, che, nel, pensier, rinova, la, paura!, "", "", Tant'?, amara, che, poco, ?, pi?, morte;, ma, per, trattar, del, ben, ch'i', vi, trovai,, dir?, de, l'altre, cose, ch'i', v'ho, scorte., "", "", Io, non, so, ben, ridir, com'i', v'intrai,, tant'era, pien, di, sonno, a, quel, punto, che, la, verace, via, abbandonai., "", "", Ma, poi, ch'i', fui, al, pi?, d'un, colle, giunto,, l?, dove, terminava, quella, valle, che, m'avea, di, paura, il, cor, compunto,...


In [9]:
rdd_capra.toDebugString

res6: String =
(2) ../../../../datasets/capra.txt MapPartitionsRDD[1] at textFile at <console>:25 []
 |  ../../../../datasets/capra.txt HadoopRDD[0] at textFile at <console>:25 []


In [10]:
rdd_divinacommedia.toDebugString

res7: String =
(2) ../../../../datasets/divinacommedia.txt MapPartitionsRDD[3] at textFile at <console>:26 []
 |  ../../../../datasets/divinacommedia.txt HadoopRDD[2] at textFile at <console>:26 []


## 102-2 Basic Spark jobs

Implement on Spark the following jobs and test them on both capra and divinacommedia datasets.

- **Word count**: count the number of occurrences of each word
  - Result: (sopra, 1), (la, 4), …
- **Word length count**: count the number of occurrences of words of given lengths
  - Result: (2, 4), (5, 8)
- Count the average length of words given their first letter (i.e., words that begin with "s" have an average length of 5)
  - Result: (s, 5), (l, 2), …
- Return the inverted index of words (i.e., for each word, list the numbers of lines in which they appear)
  - Result: (sopra, (0)), (la, (0, 1)), ...

Also, check how sorting works and try to sort key-value RDDs by descending values.

In [30]:
def word_count(rdd: org.apache.spark.rdd.RDD[String]) = {
    rdd.flatMap(line => line.split(" "))
       .map(word => (word, 1))
       .reduceByKey(_ + _)
       .sortByKey()
       .collect()
}

def word_length_count(rdd: org.apache.spark.rdd.RDD[String]) = {
    rdd.flatMap(line => line.split(" "))
       .map(word => (word.length, 1))
       .reduceByKey(_ + _)
       .collect()
}

def average_word_length(rdd: org.apache.spark.rdd.RDD[String]) = {
  rdd.flatMap(line => line.split(" "))
    .filter(word => word.nonEmpty)  // Filter out empty words
    .map(word => (word(0), (word.length, 1)))
    .reduceByKey((a, b) => (a._1 + b._1, a._2 + b._2))
    .mapValues(pair => pair._1 / pair._2)
    .collect()
}

def inverted_index(rdd: org.apache.spark.rdd.RDD[String]) = {
    rdd.zipWithIndex()
       .flatMap(pair => pair._1.split(" ").map(word => (word, pair._2)))
       .groupByKey()
       .collect()
}

def co_occurrence(rdd: org.apache.spark.rdd.RDD[String]) = {
    rdd.flatMap(line => line.split(" "))
        .flatMap(word => word.combinations(2))
        .map(pair => (pair.mkString(" "), 1))
        .reduceByKey(_ + _)
        .collect()
}

word_count: (rdd: org.apache.spark.rdd.RDD[String])Array[(String, Int)]
word_length_count: (rdd: org.apache.spark.rdd.RDD[String])Array[(Int, Int)]
average_word_length: (rdd: org.apache.spark.rdd.RDD[String])Array[(Char, Int)]
inverted_index: (rdd: org.apache.spark.rdd.RDD[String])Array[(String, Iterable[Long])]
co_occurrence: (rdd: org.apache.spark.rdd.RDD[String])Array[(String, Int)]


In [12]:
word_count(rdd_capra)

res8: Array[(String, Int)] = Array((campa,1), (capra,2), (crepa,1), (la,4), (panca,2), (sopra,1), (sotto,1))


In [13]:
word_count(rdd_divinacommedia)

res9: Array[(String, Int)] = Array(("",10684), (!,1), (!".,2), (!',1), (!',,1), (!'.,4), (!';,1), (",9), (",,4), (".,1), ("A,9), ("Acci?,3), ("Adamo";,1), ("Adima,1), ("Al,2), ("Alcun,1), ("Alma,1), ("Almen,1), ("Altra,1), ("Altri,1), ("Ambo,1), ("Amme!",,1), ("Amore,,1), ("Anastasio,1), ("Ancor,3), ("Andate,1), ("Andiamo,1), ("Anima,1), ("Anzi,2), ("Apri,1), ("Aspetta,1), ("Assai,1), ("Attienti,1), ("Attienti,,1), ("Avaccio,1), ("Avante,1), ("Baldezza,1), ("Beati,1), ("Beato,2), ("Belacqua,,1), ("Ben,3), ("Bene,1), ("Benedetto,1), ("Brievemente,1), ("Buon,2), ("Capo,1), ("Caron,,1), ("Casella,1), ("Certo,2), ("Certo,,1), ("Cesare,1), ("Che,15), ("Chi,15), ("Chiedi,1), ("Chiedi",,1), ("Chiunque,1), ("Ciacco,,1), ("Cianfa,1), ("Ciascun,1), ("Ci?,1), ("Colui,3), ("Col?",,1), ("Com'?,1), (...


In [14]:
word_length_count(rdd_capra)

res10: Array[(Int, Int)] = Array((2,4), (5,8))


In [15]:
word_length_count(rdd_divinacommedia)

res11: Array[(Int, Int)] = Array((4,9111), (16,3), (14,50), (0,10684), (6,11775), (8,5363), (12,370), (10,1741), (2,19258), (13,154), (15,18), (11,933), (1,6992), (17,1), (3,16887), (7,7379), (9,3231), (5,13504))


In [16]:
average_word_length(rdd_capra)

res12: Array[(Char, Int)] = Array((p,5), (l,2), (s,5), (c,5))


In [17]:
average_word_length(rdd_divinacommedia)

res13: Array[(Char, Int)] = Array((T,5), (d,4), (z,5), (",4), (L,3), (p,5), (R,6), (B,7), (P,5), (t,4), (b,5), (.,1), (h,2), (n,3), (f,4), (j,4), (v,5), ((,4), (Z,6), (F,6), (V,5), (:,1), (,,1), (X,3), (N,3), (r,7), (l,3), (D,4), (',2), (s,4), (e,1), (Q,5), (G,6), (M,4), (a,3), (O,3), (;,1), (A,5), (u,3), (I,4), (o,4), (i,3), (!,2), (q,5), (-,1), (S,5), (?,1), (C,5), (E,1), (?,3), (U,5), (g,5), (m,4), (c,4))


In [18]:
inverted_index(rdd_capra)

res14: Array[(String, Iterable[Long])] = Array((campa,CompactBuffer(0)), (la,CompactBuffer(0, 0, 1, 1)), (panca,CompactBuffer(0, 1)), (sotto,CompactBuffer(1)), (crepa,CompactBuffer(1)), (sopra,CompactBuffer(0)), (capra,CompactBuffer(0, 1)))


In [19]:
inverted_index(rdd_divinacommedia)

res15: Array[(String, Iterable[Long])] = Array((grand'avello,,CompactBuffer(1415)), (diseta,CompactBuffer(10716)), (vane.,CompactBuffer(8496)), (tonda,CompactBuffer(4337, 8276, 12431)), (blandimenti;,CompactBuffer(12079)), (sapore,CompactBuffer(7817)), (dando,CompactBuffer(4189, 7532, 8723)), (Verrucchio,,CompactBuffer(3758)), (Mantua,CompactBuffer(2758)), (m'apparvero,CompactBuffer(11847)), (disiderate,CompactBuffer(10196)), (dole,CompactBuffer(5454)), (moventi,CompactBuffer(10267)), (rincalzi,CompactBuffer(12933)), (freni,,CompactBuffer(2352)), (Voglia,CompactBuffer(10375)), (tormento,CompactBuffer(629, 1240, 2409, 6323, 6777)), (focina,CompactBuffer(1884)), (s?:,CompactBuffer(3305, 14101)), (marino,,CompactBuffer(5058)), (scalz?,CompactBuffer(11391)), (pensassi,CompactBuffer(8479)), ...


In [31]:
co_occurrence(rdd_capra)

res21: Array[(String, Int)] = Array((s o,2), (r p,1), (p n,2), (a a,5), (p r,3), (a c,2), (e a,1), (a m,1), (o o,1), (c m,1), (o a,1), (c a,4), (s a,1), (t t,1), (c e,1), (r e,1), (e p,1), (s r,1), (c r,3), (n c,2), (p c,2), (l a,4), (a n,2), (o p,1), (p a,4), (c p,4), (s t,1), (s p,1), (m p,1), (r a,2), (a p,3), (a r,2), (o r,1), (o t,1))


In [None]:
co_occurrence(rdd_divinacommedia)