# Prerrequisites

Installing Spark

---



In [None]:
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
!tar xf spark-3.2.0-bin-hadoop3.2.tgz
!pip -q install findspark

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.0-bin-hadoop3.2"

In [None]:
import findspark
findspark.init()

Starting Spark Session and print the version


---


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# create the session
spark = SparkSession \
        .builder \
        .master("local[*]") \
        .getOrCreate()

spark.version

'3.2.0'

Creating tunnel</br>
**To Check the Spark UI, open the URL printed by running the above command : https://######/jobs/, /SQL/**


In [None]:
 from google.colab.output import eval_js
 print(eval_js("google.colab.kernel.proxyPort(4040)") + "jobs/")

https://s77zum7b0ro-496ff2e9c6d22116-4040-colab.googleusercontent.com/jobs/


# Descargar Datasets

In [None]:
!mkdir -p /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/frankenstein.txt -P /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/el_quijote.txt -P /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/characters.csv -P /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/planets.csv -P /dataset
!ls /dataset

characters.csv	el_quijote.txt	frankenstein.txt  planets.csv


# RDD

---



## Example 1

In [None]:
#Creating RDD
textFile1 = spark.sparkContext.textFile("/dataset/frankenstein.txt")
textFile1.first()

'FRANKENSTEIN'

In [None]:
textFile

/dataset/frankenstein.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0


Creation of paralelized collection de colecciones paralelizadas
This is a fast way to create a RDD:

## Example 2

In [None]:
distData = spark.sparkContext.parallelize([25, 20, 15, 10, 5])
distData.reduce(lambda x ,y: x + y)

75

## Exercise 1
Count the number of lines for `el_quijote.txt` file

---



In [None]:
quijote = spark.sparkContext.textFile("/dataset/el_quijote.txt")
quijote.count()

2186

## Exercise 2
Print the first line of the file `el_quijote.txt`

---



In [None]:
quijote.first()



'DON QUIJOTE DE LA MANCHA'

## Transformations and Actions in RDDs 

### Actions

### Example 3

In [None]:
print(quijote.count()) # Number of elements in RDD
print(quijote.first()) # First element in RDD

2186
DON QUIJOTE DE LA MANCHA


### Transformaciones

### Example 4

In [None]:
# ReduceByKey
lines = spark.sparkContext.textFile("/dataset/frankenstein.txt")
pairs = lines.map(lambda s: (s, 1))
counts = pairs.reduceByKey(lambda a, b: a + b).cache()
counts.count()
counts.collect()

In [None]:
# SortByKey
sorted = counts.sortByKey()
sorted.collect()

In [None]:
# Count by words
lines = spark.sparkContext.textFile("/dataset/frankenstein.txt")
contarPalabras = lines.flatMap(lambda a: a.split(' ')).countByValue()

for palabra, contador in contarPalabras.items():
  print("{} : {}".format(palabra, contador))

In [None]:
#GET TOP 10 of the words with more than 4 characters
Words=spark.sparkContext.textFile("/dataset/frankenstein.txt")

WordsCount=Words.flatMap(lambda line: line.split(" ")).filter(lambda word:len(word) > 4).map(lambda word: (word, 1))
WordsCount.count()

DistinctWordsCount=WordsCount.reduceByKey(lambda a,b: a+b)
DistinctWordsCount.count()

SortedWordsCount=DistinctWordsCount.map(lambda a: (a[1], a[0])).sortByKey()
#print most frequent 10 words
SortedWordsCount.top(10)


[(540, 'which'),
 (187, 'could'),
 (177, 'would'),
 (174, 'their'),
 (152, 'should'),
 (130, 'these'),
 (122, 'before'),
 (107, 'might'),
 (105, 'myself'),
 (103, 'every')]

### Example 5

In [None]:
# Filter

linesWithSpark = textFile.filter(lambda line: "the" in line)
linesWithSpark.count()

### Exercise 3
Get the word count for the file `frankenstein.txt`

---

### Exercise 4
Get TOP 10 of the words with more than 4 characters

---



## Key/Value Pair RDD

---



### Example 6


---



In [None]:
charac_sw = spark.sparkContext.textFile("/dataset/characters.csv")
planets_sw = spark.sparkContext.textFile("/dataset/planets.csv")
charac_sw.take(10)

In [None]:
planets_sw.take(10)

In [None]:
from itertools import islice

charac_sw_noheader = charac_sw.mapPartitionsWithIndex(
    lambda idx, it: islice(it, 1, None) if idx == 0 else it)

planets_sw_noheader = planets_sw.mapPartitionsWithIndex(
    lambda idx, it: islice(it, 1, None) if idx == 0 else it)

### Exercise 5
Get a list of the population of the planet each Star Wars character belongs to

---
