# Download Datasets

Use the curl command to download files from specified URLs and save them in the current directory:

In [0]:
%sh 
curl -O 'https://raw.githubusercontent.com/masfworld/datahack_docker/master/zeppelin/data/frankenstein.txt'
curl -O 'https://raw.githubusercontent.com/masfworld/datahack_docker/master/zeppelin/data/el_quijote.txt'
curl -O 'https://raw.githubusercontent.com/masfworld/datahack_docker/master/zeppelin/data/characters.csv'
curl -O 'https://raw.githubusercontent.com/masfworld/datahack_docker/master/zeppelin/data/planets.csv'

This code lists the contents of the directory `/databricks/driver/` in a Databricks environment.

In [0]:
%sh
ls /databricks/driver/

# RDD

---



## Basics

### Example 1 - Create a RDD

Reads the contents of frankenstein.txt into an RDD and displays the first line of the file:

In [0]:
textFile = spark.sparkContext.textFile('file:/databricks/driver/frankenstein.txt')
display(textFile.first())


### Creation of paralelized collection
This is a fast way to create a RDD:

### Example 2 - Parallelize

1. **Creates an RDD**: It initializes an RDD named `distData` with the list of integers `[25, 20, 15, 10, 5]` using the `parallelize` method of Spark's `SparkContext`.

2. **Reduces the RDD**: It applies the `reduce` method to sum all the elements in the RDD. The lambda function `lambda x, y: x + y` specifies that the reduction operation should be summing the elements.

3. **Displays the result**: It uses the `display` function to show the result of the reduction, which is the sum of the elements in the list.

In [0]:
distData = spark.sparkContext.parallelize([25, 20, 15, 10, 5])
display(distData.reduce(lambda x ,y: x + y))

### Exercise 1 - Count the number of lines
Count the number of lines for `el_quijote.txt` file


In [0]:
# Load the text file 'el_quijote.txt' into an RDD named 'textfile_quijote'
textfile_quijote = spark.sparkContext.textFile("file:/databricks/driver/el_quijote.txt")

# Count the number of lines in the RDD and print the result
print("Number of lines: " + str(textfile_quijote.count()))

### Exercise 2 - Print the first line
Print the first line of the file `el_quijote.txt`

In [0]:
display(textfile_quijote.first())

## Transformations and Actions in RDDs 

### Actions
Actions trigger the execution of transformations to produce a result. They perform computations and send the results back to the driver program or save them to an external storage system. Examples include:
  - `count()`, which returns the number of elements in the RDD.
  - `collect()`, which returns all the elements of the RDD to the driver.
  - `saveAsTextFile()`, which writes the data to a text file.
  - `reduce()`

**Usage**: Actions are used to either save a result to some location or display it.
> Be very cautious with actions; we should avoid `collect()` in our production applications as it can lead to an out-of-memory exception.

#### Example 3 - Count and First

In [0]:
print(textFile.count()) # Number of elements in the RDD
print(textFile.first()) # First element of the RDD

### Transformations
- **Operations over RDDs that return a new RDD**: Transformations are operations that create a new RDD from an existing one. 
- **Lazy Evaluation**: They are lazily evaluated, meaning they only define a new RDD without immediately computing it. 
  - Only computed when an action requires a result to be returned to the driver program.
  - Note: Some transformations like `sortByKey` are not lazy.
- Examples include:
  - `map()`, which applies a function to each element in the RDD.
  - `filter()`, which returns a new RDD containing only the elements that satisfy a given condition.

> Note: Consider that SparkSQL transformations are other different kind of transformations.


#### Example 4 - ReduceByKey and SortByKey

In [0]:
# ReduceByKey

# Load the text file 'frankenstein.txt' into an RDD named 'lines'
lines = spark.sparkContext.textFile("file:/databricks/driver/frankenstein.txt")

# Map each line in the RDD to a pair (line, 1), creating an RDD of pairs
pairs = lines.map(lambda s: (s, 1))

# Reduce the pairs by key (the lines), summing the counts for each unique line
# Cache the resulting RDD to optimize subsequent actions
counts = pairs.reduceByKey(lambda a, b: a + b).cache()

# Count the number of unique lines (keys) in the RDD
counts.count()

display(counts.collect()) # Collect the RDD to the driver and display the result

In [0]:
# SortByKey

# Sort the RDD 'counts' by key (the lines) and store the result in 'sorted'
sorted = counts.sortByKey()

# Collect the sorted RDD to the driver and display the result
display(sorted.collect())

#### Example 5 - Filter

In [0]:
# Filter

# Filter the RDD 'textFile' to include only lines that contain the word "the"
linesWithSpark = textFile.filter(lambda line: "the" in line)

# Count the number of lines that contain the word "the" and display the result
display(linesWithSpark.count())

#### Exercise 3 - Word count
Get the word count for the file `frankenstein.txt`

In [0]:
# Load the text file 'frankenstein.txt' into an RDD named 'words'
words = spark.sparkContext.textFile("file:/databricks/driver/frankenstein.txt")

# Split each line into words, creating a flattened RDD of words
words.flatMap(lambda x: x.split(" ")) \
.map(lambda s: (s, 1)) \
.reduceByKey(lambda a, b: a + b) \
.map(lambda x: (x[1], x[0])) \
.sortByKey(False) \
.collect()

1. **Flattening and Mapping**:
   - `.flatMap(lambda x: x.split(" "))`: This operation splits each line of text into words based on spaces, creating a new RDD where each element is a word.
   - `.map(lambda s: (s, 1))`: This maps each word to a tuple `(word, 1)`. This transformation prepares the data for the next step, where we will count the occurrences of each word.

2. **Reducing by Key**:
   - `.reduceByKey(lambda a, b: a + b)`: This reduces the tuples by key (the word), summing up the counts (values) for each word. After this operation, each unique word will have a count representing how many times it appeared in the text.

3. **Swapping and Sorting**:
   - `.map(lambda x: (x[1], x[0]))`: This swaps the position of each tuple so that the count is the key and the word is the value. This transformation is done to prepare for sorting by the word count.
   - `.sortByKey(False)`: This sorts the RDD by the key (the counts) in descending order (`False` indicates descending order). Now, the RDD elements are sorted such that words with higher counts appear first.

4. **Collecting the Result**:
   - `.collect()`: This action collects all the elements of the RDD (now sorted by word count) to the driver node. The result is returned as a list of tuples, where each tuple contains the count of occurrences and the corresponding word.


#### Exercise 4 - Get top 10 words
Get TOP 10 of the words with more than 4 characters


In [0]:
words \
.flatMap(lambda line: line.split(" ")) \
.filter(lambda word: len(word) > 4) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b) \
.map(lambda x: (x[1], x[0])) \
.sortByKey(False) \
.take(10)

- `.flatMap(lambda line: line.split(" "))`: This operation splits each line into words based on spaces and creates a new RDD where each element is a word.
- `.filter(lambda word: len(word) > 4)`: This filters out words whose length is less than or equal to 4 characters. Only words longer than 4 characters will pass through to the next step.
- `.map(lambda word: (word, 1))`: This maps each word to a tuple `(word, 1)`, where `1` represents the count of occurrences of that word.
- `.reduceByKey(lambda a, b: a + b)`: This reduces the tuples by key (the word), summing up the counts (values) for each word. After this operation, each unique word will have a count representing how many times it appeared in the text, but only for words longer than 4 characters.
- `.map(lambda x: (x[1], x[0]))`: This swaps the position of each tuple so that the count is the key and the word is the value. This transformation is done to prepare for sorting by the word count.
- `.sortByKey(False)`: This sorts the RDD by the key (the counts) in descending order (`False` indicates descending order). Now, the RDD elements are sorted such that words with higher counts appear first.
- `.take(10)`: This action takes the top 10 elements from the RDD. These elements represent the 10 most frequent words longer than 4 characters, sorted by their frequency.



In [0]:
words \
.flatMap(lambda line: line.split(" ")) \
.filter(lambda word: len(word) > 4) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b) \
.top(10, key=lambda x: x[1])

- `flatMap(lambda line: line.split(" "))`: Splits each line of text into words based on spaces and creates a new RDD where each element is a word.

- `filter(lambda word: len(word) > 4)`: Filters out words whose length is less than or equal to 4 characters. Only words longer than 4 characters are retained.

- `map(lambda word: (word, 1))`: Maps each word to a tuple `(word, 1)`, where `1` represents the count of occurrences of that word.

- `reduceByKey(lambda a, b: a + b)`: Reduces the tuples by key (the word), summing up the counts (values) for each word. After this operation, each unique word will have a count representing how many times it appeared in the text.

- `.top(10, key=lambda x: x[1])`: Retrieves the top 10 elements from the RDD based on the specified key function `lambda x: x[1]`. Here, `x[1]` denotes the count associated with each word tuple `(word, count)`. The elements are retrieved in descending order of their counts.


## Key/Value Pair RDD

---



- Spark provides specialized operations for RDDs that store data as (key, value) pairs, often referred to as Pair RDDs.
- These operations enable efficient parallel processing operations on each key and aggregation across the network. For example, transformations like `reduceByKey()` aggregate data locally on each partition before shuffling across the network, optimizing performance for tasks such as counting or aggregating values by key.

### Example 6 - Create the RDD and remove the header



In [0]:
# Load the text file 'characters.csv' into an RDD named 'charac_sw'
charac_sw = spark.sparkContext.textFile("file:/databricks/driver/characters.csv")

# Load the text file 'planets.csv' into an RDD named 'planets_sw'
planets_sw = spark.sparkContext.textFile("file:/databricks/driver/planets.csv")

# Take the first 10 elements from the RDD 'charac_sw' and display them
charac_sw.take(10)

In [0]:
# Take the first 10 elements from the RDD 'planets_sw' and display them
planets_sw.take(10)

In [0]:
from itertools import islice

# Remove the header from 'charac_sw' RDD using mapPartitionsWithIndex
charac_sw_noheader = charac_sw.mapPartitionsWithIndex(
    lambda idx, it: islice(it, 1, None) if idx == 0 else it)

# Remove the header from 'planets_sw' RDD using mapPartitionsWithIndex
planets_sw_noheader = planets_sw.mapPartitionsWithIndex(
    lambda idx, it: islice(it, 1, None) if idx == 0 else it)

This operation removes the header from the RDD by skipping the first element in the first partition:

- `mapPartitionsWithIndex` allows processing each partition of the RDD with an index.
- `lambda idx, it: islice(it, 1, None) if idx == 0 else it`: 
  - For the partition with index `idx == 0` (first partition), `islice(it, 1, None)` skips the first element (header) and returns the rest of the elements.
  - For other partitions (`else it`), it returns all elements unchanged.

### Exercise 5 - Join Pair RDDs
Get a list of the population of the planet each Star Wars character belongs to

In [0]:
# Create pairs of (planet_name, climate) from 'planets_sw_noheader' RDD
planets_sw_pair = planets_sw_noheader \
.map(lambda line: line.split(";")) \
.map(lambda x: (x[0], x[8]))

# Create pairs of (character_name, planet_name) from 'charac_sw_noheader' RDD
characters_sw_pair = charac_sw_noheader \
.map(lambda line: line.split(",")) \
.map(lambda x: (x[8], x[0]))

# Join 'characters_sw_pair' and 'planets_sw_pair' RDDs on planet_name
# Retain only distinct records and select the first 10 records
characters_sw_pair\
.join(planets_sw_pair)\
.map(lambda x: (x[0], x[1][0], x[1][1]))\
.distinct()\
.take(10)

Create pairs of (planet_name, climate) from 'planets_sw_noheader' RDD: 
- `.map(lambda line: line.split(";"))`: splits each `line` into a list of strings using `";"` as the delimiter.
- `.map(lambda x: (x[0], x[8]))`: creates a new RDD where each element (`x`) is transformed into a tuple `(x[0], x[8])` representing the key-value pair.
     - `x[0]` likely corresponds to the first column of the CSV data, representing the planet name.
     - `x[8]` likely corresponds to the ninth column of the CSV data, representing the climate of the planet.

Create pairs of (character_name, planet_name) from 'charac_sw_noheader' RDD: similar to the above.

Join 'characters_sw_pair' and 'planets_sw_pair' RDDs on planet_name:
- `characters_sw_pair.join(planets_sw_pair)`: This operation joins two RDDs, `characters_sw_pair` and `planets_sw_pair`, based on their keys. Specifically, it joins them on the `planet_name` key, assuming `characters_sw_pair` contains tuples of `(character_name, planet_name)` and `planets_sw_pair` contains tuples of `(planet_name, climate)`.
- `.map(lambda x: (x[0], x[1][0], x[1][1]))`: After joining, each element `x` in the resulting RDD represents a tuple `(planet_name, (character_name, climate))`. This mapping rearranges the tuple to `(character_name, planet_name, climate)`, extracting the necessary information for further analysis or display.
- `.distinct()`: This removes duplicate tuples from the RDD. Each tuple is considered unique based on its entire structure `(character_name, planet_name, climate)`.
- `.take(10)`: collects the first 10 elements from the RDD.
