# Lab 02 - Spark RDD
Processing large text files using Spark

-------

# Ex 1

## Part 1.0 - creating RDD
Given the file containing words and respective frequencies, create the input RDD and separate the elements to obtain a pair RDD.
(Knowing that the general structure of a .tsv file is: `word\tword\n`)

In [4]:
path = '/data/students/bigdata_internet/lab2/word_frequency.tsv'

inRDD = sc.textFile(path)
sepRDD = inRDD.map(lambda l: (l.split('\t')[0], int(l.split('\t')[1])))

### 1.0.1 - Draw 5 (random) samples from the RDD.
The `takeSample()` method is called with `False` as a parameter to take the elements without replacement.

In [5]:
# Drawing 5 random samples
out_five = sepRDD.takeSample(False, 5)
print(out_five)

[('Roast;', 30), ('soft-its', 1), ('Olives"', 2), ('most"', 2), ('oil-suspension', 1)]


### 1.0.2 - Pick the first 5 words in order of frequency (use top).
`top()` allows a function to be passed to choose the ordering to be used: in this case, since we want the most frequent words (words with highest value), the function only needs to extract the value from each pair.

In [6]:
# Drawing the 5 most recurring words
out_top = sepRDD.top(5, lambda couple: couple[1])
print(out_top)

[('the', 1630750), ('I', 1448619), ('and', 1237250), ('a', 1164419), ('to', 997979)]


### 1.0.3 - Count how many elements the file contains.
It is possible to use the `count()` method on the RDD.

In [7]:
# Count words
n_elem = sepRDD.count()
print(f"There are {n_elem} elements inside the list")

There are 339819 elements inside the list


In [8]:
# Total number of words (for comparison in part 3.2)
numbersRDD = sepRDD.map(lambda couple: int(couple[1]))
totalwords_1 = numbersRDD.reduce(lambda couple1, couple2: couple1 + couple2)
print(totalwords_1)

45444841


### 1.0.4 - Observe `word_frequency.tsv`.
It is actually a folder, which contains `\_SUCCESS` (a Spark log file), `part-00000` and `part-00001`, which are the typical files found in the output of a `saveAsTextFile()` operation. We can infer that the RDD which originated this file was stored into 2 partitions distributed among the working nodes.

## Part 1.1 - Filtering words starting with a specified prefix

Define the prefix (`'ho'`) and filter RDD to only keep elements whose key starts with the specified string.

In [9]:
prefix = 'ho'

In [10]:
filteredRDD = sepRDD.filter(lambda couple: couple[0].startswith(prefix))

### 1.1.1
Having filtered the RDD, count how many elements are left - using the `count()` method

In [11]:
n_filtered = filteredRDD.count()
print(f"The filtered RDD contains {n_filtered} words")

The filtered RDD contains 1519 words


### 1.1.2
Find out how frequent is the most frequent word of the filtered RDD.
(this is one of the possible ways - see part 1.1.3)

In [12]:
most_freq_1 = filteredRDD.max(lambda couple: couple[1])
print(f"The most frequent word in the filtered RDD (words beginning with '{prefix}') has frequency {most_freq_1[1]}")

The most frequent word in the filtered RDD (words beginning with 'ho') has frequency 36264


### 1.1.3
Other 2 ways to evaluate the same value as in point 1.1.2
* `top()` method
* `first()` method, after sorting the RDD with the `sortBy()` transformation. Sorting is performed according to `-1*value`

In [13]:
most_freq_2 = filteredRDD.top(1, lambda couple: couple[1])[0]
print(most_freq_2[1])

36264


In [14]:
sortedRDD = filteredRDD.sortBy(lambda couple: -1*couple[1])
most_freq_3 = sortedRDD.first()
print(most_freq_3[1])

36264


In [15]:
# Take one of the three values just found
maxfreq = most_freq_1[1]

## Part 1.2 - Filter most frequent words
Set the frequency threshold (`freq`) to 70% of the highest frequency (`maxfreq`) found in the point before.

In [16]:
freq = .7*maxfreq
print(f"The threshold value has been set to {freq}")

The threshold value has been set to 25384.8


In [17]:
topFreqRDD = filteredRDD.filter(lambda line: line[1] >= freq)

## Part 1.3 - Count the remaining words and save the output

### 1.3.1
Count how many elements are left after both filtering operations (`count()` method)

In [18]:
n_remaining_words = topFreqRDD.count()
print(f"The remaining elements are {n_remaining_words}")

The remaining elements are 2


### 1.3.2
Isolate keys and store the words on an output file (`/user/s315054/lab02/results_01`)

In [19]:
# This line is just used to clear the content of the output folder to prevent overwriting
!hdfs dfs -rm -r /user/s315054/lab02/results_01.txt

23/01/11 16:33:38 INFO fs.TrashPolicyDefault: Moved: 'hdfs://BigDataHA/user/s315054/lab02/results_01.txt' to trash at: hdfs://BigDataHA/user/s315054/.Trash/Current/user/s315054/lab02/results_01.txt


In [20]:
outRDD = topFreqRDD.map(lambda c: c[0]+';')
print(outRDD.collect())

['hot;', 'how;']


In [21]:
outPath = '/user/s315054/lab02/results_01.txt'
outRDD.saveAsTextFile(outPath)

-----

# Ex 2

The program `lab02_ex02.py` is ran from terminal using `spark-submit`.
Below is reported the program code:

    from pyspark import SparkContext, SparkConf
    import sys
    import time

    start = time.time()
    # The prefix is passed as a command line argument (first one)
    prefix = sys.argv[1]
    # The output path is the second command line argument 
    outputPath = sys.argv[2]
    # Input file path (HDFS)
    path = '/data/students/bigdata_internet/lab2/word_frequency.tsv'
    # Create SparkContext object
    conf = SparkConf().setAppName('Exercise 02, lab 02')
    sc = SparkContext(conf=conf)
    # Create pair RDD (sepRDD)
    inRDD = sc.textFile(path)
    sepRDD = inRDD.map(lambda l: (l.split('\t')[0], int(l.split('\t')[1])))
    # Isolate elements whose key starts with the specified prefix
    filteredRDD = sepRDD.filter(lambda couple: couple[0].startswith(prefix))
    # Produce output file
    outRDD = filteredRDD.map(lambda c: c[0]+' - '+str(c[1])+',')
    outRDD.saveAsTextFile(outputPath)
    stop = time.time()
    print(f"The program takes {stop-start} seconds to run")


The following cells contain terminal commands (using character `!`).

In [25]:
# This line is just used to clear the content of the output folder to prevent overwriting
!hdfs dfs -rm -r /user/s315054/lab02/results_02.txt

23/01/11 16:36:14 INFO fs.TrashPolicyDefault: Moved: 'hdfs://BigDataHA/user/s315054/lab02/results_02.txt' to trash at: hdfs://BigDataHA/user/s315054/.Trash/Current/user/s315054/lab02/results_02.txt1673454974466


In [23]:
# Running locally (on jupyter.polito.it)
!spark-submit --master local --deploy-mode client lab02_ex02.py ho /user/s315054/lab02/results_02.txt 

23/01/11 16:35:16 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/01/11 16:35:16 WARN util.Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
23/01/11 16:35:16 WARN util.Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
The program takes 4.231187343597412 seconds to run                              


In [26]:
# Running on the cluster
!spark-submit --master yarn lab02_ex02.py ho /user/s315054/lab02/results_02.txt

23/01/11 16:36:18 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/01/11 16:36:18 WARN util.Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
23/01/11 16:36:18 WARN util.Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
23/01/11 16:36:27 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
The program takes 18.167657136917114 seconds to run                             


## 2.1
When ran locally, the time taken is approximately 4.2 seconds, while, when ran in the cluster (`--master yarn`), the time is about 18.2 seconds, probably due to the YARN scheduler taking time in parallelizing the tasks. Indeed, the dataset is not excessively large and running the tasks locally is still more convenient than than running them on the cluster nodes.

## 2.2

In this case, since we are executing only a single action (`saveAsTextFile()`), caching would not improve performance, as all transformations are executed just once, as we reach the line in which this action is executed.

-----

# Ex 3 - Bonus Task

Analyze the full file from which the pairs (word, frequency) were obtained.

In [29]:
filepath = "/data/students/bigdata_internet/lab2/finefoods_text.txt"

In [30]:
# Open file
inFullRDD = sc.textFile(filepath)

In [31]:
# Remove character '\x0c'
procRDD = inFullRDD.map(lambda line: line.replace('\x0c', ' '))

In [32]:
# Separate the words
allwordsRDD = procRDD.flatMap(lambda line: line.strip().split(' '))

## Issue: when multiple subsequent blankspaces are present, 
# empty strings are added to the set (they are more than 5 million)
# Remove empty strings:
actualWordsRDD = allwordsRDD.filter(lambda word: word != '')

# Count the elements
totalWords = actualWordsRDD.count()
print(f"The total number of words is {totalWords} (including duplicates)")

[Stage 17:>                                                         (0 + 2) / 2]

The total number of words is 45444841 (including duplicates)


                                                                                

In [33]:
nrows = inFullRDD.count()
print(f"The number of rows of the input file is {nrows}")

[Stage 18:>                                                         (0 + 2) / 2]

The number of rows of the input file is 568454


                                                                                

## 3.1

As highlighted, the total number of words present is 45444841, which corresponds to the number of words obtained by summing all recurrences from the file of frequencies.

It is to be noted, however, that this file needed particular attention, since probably the removal of punctuation marks was simply carried out by replacing them with blankspaces, meaning that often multiple consecutive blankspaces appear. This, when splitting the lines as `line.split(' ')`, causes the program to wrongly include as words empty strings (`''`). For this reason it ws necessary to filter them out.

Another step was that of removing the 'form feed' character `\x0c`, which apparently was not removed before. The reasons for this will be clear in point 3.2, but basically this special sequence was causing the program to misinterpret one word.

## 3.2
The following cells contain the steps used to obtain the frequencies file from the words isolated. Then, by means of a `subtract()` method call it was possible to verify that the two files were in fact identical. 

(Notice that `sepRDD` was created at the beginning of this lab)

In [34]:
tmpRDD = actualWordsRDD.map(lambda word: (word, 1))
frequencyRDD = tmpRDD.reduceByKey(lambda v1, v2: v1+v2)

In [38]:
# Check that the elements are actually the same
diffRDD = frequencyRDD.subtract(sepRDD)
check = diffRDD.count()
print(f"The number of different elements is: {check}")

[Stage 29:>                                                         (0 + 4) / 4]

The number of different elements is: 0


                                                                                

During the creation of the program, it was found that one element was different between the frequencies file and the version obtained from the full review file. Upon further inspection, it turned out that one word was containing the special character (form feed) `\x0c`, which prevented it to be reunited with the same words when applying the `reduceByKey()` operation.
Then, as shown in point 3.1, the character was removed.