# Lab 01

## Exercise 01

Store content of a text file in a RDD, then separate the elements in a new rdd;
Eventually, sum all numbers.

### 1.1 Execute in Jupyter notebook

In [1]:
rdd = sc.textFile("/data/students/bigdata_internet/lab1/lab1_dataset.txt")
fields_rdd = rdd.map(lambda line: line.split(",")) 
value_rdd = fields_rdd.map(lambda l: int(l[1]))
value_sum = value_rdd.reduce(lambda v1, v2: v1+v2) 
print("The sum is:", value_sum)

[Stage 0:>                                                          (0 + 1) / 2]

The sum is: 46


                                                                                

In [2]:
fieldsList = fields_rdd.collect()
print(fieldsList)



[['alice', '4'], ['bob', '5'], ['john', '4'], ['alice', '3'], ['john', '8'], ['bob', '3'], ['alice', '7'], ['john', '9'], ['bob', '3']]


                                                                                

### Answers:

1. Printed value: 46 (sum of all numbers - values - in the pairs).


2. Lines:

    * Line 1: store content of the file `lab1_dataset.txt` into a new RDD
    * Line 2: create a new RDD by splitting the elements in the file at ","
    * Line 3: create a new RDD containing only the numbers (need to cast them as int)
    * Line 4: apply the `reduce()` method to sum all numbers and store the result in a local Python variable

3. No, since the driver program was run locally (on pyspark.polito.it), not on the cluster nodes, therefore it was not stored on the distributed file system.

4. By changing the kernel to YARN, the program is executed by the cluster nodes. It takes more time to run the program, since it is needed to split the work among the available servers first (automatically done by the YARN scheduler).

5. Now the job is seen at hue, since it was executed by the cluster nodes.

---

### 1.2 Execute in a pyspark shell

By using `%%bash` the cell is read as a series of terminal commands. The 1st line launches the PySpark interactive shell, while the following ones are executed in that environment.

In [2]:
%%bash
pyspark --master local --deploy-mode client <<EOF
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("Ex01_2")
sc = SparkContext(conf = conf)
rdd = sc.textFile("/data/students/bigdata_internet/lab1/lab1_dataset.txt")
fields_rdd = rdd.map(lambda line: line.split(",")) 
value_rdd = fields_rdd.map(lambda l: int(l[1]))
value_sum = value_rdd.reduce(lambda v1, v2: v1+v2) 
print("The sum is:", value_sum)
EOF

The sum is: 46


23/01/21 10:01:02 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/01/21 10:01:02 WARN util.Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
23/01/21 10:01:02 WARN util.Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
23/01/21 10:01:02 WARN util.Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
23/01/21 10:01:02 WARN util.Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.
23/01/21 10:01:02 WARN util.Utils: Service 'SparkUI' could not bind on port 4045. Attempting port 4046.
                                                                                

### Answers

1. `--master local` means that the scheduler used is local (the local server did the computation). It is the same as using PySpark (local) as kernel
2. `--deploy-mode client` means that the driver program is executed locally (on jupyter.polito.it). In this case, then both the driver and the execution are hosted locally


### 1.3 Create a Spark script and run it from the command line

The file is stored on the local file system. By using `spark-submit`.

In [20]:
!spark-submit --master local --deploy-mode client 'lab01_ex1_1.py'

23/01/11 15:11:10 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/01/11 15:11:10 WARN util.Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
The sum is: 46                                                                  


### Anwers

1. The .txt file is located in the HDFS, while the script is found in the storage of the local server (jupyter.polito.it)

---

## Exercise 2 - Manipulating HDFS

In [21]:
!hdfs dfs -ls /data/students/bigdata_internet/lab1

Found 1 items
-rwxrwx---+  3 trevisan students         62 2019-09-06 10:15 /data/students/bigdata_internet/lab1/lab1_dataset.txt


In [3]:
!hdfs dfs -get /data/students/bigdata_internet/lab1/lab1_dataset.txt /home/students/s315054/labs/lab01

get: `/home/students/s315054/labs/lab01/lab1_dataset.txt': File exists


In [31]:
!touch emptyFile; hdfs dfs -put emptyFile /user/s315054

put: `/user/s315054/emptyFile': File exists


### Answers 
1. No, since the file was copied and it is not a mirror for the actual one stored in the HDFS
2. Path in HDFS: `hdfs://BigDataHA/user/s315054/`; Path on gateway local file system: `/home/students/s315054/`

---

## Exercise 3 - Running a job



In [32]:
# Before running again clear file location (saveAsTextFile does not overwrite)
!hdfs dfs -rm -r /user/s315054/lab01/results_3

rdd = sc.textFile("/data/students/bigdata_internet/lab1/lab1_dataset.txt")
fields_rdd = rdd.map(lambda line: line.split(','))   # Isolate each element of the row
reduced_rdd = fields_rdd.reduceByKey(lambda v1, v2: int(v1)+int(v2))   # Evaluate the sum of all values related to the same key
# Notice the need of cast
correct_rdd = reduced_rdd.map(lambda l: l[0]+','+str(l[1]))
sample_loc = correct_rdd.take(2)  # Take 2 elements from the RDD as a way to check the functioning
for i in range(len(sample_loc)):
    print(sample_loc[i])


correct_rdd.saveAsTextFile('/user/s315054/lab01/results_3.txt')

23/01/11 15:52:08 INFO fs.TrashPolicyDefault: Moved: 'hdfs://BigDataHA/user/s315054/lab01/results_3' to trash at: hdfs://BigDataHA/user/s315054/.Trash/Current/user/s315054/lab01/results_3


                                                                                

bob,11
john,21


### Answers

1. The code performs the following:
    * Creation of a RDD after reading the file `lab1_dataset.txt`
    * Creation of another RDD by isolating, for each row of the initial file, the name and the associated number
    * `reduceByKey` operation to sum values associated with the same name (having casted the values to integer)
    * Creation of a new RDD in which each key-value pair was converted to a single string
    * Creation of a file stored in the HDFS containing each of the elements of the file as lines
    * The method `take(2)` was also employed as a way to print a couple of the strings produced (in order to test the correct functioning)

2. The output folder contains 2 `.txt` files, one for each partition used to store the final RDD, plus a binary file named `_SUCCESS`, probably containing a log for the `saveAsTextFile()` method

---


## Exercise 4 (Bonus task)

In [40]:
!hdfs dfs -rm -r /user/s315054/lab01/results_4

rdd = sc.textFile("/data/students/bigdata_internet/lab1/lab1_dataset.txt")   # Create initial RDD from HDFS file (1 element per row)
fields_rdd = rdd.map(lambda line: line.split(','))   # Isolate each element of the row
reduced_rdd = fields_rdd.reduceByKey(lambda v1, v2: v1 + '-' + v2)   # Append all values
correct_rdd = reduced_rdd.map(lambda l: l[0]+','+str(l[1]))   # Create lines
sample_loc = correct_rdd.take(3)  # Take 2 elements from the RDD as a way to check the functioning
for i in range(len(sample_loc)):
    print(sample_loc[i])


correct_rdd.saveAsTextFile('/user/s315054/lab01/results_4')

23/01/11 16:01:09 INFO fs.TrashPolicyDefault: Moved: 'hdfs://BigDataHA/user/s315054/lab01/results_4' to trash at: hdfs://BigDataHA/user/s315054/.Trash/Current/user/s315054/lab01/results_4
bob,5-3-3
john,4-8-9
alice,4-3-7


### Comments
The program works pretty much in the same way as the one found in exercise 3, except that instead of using the method `reduceByKey()` to evaluate the sum of all values associated with the same key, it is used to assemble all values into a single string, by appending (`+` operator in python) `'-'` and the following value. Unlike before, in this case it is not necessary to cast the value to an int, since we need to operate with strings in order to perform the append operation.

Another possibility could have been that of using `foldByKey`, but since the order of the values was not specified it was more efficient to use `reduceByKey`.