### Lab0 : Spark Word Count

### Topics : 

* RDD Creation
* RDD Transformations and Actions

### Example objetive :

Given an input file , compute the nb of ocurrences of a particular word inside the file

### Reference :

* SPARK Reference Documentation: https://spark.apache.org/docs/2.3.1/programming-guide.html#rdd-operations


In [18]:
import timeit
from operator import add
from pyspark.sql import SparkSession

In [2]:
inputFile="data/server.log"

### Creating an RDD : by loading from a data source

In [3]:
lines = sc.textFile(inputFile)

In [21]:
# get the number of lines 
lines.count()

CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 156 ms


105

In [5]:
# get the number of partitions for this RDD 
lines.getNumPartitions()

2

In [6]:
# Specify the word you want to search for
search_word='error'

### Apply Transformations and actions to compute the result

Transformations : 
    
1. flatMap() transformation : split each line into the words that form it , split by whitespace
2. filter() transformation:  filter on each line those words that are equal to the search word
3. map() transformation : create a tuple with each filtered word on each line and a counter
4. reduceByKey() transformation : aggregate based on the keys(=distinct words) with a sum function (add) over all lines

Action : 
    
1. collect() : return all elements from the computed RDD

Lazy Evaluation :

* Until the collect() action is called nothing actually happens

In [7]:
counts_rdd = lines.flatMap(lambda x: x.split(' ')) \
        .filter(lambda x : search_word in x) \
        .map(lambda word : (word, 1)) \
        .reduceByKey(add)

### Inspect Job Execution

In [8]:
# See the RDD lineage
print(counts_rdd.toDebugString().decode("utf-8"))

(2) PythonRDD[7] at RDD at PythonRDD.scala:52 []
 |  MapPartitionsRDD[6] at mapPartitions at PythonRDD.scala:132 []
 |  ShuffledRDD[5] at partitionBy at NativeMethodAccessorImpl.java:0 []
 +-(2) PairwiseRDD[4] at reduceByKey at <ipython-input-7-33f738cb430a>:1 []
    |  PythonRDD[3] at reduceByKey at <ipython-input-7-33f738cb430a>:1 []
    |  data/app.log MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []
    |  data/app.log HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:0 []


In [9]:
# The lineage is telling us that there will be 2 stages for this spark job
# A shuffling of data is involved because the reduceByKey 
# requires to place all items belonging to the same key on the same partition 
# shuffling operation marks the boundary between stages

In [10]:
errors = counts_rdd.collect()

In [11]:
for word, count in errors:
    print("%s: %i" % (word, count))

[error]: 5


### Room for optimization

In [12]:
# Now , imagine we want to search a set of words ...
# Do you want to repeat every time the loading and split by whitespace operations ?
# These are going to be repeated every time unless we cache ...
cached_lines = lines.cache()

In [13]:
# Now search for other words
search_word='info'
counts_rdd = cached_lines.flatMap(lambda x: x.split(' ')) \
        .filter(lambda x : search_word in x) \
        .map(lambda word : (word, 1)) \
        .reduceByKey(add)

In [14]:
infos = counts_rdd.collect()

In [37]:
for word, count in infos:
    print("%s: %i" % (word, count))

[info]: 96


### Further analysis

* In the Spark Web UI Inspect the storage tab.
* You should see that the RDD has been cached , saved directly in memory
* Now perform again and operation , like count()

In [17]:
cached_lines.count()

105

In [15]:
# Now search for other words
search_word='dummy'
counts_rdd = cached_lines.flatMap(lambda x: x.split(' ')) \
        .filter(lambda x : search_word in x) \
        .map(lambda word : (word, 1)) \
        .reduceByKey(add)

In [16]:
dummys = counts_rdd.collect()