## Cache and Persistance in storage

It is foundamental to speed up spark executions to use cache and peristence. We can have multiple **Storage Levels**, like:
1. MEMORY_ONLY
2. MEMORY_AND_DISK
3. DISK_ONLY
4. MEMORY_ONLY_2 or DISK_ONLY_2 (which replicate each parition on two cluster nodes, if one fails, the other is ready to perform the actions on the RDD)
5. OFF_HEAP (experimental)

**NOTE:** both cache and persist return a NEW RDD

**NOTE 2:** using cache() is the same as using inRDD.persist(pyspark.StorageLevel.MEMORY_ONLY)

You can remove RDD from cache by using **unpersist()**

In [None]:
# Example of exam question
inputRDD = sc.textFile("words.txt")

print("Num of word: ", inputRDD.count()) # first time
print("Num of distinct words: ", inputRDD.count()) # second time

**How many times the system will read the content of words.txt?**
    Two times

In [None]:
# Example of exam question
inputRDD = sc.textFile("words.txt").cache()

print("Num of word: ", inputRDD.count()) # first time
print("Num of distinct words: ", inputRDD.count()) # second time

**And now?** Only one time

**NOTE:** If the size of the file is too much for the main memory, cache will read on time plus something that cannot be stored in main memory

## Accumulators
Accumulators are shared variables that are only "added" to throygh an associative operation and can therefore be efficientrly supported in parallel. They are mainly used to compute simple statistics like counts or sums. The advantage is that you can transform your data while computing the statistic.

In [1]:
# Accumulator example
# we want to select lines with valid email (valid if contains @)
# we also want on the std output the number of valid emails

# We can use two approaches: filters and counters (two Actions), using accumulators
sc
emails = ['arcangelo@frigiola.it', 'leonardo@maggio.com', 'giuseppe#esposito', 'mario@rossi.it', 'ayeye+brazorf']
invalidEmails = sc.accumulator(0)
emailsRDD = sc.parallelize(emails)

# Define filtering functions
def validEmailFunc(line):
    if (line.find('@')<0): 
        invalidEmails.add(1)
        return False
    else:
        return True
# Select only valid emails
# Count also the number of invalid emails 
validEmailsRDD = emailsRDD.filter(validEmailFunc).cache()

**Question: if we switch A and B, is the result the same?**

In [2]:
# A: Store valid emails in the output file 
validEmailsRDD.saveAsTextFile('res_accumulators/')

# B: Print the number of invalid emails
print("Invalid email addresses: ", invalidEmails.value)

Invalid email addresses:  2


No! If we switch A and B to B and A, the print will be Zero! saveAsTextFile is calling our function, so if we first call print, it won't be computed

In [3]:
print("Valid email addresses: ", validEmailsRDD.count())
print("Invalid email addresses: ", invalidEmails.value)

Valid email addresses:  3
Invalid email addresses:  2


In [4]:
print("Invalid email addresses: ", invalidEmails.value)
print("Valid email addresses: ", validEmailsRDD.count())

Invalid email addresses:  2
Valid email addresses:  3


**If we don't use cache istead, the last prints will provide erroneous outputs**

In [5]:
# We do the same thing but using foreach+accumulator

In [6]:
emails = ['arcangelo@frigiola.it', 'leonardo@maggio.com', 'giuseppe#esposito', 'mario@rossi.it', 'ayeye+brazorf']
invalidEmails = sc.accumulator(0)
emailsRDD = sc.parallelize(emails)

# Define filtering function
def invalidEmailFunc(line):
    if (line.find('@')<0):
        invalidEmails.add(1)

In [7]:
emailsRDD.foreach(invalidEmailFunc)

In [8]:
print("Invalid email addresses: ", invalidEmails.value)

Invalid email addresses:  2


**NOTE:** Pay attention on the number of actions executed, otherwise the result will not be correct using the accumulator

In [9]:
# Define an accumulator and initialize it to 0
discardedQuestion = sc.accumulator(0)

In [10]:
RDD1 = sc.parallelize([(1,'Q1'),(2,'Q2')])

In [11]:
# Define filtering function
def selectQuestion(pair):
    if(pair[0]==1):
        return True
    else:
        discardedQuestion.add(1)
        return False

In [13]:
RDD2 = RDD1.filter(selectQuestion)
RDD2.join(RDD2).collect()

[(1, ('Q1', 'Q1'))]

In [14]:
discardedQuestion.value

2

How many times are we applying the filter? Only one... but the filter transformation is applied two times. Second time is when we call the join. **PAY ATTENTION, THIS IS A TYPICAL EXAM QUESTION**

In [None]:
RDD2 = RDD1.filter(selectQuestion)
RDD1.join(RDD2).collect()

In this case the output will be 1. 

I have RDD1,
I apply the **filter** and I obtain RDD2,
then I apply **join** to RDD1.

**In this case, how many paths are associated with the filter?**
We have two paths, but the filter transformation is included only in one path!

## Broadcast Variables
Broadcast variables are read-only shared variables intantiated in the driver and sent to all worker nodes that use it in one or more Sparl operations. They are stored in the main memory of the executors as well. Broadcast variables limit the amount of data sent on the network.

In [16]:
# Example

# Create an RDD from a txt file containing a dictionary of pairs (word, int value), one pair for each line.
# Suppose the content of this file is large but can be stored in main-memory

# Create an RDD from a txt file containing a set of words (a sentence for each line)
# Transform the content of the second file mapping each word to an intefer based on the dictionaty of the first file

In [None]:
# Read the content of the dictionary
dictRDD = sc.textFile('dictionary.txt')\
            .map(lambda line: (line.split(' ')[0], line.split(' ')[1]))

# Create a broadcast variable based on the content of dictinoaryRDD
# Note: broadcast can be instantiated only passing a local variable, not an RDD
dictionary = dictRDD.collectAsMap()

# Broadcast dictionary
dictionaryBroadcast = sc.broadcast(dictionary)

In [None]:
# Read the content of the second file 
textRDD = sc.textFile("document.txt")

# Define the function that is used to map strings to integers 
def myMapFunc(line):
    transformedLine=''
    
    for word in line.split(' '):
        intValue = dictionaryBroadcast.value[word] 
        transformedLine = transformedLine+intValue+' '

    return transformedLine.strip()

# Map words in textRDD to the corresponding integers and concatenate them
mappedTextRDD= textRDD.map(myMapFunc)

In [None]:
# Store tbe result in the output folder
mappedTextRDD.saveAsTextFile(outputPath)

Do you think that here brodcast was really needed? Broadcast variable here is not really needed since we have only one transformation

## Broadcast Join
Boradcast join is similar to Join implemented in MapReduce for Hadoop. In the latter we said that, if you want to perform the join operatino on two input files, and one of them is small, you can use a Map only Join (one copy of the small file shared for all servers). 

Join transformation is expensive in terms of execition time and amount of data sent on th network. If one of the two input RDDs of the key/val pairs is small enough to be stored in the main memory we can use a more efficient solution based on a broadcast variable. 

**ATTENTION:** each key in big file can be associated to one single line in the small file.

In [None]:
# Read the first input file
largeRDD = sc.textFile("post.txt").map(lambda line: (int(line.split(',')[0]), line.split(',')[1]) )

# Read the second input file
smallRDD = sc.textFile("profiles.txt").map(lambda line: (int(line.split(',')[0]), line.split(',')[1]) )

# Broadcast join version
# Store the "small" RDD in a local python variable in the driver
# and broadcast it
localSmallTable = smallRDD.collectAsMap()
localSmallTableBroadcast = sc.broadcast(localSmallTable)

In [None]:
# Function for joining a record of the large RDD with the matching
# record of the small one

def joinRecords(largeTableRecord):
    returnedRecords = []
    key = largeTableRecord[0]
    valueLargeRecord = largeTableRecord[1]
    
    if key in localSmallTableBroadcast.value:
    returnedRecords.append( (key, (valueLargeRecord,\
    localSmallTableBroadcast.value[key]) ) )
    
    return returnedRecords

# Execute the broadcast join operation by using a flatMap
# transformation on the "large" RDD
userPostProfileRDDBroadcatJoin = largeRDD.flatMap(joinRecords) 

**NOTE:** We use flat map instead of Map since the function **joinRecords** return a single record everytime