# Chapter 6 - Advanced Spark Programming

## Basics - Shared Variables

* Functions that are passed to spark (i.e. <code>map</code> or a condition for <code>filter</code>)
    - Can use variables defined outside them in the driver program
    - Each task running on the cluster gets a new copy of each variable 
    - Updates from copy of each variable is not propagated back to the driver
    - ** Shared variables relax this restriction for two common type sof communication patterns: aggregation of results and broadcasts**

* **Accumulators**: aggregates values from worker nodes back to the driver program
    - One of the most common uses of accumulators: count events that occur during job execution for debugging purposes
    - It is possible to aggregate values from an entire RDD back to the driver program using actions like <code>reduce</code>, but sometimes we need a simple way to aggregate values that (in the process of transforming the RDD) generated at different scale or granularity than that of the RDD itself 
    - Created in the driver by calling the <code>SparkContext.accumulator(initialValue)</code> method
        - Produces an accumulator holding an initial value
    - Worker code in Spark closures can add to the accumulator with its += method (or add in Java)
    - The driver program can call <code>value</code> on the accumulator to acces its value (or call <code>value()</code> and <code>setValue()</code> in Java)
    - Tasks on worker nodes cannot access the accumulator's <code>value</code> from the point of view of these tasks, accumulators are <i>write-only</i> variables
    - For accumulators used in actions, Spark only applies each task's update to each accumulator once. Thus if we want a reliable absolute value counter, reglardless of failures or multiple evaluations, we must put it inside an action like <code>foreach</code>
    - For accumulators used in RDD transformations instead of actions this guarantee does not exist; an accumulator update within a transformation can occur more than once
        - Example: Unintended multiple update occurs when a cached but infrequent used RDD is first evicted from the LRU cache and is then subsequently needed. This forces the RDD to be recalculated from its lineage, with the uninteded side-effect that calls to update an accumulator within the transformations in the lineage are sent again to the driver 
        - Only use accumulators inside transformations for debugging purposes
        
**finish custom accum notes**
* Spark's built-in accumulator type is integers

* Broadcast variables: efficiently distribute large variables

### Accumulator Example - Empty Line Count 
Say we are loading a list of all the call signs we want to retrieve logs for from a file, but we are also interested in how many lines of the input file were blank (perhaps we do not expect to see many such lines in valid input)
* Creates an <code>Accumulator[Int]</code> called <code>blankLines</code>
* Add 1 to it whenever we see a blank line in the input
* After evaluation of the transformation, we print the value of the counter
* Note that we only see the right count after we run an action (<code>saveAsTextFile</code> or <code>collect</code>) because the transformation is lazy 



In [31]:
from __future__ import print_function

In [120]:
inputFile='callsigns'
f = sc.textFile(inputFile)
# print('text file contains: ',f.collect())
# Create Accumulator[Int] initialized to 0
blankLines = sc.accumulator(0)

def extractCallSigns(line):
    global blankLines # make the global variable accessible
    if (line==""):
        blankLines += 1
    return line.split(" ")

callSigns = f.flatMap(extractCallSigns)
print('callSigns:',callSigns.collect())
# f.flatMap(extractCallSigns).collect()
print('number of blank lines: ',blankLines.value)

callSigns: [u'W8PAL', u'stuff', u'', u'W6BB', u'VE3UOW', u'VE2CUA', u'VE2UN', u'OH2TI', u'GB1MIR', u'K2AMH', u'UA1LO', u'N7ICE', u'W8PAL']
number of blank lines:  1


### Accumulator Example - Error Count
For simplicity we are going to count the number of call signs with "A" in it 

In [118]:
# create accumulators for validating error count example
validSignCount = sc.accumulator(0)
invalidSignCount = sc.accumulator(0)

def validateSign(sign):
    global validSignCount, invalidSignCount
#     if re.match(r"\A\d?[a-zA-Z]{1,2}\d{1,4}[a-zA-Z]{1,3}\Z", sign):
    if 'A' in sign:
        validSignCount += 1
        return True 
    else:
        invalidSignCount += 1
        return False
    
# count the number of times we contacted each call sign
validSigns = callSigns.filter(validateSign)
contactCount = validSigns.map(lambda sign: (sign,1)).reduceByKey(lambda x,y:x+y)
validSigns.collect()
print(validSignCount.value)
print('validSigns: ',validSigns.collect())
print('contactCount: ',contactCount.collect())

5
validSigns:  [u'W8PAL', u'VE2CUA', u'K2AMH', u'UA1LO', u'W8PAL']
contactCount:  [(u'VE2CUA', 1), (u'K2AMH', 1), (u'W8PAL', 2), (u'UA1LO', 1)]
