# Accumulators

Create them by calling

In [45]:
testRDD = sc.parallelize((1,2,3,4),2)
print(testRDD.collect())
acc = sc.accumulator(0)
type(acc)

[1, 2, 3, 4]


pyspark.accumulators.Accumulator

In [46]:
def doStuff(x):
    global acc #important keyword!!
    acc +=1
    return x

result = testRDD.map(doStuff).collect()
print(acc)

4


You can add to them calling the add function in Java or using the += syntax.

For the driver to access the value of the accumulator, call <code>acc.value()</code> in Java and <code>acc.value</code> in Sacla.

Remember that an action needs to be called first... (e.g. <code>collect()</code> above).

Accumulators can be thought of WRITE ONLY variables. Results can be written back to the Driver but cannot be accessed from the worker nodes.

### Fault Tolerance
Accumulators inside TRANSFORMATIONS are impacted by dying or slow nodes which can lead to duplicate computations.

Accumulators inside ACTIONS are only updated once and therefor accurate. 

==> Accumulators in transformations should be used for debugging purposes only!

### Custom Accumulators

Besides integers, Spark supports Double, Float and Long types for accumulators. Custom accumulators must extend <code> AccumulatorParam</code>. In particular, operations need to be commutative and associative (i.e. a OP b = b OP a  and   (A OP b) OP C = A OP (B OP C)). This includes sum and max. 

# Broadcast Variables

Broadcast variables can be thought of READ ONLY variables. A typical use case would be if your application needs to send a large, read-only lookup table to all the nodes (or a large feature vector in a ML program).

Broadcast variables must be serializable. The variable will be sent to each node only once. Updates will not be propagated to other nodes (==> write read only)

In [4]:
contactCounts = sc.parallelize((('+1',5), ('+49',3),('+1',2),('+49',31)))
print(contactCounts.collect())

# WITHOUT BRAODCASTING
signPrefixes = sc.parallelize([['+1','USA'],['+49','Germany']]).collectAsMap() 
print(signPrefixes)

def processSignCount(sign_count):
    country = signPrefixes[sign_count[0]]
    count = sign_count[1]
    return (country,count)
countryContactCounts = (contactCounts.map(processSignCount).reduceByKey(lambda x,y:x+y))
print(countryContactCounts.collect())


[('+1', 5), ('+49', 3), ('+1', 2), ('+49', 31)]
{'+49': 'Germany', '+1': 'USA'}
[('USA', 7), ('Germany', 34)]


In [5]:
# WITH BROADCASTING
signPrefixes = sc.parallelize([['+1','USA'],['+49','Germany']]).collectAsMap() 
signPrefixes = sc.broadcast(signPrefixes) # NEW NEW NEW NEW NEW NEW NEW NEW NEW NEW
print(type(signPrefixes))

def processSignCount(sign_count):
    country = signPrefixes.value[sign_count[0]] # NEW NEW NEW NEW NEW NEW NEW NEW
    count = sign_count[1]
    return (country,count)
countryContactCounts = (contactCounts.map(processSignCount).reduceByKey(lambda x,y:x+y))
print(countryContactCounts.collect())

<class 'pyspark.broadcast.Broadcast'>
[('USA', 7), ('Germany', 34)]


Broadcast variables are particularly useful when the variable (lookup table etc.) is LARGE (Sparks default task launcher is optimized for small tasks!) and used in MULTIPLE PARALLEL OPERATIONS. 

Spark uses the Java Serializer when when sending variables over the network. This can be slow for anything other than arrays of primitive types. Alternatively, one can select a different serializer by changing the spark.serializer property or building your custom serializer.

# Working on a Per-Partition Basis 
## with <code>mapPartitions()</code>, <code>mapPartitionsWithIndex()</code>, <code>foreachPartition()</code>

<code>mapPartitions()</code> can be used as an alternative to <code>map()</code> & <code>foreach()</code>. <code>mapPartitions()</code> is called ONCE FOR EACH PARTITION unlike <code>map()</code> & <code>foreach()</code> which is called for each element in the RDD. The main advantage being that, we can do initialization on Per-Partition basis instead of per-element basis.

In [48]:
numPartitions = 3
rdd = sc.parallelize([1,2,3,4,5],numPartitions)
print(rdd.collect())

def mapToPartitionIndex(index,iterator):
    return [str(x) + " -> " + str(index) for x in iterator]

mapped =   rdd.mapPartitionsWithIndex(mapToPartitionIndex)

print(mapped.collect())


[1, 2, 3, 4, 5]
['1 -> 0', '2 -> 1', '3 -> 1', '4 -> 2', '5 -> 2']


In [49]:
#Averaging without mapPartitions()
def combineCtrs(c1,c2):
    return (c1[0] +c2[0],c1[1] + c2[1])
def basicAvg(nums):
    '''compute the average'''
    result = nums.map(lambda num: (num,1)).reduce(combineCtrs)
    return result[0]/float(result[1])

print(rdd.map(lambda num: (num,1)).collect())
print(rdd.map(lambda num: (num,1)).reduce(combineCtrs))
print("result: ",basicAvg(rdd))
print(" ")

#Averaging with mapPartitions()
def partitionCtr(nums):
    '''cumpute sumCOunter for partition'''
    sumCount = [0,0]
    for num in nums:
        sumCount[0] += num
        sumCount[1] += 1
    return [sumCount]
def fastAvg(nums):
    '''compute the average'''
    sumCount = nums.mapPartitions(partitionCtr).reduce(combineCtrs)
    return sumCount[0] / float(sumCount[1])

print('without creating (num,1) tuples:')
print("results on partitions: ",rdd.mapPartitions(partitionCtr).collect())
print("accumulated result: ",rdd.mapPartitions(partitionCtr).reduce(combineCtrs)) #compare to result from mapToPartitionIndex
print("result: ",fastAvg(rdd))

[(1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]
(15, 5)
result:  3.0
 
without creating (num,1) tuples:
results on partitions:  [[1, 1], [5, 2], [9, 2]]
accumulated result:  (15, 5)
result:  3.0


# Piping to External Programs

<code>pipe()</code> returns an RDD created by piping elements to a forked external process. The process needs read and write to Unix standard streams. ==> <code>pipe()</code> is transformation of a RDD that reads each element from standard input as a <code>String</code> and writes results to standard output as <code>Strings</code>.


echo.bat: 
<code>
@echo off
echo "stupid batch"
</code>

In [74]:
print(sc.parallelize(['1', '2', '', '3'],2).pipe('echo.bat').collect())
print(sc.parallelize(['1', '2', '', '3'],3).pipe('echo.bat').collect())

['"stupid batch"\r', '"stupid batch"\r']
['"stupid batch"\r', '"stupid batch"\r', '"stupid batch"\r']


..gets executed on every executor
what about R files? how does spark know how to execute R (perl, fortran,...) code?

# Numeric RDD Operations

Spark provides descriptive statistics on numeric RDDs. THey are implemented with a streaming algorithm that allows for building up our model one element at a time. In a single pass over the data they are returned as a <code>StatsCounter</code> object by calling <code>stats()</code>. Available are:

<code> count(), mean(), sum(), max(), min(), varaince(), sampleVariance(), stdev(), sampleStdev()</code>

In [50]:
print(rdd.collect())
print("max: ", rdd.max())
print("mean: ", rdd.mean())

stats = rdd.stats()
print("stats: ",stats)
print("count from StatsCounter: ", stats.count())

[1, 2, 3, 4, 5]
max:  5
mean:  3.0
stats:  (count: 5, mean: 3.0, stdev: 1.41421356237, max: 5.0, min: 1.0)
count from StatsCounter:  5
