# Accumulators

## Spark Set Up

In [1]:
## Imports
import re
import json
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

from pyspark.sql import SparkSession

app_name = "week2_demo"
master = "local[*]"
spark = SparkSession\
        .builder\
        .appName(app_name)\
        .master(master)\
        .config("spark.ui.port","42229")\
        .getOrCreate()
sc = spark.sparkContext

## Change the working directory
%cd /media

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/04/22 13:50:42 INFO org.apache.spark.SparkEnv: Registering MapOutputTracker
22/04/22 13:50:43 INFO org.apache.spark.SparkEnv: Registering BlockManagerMaster
22/04/22 13:50:43 INFO org.apache.spark.SparkEnv: Registering BlockManagerMasterHeartbeat
22/04/22 13:50:43 INFO org.apache.spark.SparkEnv: Registering OutputCommitCoordinator
22/04/22 13:50:43 WARN org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 42229. Attempting port 42230.
22/04/22 13:50:43 WARN org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 42230. Attempting port 42231.


/media


## What are Accumulators?

So far we have cover variables that live inside a functional closure. Each time that we call `map` on a Spark job, the data is distributed and each computation is done at each stage, independently from each other. Sometimes, we want to have a variable that is shared across the entire job. These variables are called **Accumulators**. Accumuators can aggregate variables at the stage level and bring the results back to the driver to be used for other purposes. 

They can be used for either counting events that are not related to the main Spark job, or for debugging purposes. Let's see some examples

In [3]:
## Simple example, let's count the number of empty lines in the Alice RDD
ALICE_TXT = 'file:///media' + "/data/alice.txt"
aliceRDD = sc.textFile(ALICE_TXT)

## Let's define our accumulator and then initialize. We can use accumulators for adding or multiplying
blank_lines = sc.accumulator(0)

## Let's create a function that counts the blank lines
def count_blank(line):
    if (line == ""):
        blank_lines.add(1)

## Let's run our accumulator using the foreach action
aliceRDD.foreach(count_blank)

## Let's print the result
print(f"Number of blank lines in Alice is {blank_lines}")

Number of blank lines in Alice is 951


                                                                                

In [4]:
## Let's run now what could go wrong with Accumulators
## We'll use the grades data that we created before
## Helper function to count the number of failing grades
def parse_grades(line, accumulator):
    """Helper function to parse input & track failing grades."""
    student,course,grade = line.split(',')
    grade = int(grade)
    if grade < 60:
        accumulator.add(1)
    return(student,course, grade)

## Accumulator to keep track of fails
nFail = sc.accumulator(0)

# Compute averages
csv_path = 'file:///media' + "/data/grades.csv"
gradesRDD = sc.textFile(csv_path)\
              .map(lambda x: parse_grades(x, nFail))

studentAvgs = gradesRDD.map(lambda x: (x[0], (x[2], 1)))\
                       .reduceByKey(lambda a, b: (a[0] + b[0], a[1] + b[1]))\
                       .mapValues(lambda x: x[0]/x[1])

courseAvgs = gradesRDD.map(lambda x: (x[1], (x[2], 1)))\
                      .reduceByKey(lambda a, b: (a[0] + b[0], a[1] + b[1]))\
                      .mapValues(lambda x: x[0]/x[1])

In [6]:
## Take a look at the averages
print("===== average by student =====")
print(studentAvgs.collect())
print("===== average by course =====")
print(courseAvgs.collect())
print("===== number of failing grades awarded =====")
print(nFail)

===== average by student =====
[('10001', 92.5), ('10004', 90.0), ('10002', 70.0), ('10003', 60.0), ('10005', 72.5)]
===== average by course =====
[('102', 62.333333333333336), ('101', 87.0), ('103', 71.66666666666667)]
===== number of failing grades awarded =====
4


What went wrong?

In [10]:
## Better way

## Accumulator to keep track of fails
nFail = sc.accumulator(0)

def parse_grades(line):
    """Helper function to parse input & track failing grades."""
    student,course,grade = line.split(',')
    grade = int(grade)
    return(student,course, grade)

def count_fails(line):
    grade = line[2]
    if grade < 60:
        nFail.add(1)
        
## Now let's run our count
csv_path = 'file:///media' + "/data/grades.csv"
gradesRDD = sc.textFile(csv_path)\
              .map(lambda x: parse_grades(x))

gradesRDD.foreach(count_fails)
print("===== number of failing grades awarded =====")
print(nFail)

===== number of failing grades awarded =====
2
