### Distributed Shared  Variables
   1. Braoadcast Variables:
     - save a large value on all worker nodes and reuse it across many spark actions without resenting it to the cluster.
     - It is an immuatable variable.
     - Since the data/variable is cached on every machine in the cluster instead of  serialization/deserialization with every single task.
   2. Accumulators:
     - Add together data from all the task into a shared result (After action operation only).
     - It is a mutable variable.
     - Updated values inside of a variety of transformations and propagating that value to the driver node in an efficient and fault-tolerant way

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession\
   .builder\
   .master("local[2]")\
   .appName("SDG_Chapter14")\
   .getOrCreate()

In [2]:
sc = spark.sparkContext

In [3]:
my_collection = "Spark The Definitve Guide : Big Data Processing Made Simple"\
   .split(" ")
words = sc.parallelize(my_collection, 2)

In [4]:
supplementData = {
    "Spark":1000, "Definitive":200,
    "Big":-300, "Simple":100
}
suppBroadcast = sc.broadcast(supplementData)

In [5]:
suppBroadcast.value

{'Spark': 1000, 'Definitive': 200, 'Big': -300, 'Simple': 100}

In [10]:
suppBroadcast.value.get("Spark", 0)

1000

In [13]:
words.map(lambda word: (word, suppBroadcast.value.get(word,  0)))\
  .sortBy(lambda wordPair: wordPair[1])\
  .collect()

[('Big', -300),
 ('The', 0),
 ('Definitve', 0),
 ('Guide', 0),
 (':', 0),
 ('Data', 0),
 ('Processing', 0),
 ('Made', 0),
 ('Simple', 100),
 ('Spark', 1000)]

#### Accumulator

In [14]:
flights = spark.read\
  .parquet("/home/jagadeesh/git/Spark-The-Definitive-Guide/data/flight-data/parquet/2010-summary.parquet/")

In [15]:
accChina = sc.accumulator(0)

In [16]:
accChina

Accumulator<id=0, value=0>

In [17]:
def accChinaFunc(flight_row):
    destination = flight_row["DEST_COUNTRY_NAME"]
    origin = flight_row["ORIGIN_COUNTRY_NAME"]
    if destination == "China":
        accChina.add(flight_row["count"])
    if origin == "China":
        accChina.add(flight_row["count"])

In [18]:
flights.foreach(lambda flight_row: accChinaFunc(flight_row))

In [19]:
accChina.value

953