# PySpark Shared Variables Tutorial
In this section of the PySpark RDD tutorial, let’s learn what are the different types of PySpark Shared variables and how they are used in PySpark transformations.

When PySpark executes transformation using map() or reduce() operations, It executes the transformations on a remote node by using the variables that are shipped with the tasks and these variables are not sent back to PySpark Driver hence there is no capability to reuse and sharing the variables across tasks. PySpark shared variables solve this problem using the below two techniques. PySpark provides two types of shared variables.

- Broadcast variables (read-only shared variable)
- Accumulator variables (updatable shared variables)

## Broadcast read-only Variables
Broadcast variables are read-only shared variables that are cached and available on all nodes in a cluster in-order to access or use by the tasks. Instead of sending this data along with every task, PySpark distributes broadcast variables to the machine using efficient broadcast algorithms to reduce communication costs.

One of the best use-case of PySpark RDD Broadcast is to use with lookup data for example zip code, state, country lookups e.t.c

When you run a PySpark RDD job that has the Broadcast variables defined and used, PySpark does the following.

- PySpark breaks the job into stages that have distributed shuffling and actions are executed with in the stage.
- Later Stages are also broken into tasks
- PySpark broadcasts the common data (reusable) needed by tasks within each stage.
- The broadcasted data is cache in serialized format and deserialized before executing each task.

The PySpark Broadcast is created using the broadcast(v) method of the SparkContext class. This method takes the argument v that you want to broadcast.

In [1]:
broadcastVar = sc.broadcast([0, 1, 2, 3])
broadcastVar.value

[0, 1, 2, 3]

Note that broadcast variables are not sent to executors with sc.broadcast(variable) call instead, they will be sent to executors when they are first used.

[更多参考](https://sparkbyexamples.com/pyspark/pyspark-broadcast-variables/)

## Accumulators
PySpark Accumulators are another type shared variable that are only “added” through an associative and commutative operation and are used to perform counters (Similar to Map-reduce counters) or sum operations.

PySpark by default supports creating an accumulator of any numeric type and provides the capability to add custom accumulator types. Programmers can create following accumulators

- named accumulators
- unnamed accumulators

When you create a named accumulator, you can see them on PySpark web UI under the “Accumulator” tab. On this tab, you will see two tables; the first table “accumulable” – consists of all named accumulator variables and their values. And on the second table “Tasks” – value for each accumulator modified by a task.

Where as unnamed accumulators are not shows on PySpark web UI, For all practical purposes it is suggestable to use named accumulators.

Accumulator variables are created using `SparkContext.longAccumulator(v)`

In [4]:
sc

In [10]:
accum = sc.accumulator(0)
s = sc.parallelize([1, 2, 3]).foreach(lambda x: accum.add(x))
print(s)

None


- [Long Accumulator](https://sparkbyexamples.com/spark/spark-accumulators/#LongAccumulator)
- [Double Accumulator](https://sparkbyexamples.com/spark/spark-accumulators/#DoubleAccumulator)
- [Collection Accumulator](https://sparkbyexamples.com/spark/spark-accumulators/#CollectionAccumulator)

# example

In [11]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

states = {"NY": "New York", "CA": "California", "FL": "Florida"}
broadcastStates = spark.sparkContext.broadcast(states)

data = [("James", "Smith", "USA", "CA"), ("Michael", "Rose", "USA", "NY"),
        ("Robert", "Williams", "USA", "CA"), ("Maria", "Jones", "USA", "FL")]

rdd = spark.sparkContext.parallelize(data)

def state_convert(code):
    return broadcastStates.value[code]


result = rdd.map(lambda x: (x[0], x[1], x[2], state_convert(x[3]))).collect()
print(result)

[('James', 'Smith', 'USA', 'California'), ('Michael', 'Rose', 'USA', 'New York'), ('Robert', 'Williams', 'USA', 'California'), ('Maria', 'Jones', 'USA', 'Florida')]
