# Spark Shuffle

## Spark Set Up

In [1]:
## Imports
import re
import json
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

from pyspark.sql import SparkSession

app_name = "week2_demo"
master = "local[*]"
spark = SparkSession\
        .builder\
        .appName(app_name)\
        .master(master)\
        .config("spark.ui.port","42229")\
        .getOrCreate()
sc = spark.sparkContext

## Change the working directory
%cd /media

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/04/20 15:05:52 INFO org.apache.spark.SparkEnv: Registering MapOutputTracker
22/04/20 15:05:52 INFO org.apache.spark.SparkEnv: Registering BlockManagerMaster
22/04/20 15:05:52 INFO org.apache.spark.SparkEnv: Registering BlockManagerMasterHeartbeat
22/04/20 15:05:52 INFO org.apache.spark.SparkEnv: Registering OutputCommitCoordinator
22/04/20 15:05:53 WARN org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 42229. Attempting port 42230.


/media


## What is the Spark Shuffle?

Shuffling is an important mechanism of the Map Reduce framework and it is key that we understand when and how happens. Shuffling is when Spark redistribute the data across the different workers and executors (even machines if working on big enough cluster). For Wide transformations, Spark typically runs a shuffle to group elements from different stages and phases.
But why is the shuffle expensive? Because it involves the following operations:

* Disk Input/Output (I/O)
* Involves serialization and deserialization of data
* Network I/O

## Steps of the Spark Shuffle

Let's suppose we are running a `reduceByKey()` operation

1. Spark will first run all map phases and tasks on all partitions and groups value for every key
2. If the results of the map do not fit in memory, Spark will store the data on disk (Disk I/O)
3. Spark shuffles the data across the different partitions
4. It finally reduces tasks on each partition based on the key

In [4]:
## Let's see some examples
ALICE_TXT = 'file:///media' + "/data/alice.txt"
aliceRDD = sc.textFile(ALICE_TXT)

## Let's print the number of partitions
print(f"Number of partitions in the Alice RDD is {aliceRDD.getNumPartitions()}")

Number of partitions in the Alice RDD is 2


In [5]:
## Now let's run the same word count that we did before
## Perform a word count
result = aliceRDD.flatMap(lambda line: re.findall('[a-z]+', line.lower())) \
                 .map(lambda word: (word, 1)) \
                 .reduceByKey(lambda a, b: a + b)

## Let's print the number of partitions
print(f"Number of partitions in the result RDD is {result.getNumPartitions()}")

Number of partitions in the result RDD is 2


## Shuffle Partition Size

You will have control over the number of partitions when running the Shuffle, we will cover them later in this module. When you are dealing with smaller datasets, you should reduce the number of shuffle partitions to avoid having to shuffle a lot of tasks with almost no data.

If the data is too large, having a small number of partitions can lead to memory errors, so you would want to increase it. 

Getting the sweet spot number is tricky, and should be done whenever you have any issues with the performance of your Spark Job. 
