# Spark UI and Data skewness

It is very important to understand how data transformations get executed in Spark in order to interpret the Spark User Interface. 

### Partitions

Data is seperated into partitions to fit in a cluster of network computers. Transformations usually get executed on partition level. Some tranformations such as `map`, `filter` do not change the partition structure where as some transformations such as `reduceByKey`, `reduce`, `join` change the partition structure through a `shuffle`.

In [None]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate() 
spark

In [None]:
num = 5

partition_data = []

for i in range(0, num):
    for j in range(0, i):
        partition_data.append((i, j))

print(partition_data)

In [None]:
partition_rdd = spark.sparkContext.parallelize(partition_data, 2)
partition_rdd.getNumPartitions()

We use the following python function to print the partition structure of data in a way we can understand what data is in each partition

In [None]:
def to_human_readable(rdd_obj):
    partition_view = rdd_obj.mapPartitions(lambda l: [l]).map(list).collect()

    for partition in partition_view:
        print(partition)

In [None]:
to_human_readable(partition_rdd)

#### maps and filters

In [None]:
filtered_rdd = partition_rdd.filter(lambda pair: pair[0] % 2 == 0)
# filtered_rdd.count()
to_human_readable(filtered_rdd)

### Shuffles
To access the Spark UI go to http://localhost:4040

In [None]:
reduced_rdd = partition_rdd.reduceByKey(lambda a, b: a + b, numPartitions=5)
reduced_rdd.count()
to_human_readable(reduced_rdd)

### Data Skew

Data skewness is one of the common problems with Big Data. Having skewed data can affect both the computation cost and the stability of the cluster.
 
To understand skew, first lets created a normal dataset where keys are distributed uniformly. 

In [None]:
num = 1000

data = []

for i in range(0, num):
    for j in range(0, i):
        data.append((i, j))

len(data)

Now, we introduce a skewed key to this dataset. 

In [None]:
big_num = 1000000

skew_data = data

for i in range(0, big_num):
    skew_data.append((big_num, i))

len(skew_data)

We can now load this data to a spark RDD and run a shuffle (`groupByKey()`) to see how the skew affects the computation resources.

In [None]:
dataset = skew_data

rdd = spark.sparkContext.parallelize(dataset)
rdd.getNumPartitions()

In [None]:
grouped_rdd = rdd.groupByKey().cache()
mapped_rdd = grouped_rdd.map(lambda pair: (pair[0], [(i + 10) for i in pair[1]]))
mapped_rdd.count()