**Distributed Processing Challenges: Handling Data Skew in RDD PySpark**<a href="#Distributed-Processing-Challenges:-Handling-Data-Skew-in-RDD-PySpark" class="anchor-link">¶</a>
=================================================================================================================================================================================

**`Udemy Course: Best Hands-on Big Data Practices and Use Cases using PySpark`**

**`Author: Amin Karami (PhD, FHEA)`**

In \[1\]:

    # Load Spark engine
    !pip3 install -q findspark
    import findspark
    findspark.init()

    from pyspark import SparkContext, SparkConf

    # Initializing Spark
    conf = SparkConf().setAppName("Skew").setMaster("local[*]")
    sc = SparkContext(conf=conf)

Loading Data Skew<a href="#Loading-Data-Skew" class="anchor-link">¶</a>
-----------------------------------------------------------------------

To understand skew, we create a random data where keys are uniformly
distributed.

In \[9\]:

    import numpy as np
    import random

    key_1 = ['a'] * 10
    key_2 = ['b'] * 6000000
    key_3 = ['c'] * 800
    key_4 = ['d'] * 10000
    keys = key_1 + key_2 + key_3 + key_4
    random.shuffle(keys)


    values_1 = list(np.random.randint(low = 1, high = 100, size = len(key_1)))
    values_2 = list(np.random.randint(low = 1, high = 100, size = len(key_2)))
    values_3 = list(np.random.randint(low = 1, high = 100, size = len(key_3)))
    values_4 = list(np.random.randint(low = 1, high = 100, size = len(key_4)))
    values = values_1 + values_2 + values_3 + values_4


    pair_skew = list(zip(keys,values))

In \[10\]:

    # load data into RDD
    rdd = sc.parallelize(pair_skew, 8)

(1) Run a shuffle `groupByKey()` to see how the skew effects computation resources.<a href="#(1)-Run-a-shuffle-groupByKey()-to-see-how-the-skew-effects-computation-resources." class="anchor-link">¶</a>
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In \[2\]:

    data_sample = [(1,4), (2,2), (2,1), (3,5), (2,5), (2,10), (2,7), (3,4), (2,1), (2,4), (4,4)]
    rdd_sample = sc.parallelize(data_sample, 3)

    rdd_sample.glom().collect()

Out\[2\]:

    [[(1, 4), (2, 2), (2, 1)],
     [(3, 5), (2, 5), (2, 10)],
     [(2, 7), (3, 4), (2, 1), (2, 4), (4, 4)]]

In \[4\]:

    rdd_sample_grouped = rdd_sample.groupByKey()

    # show groupby results
    for item in rdd_sample_grouped.collect():
        print(item[0], [value for value in item[1]])
        
    # show partitions:
    rdd_sample_grouped.glom().collect()

    3 [5, 4]
    1 [4]
    4 [4]
    2 [2, 1, 5, 10, 7, 1, 4]

Out\[4\]:

    [[(3, <pyspark.resultiterable.ResultIterable at 0x7fb5eeb56100>)],
     [(1, <pyspark.resultiterable.ResultIterable at 0x7fb5eeb47610>),
      (4, <pyspark.resultiterable.ResultIterable at 0x7fb5eeb47be0>)],
     [(2, <pyspark.resultiterable.ResultIterable at 0x7fb5eeb47eb0>)]]

In \[11\]:

    grouped_rdd = rdd.groupByKey().cache()

    # run a simple data transformation (using map()) on the skewed data
    grouped_rdd.map(lambda pair:(pair[0], [(i+10) for i in pair[1]])).count()

Out\[11\]:

    4

Mitigate data skewness: SALTING<a href="#Mitigate-data-skewness:-SALTING" class="anchor-link">¶</a>
---------------------------------------------------------------------------------------------------

In \[12\]:

    #import random

    def salting(val):
        tmp = val + "_" + str(random.randint(0,5))
        return tmp

In \[13\]:

    # salting method:
    rdd_salting = rdd.map(lambda x: (salting(x[0]), x[1]))


    # actual code
    grouped_rdd = rdd_salting.groupByKey().cache()
    # run a simple data transformation (using map()) on the skewed data
    grouped_rdd.map(lambda pair:(pair[0], [(i+10) for i in pair[1]])).count()

Out\[13\]:

    24

(2) Run a shuffle `sortByKey()` to see how the skew effects computation resources.<a href="#(2)-Run-a-shuffle-sortByKey()-to-see-how-the-skew-effects-computation-resources." class="anchor-link">¶</a>
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In \[14\]:

    rdd_sort = rdd.sortByKey(ascending=False, numPartitions=4)
    rdd_sort.count()

Out\[14\]:

    6010810

Mitigate data skewness: SALTING<a href="#Mitigate-data-skewness:-SALTING" class="anchor-link">¶</a>
---------------------------------------------------------------------------------------------------

In \[15\]:

    rdd_sort = rdd_salting.sortByKey(ascending=False, numPartitions=4)
    rdd_sort.count()

Out\[15\]:

    6010810

(3) Run a shuffle `Join()` to see how the skew effects computation resources.<a href="#(3)-Run-a-shuffle-Join()-to-see-how-the-skew-effects-computation-resources." class="anchor-link">¶</a>
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In \[17\]:

    # example of join

    small_rdd1 = sc.parallelize([(2,3), (1,3), (1,4), (3,1), (5,1)], 3)
    small_rdd2 = sc.parallelize([(4,3), (0,1), (1,2), (2,1)], 2)


    print(small_rdd1.collect())
    print(small_rdd2.collect())

    [(2, 3), (1, 3), (1, 4), (3, 1), (5, 1)]
    [(4, 3), (0, 1), (1, 2), (2, 1)]

In \[18\]:

    join1 = small_rdd1.join(small_rdd2)
    join1.collect()

Out\[18\]:

    [(1, (3, 2)), (1, (4, 2)), (2, (3, 1))]

In \[19\]:

    join1.getNumPartitions()

Out\[19\]:

    5

In \[20\]:

    join1.glom().collect()

Out\[20\]:

    [[], [(1, (3, 2)), (1, (4, 2))], [(2, (3, 1))], [], []]

In \[21\]:

    # Generate normal data

    key_1 = ['a'] * 5
    key_2 = ['b'] * 60
    key_4 = ['c'] * 100

    keys = key_1 + key_2 + key_4
    random.shuffle(keys)


    values_1 = list(np.random.randint(low = 1, high = 100, size = len(key_1)))
    values_2 = list(np.random.randint(low = 1, high = 100, size = len(key_2)))
    values_4 = list(np.random.randint(low = 1, high = 100, size = len(key_4)))
    values = values_1 + values_2 + values_4

    pair_data = list(zip(keys,values))

In \[22\]:

    small_rdd = sc.parallelize(pair_data, 2)

In \[23\]:

    # Join without salting

    rdd_join = rdd.join(small_rdd)
    rdd_join.map(lambda x: int(x[1][0] + x[1][1])).reduce(lambda a,b: a+b)

Out\[23\]:

    37848130700

Mitigate data skewness: SALTING<a href="#Mitigate-data-skewness:-SALTING" class="anchor-link">¶</a>
---------------------------------------------------------------------------------------------------

In \[26\]:

    #import random
    # add a random value to the key --> (key, randint)
    rdd_new = rdd.map(lambda x: ((x[0], random.randint(0, 10)), x[1])).cache()

    # replicate the small data
    small_rdd_new = small_rdd.cartesian(sc.parallelize(range(0, 11))).map(lambda x: ((x[0][0], x[1]), x[0][1])).cache()

In \[27\]:

    # Join with salting

    rdd_join = rdd_new.join(small_rdd_new)
    rdd_join.map(lambda x: int(x[1][0] + x[1][1])).reduce(lambda a,b: a+b)

Out\[27\]:

    37848130700