<H3> reduceByKey() vs groupByKey() </H3>

<p> <b> reduceByKey() </b>: The data is locally aggregated first, then this locally aggregated data is shuffled and colocated based on the key. After shuffling, the data is further aggregated to produce the final result. This method efficiently uses all the nodes and reduces the amount of data shuffled.</p>

<p><b> groupByKey() </b>: The data is first colocated based on the key, then it is aggregated. This approach has the following disadvantages:
    <ol><li> Since the data is shuffled before aggregation, a large amount of data is shuffled, leading to high Network I/O.</li>
        <li> After shuffling, the aggregation is done on one machine, resulting in inefficient use of the cluster's compute capabilities. For example, if there are 9 keys (statuses, regions, etc.), they will be shuffled to 9 or fewer machines based on the hashing algorithm and processed on these 9 machines. If the cluster has 100 nodes, the compute capabilities of 91 machines are not utilized.</p>

In [1]:
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
builder. \
config('spark.ui.port', '0'). \
config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

In [2]:
orders_rdd = spark.sparkContext.textFile("/public/trendytech/retail_db/orders/part-00000")

In [3]:
mapped_rdd = orders_rdd.map(lambda line: (line.split(",")[3], line.split(",")[2]))

In [4]:
mapped_rdd.take(10)

[('CLOSED', '11599'),
 ('PENDING_PAYMENT', '256'),
 ('COMPLETE', '12111'),
 ('CLOSED', '8827'),
 ('COMPLETE', '11318'),
 ('COMPLETE', '7130'),
 ('COMPLETE', '4530'),
 ('PROCESSING', '2911'),
 ('PENDING_PAYMENT', '5657'),
 ('PENDING_PAYMENT', '5648')]

In [7]:
grouped_rdd = mapped_rdd.groupByKey()

In [8]:
grouped_rdd.take(5)

[('CLOSED', <pyspark.resultiterable.ResultIterable at 0x7f7a855f8f60>),
 ('CANCELED', <pyspark.resultiterable.ResultIterable at 0x7f7a855f81d0>),
 ('PENDING_PAYMENT',
  <pyspark.resultiterable.ResultIterable at 0x7f7a85149160>),
 ('COMPLETE', <pyspark.resultiterable.ResultIterable at 0x7f7a850e9e80>),
 ('PROCESSING', <pyspark.resultiterable.ResultIterable at 0x7f7a851491d0>)]

In [11]:
result = grouped_rdd.map(lambda x: (x[0],len(x[1])))

In [12]:
result.collect()

[('CLOSED', 7556),
 ('CANCELED', 1428),
 ('PENDING_PAYMENT', 15030),
 ('COMPLETE', 22899),
 ('PROCESSING', 8275),
 ('PAYMENT_REVIEW', 729),
 ('PENDING', 7610),
 ('ON_HOLD', 3798),
 ('SUSPECTED_FRAUD', 1558)]

In [13]:
spark.stop()