# We will learn: 

- Difference between: 
    1) `reduceByKey()` and `reduce()`
    2) `reduceByKey()` and `groupBykey()`
- DAG 

# SparkSession

In [1]:
spark

## Create `SparkContext`

In [2]:
sc = spark.sparkContext
sc

# `reduceByKey()`

- Works on PAIR RDD 
    - ('hello', 1)
    - ('world', 1)
- Its a transformation 
- local aggregation takes place in all worker node (like combiner in MR)
    - As a result of this, shuffling would be less, as majority of the aggregation would be executed in every worker machine

In [3]:
data_set = 's3://fcc-spark-example/dataset/2023/orders.txt'

In [4]:
# Transformations 

rdd1 = sc.textFile(data_set)
rdd2 = rdd1.map(lambda line: (line.split(',')[-1], 1)) 


In [5]:
rdd2.take(5)

                                                                                

[('CLOSED', 1),
 ('PENDING_PAYMENT', 1),
 ('COMPLETE', 1),
 ('CLOSED', 1),
 ('COMPLETE', 1)]

In [6]:
rdd3 = rdd2.reduceByKey(lambda x, y: x+y)

In [7]:
rdd3.collect()

                                                                                

[('CLOSED', 7556),
 ('CANCELED', 1428),
 ('PENDING_PAYMENT', 15030),
 ('COMPLETE', 22899),
 ('PROCESSING', 8274),
 ('PAYMENT_REVIEW', 729),
 ('PENDING', 7609),
 ('ON_HOLD', 3798),
 ('SUSPECTED_FRAUD', 1558)]

# `reduce()`

- Its an action 
- Works on normal RDD
- Finally we get only single thing as an answer in the driver machine (thats why its an action)

In [10]:
some_data = list(range(10))
some_data

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [11]:
rdd1 = sc.parallelize(some_data)
rdd1.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [12]:
rdd1.reduce(lambda x, y: x + y)



45

In [13]:
rdd1.reduce(lambda x, y: max(x,y))

9

In [14]:
rdd1.reduce(lambda x, y: min(x,y))

0

# `reduceByKey()` vs `groupByKey()`

### Lets take a large dataset 
- `aws s3 ls s3://amazon-reviews-pds/tsv/` 
- Reference : [here](https://s3.amazonaws.com/amazon-reviews-pds/readme.html)
- We will use only books data :
    - amazon_reviews_us_Books_v1_00.tsv.gz 
    - amazon_reviews_us_Books_v1_01.tsv.gz
    - amazon_reviews_us_Books_v1_02.tsv.gz


# `reduceByKey()` 

In [15]:
dataset = 's3://amazon-reviews-pds/tsv/amazon_reviews_us_Books*'

In [16]:
rdd1 = sc.textFile(dataset)

In [17]:
rdd1.take(5)

                                                                                

['marketplace\tcustomer_id\treview_id\tproduct_id\tproduct_parent\tproduct_title\tproduct_category\tstar_rating\thelpful_votes\ttotal_votes\tvine\tverified_purchase\treview_headline\treview_body\treview_date',
 'US\t25933450\tRJOVP071AVAJO\t0439873800\t84656342\tThere Was an Old Lady Who Swallowed a Shell!\tBooks\t5\t0\t0\tN\tY\tFive Stars\tI love it and so does my students!\t2015-08-31',
 'US\t1801372\tR1ORGBETCDW3AI\t1623953553\t729938122\tI Saw a Friend\tBooks\t5\t0\t0\tN\tY\tPlease buy "I Saw a Friend"! Your children will be delighted!\tMy wife and I ordered 2 books and gave them as presents...one to a friend\'s daughter and the other to our grandson! Both children were so happy with the story, by author Katrina Streza, and they were overjoyed with the absolutely adorable artwork, by artist Michele Katz, throughout the book! We highly recommend &#34;I Saw a Friend&#34; to all your little ones!!!\t2015-08-31',
 'US\t5782091\tR7TNRFQAOUTX5\t142151981X\t678139048\tBlack Lagoon, Vol. 6

In [18]:
header = rdd1.first()

                                                                                

In [20]:
header.split('\t')

['marketplace',
 'customer_id',
 'review_id',
 'product_id',
 'product_parent',
 'product_title',
 'product_category',
 'star_rating',
 'helpful_votes',
 'total_votes',
 'vine',
 'verified_purchase',
 'review_headline',
 'review_body',
 'review_date']

In [21]:
rdd2 = rdd1.filter(lambda x: x != header)

In [22]:
rdd3 = rdd2.map(lambda x: x.split('\t'))
#rdd3.take(1)

In [23]:
# Extracting only product ID (col #4)
rdd4 = rdd3.map(lambda x: (x[3], 1))
#rdd4.take(5)

In [24]:
result = rdd4.reduceByKey(lambda x, y : x + y)

In [25]:
result.take(5)

                                                                                

[('142151981X', 4),
 ('1604600527', 4),
 ('0399170863', 254),
 ('0671728725', 3),
 ('1570913722', 9)]

![Alt Text](../img/DAG_1.png)


In [26]:
result.getNumPartitions()

3

![Alt Text](../img/DAG_2.png)

# `groupByKey()`

- Works on regular RDD 
- Its a transformation 
- local aggregation DOEST NOT takes place in all worker node 
- lot of shuffle involved 
- can lead to OOM error 
- parallelism gets effected 

`groupByKey()` is not recommended

In [27]:
dataset = 's3://amazon-reviews-pds/tsv/amazon_reviews_us_Books*'

In [28]:
rdd1 = sc.textFile(dataset)
header = rdd1.first()

rdd2 = rdd1.filter(lambda x: x != header) 
rdd3 = rdd2.map(lambda x: x.split('\t'))
rdd4 = rdd3.map(lambda x: (x[3], 1))


                                                                                

In [31]:
result = rdd4.groupByKey() \
             .map(lambda x: (x[0], len(x[1])))

In [33]:
result.take(5)

                                                                                

[('1604600527', 4),
 ('1434708632', 8),
 ('0071472657', 22),
 ('B0007FWGQY', 1),
 ('0375859349', 97)]

# Summary 

<!-- - `map()` 

    - 100 ROWs =======> 1000 ROWs
    
    
- `reduceByKey()`

    - 100 ROWs =======> 10 ROWs (if there are 10 distinct Keys)
    
    
- `reduce()`

    - 100 ROWs =======> 1 ROW 
    
- `filter()` 

    - 100 ROWs =======> 0 <= No. of ROWs <= 100 (Depending on the filer ) -->
    


We learnt:
- Difference between: 
    1) `reduceByKey()` and `reduce()`
    2) `reduceByKey()` and `groupBykey()`
- DAG   
- Spark History Server and what are :
    - Jobs
    - Stages 
    - Tasks 