# We will learn: 

- Difference between: 
    1) `reduceByKey()` and `reduce()`
    2) `reduceByKey()` and `groupBykey()`


# SparkSession

In [1]:
spark

## Create `SparkContext`

In [3]:
sc = spark.sparkContext
sc

# `reduceByKey()`

- Works on PAIR RDD 
    - ('hello', 1)
    - ('world', 1)
- Its a transformation 
- local aggregation takes place in all worker node (like combiner in MR)
    - As a result of this, shuffling would be less, as majority of the aggregation would be executed in every worker machine

In [13]:
data_set = 's3://fcc-spark-example/dataset/2023/orders.txt'

In [23]:
# Transformations 

rdd1 = sc.textFile(data_set)
rdd2 = rdd1.map(lambda line: (line.split(',')[-1], 1)) 


In [20]:
rdd2.take(5)

                                                                                

[('CLOSED', 1),
 ('PENDING_PAYMENT', 1),
 ('COMPLETE', 1),
 ('CLOSED', 1),
 ('COMPLETE', 1)]

In [24]:
rdd3 = rdd2.reduceByKey(lambda x, y: x+y)

In [25]:
rdd3.collect()

                                                                                

[('CLOSED', 7556),
 ('CANCELED', 1428),
 ('PENDING_PAYMENT', 15030),
 ('COMPLETE', 22899),
 ('PROCESSING', 8274),
 ('PAYMENT_REVIEW', 729),
 ('PENDING', 7609),
 ('ON_HOLD', 3798),
 ('SUSPECTED_FRAUD', 1558)]

# `reduce()`

- Its an action 
- Works on normal RDD
- Finally we get only single thing as an answer in the driver machine (thats why its an action)

In [6]:
some_data = list(range(10))
some_data

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [7]:
rdd1 = sc.parallelize(some_data)
rdd1.collect()

                                                                                

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [8]:
rdd1.reduce(lambda x, y: x + y)

45

In [11]:
rdd1.reduce(lambda x, y: max(x,y))

9

In [12]:
rdd1.reduce(lambda x, y: min(x,y))

0

# `reduceByKey()` vs `groupByKey()`

### Lets take a large dataset 
- `aws s3 ls s3://amazon-reviews-pds/tsv/` 
- Reference : [here](https://s3.amazonaws.com/amazon-reviews-pds/readme.html)

In [29]:
dataset = 's3://amazon-reviews-pds/tsv/*'

In [30]:
rdd1 = sc.textFile(dataset)

In [31]:
rdd1.take(5)

                                                                                

['marketplace\tcustomer_id\treview_id\tproduct_id\tproduct_parent\tproduct_title\tproduct_category\tstar_rating\thelpful_votes\ttotal_votes\tvine\tverified_purchase\treview_headline\treview_body\treview_date',
 'DE\t10133\tRVOG49N0H1FB6\tB004TACMZ8\t569741360\tBosch GMS120\xa0Ortungsgerät digital multi-Scanner\tHome Improvement\t5\t0\t0\tN\tY\tSuper\tDelivery took a little bit more then i expected(2 days more) but package was in good condition and the device i fully functional.\t2014-08-01',
 "DE\t19612\tRNCMD6OLTP4HM\t1846071224\t785505948\tThe Wheels On The Bus: Favourite Nursery Rhymes (BBC Audio Children's)\tBooks\t5\t1\t1\tN\tY\tGreat compilation\tWe enjoy listening to the song as preparation for our baby. Some of the songs were quite new even for us. We hope our baby will enjoy the songs too as much as we do.\t2014-12-04",
 "DE\t19612\tR4AUOBI8YC0R8\t0375851569\t516548029\tDr. Seuss's  Beginner Book Collection\tBooks\t5\t0\t0\tN\tY\tGreat Collection\tVery great compilation. Inter

In [32]:
header = rdd1.first()

                                                                                

In [33]:
rdd2 = rdd1.filter(lambda x: x != header)

In [34]:
rdd2.take(5)

                                                                                

['DE\t10133\tRVOG49N0H1FB6\tB004TACMZ8\t569741360\tBosch GMS120\xa0Ortungsgerät digital multi-Scanner\tHome Improvement\t5\t0\t0\tN\tY\tSuper\tDelivery took a little bit more then i expected(2 days more) but package was in good condition and the device i fully functional.\t2014-08-01',
 "DE\t19612\tRNCMD6OLTP4HM\t1846071224\t785505948\tThe Wheels On The Bus: Favourite Nursery Rhymes (BBC Audio Children's)\tBooks\t5\t1\t1\tN\tY\tGreat compilation\tWe enjoy listening to the song as preparation for our baby. Some of the songs were quite new even for us. We hope our baby will enjoy the songs too as much as we do.\t2014-12-04",
 "DE\t19612\tR4AUOBI8YC0R8\t0375851569\t516548029\tDr. Seuss's  Beginner Book Collection\tBooks\t5\t0\t0\tN\tY\tGreat Collection\tVery great compilation. Interesting story and rhymes even for adults. The tongue twister is amazing. This complete my collection of Dr. Seuss's.\t2014-12-04",
 'DE\t19677\tR1VSHIJ1RHIBTE\tB0060SVG54\t302116447\tZwei an einem Tag\tVideo DVD

In [None]:
rdd2 = rdd1.map(lambda x: x.split('\t'))
rdd2.take(5)

[Stage 21:>                                                         (0 + 0) / 1]

# Summary 

<!-- - `map()` 

    - 100 ROWs =======> 1000 ROWs
    
    
- `reduceByKey()`

    - 100 ROWs =======> 10 ROWs (if there are 10 distinct Keys)
    
    
- `reduce()`

    - 100 ROWs =======> 1 ROW 
    
- `filter()` 

    - 100 ROWs =======> 0 <= No. of ROWs <= 100 (Depending on the filer ) -->
    


We learnt:
- Types of Transformation: 
    1) Narrow Transformation 
        - `map()`
        - `flatmap()`
        - `filter()`
    2) Wide Transformation 
        - `reduceByKey()`
        - `groupByKey()`
        
- We should always :
    - try to minimise `wide` transformation 
    - use `wide` transformation as late as possible 
    
- Spark History Server and what are :
    - Jobs
    - Stages 
    - Tasks 