# We will learn: 

- `parallelize()`
- chaining/chained transformation 
- partition 
- `countByValue()`


# SparkSession

In [4]:
spark

## Load the `dataset`

In [1]:
data_set = 's3://fcc-spark-example/dataset/2023/orders.txt'

## Create `SparkContext`

In [2]:
sc = spark.sparkContext

In [3]:
sc

## Create a `RDD`

In [11]:
words = ['ram', 'Alex', 'alex', 'hello', 'word', 'US', 'India', 'India']

In [12]:
rdd1 = sc.parallelize(words)

## Word Cound

In [13]:
rdd1.collect()

['ram', 'Alex', 'alex', 'hello', 'word', 'US', 'India', 'India']

In [15]:
rdd2 = rdd1.map(lambda x: (x.lower(), 1))

In [16]:
rdd2.collect()

[('ram', 1),
 ('alex', 1),
 ('alex', 1),
 ('hello', 1),
 ('word', 1),
 ('us', 1),
 ('india', 1),
 ('india', 1)]

In [18]:
rdd3 = rdd2.reduceByKey(lambda x, y: x+y)

In [19]:
rdd3.collect()

                                                                                

[('alex', 2), ('us', 1), ('ram', 1), ('hello', 1), ('word', 1), ('india', 2)]

In [22]:
rdd4 = rdd3.sortBy(lambda x: x[1], ascending=False)

In [23]:
rdd4.collect()

[('alex', 2), ('india', 2), ('us', 1), ('ram', 1), ('hello', 1), ('word', 1)]

## Lets `Chain` the whole thing

In [26]:
words = ['ram', 'Alex', 'alex', 'hello', 'word', 'US', 'India', 'India']

result_rdd = (sc.parallelize(words)
                .map(lambda x: (x.lower(), 1)) 
                .reduceByKey(lambda x, y: x+y)
                .sortBy(lambda x: x[1], ascending=False)
             )

                                                                                

In [27]:
result_rdd.collect()

[('alex', 2), ('india', 2), ('us', 1), ('ram', 1), ('hello', 1), ('word', 1)]

## Partition 

### a) Create an RDD using `parallelize`

In [28]:
words = ['ram', 'Alex', 'alex', 'hello', 'word', 'US', 'India', 'India']

rdd1= sc.parallelize(words)

In [32]:
rdd1.getNumPartitions() # Why the partition is 2 ? it shoudl be 1, right ? 

2

#### `defaultParallelism()`

`defaultParallelism` is used to set the number of partitions for RDDs and DataFrames when they are created through parallel operations such as `parallelize()` or `map()`. 

This property is used to determine the level of parallelism when executing distributed computations on RDDs or DataFrames.


In [36]:
sc.defaultParallelism  

2

### b) Create an RDD by `reading a file from S3`

In [37]:
rdd = sc.textFile(data_set)

In [38]:
rdd.getNumPartitions()

2

#### `defaultMinPartitions()`

`defaultMinPartitions` is used to set the minimum number of partitions for `RDDs (Resilient Distributed Datasets)` and `DataFrames` when they are created by reading data from external storage systems such as `HDFS (Hadoop Distributed File System)`, `S3 (Amazon Simple Storage Service)`, or `local` file system. This property is used to ensure that the data is evenly distributed among the partitions. If the input data size is smaller than the defaultMinPartitions, then Spark creates only as many partitions as there are input files.

In [35]:
sc.defaultMinPartitions

2

### `countByValue()`

#### Previouse we did this to `count` no. of orders based on the `status`: 

In [43]:
data_set = 's3://fcc-spark-example/dataset/2023/orders.txt'
result = (sc
           .textFile(data_set)
           .map(lambda line: (line.split(',')[-1], 1)) 
           .reduceByKey(lambda x, y: x+y)
           .sortBy(lambda x: x[-1], ascending=False)
         )

                                                                                

In [44]:
result.collect()

[('COMPLETE', 22899),
 ('PENDING_PAYMENT', 15030),
 ('PROCESSING', 8274),
 ('PENDING', 7609),
 ('CLOSED', 7556),
 ('ON_HOLD', 3798),
 ('SUSPECTED_FRAUD', 1558),
 ('CANCELED', 1428),
 ('PAYMENT_REVIEW', 729)]

#### Let's another way using `countByValue()`

In [49]:
data_set = 's3://fcc-spark-example/dataset/2023/orders.txt'
result = (sc
           .textFile(data_set)
           .map(lambda line: (line.split(',')[-1])) 
           .countByValue()  # It returns a local dict (its an action)
         )

In [50]:
result

defaultdict(int,
            {'CLOSED': 7556,
             'PENDING_PAYMENT': 15030,
             'COMPLETE': 22899,
             'PROCESSING': 8274,
             'PAYMENT_REVIEW': 729,
             'PENDING': 7609,
             'ON_HOLD': 3798,
             'CANCELED': 1428,
             'SUSPECTED_FRAUD': 1558})

`countByValue()` is a method in Apache Spark that is used to count the frequency of each unique element in an `RDD (Resilient Distributed Dataset)`. It returns a Map object, where each unique element in the RDD is a key, and its corresponding value is the number of times it occurs in the RDD.

# Summary 

<!-- - `map()` 

    - 100 ROWs =======> 1000 ROWs
    
    
- `reduceByKey()`

    - 100 ROWs =======> 10 ROWs (if there are 10 distinct Keys)
    
    
- `reduce()`

    - 100 ROWs =======> 1 ROW 
    
- `filter()` 

    - 100 ROWs =======> 0 <= No. of ROWs <= 100 (Depending on the filer ) -->
    


We leant: 
- `parallelize()`
- chaining/chained transformation 
- partition 
- `countByValue()` vs `reduceByKey()`