# 30-Day Journey with Apache Spark⚡Day 6 🚩

## **Day 6: Transformations and Actions in Spark** 🔄🎬
Welcome to Day 6! Today, we’ll explore **two core operations in Spark**: **Transformations and Actions**. Understanding how these operations work will help you process data more efficiently in the Spark environment.



---



### **1️⃣ Transformations**
Definition:
**Transformations** are operations applied to RDDs, DataFrames, or Datasets that return a new dataset but do not execute immediately. Transformations are lazy, meaning they are only executed when an Action is called.

**Common Types of Transformations:**

+ **Map**: Applies a function to each element of the dataset and returns a new dataset.

    `rdd.map(lambda x: x * 2)`

+ **Filter**: Filters elements that satisfy a condition.

    `rdd.filter(lambda x: x > 10)`

+ **FlatMap**: Similar to map, but allows returning multiple elements from each input element.

    `rdd.flatMap(lambda x: x.split(" "))`

+ **Union**: Combines two datasets into one.

    `rdd1.union(rdd2)`

+ **Distinct**: Removes duplicate elements from the dataset.

    `rdd.distinct()`









---



### **2️⃣ Actions**

Definition: **Actions** **trigger the execution of Transformations** and return the result to the driver program or save the data to external storage.

**Common Types of Actions:**






+ **Collect**: Retrieves all elements of the dataset and returns them to the driver as a list.

    `rdd.collect()`

+ **Count**: Counts the number of elements in the dataset.

    `rdd.count()`

+ **Take**: Retrieves a specified number of elements from the dataset.

    `rdd.take(5)`

---

### **𝗟𝗮𝘇𝘆 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻** ✅
Spark uses lazy evaluation, meaning *Transformations are not immediately computed*. **Instead, Spark builds an Execution Plan as a DAG**(Directed Acyclic Graph).

DAG optimization minimizes data redistribution (data shuffling) and improves performance.
Execution occurs only when an Action is called, at which point the plan is executed.

### **🔑 Key Takeaways:**

#### **Transformations**: Lazy operations that build the execution plan.

#### **Actions**: Trigger execution and return results or save data.

#### **Lazy Evaluation**: Enables Spark to optimize the data processing workflow and use resources efficiently.

#### Lazy Evaluation is a breakthrough in Spark, optimizing resource usage and minimizing unnecessary computations.

---

# 🎯 Practice Example: Getting started with **Transformation and Actions** in Apache Spark

In [1]:
# Intalling Java JDK and Apache Spark:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!pip install pyspark



In [2]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

In [3]:
# Create first Spark session:
from pyspark.sql import SparkSession
# Creat a Spark session
spark = SparkSession.builder.appName("PySpark in Colab").getOrCreate()
# Print Spark Version
print(spark.version)

3.5.3


## Transformations

In [8]:
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 4, 5])
mapped_rdd = rdd.map(lambda x: x*2)
filtered_rdd = rdd.filter(lambda x: x > 2)
distinct_rdd = rdd.distinct()
print(mapped_rdd.collect())
print(filtered_rdd.collect())
print(distinct_rdd.collect())

[2, 4, 6, 8, 8, 10]
[3, 4, 4, 5]
[2, 4, 1, 3, 5]


## Actions

In [9]:
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
total_count = rdd.count()
first_element = rdd.first()
sum_elements = rdd.sum() #or using sum_elements = rdd.reduce(lambda a, b: a + b)
print(f'{total_count} {first_element} {sum_elements}')

5 1 15
