### ## # **RDD Transformations**

**NARROW TRANSFORMATIONS (No Shuffle):**

These transformations operate within a single partition, so no data is moved across the network.

**WIDE TRANSFORMATIONS (Causes Shuffle):**

These transformations re-partition the data — often across the cluster — and involve network I/O.

| **Transformation** | **Type** | **Causes Shuffle?** |
| ------------------ | -------- | ------------------- |
| `map()`            | Narrow   | ❌ No                |
| `filter()`         | Narrow   | ❌ No                |
| `flatMap()`        | Narrow   | ❌ No                |
| `reduceByKey()`    | Wide     | ✅ Yes               |
| `groupByKey()`     | Wide     | ✅ Yes               |
| `join()`           | Wide     | ✅ Yes               |
| `distinct()`       | Wide     | ✅ Yes               |

**How About DataFrame Transformations?**

DataFrames in PySpark have their own methods (e.g., select(), withColumn(), groupBy(), join(), dropDuplicates()), but under the hood, many of them compile down to RDD operations and get optimized using the Catalyst optimizer.

| **RDD Operation** | **DataFrame Equivalent**          |
| ----------------- | --------------------------------- |
| `map()`           | `select()`, `withColumn()`        |
| `filter()`        | `filter()` / `where()`            |
| `flatMap()`       | `explode()`                       |
| `reduceByKey()`   | `groupBy().agg()` with reduction  |
| `groupByKey()`    | `groupBy()`                       |
| `join()`          | `join()`                          |
| `distinct()`      | `dropDuplicates()` / `distinct()` |



In [0]:
# Create Spark Context for rdd
rdd = spark.sparkContext.parallelize([1, 2, 3, 4])

In [0]:
# map()
rdd.map(lambda x: x * 2).collect()  # [2, 4, 6, 8]

# filter()
rdd.filter(lambda x: x % 2 == 0).collect()  # [2, 4]

# flatMap()
rdd = spark.sparkContext.parallelize(["hello world", "spark rdd"])
rdd.flatMap(lambda x: x.split()).collect()  # ['hello', 'world', 'spark', 'rdd']

"""
Actions:
collect(), count(), first(), take(n)
"""
rdd.collect()  # [1, 2, 3, 4]
rdd.count()  # 4
rdd.first()  # 1
rdd.take(2)  # [1, 2]