<a href="https://colab.research.google.com/github/anjli01/PySpark-Notes/blob/main/12_Sorting%2C_Filtering%2C_Distinct%2C_DropDuplicates.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Here are concise notes for Beginner Data Engineers, based on the provided content and enhanced with practical considerations:

---

# Spark Transformations for Data Engineers: Clean, Sort, Deduplicate

As a Data Engineer, mastering Spark transformations is fundamental for efficient data cleaning, organization, and preparation. Pay close attention to the performance implications, especially the difference between wide and narrow transformations.

---

## 1. Sorting Data (`sort`, `orderBy`)

Sorting orders DataFrame rows by one or more columns.

*   **Aliases**: `sort(*cols, asc=True)` and `orderBy(*cols, asc=True)` are interchangeable.
*   **Direction**:
    *   **Ascending (default)**: `df.sort("Age")`
    *   **Descending**: Use `col("column_name").desc()`, e.g., `df.sort(col("Age").desc())`
    *   **Multiple Columns**: Specify order for each, e.g., `df.orderBy("Age", col("Name").desc())`
*   **Transformation Type**: **Wide Transformation** (requires data shuffling).

**Example (Python):**

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize Spark Session
spark = SparkSession.builder.appName("SortFilterDistinct").getOrCreate()

# Sample Data
data = [("Alice", 30, "NY"),
        ("Bob", 25, "LD"),
        ("Charlie", 35, "NY"),
        ("Alice", 30, "NY"), # Duplicate row
        ("David", 22, "SF"),
        ("Charlie", 35, "LA")] # Charlie is duplicated on Name/Age, but not Name/Age/City
columns = ["Name", "Age", "City"]
df = spark.createDataFrame(data, columns)
df.show()
# +-------+---+----+
# |   Name|Age|City|
# +-------+---+----+
# |  Alice| 30|  NY|
# |    Bob| 25|  LD|
# |Charlie| 35|  NY|
# |  Alice| 30|  NY|
# |  David| 22|  SF|
# |Charlie| 35|  LA|
# +-------+---+----+

print("\nSorted by Age (ascending):")
df.sort("Age").show()

print("\nSorted by Age (asc), then Name (desc):")
df.orderBy("Age", col("Name").desc()).show()

+-------+---+----+
|   Name|Age|City|
+-------+---+----+
|  Alice| 30|  NY|
|    Bob| 25|  LD|
|Charlie| 35|  NY|
|  Alice| 30|  NY|
|  David| 22|  SF|
|Charlie| 35|  LA|
+-------+---+----+


Sorted by Age (ascending):
+-------+---+----+
|   Name|Age|City|
+-------+---+----+
|  David| 22|  SF|
|    Bob| 25|  LD|
|  Alice| 30|  NY|
|  Alice| 30|  NY|
|Charlie| 35|  LA|
|Charlie| 35|  NY|
+-------+---+----+


Sorted by Age (asc), then Name (desc):
+-------+---+----+
|   Name|Age|City|
+-------+---+----+
|  David| 22|  SF|
|    Bob| 25|  LD|
|  Alice| 30|  NY|
|  Alice| 30|  NY|
|Charlie| 35|  LA|
|Charlie| 35|  NY|
+-------+---+----+



---

## 2. Filtering Data (`filter` / `where`)

Filtering selects rows based on a specified condition.

*   **Aliases**: `filter(condition)` and `where(condition)` are interchangeable.
*   **Transformation Type**: **Narrow Transformation** (no shuffling required).

**Example (Python):**

In [2]:
print("\nFiltered for Age > 28:")
df.filter(col("Age") > 28).show()


Filtered for Age > 28:
+-------+---+----+
|   Name|Age|City|
+-------+---+----+
|  Alice| 30|  NY|
|Charlie| 35|  NY|
|  Alice| 30|  NY|
|Charlie| 35|  LA|
+-------+---+----+



---

## 3. Handling Duplicates (`distinct`, `dropDuplicates`)

Removing duplicate rows is a critical data cleaning step.

### `distinct()`

*   Returns a new DataFrame with only unique rows, considering **all columns**.
*   **Transformation Type**: **Wide Transformation** (requires global comparison and shuffling).

### `dropDuplicates(subset=None)`

*   Removes duplicate rows.
*   **`subset=None`**: Considers **all columns** for duplication (similar to `distinct()`).
*   **`subset=[column_names]`**: Considers *only* the specified columns for duplication, keeping the first occurrence.
*   **Transformation Type**: **Wide Transformation** (requires shuffling).

### 🚀 **Performance Preference for Data Engineers**

`dropDuplicates(subset=...)` is generally **preferred over `distinct()`** for targeted de-duplication:

*   **Efficient Shuffle**: When `subset` is specified, Spark only needs to shuffle data based on those columns, significantly reducing network I/O and data movement.
*   **Reduced Overhead**: `distinct()` always processes all columns, even if not needed for uniqueness, incurring higher overhead.
*   **Targeted Control**: Provides fine-grained control over which columns define uniqueness.

**Example (Python):**

In [3]:
print("\nDistinct rows (considering all columns):")
df.distinct().show()

print("\nDrop duplicates (on all columns):")
df.dropDuplicates().show()

print("\nDrop duplicates on 'Name' and 'Age' (keep first occurrence):")
df.dropDuplicates(subset=["Name", "Age"]).show()
# +-------+---+----+
# |   Name|Age|City|
# +-------+---+----+
# |  David| 22|  SF|
# |    Bob| 25|  LD|
# |  Alice| 30|  NY| # Keeps the first Alice, 30 occurrence
# |Charlie| 35|  NY| # Keeps the first Charlie, 35 occurrence
# +-------+---+----+

# Stop SparkSession
spark.stop()


Distinct rows (considering all columns):
+-------+---+----+
|   Name|Age|City|
+-------+---+----+
|    Bob| 25|  LD|
|Charlie| 35|  NY|
|  Alice| 30|  NY|
|  David| 22|  SF|
|Charlie| 35|  LA|
+-------+---+----+


Drop duplicates (on all columns):
+-------+---+----+
|   Name|Age|City|
+-------+---+----+
|    Bob| 25|  LD|
|Charlie| 35|  NY|
|  Alice| 30|  NY|
|  David| 22|  SF|
|Charlie| 35|  LA|
+-------+---+----+


Drop duplicates on 'Name' and 'Age' (keep first occurrence):
+-------+---+----+
|   Name|Age|City|
+-------+---+----+
|  Alice| 30|  NY|
|    Bob| 25|  LD|
|Charlie| 35|  NY|
|  David| 22|  SF|
+-------+---+----+



---

## 4. Understanding Wide vs. Narrow Transformations (KEY CONCEPT for DEs)

Understanding these types is vital for writing performant Spark applications.

| Feature             | Narrow Transformations                                      | Wide Transformations (Shuffles)                                      |
| :------------------ | :---------------------------------------------------------- | :------------------------------------------------------------------- |
| **Input to Output** | Each input partition contributes to **at most one** output partition. | Each input partition can contribute to **multiple** output partitions.   |
| **Data Shuffling**  | **NO data shuffling** across the network.                   | **REQUIRES data shuffling** across the network (expensive).           |
| **Performance**     | Generally **faster** (no network I/O).                    | Can be significantly **slower** (network I/O, serialization/deserialization, disk I/O). |
| **Examples**        | `filter()`, `map()`, `withColumn()`, `select()`, `unionAll()` | `groupBy()`, `orderBy()`, `sort()`, `distinct()`, `dropDuplicates()`, `repartition()`, `join()` (unless broadcast). |
| **DE Imperative**   | **Prefer these** when possible.                           | **Optimize them** when unavoidable.                                      |

**Why it matters for Data Engineers:**

*   **Cost of Shuffles**: Network I/O, disk I/O, and CPU overhead make shuffles the most expensive operations in Spark.
*   **Data Skew**: Shuffles can lead to imbalanced partitions (data skew), causing some tasks to run significantly longer and creating bottlenecks.
*   **Optimization Strategies**:
    *   **Minimize Shuffles**: Restructure logic to perform narrow ops first.
    *   **Proper Partitioning**: Ensure data is optimally partitioned before wide operations.
    *   **`spark.sql.shuffle.partitions`**: Tune this configuration (default 200) to balance partition size and task overhead.
    *   **Broadcast Joins**: Use for small lookup tables to avoid a shuffle during `join` operations.
