<a href="https://colab.research.google.com/github/anjli01/PySpark-Notes/blob/main/11_Handling_Nulls___Missing_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Handling Nulls in Spark DataFrames for Beginner Data Engineers

### 1. Introduction: Why Handle Nulls?

Missing data is ubiquitous in real-world datasets. If left unaddressed, nulls can:
*   Lead to incorrect analysis and aggregations (e.g., `AVG` calculations might exclude nulls, `COUNT` might treat them differently).
*   Cause errors in downstream processing or machine learning models.
*   Impact data integrity and quality.

Spark offers several methods to effectively manage nulls, allowing you to clean and prepare your data for further processing.

### 2. Core `df.na` Methods

These methods are accessed through `df.na` (e.g., `my_dataframe.na.fill(...)`).

#### a. `na.fill()`: Replacing Null Values

*   **Purpose**: Replaces `null` values in specified columns with a given value.
*   **Parameters**:
    *   `value`: The replacement value. Can be:
        *   A single value (e.g., `0`, `"Unknown"`) to apply to all compatible columns.
        *   A dictionary `{column_name: replacement_value}` to specify different values for different columns.
    *   `subset`: (Optional) A list of column names to apply the fill operation to. If `None`, it applies to all columns of compatible type.

*   **Key Points**:
    *   `value` type must be compatible with the column's data type.
    *   Using a dictionary for `value` provides fine-grained control for different column types (e.g., `0` for numeric, `"N/A"` for string).

**Example (Python):**

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg

# Initialize SparkSession
spark = SparkSession.builder.appName("HandlingNulls").getOrCreate()

# Sample Data
data = [("Alice", None, "New York", 100.0),
        ("Bob", 25, "London", None),
        ("Charlie", 35, None, 120.5),
        ("David", None, None, None),
        (None, 40, "Paris", 90.0),
        ("Eve", 28, "Berlin", 110.0),
        ("Frank", None, None, None)] # Added for more examples later

columns = ["Name", "Age", "City", "Score"]
df = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df.show()

print("\n--- 1. Using na.fill() ---")

# Fill all numeric nulls with 0 and all string nulls in 'City' with "Unknown"
print("\nAfter filling 'Age' nulls with 0 and 'City' nulls with 'Unknown':")
df.na.fill(0, subset=["Age"]) \
  .na.fill("Unknown", subset=["City"]) \
  .show()

# You can also use a dictionary for multiple types/columns in one go
print("\nAfter filling 'Age' with 99 and 'City' with 'N/A' using a dictionary:")
fill_values = {"Age": 99, "City": "N/A"}
df.na.fill(fill_values).show()

# Fill 'Score' nulls with an average (common practice for numeric data)
avg_score = df.select(avg("Score")).collect()[0][0] # Calculate average score from non-nulls
print(f"\nAverage Score for filling: {avg_score}")

df_filled_with_avg = df.na.fill(avg_score, subset=["Score"])
print("\nAfter filling 'Score' nulls with calculated average:")
df_filled_with_avg.show()

Original DataFrame:
+-------+----+--------+-----+
|   Name| Age|    City|Score|
+-------+----+--------+-----+
|  Alice|NULL|New York|100.0|
|    Bob|  25|  London| NULL|
|Charlie|  35|    NULL|120.5|
|  David|NULL|    NULL| NULL|
|   NULL|  40|   Paris| 90.0|
|    Eve|  28|  Berlin|110.0|
|  Frank|NULL|    NULL| NULL|
+-------+----+--------+-----+


--- 1. Using na.fill() ---

After filling 'Age' nulls with 0 and 'City' nulls with 'Unknown':
+-------+---+--------+-----+
|   Name|Age|    City|Score|
+-------+---+--------+-----+
|  Alice|  0|New York|100.0|
|    Bob| 25|  London| NULL|
|Charlie| 35| Unknown|120.5|
|  David|  0| Unknown| NULL|
|   NULL| 40|   Paris| 90.0|
|    Eve| 28|  Berlin|110.0|
|  Frank|  0| Unknown| NULL|
+-------+---+--------+-----+


After filling 'Age' with 99 and 'City' with 'N/A' using a dictionary:
+-------+---+--------+-----+
|   Name|Age|    City|Score|
+-------+---+--------+-----+
|  Alice| 99|New York|100.0|
|    Bob| 25|  London| NULL|
|Charlie| 35|     

#### b. `na.drop()`: Dropping Rows with Null Values

*   **Purpose**: Removes rows from the DataFrame that contain `null` values based on specified conditions.
*   **Parameters**:
    *   `how`: Specifies the condition for dropping rows.
        *   `'any'` (default): Drops a row if it contains *any* `null` values in the considered columns.
        *   `'all'`: Drops a row if *all* its values in the considered columns are `null`.
    *   `thresh`: (Optional) An integer. Drops a row if it has *fewer than `thresh`* non-null values in the considered columns.
    *   `subset`: (Optional) A list of column names to consider for dropping. If `None`, all columns are considered.

*   **Key Points**:
    *   `na.drop()` is useful for removing incomplete records, but can lead to significant data loss if not used carefully.
    *   `thresh` provides a more flexible way to keep partially complete rows.

**Example (Python):**

In [2]:
# Continue from the previous SparkSession and df

print("\n--- 2. Using na.drop() ---")

print("\nOriginal DataFrame:")
df.show() # Showing df again for context

# Drop rows if they contain any null value in any column
print("\nAfter dropping rows with ANY null value:")
df.na.drop(how='any').show()

# Drop rows if 'Age' or 'City' is null
print("\nAfter dropping rows where 'Age' OR 'City' is null:")
df.na.drop(subset=["Age", "City"]).show()

# Drop rows if they have less than 2 non-null values (threshold)
print("\nAfter dropping rows with less than 2 non-null values across ALL columns:")
df.na.drop(thresh=2).show()

# Drop rows if 'Age' or 'City' has less than 1 non-null value (i.e., if either is null)
print("\nAfter dropping rows where 'Age' or 'City' has less than 1 non-null value:")
df.na.drop(thresh=1, subset=["Age", "City"]).show()


--- 2. Using na.drop() ---

Original DataFrame:
+-------+----+--------+-----+
|   Name| Age|    City|Score|
+-------+----+--------+-----+
|  Alice|NULL|New York|100.0|
|    Bob|  25|  London| NULL|
|Charlie|  35|    NULL|120.5|
|  David|NULL|    NULL| NULL|
|   NULL|  40|   Paris| 90.0|
|    Eve|  28|  Berlin|110.0|
|  Frank|NULL|    NULL| NULL|
+-------+----+--------+-----+


After dropping rows with ANY null value:
+----+---+------+-----+
|Name|Age|  City|Score|
+----+---+------+-----+
| Eve| 28|Berlin|110.0|
+----+---+------+-----+


After dropping rows where 'Age' OR 'City' is null:
+----+---+------+-----+
|Name|Age|  City|Score|
+----+---+------+-----+
| Bob| 25|London| NULL|
|NULL| 40| Paris| 90.0|
| Eve| 28|Berlin|110.0|
+----+---+------+-----+


After dropping rows with less than 2 non-null values across ALL columns:
+-------+----+--------+-----+
|   Name| Age|    City|Score|
+-------+----+--------+-----+
|  Alice|NULL|New York|100.0|
|    Bob|  25|  London| NULL|
|Charlie|  3

#### c. `na.replace()`: Replacing Specific Values

*   **Purpose**: Replaces specific values (not just `nulls`) in specified columns with another value. This is useful for cleaning up inconsistent data entries like `""`, `"N/A"`, or placeholder numbers.
*   **Parameters**:
    *   `subset`: (Optional) A list of column names to apply the replacement to.
    *   `replacement_map`: A dictionary where keys are the values to be replaced, and values are the new replacement values. This can be:
        *   `{old_value: new_value}` for a single value replacement.
        *   `[{old_value: new_value}, {old_value_2: new_value_2}]` for multiple replacements.
        *   For column-specific replacements, it can be `{col_name: {old_value: new_value}}`.

*   **Key Points**:
    *   Allows for replacing non-null values, unlike `na.fill()`.
    *   Handy for standardizing entries or correcting data entry mistakes.

**Example (Python):**

In [3]:
# Continue from the previous SparkSession

# Sample data for replace
df_replace_example = spark.createDataFrame([("Apple", "Red"), ("Banana", "Yellow"),
                                            ("Grape", "Red"), ("Orange", "N/A")],
                                           ["Fruit", "Color"])
print("\n--- 3. Using na.replace() ---")
print("\nOriginal DataFrame for replace example:")
df_replace_example.show()

# Replace "Red" with "Crimson" in the "Color" column
print("\nAfter replacing 'Red' with 'Crimson' in 'Color' column:")
df_replace_example.na.replace("Red", "Crimson", "Color").show()

# Replace multiple values using a dictionary
print("\nAfter replacing 'Red' with 'Crimson' and 'N/A' with 'Unknown' in 'Color':")
replacement_map = {"Red": "Crimson", "N/A": "Unknown"}
df_replace_example.na.replace(replacement_map, subset=["Color"]).show()

# Using the original df, let's replace Age 25 with 100
print("\nOriginal DataFrame (again for Age replacement context):")
df.show()
print("\nAfter replacing Age 25 with 100:")
df.na.replace(25, 100, "Age").show() # Note: The original df has Bob, Age 25.


--- 3. Using na.replace() ---

Original DataFrame for replace example:
+------+------+
| Fruit| Color|
+------+------+
| Apple|   Red|
|Banana|Yellow|
| Grape|   Red|
|Orange|   N/A|
+------+------+


After replacing 'Red' with 'Crimson' in 'Color' column:
+------+-------+
| Fruit|  Color|
+------+-------+
| Apple|Crimson|
|Banana| Yellow|
| Grape|Crimson|
|Orange|    N/A|
+------+-------+


After replacing 'Red' with 'Crimson' and 'N/A' with 'Unknown' in 'Color':
+------+-------+
| Fruit|  Color|
+------+-------+
| Apple|Crimson|
|Banana| Yellow|
| Grape|Crimson|
|Orange|Unknown|
+------+-------+


Original DataFrame (again for Age replacement context):
+-------+----+--------+-----+
|   Name| Age|    City|Score|
+-------+----+--------+-----+
|  Alice|NULL|New York|100.0|
|    Bob|  25|  London| NULL|
|Charlie|  35|    NULL|120.5|
|  David|NULL|    NULL| NULL|
|   NULL|  40|   Paris| 90.0|
|    Eve|  28|  Berlin|110.0|
|  Frank|NULL|    NULL| NULL|
+-------+----+--------+-----+


Afte

### 3. Advanced Conditional Handling with `when().otherwise()`

For more complex null handling logic, especially when you need to derive new values based on conditions involving other columns or intricate rules, `when().otherwise()` from `pyspark.sql.functions` is highly effective.

*   **Purpose**: Allows you to create new columns or modify existing ones based on a series of conditions. It's powerful for implementing custom null-filling strategies.
*   **Key Concepts**:
    *   **`when(condition, value)`**: If `condition` is true, assign `value`.
    *   **`.otherwise(value)`**: If none of the preceding `when` conditions are true, assign this `value`.
    *   **Chaining**: You can chain multiple `when` clauses (`.when(condition2, value2).when(condition3, value3)...`) for multi-step logic.
    *   **`col().isNull()`**: Used to check if a column's value is null.
    *   **`lit(value)`**: Used to create a literal value (constant) in Spark expressions.

**Example (Python):**

In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, lit, avg # Added avg for score calculation

# Initialize SparkSession (if not already running)
# spark = SparkSession.builder.appName("ConditionalNullHandling").getOrCreate()

# Sample Data
data = [("Alice", 30, "M", None),
        ("Bob", 25, "M", 60000),
        ("Charlie", None, "F", 80000), # Age is null
        ("David", 40, "M", None), # Salary is null
        ("Eve", 22, "F", 55000),
        ("Frank", None, None, None)] # Age, Gender, Salary are null
columns = ["Name", "Age", "Gender", "Salary"]
df_conditional = spark.createDataFrame(data, columns)

print("\n--- Conditional Handling with when().otherwise() ---")
print("\nOriginal DataFrame for conditional handling:")
df_conditional.show()

# Conditional filling for 'Salary':
# - If Salary is null AND Age is less than 30, fill with 50000.
# - If Salary is null AND Age is 30 or more, fill with 70000.
# - Otherwise, keep original Salary.
df_salary_filled = df_conditional.withColumn("Salary_Filled",
    when(col("Salary").isNull() & (col("Age") < 30), lit(50000))
    .when(col("Salary").isNull() & (col("Age") >= 30), lit(70000))
    .otherwise(col("Salary")))

print("\nConditional filling for Salary:")
df_salary_filled.show()

# Conditional filling for 'Age' and 'Gender' (nested when statements):
# - If Age is null:
#   - If Gender is 'M', fill Age with 30.
#   - If Gender is 'F', fill Age with 28.
#   - Otherwise (Gender is also null or unknown), fill Age with 0.
# - Otherwise, keep original Age.
# For Gender:
# - If Gender is null, fill with "Unknown".
# - Otherwise, keep original Gender.

df_age_gender_filled = df_conditional.withColumn("Age_Filled",
    when(col("Age").isNull(), # Outer condition: if Age is null
        when(col("Gender") == "M", lit(30)) # Inner condition 1: if Gender is M
        .when(col("Gender") == "F", lit(28)) # Inner condition 2: if Gender is F
        .otherwise(lit(0)) # Default if Gender is also null or neither M/F
    )
    .otherwise(col("Age")) # If Age is not null, keep original Age
).withColumn("Gender_Filled",
    when(col("Gender").isNull(), lit("Unknown"))
    .otherwise(col("Gender"))
)

print("\nConditional filling for Age and Gender:")
df_age_gender_filled.show()


--- Conditional Handling with when().otherwise() ---

Original DataFrame for conditional handling:
+-------+----+------+------+
|   Name| Age|Gender|Salary|
+-------+----+------+------+
|  Alice|  30|     M|  NULL|
|    Bob|  25|     M| 60000|
|Charlie|NULL|     F| 80000|
|  David|  40|     M|  NULL|
|    Eve|  22|     F| 55000|
|  Frank|NULL|  NULL|  NULL|
+-------+----+------+------+


Conditional filling for Salary:
+-------+----+------+------+-------------+
|   Name| Age|Gender|Salary|Salary_Filled|
+-------+----+------+------+-------------+
|  Alice|  30|     M|  NULL|        70000|
|    Bob|  25|     M| 60000|        60000|
|Charlie|NULL|     F| 80000|        80000|
|  David|  40|     M|  NULL|        70000|
|    Eve|  22|     F| 55000|        55000|
|  Frank|NULL|  NULL|  NULL|         NULL|
+-------+----+------+------+-------------+


Conditional filling for Age and Gender:
+-------+----+------+------+----------+-------------+
|   Name| Age|Gender|Salary|Age_Filled|Gender_Fill

### Summary and Best Practices

*   **`na.fill()`**: Best for simple, direct replacements of nulls with a constant value (numeric, string, etc.) or a dictionary of column-specific values.
*   **`na.drop()`**: Use when incomplete rows are undesirable or when a large percentage of nulls makes a record useless. Be cautious of data loss.
*   **`na.replace()`**: Ideal for standardizing non-null values, fixing data entry errors, or handling specific placeholder values that aren't technically `null`.
*   **`when().otherwise()`**: The most flexible method for complex, rule-based null handling, conditional logic across columns, or deriving new values. It's often used when simple `fill` or `drop` isn't sufficient.

Always understand your data and the implications of your null-handling strategy. Incorrect handling can lead to biased analyses or data loss.

In [5]:
# Stop the SparkSession
spark.stop()