<a href="https://colab.research.google.com/github/anjli01/PySpark-Notes/blob/main/05_Column_Operations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Column Operations

Column operations are fundamental for data cleaning, transformation, and feature engineering in PySpark. They allow you to manipulate, create, and modify data within DataFrame columns efficiently.

### Key Functions for Column Expressions

PySpark provides essential functions in `pyspark.sql.functions` to construct powerful expressions:

| Function           | Description                                                                                                                                                                                                                                                         | Interview Tip                                                                                                                                                                                                                                                                                                                                                                                                   |
| :----------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `col(column_name)` | **References a column by its name.** Safest way, especially with names conflicting with keywords or containing special characters.                                                                                                                                 | Always prefer `col()` for referencing existing columns over string literals (e.g., `df["column_name"]`) as it's more robust and readable, especially when combining with other functions.                                                                                                                                                                                                                             |
| `lit(value)`       | **Creates a literal column with the given value.** Useful for adding constant values to a DataFrame.                                                                                                                                                                | Use `lit()` when you need to introduce a fixed value into your DataFrame, such as a default category, a timestamp of processing, or a placeholder.                                                                                                                                                                                                                                                             |
| `expr(sql_expression_string)` | **Allows you to use SQL expressions directly within DataFrame operations.** Very powerful for complex logic not easily expressible with other PySpark functions.                                                                                                                               | `expr()` is a wildcard for complex scenarios. If you find a SQL expression that does exactly what you need, `expr()` is often the quickest way to integrate it, bridging the gap between SQL and DataFrame API. Be cautious with readability for extremely complex SQL strings.                                                                                                                             |
| `when(condition, value)` | **Implements conditional logic** (similar to `IF` or `CASE` in SQL). If `condition` is true, it returns `value`. Often chained and terminated with `otherwise()`.                                                                                                                      | `when()` is crucial for creating new categorical features, flagging data, or applying different calculations based on specific conditions. It's a cornerstone of data transformation, enabling dynamic logic.                                                                                                                                                                                           |
| `otherwise(value)` | **Used in conjunction with `when()`** to specify the default value if none of the preceding `when()` conditions are met.                                                                                                                                            | `otherwise()` ensures that every row has a value for the new column when using `when()`. Forgetting it can lead to `null` values where you might expect a default. It acts as the `ELSE` clause in a SQL `CASE` statement.                                                                                                                                                                                      |

### Examples: Conditional Logic with `when()` and `otherwise()`

`when()` and `otherwise()` are powerful for creating new columns based on complex conditions. You can chain multiple `when()` clauses, and the first condition that evaluates to true will have its corresponding value returned.

**Setup for Conditional Logic Examples:**

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, expr, when, concat_ws

spark = SparkSession.builder.appName("CalculatedColumns").getOrCreate()

data_cond = [("Alice", 18, 50000),
             ("Bob", 25, 60000),
             ("Charlie", 32, 80000),
             ("David", 40, 95000),
             ("Eve", 16, 40000)]
columns_cond = ["Name", "Age", "Salary"]
df_cond = spark.createDataFrame(data_cond, columns_cond)
print("Original DataFrame (for conditional logic):")
df_cond.show()

Original DataFrame (for conditional logic):
+-------+---+------+
|   Name|Age|Salary|
+-------+---+------+
|  Alice| 18| 50000|
|    Bob| 25| 60000|
|Charlie| 32| 80000|
|  David| 40| 95000|
|    Eve| 16| 40000|
+-------+---+------+



#### 1. Creating an "AgeGroup" Column


In [2]:
df_with_age_group = df_cond.withColumn("AgeGroup",
                                       when(col("Age") < 20, "Teenager")
                                       .when(col("Age") < 30, "Young Adult") # Modified from text for better clarity
                                       .otherwise("Adult"))
print("DataFrame with 'AgeGroup':")
df_with_age_group.show()

DataFrame with 'AgeGroup':
+-------+---+------+-----------+
|   Name|Age|Salary|   AgeGroup|
+-------+---+------+-----------+
|  Alice| 18| 50000|   Teenager|
|    Bob| 25| 60000|Young Adult|
|Charlie| 32| 80000|      Adult|
|  David| 40| 95000|      Adult|
|    Eve| 16| 40000|   Teenager|
+-------+---+------+-----------+



#### 2. Creating a "TaxBracket" Column (with compound conditions)


In [3]:
df_with_tax_bracket = df_cond.withColumn("TaxBracket",
                                       when(col("Salary") < 50000, "Low")
                                       .when((col("Salary") >= 50000) & (col("Salary") < 80000), "Medium")
                                       .otherwise("High"))
print("DataFrame with 'TaxBracket':")
df_with_tax_bracket.show()

DataFrame with 'TaxBracket':
+-------+---+------+----------+
|   Name|Age|Salary|TaxBracket|
+-------+---+------+----------+
|  Alice| 18| 50000|    Medium|
|    Bob| 25| 60000|    Medium|
|Charlie| 32| 80000|      High|
|  David| 40| 95000|      High|
|    Eve| 16| 40000|       Low|
+-------+---+------+----------+



### Nesting Conditions using Multiple `when()` (CASE WHEN equivalent)

You can chain multiple `when()` clauses to create complex conditional logic, similar to `CASE WHEN ... THEN ... WHEN ... THEN ... ELSE ... END` in SQL. The conditions are evaluated in order, and the first `when()` condition that evaluates to `true` will have its corresponding value returned.

**Setup for Nesting `when()` Examples:**

In [4]:
data_nested = [("Alice", 85),
               ("Bob", 72),
               ("Charlie", 91),
               ("David", 60),
               ("Eve", 45)]
columns_nested = ["Student", "Score"]
df_nested = spark.createDataFrame(data_nested, columns_nested)
print("Original DataFrame (for nested conditions):")
df_nested.show()

Original DataFrame (for nested conditions):
+-------+-----+
|Student|Score|
+-------+-----+
|  Alice|   85|
|    Bob|   72|
|Charlie|   91|
|  David|   60|
|    Eve|   45|
+-------+-----+



#### 1. Assigning Grades based on Score


In [5]:
df_with_grades = df_nested.withColumn("Grade",
                                      when(col("Score") >= 90, "A")
                                      .when(col("Score") >= 80, "B")
                                      .when(col("Score") >= 70, "C")
                                      .when(col("Score") >= 60, "D")
                                      .otherwise("F")) # Default if no other conditions met
print("DataFrame with assigned Grades:")
df_with_grades.show()

DataFrame with assigned Grades:
+-------+-----+-----+
|Student|Score|Grade|
+-------+-----+-----+
|  Alice|   85|    B|
|    Bob|   72|    C|
|Charlie|   91|    A|
|  David|   60|    D|
|    Eve|   45|    F|
+-------+-----+-----+



#### 2. Complex Scholarship Eligibility Example


In [6]:
df_with_scholarship = df_nested.withColumn("ScholarshipStatus",
                                         when((col("Score") >= 90) & (col("Student") == "Charlie"), lit("Full Scholarship"))
                                         .when(col("Score") >= 85, lit("Partial Scholarship"))
                                         .when(col("Score") >= 70, lit("Eligibility Review"))
                                         .otherwise(lit("Not Eligible")))
print("DataFrame with Scholarship Status:")
df_with_scholarship.show()

DataFrame with Scholarship Status:
+-------+-----+-------------------+
|Student|Score|  ScholarshipStatus|
+-------+-----+-------------------+
|  Alice|   85|Partial Scholarship|
|    Bob|   72| Eligibility Review|
|Charlie|   91|   Full Scholarship|
|  David|   60|       Not Eligible|
|    Eve|   45|       Not Eligible|
+-------+-----+-------------------+

