<a href="https://colab.research.google.com/github/anjli01/PySpark-Notes/blob/main/04_DataFrame_Transformations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### DataFrame Transformations

Transformations are operations that return a new DataFrame. They are lazy and are not executed until an action is called.

#### 1. `withColumn()`

Used to add a new column or replace an existing column in a DataFrame.

*   **Add a new column with a literal value:**

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, col

spark = SparkSession.builder.appName("DFTransformations").getOrCreate()
data = [("Alice", 30), ("Bob", 25), ("Charlie", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
df.show()

+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 30|
|    Bob| 25|
|Charlie| 35|
+-------+---+



In [2]:
# Add a new column 'City' with a literal value "Unknown"
df_with_city = df.withColumn("City", lit("Unknown"))
print("After adding 'City' column:")
df_with_city.show()

After adding 'City' column:
+-------+---+-------+
|   Name|Age|   City|
+-------+---+-------+
|  Alice| 30|Unknown|
|    Bob| 25|Unknown|
|Charlie| 35|Unknown|
+-------+---+-------+



*   **Add a new column based on an existing column:**

In [3]:
# Add a new column 'Age_in_Months' based on 'Age'
df_with_months = df.withColumn("Age_in_Months", col("Age") * 12)
print("After adding 'Age_in_Months' column:")
df_with_months.show()

After adding 'Age_in_Months' column:
+-------+---+-------------+
|   Name|Age|Age_in_Months|
+-------+---+-------------+
|  Alice| 30|          360|
|    Bob| 25|          300|
|Charlie| 35|          420|
+-------+---+-------------+



*   **Replace an existing column (e.g., change data type):**

In [4]:
# Replace 'Age' column, converting its type to String
df_replaced_age = df.withColumn("Age", col("Age").cast("string"))
print("After replacing 'Age' column type:")
df_replaced_age.printSchema()
df_replaced_age.show()

After replacing 'Age' column type:
root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)

+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 30|
|    Bob| 25|
|Charlie| 35|
+-------+---+



*(Remember, `withColumn` with an existing name replaces it.)*

#### 2. `drop()`

Used to remove one or more columns from a DataFrame.

*   **Remove a single column:**

In [5]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DFTransformations").getOrCreate()
data = [("Alice", 30, "New York"), ("Bob", 25, "London"), ("Charlie", 35, "Paris")]
columns = ["Name", "Age", "City"]
df = spark.createDataFrame(data, columns)
df.show()

df_no_city = df.drop("City")
print("After dropping 'City' column:")
df_no_city.show()

+-------+---+--------+
|   Name|Age|    City|
+-------+---+--------+
|  Alice| 30|New York|
|    Bob| 25|  London|
|Charlie| 35|   Paris|
+-------+---+--------+

After dropping 'City' column:
+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 30|
|    Bob| 25|
|Charlie| 35|
+-------+---+



*   **Remove multiple columns:**

In [6]:
df_no_age_city = df.drop("Age", "City")
print("After dropping 'Age' and 'City' columns:")
df_no_age_city.show()

After dropping 'Age' and 'City' columns:
+-------+
|   Name|
+-------+
|  Alice|
|    Bob|
|Charlie|
+-------+



#### 3. `alias()`

Used to rename a column or an expression. It's often used within `select()` or with `col()`.

*   **Using `alias()` in `select()` to rename a column:**

In [7]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("DFTransformations").getOrCreate()
data = [("Alice", 30), ("Bob", 25)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
df.show()

df.select(col("Name").alias("Full_Name"), col("Age")).show()

+-----+---+
| Name|Age|
+-----+---+
|Alice| 30|
|  Bob| 25|
+-----+---+

+---------+---+
|Full_Name|Age|
+---------+---+
|    Alice| 30|
|      Bob| 25|
+---------+---+



*   **Using `alias()` with an expression in `select()`:**

In [8]:
df.select((col("Age") * 2).alias("Double_Age")).show()

+----------+
|Double_Age|
+----------+
|        60|
|        50|
+----------+



#### 4. `selectExpr()`

Allows you to use SQL-like expressions to select columns and apply transformations directly. Highly convenient for complex expressions.

*   **Select specific columns and rename:**

In [9]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DFTransformations").getOrCreate()
data = [("Alice", 30, "New York"), ("Bob", 25, "London")]
columns = ["Name", "Age", "City"]
df = spark.createDataFrame(data, columns)
df.show()

df.selectExpr("Name", "Age as Person_Age", "City").show()

+-----+---+--------+
| Name|Age|    City|
+-----+---+--------+
|Alice| 30|New York|
|  Bob| 25|  London|
+-----+---+--------+

+-----+----------+--------+
| Name|Person_Age|    City|
+-----+----------+--------+
|Alice|        30|New York|
|  Bob|        25|  London|
+-----+----------+--------+



*   **Perform calculations using SQL expressions:**

In [10]:
df.selectExpr("Name", "Age * 12 as Age_in_Months", "upper(City) as City_Upper").show()

+-----+-------------+----------+
| Name|Age_in_Months|City_Upper|
+-----+-------------+----------+
|Alice|          360|  NEW YORK|
|  Bob|          300|    LONDON|
+-----+-------------+----------+



*   **Apply conditional logic (CASE WHEN):**

In [11]:
df.selectExpr("Name", "CASE WHEN Age > 28 THEN 'Adult' ELSE 'Young' END as Category").show()

+-----+--------+
| Name|Category|
+-----+--------+
|Alice|   Adult|
|  Bob|   Young|
+-----+--------+



#### 5. Renaming Columns

Spark DataFrames don't have a direct `rename` method like Pandas. You typically use `withColumnRenamed()` or `select()` with `alias()`.

*   **`withColumnRenamed()`**: The most common and direct way to rename a *single* column.

In [12]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DFTransformations").getOrCreate()
data = [("Alice", 30), ("Bob", 25)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
df.show()

+-----+---+
| Name|Age|
+-----+---+
|Alice| 30|
|  Bob| 25|
+-----+---+



In [13]:
df_renamed = df.withColumnRenamed("Name", "Full_Name")
print("After renaming 'Name' to 'Full_Name':")
df_renamed.show()

After renaming 'Name' to 'Full_Name':
+---------+---+
|Full_Name|Age|
+---------+---+
|    Alice| 30|
|      Bob| 25|
+---------+---+



*   **Renaming multiple columns:**
    *   **Chaining `withColumnRenamed()`**:

In [14]:
df_renamed_multiple = df.withColumnRenamed("Name", "PersonName").withColumnRenamed("Age", "PersonAge")
print("After renaming multiple columns (chaining):")
df_renamed_multiple.show()

After renaming multiple columns (chaining):
+----------+---------+
|PersonName|PersonAge|
+----------+---------+
|     Alice|       30|
|       Bob|       25|
+----------+---------+



*   **Using `select()` with `alias()`**: Generally more efficient for many renames, as it's a single transformation.

In [15]:
from pyspark.sql.functions import col
df_renamed_select = df.select(
    col("Name").alias("Person_Name"),
    col("Age").alias("Person_Age")
)
print("After renaming multiple columns (using select with alias):")
df_renamed_select.show()

After renaming multiple columns (using select with alias):
+-----------+----------+
|Person_Name|Person_Age|
+-----------+----------+
|      Alice|        30|
|        Bob|        25|
+-----------+----------+



---

### Chaining Transformations

One of the most powerful features of Spark DataFrames is the ability to chain transformations. This makes your code concise, readable, and allows Spark to optimize the entire sequence of operations more effectively due to lazy evaluation.

**Example (Python):**

In [16]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit

spark = SparkSession.builder.appName("ChainingTransformations").getOrCreate()

data = [("Alice", 30, "NY", 75000),
        ("Bob", 25, "LD", 60000),
        ("Charlie", 35, "NY", 90000),
        ("David", 22, "SF", 55000),
        ("Eve", 40, "LD", 100000)]
columns = ["Name", "Age", "City_Code", "Salary"]
df = spark.createDataFrame(data, columns)
df.show()

+-------+---+---------+------+
|   Name|Age|City_Code|Salary|
+-------+---+---------+------+
|  Alice| 30|       NY| 75000|
|    Bob| 25|       LD| 60000|
|Charlie| 35|       NY| 90000|
|  David| 22|       SF| 55000|
|    Eve| 40|       LD|100000|
+-------+---+---------+------+



In [17]:
# Chain multiple transformations:
# 1. Filter for age > 25
# 2. Add a new column 'Bonus' (10% of salary)
# 3. Select specific columns and rename 'Age' to 'YearsOld'
# 4. Add a literal 'Status' column
# 5. Sort by Salary in descending order

df_processed = df.filter(col("Age") > 25) \
                 .withColumn("Bonus", col("Salary") * 0.10) \
                 .select(col("Name"),
                         col("Age").alias("YearsOld"),
                         col("Salary"),
                         col("Bonus"),
                         lit("Processed").alias("Status")) \
                 .sort(col("Salary").desc()) # Add a sort for good measure

print("Chained transformations result:")
df_processed.show()

spark.stop()

Chained transformations result:
+-------+--------+------+-------+---------+
|   Name|YearsOld|Salary|  Bonus|   Status|
+-------+--------+------+-------+---------+
|    Eve|      40|100000|10000.0|Processed|
|Charlie|      35| 90000| 9000.0|Processed|
|  Alice|      30| 75000| 7500.0|Processed|
+-------+--------+------+-------+---------+




---

### Creating Calculated Columns

Calculated columns are new columns derived from existing columns using expressions or functions.

**Key functions for calculations:**

| Function          | Description                                                                                                                                                                                               |
| :---------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `col("column_name")` | References a column by its name. Essential for performing operations on column values.                                                                                                                      |
| `lit(value)`      | Creates a literal column (a column with a constant value) or a literal value within an expression.                                                                                                        |
| `when(condition, value).otherwise(default_value)` | Implements conditional logic (like SQL's CASE WHEN). `when` can be chained for multiple conditions. The final `otherwise` provides a default if no conditions are met. |
| `concat_ws("separator", col1, col2, ...)` | Concatenates multiple string columns into a single string column, using the specified separator.                                                                                    |
| `cast("new_type")`| Converts the data type of a column.                                                                                                                                                                     |

**Example (Python):**


In [18]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, concat_ws, lit, when

spark = SparkSession.builder.appName("CalculatedColumns").getOrCreate()

data = [("Alice", "Smith", 30, 75000),
        ("Bob", "Johnson", 25, 60000),
        ("Charlie", "Brown", 35, 90000)]
columns = ["FirstName", "LastName", "Age", "Salary"]
df = spark.createDataFrame(data, columns)
df.show()

# 1. Calculate 'FullName'
df_full_name = df.withColumn("FullName", concat_ws(" ", col("FirstName"), col("LastName")))
print("After adding 'FullName':")
df_full_name.show()

+---------+--------+---+------+
|FirstName|LastName|Age|Salary|
+---------+--------+---+------+
|    Alice|   Smith| 30| 75000|
|      Bob| Johnson| 25| 60000|
|  Charlie|   Brown| 35| 90000|
+---------+--------+---+------+

After adding 'FullName':
+---------+--------+---+------+-------------+
|FirstName|LastName|Age|Salary|     FullName|
+---------+--------+---+------+-------------+
|    Alice|   Smith| 30| 75000|  Alice Smith|
|      Bob| Johnson| 25| 60000|  Bob Johnson|
|  Charlie|   Brown| 35| 90000|Charlie Brown|
+---------+--------+---+------+-------------+



In [19]:
# 2. Calculate 'AnnualBonus' with conditional logic (using when/otherwise)
# 10% of salary if age > 30, else 5%
df_bonus = df.withColumn("AnnualBonus",
                         when(col("Age") > 30, col("Salary") * 0.10)
                         .otherwise(col("Salary") * 0.05))
print("After adding 'AnnualBonus':")
df_bonus.show()

After adding 'AnnualBonus':
+---------+--------+---+------+-----------+
|FirstName|LastName|Age|Salary|AnnualBonus|
+---------+--------+---+------+-----------+
|    Alice|   Smith| 30| 75000|     3750.0|
|      Bob| Johnson| 25| 60000|     3000.0|
|  Charlie|   Brown| 35| 90000|     9000.0|
+---------+--------+---+------+-----------+



In [20]:
# 3. Combine multiple calculations and selections (chained withColumn and when)
df_combined = df.withColumn("FullName", concat_ws(" ", col("FirstName"), col("LastName"))) \
                .withColumn("ExperienceCategory",
                            when(col("Age") < 28, "Junior")
                            .when(col("Age") < 35, "Mid-level") # This condition is checked if the first one is false
                            .otherwise("Senior")) \
                .select("FullName", "Age", "Salary", "ExperienceCategory")

print("After combining multiple calculations:")
df_combined.show()

spark.stop()

After combining multiple calculations:
+-------------+---+------+------------------+
|     FullName|Age|Salary|ExperienceCategory|
+-------------+---+------+------------------+
|  Alice Smith| 30| 75000|         Mid-level|
|  Bob Johnson| 25| 60000|            Junior|
|Charlie Brown| 35| 90000|            Senior|
+-------------+---+------+------------------+

