<a href="https://colab.research.google.com/github/amrit6878/Learning-PySpark/blob/main/TypeCasting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Converting a column’s data type to another type.

Commonly required when:

	1.	Reading data from CSV/JSON where numbers are read as string.
	2.	Preparing data for joins, aggregations, or mathematical operations.
	3.	Ensuring schema consistency in ETL pipelines.

  .cast()
  
  withColumn() → Create a new column or replace an existing column with casted type.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("TypeCastingExample").getOrCreate()

data = [
    ("101", "Alice", "25", "50000.50", "2025-01-10 12:30:00"),
    ("102", "Bob", "30", "60000.00", "2025-02-15 09:15:00"),
    ("103", "Charlie", None, "55000.75", None)
]
columns = ["ID", "Name", "Age", "Salary", "JoiningDate"]

df = spark.createDataFrame(data, columns)
df.printSchema()
df.show(truncate=False)

root
 |-- ID: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Salary: string (nullable = true)
 |-- JoiningDate: string (nullable = true)

+---+-------+----+--------+-------------------+
|ID |Name   |Age |Salary  |JoiningDate        |
+---+-------+----+--------+-------------------+
|101|Alice  |25  |50000.50|2025-01-10 12:30:00|
|102|Bob    |30  |60000.00|2025-02-15 09:15:00|
|103|Charlie|NULL|55000.75|NULL               |
+---+-------+----+--------+-------------------+



Convert Age to IntegerType:

In [2]:
df_cast = df.withColumn("Age_Int", col("Age").cast("int"))
df_cast.show()
df_cast.printSchema()

+---+-------+----+--------+-------------------+-------+
| ID|   Name| Age|  Salary|        JoiningDate|Age_Int|
+---+-------+----+--------+-------------------+-------+
|101|  Alice|  25|50000.50|2025-01-10 12:30:00|     25|
|102|    Bob|  30|60000.00|2025-02-15 09:15:00|     30|
|103|Charlie|NULL|55000.75|               NULL|   NULL|
+---+-------+----+--------+-------------------+-------+

root
 |-- ID: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Salary: string (nullable = true)
 |-- JoiningDate: string (nullable = true)
 |-- Age_Int: integer (nullable = true)



Convert ID → Integer, Age → Integer, Salary → Double

In [3]:
df_cast = df \
    .withColumn("ID_Int", col("ID").cast("int")) \
    .withColumn("Age_Int", col("Age").cast("int")) \
    .withColumn("Salary_Double", col("Salary").cast("double"))

df_cast.printSchema()

root
 |-- ID: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Salary: string (nullable = true)
 |-- JoiningDate: string (nullable = true)
 |-- ID_Int: integer (nullable = true)
 |-- Age_Int: integer (nullable = true)
 |-- Salary_Double: double (nullable = true)



# **Using selectExpr() for Casting**



In [4]:
df_cast = df.selectExpr(
    "cast(ID as int) ID_Int",
    "Name",
    "cast(Age as int) Age_Int",
    "cast(Salary as double) Salary_Double"
)
df_cast.printSchema()

root
 |-- ID_Int: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age_Int: integer (nullable = true)
 |-- Salary_Double: double (nullable = true)



Handling Invalid Values - using `na.fill() and when()/otherwise()` as conditional conversion Useful for:

	•	Conditional cleaning
	•	Handling invalid data formats
	•	Setting defaults

In [5]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when

spark = SparkSession.builder.appName("HandleInvalidData").getOrCreate()

data = [
    ("101", "Alice", "25", "50000.50"),
    ("102", "Bob", None, "abc"),        # Invalid Salary
    ("103", "Charlie", "thirty", "55000.75"), # Invalid Age
    ("104", "David", None, None),       # Null Age and Salary
    ("105", "Eva", "40", "80000.00")
]
columns = ["ID", "Name", "Age", "Salary"]

df = spark.createDataFrame(data, columns)
df.show(truncate=False)

+---+-------+------+--------+
|ID |Name   |Age   |Salary  |
+---+-------+------+--------+
|101|Alice  |25    |50000.50|
|102|Bob    |NULL  |abc     |
|103|Charlie|thirty|55000.75|
|104|David  |NULL  |NULL    |
|105|Eva    |40    |80000.00|
+---+-------+------+--------+



na.fill() can replace NULL values with default values.


In [6]:
df_filled = df.na.fill({"Age": "0", "Salary": "0.0"})
df_filled.show()

+---+-------+------+--------+
| ID|   Name|   Age|  Salary|
+---+-------+------+--------+
|101|  Alice|    25|50000.50|
|102|    Bob|     0|     abc|
|103|Charlie|thirty|55000.75|
|104|  David|     0|     0.0|
|105|    Eva|    40|80000.00|
+---+-------+------+--------+



In [7]:
from pyspark.sql.functions import regexp_extract

# Keep only digits, else set 0
df_age_clean = df_filled.withColumn(
    "Age_Int",
    when(col("Age").rlike("^[0-9]+$"), col("Age").cast("int")).otherwise(0)
)
df_age_clean.show()

+---+-------+------+--------+-------+
| ID|   Name|   Age|  Salary|Age_Int|
+---+-------+------+--------+-------+
|101|  Alice|    25|50000.50|     25|
|102|    Bob|     0|     abc|      0|
|103|Charlie|thirty|55000.75|      0|
|104|  David|     0|     0.0|      0|
|105|    Eva|    40|80000.00|     40|
+---+-------+------+--------+-------+

