##### PySpark cast column
- In PySpark, you can **change data type** of column using **cast()** function.

Below are the **subclasses** of the **DataType** classes in PySpark and we can change or cast DataFrame columns to only these types.

- `NumericType`
- `StringType`
- `DateType`
- `TimestampType`
- ArrayType
- StructType
- ObjectType
- MapType
- BinaryType
- `BooleanType`
- CalendarIntervalType
- HiveStringType
- NullType

**Common Functionality**

✅ Converts **string or timestamp** into a **date**, but assumes **yyyy-MM-dd** format.

✅ If the format **does not match**, it returns **NULL**.

✅ Cannot specify a **custom format** like **to_date()**.

✅ **Removes** the **time** portion from **TimestampType**, keeping **only the date**.

##### .cast("date")

- In Spark, **.cast("date")** is a **shorthand** way to **cast** a column to **DateType**.

- **No** need to **import DateType**.

- Internally, **Spark converts "date" to DateType()**.

##### cast(DateType())

✅ Equivalent to **.cast("date")**, but more explicit in code.

| col_name	             | After col_name.cast(DateType()) | After col_name.cast("date")  |
|------------------------|---------------------------------|------------------------------|
| "2024-03-06"	         | 2024-03-06 (Date)               |    2024-03-06 (Date)         |
| "06-03-2024"	         | NULL (Format mismatch)          |    NULL (Format mismatch)    |
| "2024-03-06 12:30:00"	 | 2024-03-06 (Time removed)       |    2024-03-06 (Time removed) |

**✔ When cast() works?**

- **String format** must be:

      yyyy-MM-dd HH:mm:ss -> default timestamp format
      yyyy-MM-dd          -> default date format

**❌ When cast() returns NULL?**

      "15-07-2024 10:30:00"   ❌  -> custom timestamp format
      "2024/07/15 10:30"     ❌   -> custom timestamp format

In [0]:
from pyspark.sql.functions import col
from pyspark.sql.types import DateType, StringType, DoubleType, BooleanType

In [0]:
data = [(101, 'Hitesh', 'baleno', '2507623', 25.1, "3000.6089", 'FALSE', "2006-01-01", "2024-07-15 10:30:00"),
        (102, 'Kiran', 'alto', '2012345', 28.6, "3300.8067", 'TRUE', "1980-01-10", "2025-04-01 08:15:00"),
        (103, 'Adarsh', 'swift', '3045893', 32.4, "5000.5034", 'FALSE', "1985-11-19", "2024-06-11 18:25:55"),
        (104, 'Kamal', 'city', '3512678', 43.8, "4550.5034", 'TRUE', "2025-05-28", "2023-08-17 22:55:35"),
        (105, 'Prakash', 'dzire', '2267934', 62.5, "6780.5034", 'FALSE', "2025-09-16", "2025-04-01 20:45:22")]

columns = ['SNo', 'name', 'product', 'amount', 'avg', 'Salary', 'is_discounted', "jobStartDate", "timestamp_str"]

df_initial = spark.createDataFrame(data, columns)
display(df_initial)

SNo,name,product,amount,avg,Salary,is_discounted,jobStartDate,timestamp_str
101,Hitesh,baleno,2507623,25.1,3000.6089,False,2006-01-01,2024-07-15 10:30:00
102,Kiran,alto,2012345,28.6,3300.8067,True,1980-01-10,2025-04-01 08:15:00
103,Adarsh,swift,3045893,32.4,5000.5034,False,1985-11-19,2024-06-11 18:25:55
104,Kamal,city,3512678,43.8,4550.5034,True,2025-05-28,2023-08-17 22:55:35
105,Prakash,dzire,2267934,62.5,6780.5034,False,2025-09-16,2025-04-01 20:45:22


##### Method 1
- Using **col().cast()**

In [0]:
df_initial.withColumn("Salary", df_initial.Salary.cast('double')).printSchema()    
# # df_initial.withColumn("Salary", df_initial.Salary.cast(DoubleType())).printSchema()
# df_initial.withColumn("Salary", col("Salary").cast('double')).printSchema()

root
 |-- SNo: long (nullable = true)
 |-- name: string (nullable = true)
 |-- product: string (nullable = true)
 |-- amount: string (nullable = true)
 |-- avg: double (nullable = true)
 |-- Salary: double (nullable = true)
 |-- is_discounted: string (nullable = true)
 |-- jobStartDate: string (nullable = true)
 |-- timestamp_str: string (nullable = true)



In [0]:
df_cast_type01 = df_initial\
  .select(
    col('SNo').cast('integer'),
    col('name'),
    col('product'),
    col('amount').cast('long'),
    col('avg'),
    col('Salary').cast('double'),
    col('is_discounted').cast('boolean'),
    col("jobStartDate").cast('date'),
    col("timestamp_str").cast('timestamp'),
    col("timestamp_str").cast('date').alias('ts_date_format')
  )
display(df_cast_type01)

SNo,name,product,amount,avg,Salary,is_discounted,jobStartDate,timestamp_str,ts_date_format
101,Hitesh,baleno,2507623,25.1,3000.6089,False,2006-01-01,2024-07-15T10:30:00.000Z,2024-07-15
102,Kiran,alto,2012345,28.6,3300.8067,True,1980-01-10,2025-04-01T08:15:00.000Z,2025-04-01
103,Adarsh,swift,3045893,32.4,5000.5034,False,1985-11-19,2024-06-11T18:25:55.000Z,2024-06-11
104,Kamal,city,3512678,43.8,4550.5034,True,2025-05-28,2023-08-17T22:55:35.000Z,2023-08-17
105,Prakash,dzire,2267934,62.5,6780.5034,False,2025-09-16,2025-04-01T20:45:22.000Z,2025-04-01


In [0]:
from pyspark.sql.types import IntegerType, StringType, DoubleType, BooleanType, DateType, TimestampType, LongType

df_cast_type02 = df_initial\
  .select(
    col('SNo').cast(IntegerType()),
    col('name'),
    col('product'),
    col('amount').cast(LongType()),
    col('avg'),
    col('Salary').cast(DoubleType()),
    col('is_discounted').cast(BooleanType()),
    col("jobStartDate").cast(DateType()),
    col("timestamp_str").cast(TimestampType()),
    col("timestamp_str").cast(DateType()).alias('ts_date_format')
  )
display(df_cast_type02)

SNo,name,product,amount,avg,Salary,is_discounted,jobStartDate,timestamp_str,ts_date_format
101,Hitesh,baleno,2507623,25.1,3000.6089,False,2006-01-01,2024-07-15T10:30:00.000Z,2024-07-15
102,Kiran,alto,2012345,28.6,3300.8067,True,1980-01-10,2025-04-01T08:15:00.000Z,2025-04-01
103,Adarsh,swift,3045893,32.4,5000.5034,False,1985-11-19,2024-06-11T18:25:55.000Z,2024-06-11
104,Kamal,city,3512678,43.8,4550.5034,True,2025-05-28,2023-08-17T22:55:35.000Z,2023-08-17
105,Prakash,dzire,2267934,62.5,6780.5034,False,2025-09-16,2025-04-01T20:45:22.000Z,2025-04-01


- I prefer the **first option** of these **two** since I **don’t** have to **import the types** but either should work.

##### Method 2
- Using **.withColumn()**

In [0]:
df_cast_with_type01 = df_initial\
  .withColumn('SNo', col('SNo').cast(IntegerType()))\
  .withColumn('amount', col('amount').cast(LongType()))\
  .withColumn('Salary', col('Salary').cast(DoubleType())) \
  .withColumn('is_discounted', col('is_discounted').cast(BooleanType())) \
  .withColumn('jobStartDate', col('jobStartDate').cast(DateType())) \
  .withColumn('timestamp_str', col('timestamp_str').cast(TimestampType())) \
  .withColumn('ts_date_format', col("timestamp_str").cast(DateType()))

display(df_cast_with_type01)

SNo,name,product,amount,avg,Salary,is_discounted,jobStartDate,timestamp_str,ts_date_format
101,Hitesh,baleno,2507623,25.1,3000.6089,False,2006-01-01,2024-07-15T10:30:00.000Z,2024-07-15
102,Kiran,alto,2012345,28.6,3300.8067,True,1980-01-10,2025-04-01T08:15:00.000Z,2025-04-01
103,Adarsh,swift,3045893,32.4,5000.5034,False,1985-11-19,2024-06-11T18:25:55.000Z,2024-06-11
104,Kamal,city,3512678,43.8,4550.5034,True,2025-05-28,2023-08-17T22:55:35.000Z,2023-08-17
105,Prakash,dzire,2267934,62.5,6780.5034,False,2025-09-16,2025-04-01T20:45:22.000Z,2025-04-01


##### Method 3
- Using a Python **dictionary and .withColumn()**

In [0]:
data_types = {
    'SNo': 'integer',
    'amount': 'long',
    'Salary': 'double',
    'is_discounted': 'boolean',
    'jobStartDate': 'date',
    'timestamp_str': 'timestamp'
}

df_dict = df_initial
for column_name, data_type in data_types.items():
    df_dict = df_dict.withColumn(column_name, col(column_name).cast(data_type))

display(df_dict)

SNo,name,product,amount,avg,Salary,is_discounted,jobStartDate,timestamp_str
101,Hitesh,baleno,2507623,25.1,3000.6089,False,2006-01-01,2024-07-15T10:30:00.000Z
102,Kiran,alto,2012345,28.6,3300.8067,True,1980-01-10,2025-04-01T08:15:00.000Z
103,Adarsh,swift,3045893,32.4,5000.5034,False,1985-11-19,2024-06-11T18:25:55.000Z
104,Kamal,city,3512678,43.8,4550.5034,True,2025-05-28,2023-08-17T22:55:35.000Z
105,Prakash,dzire,2267934,62.5,6780.5034,False,2025-09-16,2025-04-01T20:45:22.000Z


In [0]:
data_types = {
    'SNo': IntegerType(),
    'amount': LongType(),
    'Salary': DoubleType(),
    'is_discounted': BooleanType(),
    'jobStartDate': DateType(),
    'timestamp_str': TimestampType()
  }

for column_name, data_type in data_types.items():
    df_dict = df_dict.withColumn(column_name, col(column_name).cast(data_type))

display(df_dict)

SNo,name,product,amount,avg,Salary,is_discounted,jobStartDate,timestamp_str
101,Hitesh,baleno,2507623,25.1,3000.6089,False,2006-01-01,2024-07-15T10:30:00.000Z
102,Kiran,alto,2012345,28.6,3300.8067,True,1980-01-10,2025-04-01T08:15:00.000Z
103,Adarsh,swift,3045893,32.4,5000.5034,False,1985-11-19,2024-06-11T18:25:55.000Z
104,Kamal,city,3512678,43.8,4550.5034,True,2025-05-28,2023-08-17T22:55:35.000Z
105,Prakash,dzire,2267934,62.5,6780.5034,False,2025-09-16,2025-04-01T20:45:22.000Z


##### Method 4
- Using **col().cast()** with a Python **dictionary** and Python **list comprehension**

In [0]:
# This returns a list of tuples
df_initial.dtypes

[('SNo', 'bigint'),
 ('name', 'string'),
 ('product', 'string'),
 ('amount', 'string'),
 ('avg', 'double'),
 ('Salary', 'string'),
 ('is_discounted', 'string'),
 ('jobStartDate', 'string'),
 ('timestamp_str', 'string')]

- This returns a **list of tuples**.
- **Each tuple** has:
      
      (column_name, current_data_type)

In [0]:
# [col for col in df_initial.dtypes]
[column_schema for column_schema in df_initial.dtypes]

[('SNo', 'bigint'),
 ('name', 'string'),
 ('product', 'string'),
 ('amount', 'string'),
 ('avg', 'double'),
 ('Salary', 'string'),
 ('is_discounted', 'string'),
 ('jobStartDate', 'string'),
 ('timestamp_str', 'string')]

- Each **column_schema** looks like:

      ('SNo', 'bigint')

      column_schema[0] → column name
      column_schema[1] → data type

In [0]:
[col(column_schema[0]) for column_schema in df_initial.dtypes]

[Column<'SNo'>,
 Column<'name'>,
 Column<'product'>,
 Column<'amount'>,
 Column<'avg'>,
 Column<'Salary'>,
 Column<'is_discounted'>,
 Column<'jobStartDate'>,
 Column<'timestamp_str'>]

In [0]:
[col(column_schema[1]) for column_schema in df_initial.dtypes]

[Column<'bigint'>,
 Column<'string'>,
 Column<'string'>,
 Column<'string'>,
 Column<'double'>,
 Column<'string'>,
 Column<'string'>,
 Column<'string'>,
 Column<'string'>]

In [0]:
data_type_map = {
    'SNo': 'integer',
    'amount': 'long',
    'Salary': 'double',
    'is_discounted': 'boolean',
    'jobStartDate': 'date',
    'timestamp_str': 'timestamp'
  }

In [0]:
[col(column_schema[0]).cast(data_type_map.get(column_schema[0], column_schema[1])) for column_schema in df_initial.dtypes]

[Column<'CAST(SNo AS INTEGER)'>,
 Column<'CAST(name AS STRING)'>,
 Column<'CAST(product AS STRING)'>,
 Column<'CAST(amount AS LONG)'>,
 Column<'CAST(avg AS DOUBLE)'>,
 Column<'CAST(Salary AS DOUBLE)'>,
 Column<'CAST(is_discounted AS BOOLEAN)'>,
 Column<'CAST(jobStartDate AS DATE)'>,
 Column<'CAST(timestamp_str AS TIMESTAMP)'>]

     data_type_map.get(column_schema[0], column_schema[1])

- If the column `exists` in **data_type_map**, `use that type`. Otherwise, keep the **original type**.

|     Column        | Found in map? |    Before    |     After     |
| ----------------- | ------------- | -------------| --------------|
|   SNo             |      ✅       | long         |   integer     |
|   name            |      ❌       | string       |   string      |
|   product         |      ❌       | string       |   string      |
|   amount          |      ✅       | string       |   long        |
|   avg             |      ❌       | double       |   double      |
|   Salary          |      ✅       | string       |   double      |
|   is_discounted   |      ✅       | string       |  boolean      |
|   jobStartDate    |      ✅       | string       |  date         |
|   timestamp_str   |      ✅       | string       |  timestamp    |

     col(column_schema[0]).cast(target_type)

- `Selects a column`.
- `Changes its data type`.

**What does the list comprehension produce?**

      [
        col("SNo").cast("integer"),
        col("amount").cast("long"),
        col("Salary").cast("double"),
        col("is_discounted").cast("boolean"),
        col("jobStartDate").cast("date"),
        col("timestamp_str").cast("timestamp")
      ]

In [0]:
data_type_map = {
    'SNo': 'integer',
    'amount': 'long',
    'Salary': 'double',
    'is_discounted': 'boolean',
    'jobStartDate': 'date',
    'timestamp_str': 'timestamp'
  }

df_dict_list_01 = df_initial\
  .select([col(column_schema[0]).cast(data_type_map.get(column_schema[0], column_schema[1])) for column_schema in df_initial.dtypes])

display(df_dict_list_01)

SNo,name,product,amount,avg,Salary,is_discounted,jobStartDate,timestamp_str
101,Hitesh,baleno,2507623,25.1,3000.6089,False,2006-01-01,2024-07-15T10:30:00.000Z
102,Kiran,alto,2012345,28.6,3300.8067,True,1980-01-10,2025-04-01T08:15:00.000Z
103,Adarsh,swift,3045893,32.4,5000.5034,False,1985-11-19,2024-06-11T18:25:55.000Z
104,Kamal,city,3512678,43.8,4550.5034,True,2025-05-28,2023-08-17T22:55:35.000Z
105,Prakash,dzire,2267934,62.5,6780.5034,False,2025-09-16,2025-04-01T20:45:22.000Z


In [0]:
data_type_map = {
    'SNo': IntegerType(),
    'amount': LongType(),
    'Salary': DoubleType(),
    'is_discounted': BooleanType(),
    'jobStartDate': DateType(),
    'timestamp_str': TimestampType()
  }

df_dict_list_02 = df_initial\
  .select([col(column_schema[0]).cast(data_type_map.get(column_schema[0], column_schema[1])) for column_schema in df_initial.dtypes])

display(df_dict_list_02)

SNo,name,product,amount,avg,Salary,is_discounted,jobStartDate,timestamp_str
101,Hitesh,baleno,2507623,25.1,3000.6089,False,2006-01-01,2024-07-15T10:30:00.000Z
102,Kiran,alto,2012345,28.6,3300.8067,True,1980-01-10,2025-04-01T08:15:00.000Z
103,Adarsh,swift,3045893,32.4,5000.5034,False,1985-11-19,2024-06-11T18:25:55.000Z
104,Kamal,city,3512678,43.8,4550.5034,True,2025-05-28,2023-08-17T22:55:35.000Z
105,Prakash,dzire,2267934,62.5,6780.5034,False,2025-09-16,2025-04-01T20:45:22.000Z
