#### How to convert timestamp to date?

|             source                |          bronze                  |     silver      |     gold       |
|-----------------------------------|----------------------------------|-----------------|----------------|
|  2025-08-25T00:00:00.000+00:00    |   2025-08-25T00:00:00.000+00:00  |    2025-09-14   |   2025-09-14   |

In [0]:
from pyspark.sql.functions import to_date, col

**Ex 01**

In [0]:
df_ts = spark.read.csv("/Volumes/@azureadb/pyspark/timestamp/timestamptodate.csv", inferSchema=True, header=True)
display(df_ts.limit(10))

start_date,description,product_category,product_group,cloud_flatform,work_id,product_feedback,product_type,records,product_id
2025-08-25T00:00:00.000Z,Not Available,AWS,saint-gobain,azure / aws / gcc,9876543,first_visit,Not Available,1,409516064
2025-08-25T00:00:00.000Z,Azure Web Analytics,AWS,saint-gobain,azure / aws / gcc,9876544,purchase,Not Available,1,409516064
2025-08-25T00:00:00.000Z,GCC,AWS,saint-gobain,azure / aws / gcc,9876545,search,Not Available,1,409516064
2025-08-25T00:00:00.000Z,community edition,AWS,saint-gobain,azure / aws / gcc,9876546,search,Not Available,1,409516064
2025-08-25T00:00:00.000Z,data center,AWS,saint-gobain,azure / aws / gcc,9876547,search,Not Available,1,409516064
2025-08-25T00:00:00.000Z,work in progress,AWS,saint-gobain,azure / aws / gcc,9876548,add_to_cart,Not Available,1,409516064
2025-08-25T00:00:00.000Z,work in progress,AWS,saint-gobain,azure / aws / gcc,9876549,add_to_cart,Not Available,1,409516064
2025-08-25T00:00:00.000Z,work in progress,AWS,saint-gobain,azure / aws / gcc,9876550,add_to_cart,Not Available,1,409516064
2025-08-25T00:00:00.000Z,work in progress,AWS,saint-gobain,azure / aws / gcc,9876551,add_to_cart,Not Available,1,409516064
2025-08-25T00:00:00.000Z,work in progress,AWS,saint-gobain,azure / aws / gcc,9876552,add_to_cart,Not Available,1,409516064


In [0]:
# Convert string column to date type
df_with_date = df_ts.withColumn("date_parsed", to_date(col("start_date"), "yyyy-MM-dd'T'HH:mm:ss.SSSXXX"))
display(df_with_date.limit(15))

start_date,description,product_category,product_group,cloud_flatform,work_id,product_feedback,product_type,records,product_id,date_parsed
2025-08-25T00:00:00.000Z,Not Available,AWS,saint-gobain,azure / aws / gcc,9876543,first_visit,Not Available,1,409516064,2025-08-25
2025-08-25T00:00:00.000Z,Azure Web Analytics,AWS,saint-gobain,azure / aws / gcc,9876544,purchase,Not Available,1,409516064,2025-08-25
2025-08-25T00:00:00.000Z,GCC,AWS,saint-gobain,azure / aws / gcc,9876545,search,Not Available,1,409516064,2025-08-25
2025-08-25T00:00:00.000Z,community edition,AWS,saint-gobain,azure / aws / gcc,9876546,search,Not Available,1,409516064,2025-08-25
2025-08-25T00:00:00.000Z,data center,AWS,saint-gobain,azure / aws / gcc,9876547,search,Not Available,1,409516064,2025-08-25
2025-08-25T00:00:00.000Z,work in progress,AWS,saint-gobain,azure / aws / gcc,9876548,add_to_cart,Not Available,1,409516064,2025-08-25
2025-08-25T00:00:00.000Z,work in progress,AWS,saint-gobain,azure / aws / gcc,9876549,add_to_cart,Not Available,1,409516064,2025-08-25
2025-08-25T00:00:00.000Z,work in progress,AWS,saint-gobain,azure / aws / gcc,9876550,add_to_cart,Not Available,1,409516064,2025-08-25
2025-08-25T00:00:00.000Z,work in progress,AWS,saint-gobain,azure / aws / gcc,9876551,add_to_cart,Not Available,1,409516064,2025-08-25
2025-08-25T00:00:00.000Z,work in progress,AWS,saint-gobain,azure / aws / gcc,9876552,add_to_cart,Not Available,1,409516064,2025-08-25


In [0]:
# Convert string column to date type
df_silver = df_ts.withColumn("start_date", to_date(col("start_date"), "yyyy-MM-dd'T'HH:mm:ss.SSSXXX"))
display(df_silver.limit(10))

start_date,description,product_category,product_group,cloud_flatform,work_id,product_feedback,product_type,records,product_id
2025-08-25,Not Available,AWS,saint-gobain,azure / aws / gcc,9876543,first_visit,Not Available,1,409516064
2025-08-25,Azure Web Analytics,AWS,saint-gobain,azure / aws / gcc,9876544,purchase,Not Available,1,409516064
2025-08-25,GCC,AWS,saint-gobain,azure / aws / gcc,9876545,search,Not Available,1,409516064
2025-08-25,community edition,AWS,saint-gobain,azure / aws / gcc,9876546,search,Not Available,1,409516064
2025-08-25,data center,AWS,saint-gobain,azure / aws / gcc,9876547,search,Not Available,1,409516064
2025-08-25,work in progress,AWS,saint-gobain,azure / aws / gcc,9876548,add_to_cart,Not Available,1,409516064
2025-08-25,work in progress,AWS,saint-gobain,azure / aws / gcc,9876549,add_to_cart,Not Available,1,409516064
2025-08-25,work in progress,AWS,saint-gobain,azure / aws / gcc,9876550,add_to_cart,Not Available,1,409516064
2025-08-25,work in progress,AWS,saint-gobain,azure / aws / gcc,9876551,add_to_cart,Not Available,1,409516064
2025-08-25,work in progress,AWS,saint-gobain,azure / aws / gcc,9876552,add_to_cart,Not Available,1,409516064


✅ Format explanation:
|      Format      |     description   |
|------------------|-------------------|
|  **yyyy-MM-dd**  | year, month, day  |
|  **'T'**         | literal T in the string |
|  **HH:mm:ss**    | hour, minute, second  |
|  **.SSS**        | milliseconds  |
|  **XXX**         | timezone offset (+00:00) |

#### to_date()

✅ **to_date()** function is used to format a **"date string" (or) "timestamp string" column** into the **"Date" Type column** using a **specified format**.

✅ If the **format is not provided**, to_date() takes the **default value as 'yyyy-MM-dd'**.

✅ Extracts only the **date** portion **(removes time part if present)**.

✅ Returns **NULL** if the format does **not match**.

- **to_date():** extracts **only the date** part (ignores time).
- **to_timestamp():** parses **both date and time**.

#### Syntax:

     to_date(col, format=None)

**Parameters:**

- **col** → Column name or expression containing the **date string/timestamp**.

- **format (optional)** → A string specifying the format of the input date (using Java SimpleDateFormat patterns). If **not provided**, it tries to parse with the **default format yyyy-MM-dd**.

In [0]:
%run ./config

In [0]:
convert = convert_timestamp_date(df_ts, "start_date", "yyyy-MM-dd'T'HH:mm:ss.SSSXXX")
display(convert.limit(10))

start_date,description,product_category,product_group,cloud_flatform,work_id,product_feedback,product_type,records,product_id
2025-08-25,Not Available,AWS,saint-gobain,azure / aws / gcc,9876543,first_visit,Not Available,1,409516064
2025-08-25,Azure Web Analytics,AWS,saint-gobain,azure / aws / gcc,9876544,purchase,Not Available,1,409516064
2025-08-25,GCC,AWS,saint-gobain,azure / aws / gcc,9876545,search,Not Available,1,409516064
2025-08-25,community edition,AWS,saint-gobain,azure / aws / gcc,9876546,search,Not Available,1,409516064
2025-08-25,data center,AWS,saint-gobain,azure / aws / gcc,9876547,search,Not Available,1,409516064
2025-08-25,work in progress,AWS,saint-gobain,azure / aws / gcc,9876548,add_to_cart,Not Available,1,409516064
2025-08-25,work in progress,AWS,saint-gobain,azure / aws / gcc,9876549,add_to_cart,Not Available,1,409516064
2025-08-25,work in progress,AWS,saint-gobain,azure / aws / gcc,9876550,add_to_cart,Not Available,1,409516064
2025-08-25,work in progress,AWS,saint-gobain,azure / aws / gcc,9876551,add_to_cart,Not Available,1,409516064
2025-08-25,work in progress,AWS,saint-gobain,azure / aws / gcc,9876552,add_to_cart,Not Available,1,409516064


**Casts multiple columns of a DataFrame to given types?**

In [0]:
import pyspark.sql.functions as F

# Sample data
data = [
    ("Albert", "25", "55000.50", "true", "2025-08-25", "2025-08-25T15:30:00.000+00:00"),
    ("Baskar", "30", "72000.00", "false", "2024-12-31", "2024-12-31T08:45:15.000+00:00"),
    ("Chetan", "27", "63000.75", "true", "2023-01-15", "2023-01-15T20:00:00.000+00:00"),
    ("Dravid", "26", "66000.50", "true", "2024-06-20", "2024-05-22T15:40:55.000+00:00"),
    ("Nishant", "32", "98700.00", "false", "2023-10-21", "2020-10-31T06:45:45.000+00:00"),
    ("David", "29", "34512.75", "true", "2023-03-15", "2021-08-28T20:15:55.000+00:00"),
    ("Mohan", "33", "34908.50", "true", "2022-04-15", "2019-09-29T19:35:55.000+00:00"),
    ("Niroop", "35", "49654.98", "false", "2021-02-18", "2024-05-22T08:45:25.000+00:00"),
    ("Pushpa", "23", "44111.99", "true", "2020-07-19", "2023-09-25T20:00:00.000+00:00")
]

# Define schema
columns = ["name", "age", "salary", "is_active", "join_date", "last_login"]

df = spark.createDataFrame(data, columns)
display(df)

name,age,salary,is_active,join_date,last_login
Albert,25,55000.5,True,2025-08-25,2025-08-25T15:30:00.000+00:00
Baskar,30,72000.0,False,2024-12-31,2024-12-31T08:45:15.000+00:00
Chetan,27,63000.75,True,2023-01-15,2023-01-15T20:00:00.000+00:00
Dravid,26,66000.5,True,2024-06-20,2024-05-22T15:40:55.000+00:00
Nishant,32,98700.0,False,2023-10-21,2020-10-31T06:45:45.000+00:00
David,29,34512.75,True,2023-03-15,2021-08-28T20:15:55.000+00:00
Mohan,33,34908.5,True,2022-04-15,2019-09-29T19:35:55.000+00:00
Niroop,35,49654.98,False,2021-02-18,2024-05-22T08:45:25.000+00:00
Pushpa,23,44111.99,True,2020-07-19,2023-09-25T20:00:00.000+00:00


In [0]:
col_type_map = {
    "age": "int",
    "salary": "double",
    "is_active": "boolean",
    "join_date": "date",
    "last_login": "timestamp"
}

In [0]:
# Apply casting with correct timestamp format
df_casted = cast_dataframe_columns(df, col_type_map, "yyyy-MM-dd'T'HH:mm:ss.SSSXXX")
display(df_casted)

name,age,salary,is_active,join_date,last_login
Albert,25,55000.5,True,2025-08-25,2025-08-25
Baskar,30,72000.0,False,2024-12-31,2024-12-31
Chetan,27,63000.75,True,2023-01-15,2023-01-15
Dravid,26,66000.5,True,2024-06-20,2024-05-22
Nishant,32,98700.0,False,2023-10-21,2020-10-31
David,29,34512.75,True,2023-03-15,2021-08-28
Mohan,33,34908.5,True,2022-04-15,2019-09-29
Niroop,35,49654.98,False,2021-02-18,2024-05-22
Pushpa,23,44111.99,True,2020-07-19,2023-09-25
