##### why passing argument (format=None & is_timestamp=False) in function?

In [0]:
import pyspark.sql.functions as F

In [0]:
def data_fun_ts(df, column, format=None, is_timestamp=False):
    if is_timestamp:
        if format:
            df = df.withColumn(column, F.to_timestamp(F.col(column), format))
        else:
            df = df.withColumn(column, F.to_timestamp(F.col(column)))
    else:
        if format:
            df = df.withColumn(column, F.to_date(F.col(column), format))
        else:
            df = df.withColumn(column, F.to_date(F.col(column)))
    return df

- **format=None => optional**
  - A **date/timestamp** format string (e.g., **"dd-MM-yyyy"**).
  - If **None**, Spark will use its **default formats**.

  - Use this when your **data** is already in Spark’s **default expected format**:
    - For **dates** → **yyyy-MM-dd**
    - For **timestamps** → **yyyy-MM-dd HH:mm:ss**

- **is_timestamp (boolean):**
  - **False** → treat the column as a **date**
  - **True** → treat the column as a **timestamp**

**If is_timestamp=True**

      if is_timestamp:
          if format:
              df = df.withColumn(column, F.to_timestamp(F.col(column), format))
          else:
              df = df.withColumn(column, F.to_timestamp(F.col(column)))

- Uses **F.to_timestamp()** to convert the column into a **timestamp type**.
- If **format** is provided → Spark parses using that **format**.
- If **format=None** → Spark expects **default timestamp format**:
  - yyyy-MM-dd HH:mm:ss


**If is_timestamp=False (default case)**

      else:
          if format:
              df = df.withColumn(column, F.to_date(F.col(column), format))
          else:
              df = df.withColumn(column, F.to_date(F.col(column)))

- Uses **F.to_date()** to convert the column into a **date type**.
- If **format** is provided → Spark parses using that **format**.
- If **format=None** → Spark expects **default date format**:
  - yyyy-MM-dd

**Case A**
- **format=None**
- **default timestamp format**

     df_casted = data_fun_ts(df, "my_ts", is_timestamp=True)
                          (or)
     df_casted = data_fun_ts(df, "my_ts", None, is_timestamp=True)
                          (or)
     df_casted = data_fun_ts(df, "my_ts", format=None, is_timestamp=True)

In [0]:
data = [("2025-09-24 12:30:45",),
        ("2025-01-01 11:43:55",),
        ("2025-08-20 14:35:35",),
        ("2025-05-08 09:12:45",),
        ("2025-07-18 06:18:25",)]

df_ts_nn = spark.createDataFrame(data, ["my_ts"])

df_ts_nn_cast = data_fun_ts(df_ts_nn, "my_ts", None, is_timestamp=True)
display(df_ts_nn_cast)

my_ts
2025-09-24T12:30:45.000Z
2025-01-01T11:43:55.000Z
2025-08-20T14:35:35.000Z
2025-05-08T09:12:45.000Z
2025-07-18T06:18:25.000Z


- Works fine because Spark expects **yyyy-MM-dd HH:mm:ss**.

**Case B**
- **Custom timestamp or milliseconds format**

In [0]:
data = [("2025-09-24T12:30:45.123+05:30",),
        ("2025-10-14T15:35:55.163+05:30",),
        ("2025-11-11T16:39:55.184+05:30",),
        ("2025-05-04T23:55:25.193+05:30",),
        ("2024-02-16T21:26:38.143+05:30",)]

df_mls = spark.createDataFrame(data, ["my_ts"])

df_ts_cust_frmt_cast = data_fun_ts(df_mls, "my_ts", "yyyy-MM-dd'T'HH:mm:ss.SSSXXX", is_timestamp=True)
display(df_ts_cust_frmt_cast)

my_ts
2025-09-24T07:00:45.123Z
2025-10-14T10:05:55.163Z
2025-11-11T11:09:55.184Z
2025-05-04T18:25:25.193Z
2024-02-16T15:56:38.143Z


      2025-09-24T12:30:45.123+05:30 -> is 5 hours 30 minutes ahead of UTC.

      # convert to UTC
      12:30:45.123 - 05:30 = 07:00:45.123+00:00

      Local time = 12:30:45.123
      Time zone = +05:30 (India Standard Time)

| Input (IST)                   | Stored in Spark (UTC)         |
| ----------------------------- | ----------------------------- |
| 2025-09-24T12:30:45.123+05:30 | 2025-09-24T07:00:45.123+00:00 |
| 2025-10-14T15:35:55.163+05:30 | 2025-10-14T10:05:55.163+00:00 |
| 2025-11-11T16:39:55.184+05:30 | 2025-11-11T11:09:55.184+00:00 |
| 2025-05-04T23:55:25.193+05:30 | 2025-05-04T18:25:25.193+00:00 |
| 2024-02-16T21:26:38.143+05:30 | 2024-02-16T15:56:38.143+00:00 |

In [0]:
data = [
    ("2025-09-24T12:30:45",),
    ("2025-10-14T15:35:55",),
    ("2025-11-11T16:39:55",),
    ("2025-05-04T23:55:25",),
    ("2024-02-16T21:26:38",)
]

df_ts_cu = spark.createDataFrame(data, ["my_ts"])

# Correct ISO timestamp format
df_ts_cu_frmt_cast = data_fun_ts(df_ts_cu, "my_ts", format="yyyy-MM-dd'T'HH:mm:ss", is_timestamp=True)

display(df_ts_cu_frmt_cast)

my_ts
2025-09-24T12:30:45.000Z
2025-10-14T15:35:55.000Z
2025-11-11T16:39:55.000Z
2025-05-04T23:55:25.000Z
2024-02-16T21:26:38.000Z


In [0]:
data = [
    ("2025-09-24 12:30:45",),
    ("2025-10-14 15:35:55",),
    ("2025-11-11 16:39:55",),
    ("2025-05-04 23:55:25",),
    ("2024-02-16 21:26:38",)
]

df_ts_cu_t = spark.createDataFrame(data, ["my_ts"])

# Correct ISO timestamp format
df_ts_cu_t_frmt_cast = data_fun_ts(df_ts_cu_t, "my_ts", format="yyyy-MM-dd HH:mm:ss", is_timestamp=True)

display(df_ts_cu_t_frmt_cast)

my_ts
2025-09-24T12:30:45.000Z
2025-10-14T15:35:55.000Z
2025-11-11T16:39:55.000Z
2025-05-04T23:55:25.000Z
2024-02-16T21:26:38.000Z


**Case C**
- **format="yyyy-MM-dd"**
- is_timestamp=True / False

In [0]:
data = [("2025-09-24",),
        ("2025-01-01",),
        ("2025-07-20",),
        ("2025-06-11",),
        ("2025-04-28",)]

df_frmt_dt = spark.createDataFrame(data, ["my_ts"])

df_frmt_ts_casted = data_fun_ts(df_frmt_dt, "my_ts", "yyyy-MM-dd", is_timestamp=False)
display(df_frmt_ts_casted)

my_ts
2025-09-24
2025-01-01
2025-07-20
2025-06-11
2025-04-28


In [0]:
data = [("2025-09-24",),
        ("2025-01-01",),
        ("2025-07-20",),
        ("2025-06-11",),
        ("2025-04-28",)]

df_frmt_dt = spark.createDataFrame(data, ["my_ts"])

df_frmt_dt_casted = data_fun_ts(df_frmt_dt, "my_ts", "yyyy-MM-dd", is_timestamp=True)
display(df_frmt_dt_casted)

my_ts
2025-09-24T00:00:00.000Z
2025-01-01T00:00:00.000Z
2025-07-20T00:00:00.000Z
2025-06-11T00:00:00.000Z
2025-04-28T00:00:00.000Z


- Spark adds a **default 00:00:00** time when **only the date** is provided.

**Case D**
- **format=None** with **only a date** (missing time)

In [0]:
data = [("2025-09-24",),
        ("2025-01-01",),
        ("2025-07-20",),
        ("2025-06-11",),
        ("2025-04-28",)]

df_ts_dt = spark.createDataFrame(data, ["my_ts"])

df_ts_dt_f_cast = data_fun_ts(df_ts_dt, "my_ts", None, is_timestamp=False)
display(df_ts_dt_f_cast)

my_ts
2025-09-24
2025-01-01
2025-07-20
2025-06-11
2025-04-28


In [0]:
data = [("2025-09-24",),
        ("2025-01-01",),
        ("2025-07-20",),
        ("2025-06-11",),
        ("2025-04-28",)]

df_ts_dt = spark.createDataFrame(data, ["my_ts"])

df_ts_dt_cast = data_fun_ts(df_ts_dt, "my_ts", None, is_timestamp=True)
display(df_ts_dt_cast)

my_ts
2025-09-24T00:00:00.000Z
2025-01-01T00:00:00.000Z
2025-07-20T00:00:00.000Z
2025-06-11T00:00:00.000Z
2025-04-28T00:00:00.000Z


- If your **date** is in the **default format yyyy-MM-dd**, you can directly use:

      df_ts = df.withColumn("my_timestamp", F.to_timestamp(F.col("my_date")))
      df_ts.show(truncate=False)

- Spark assumes **midnight time (00:00:00)** because date doesn’t have time info.

      +----------+-------------------+
      |my_date   |my_timestamp       |
      +----------+-------------------+
      |2025-11-07|2025-11-07 00:00:00|
      |2025-12-25|2025-12-25 00:00:00|
      |2024-02-29|2024-02-29 00:00:00|
      +----------+-------------------+

**Summary for timestamps**
- **format=None** → Works only if column is **already yyyy-MM-dd HH:mm:ss**.
- **format="yyyy-MM-dd"** → Parses date-only strings, time defaults to **midnight**.
- **Custom formats** (e.g., ISO with T, milliseconds, timezone) require **explicit format="..."**.