##### Why need to pass argument (format=None) in a function?

In [0]:
import pyspark.sql.functions as F

In [0]:
def data_fun(df, column, format=None):
    if format:
        df = df.withColumn(column, F.to_date(F.col(column), format))
    else:
        df = df.withColumn(column, F.to_date(F.col(column)))
    return df

**1) When format=None**

- **format=None => optional**
  - If **None**, Spark will use its **default formats**.
- Use this when your **data** is already in Spark’s **default expected format**:
  - For **dates** → **yyyy-MM-dd**
  - For **timestamps** → **yyyy-MM-dd HH:mm:ss**

In [0]:
data = [("2025-09-24",),
        ("2025-01-01",),
        ("2025-03-14",),
        ("2025-04-11",),
        ("2024-03-25",)]
        
df_dt = spark.createDataFrame(data, ["my_date"])

# No format needed (already yyyy-MM-dd)
df_dt_casted = data_fun(df_dt, "my_date", None)
display(df_dt_casted)

my_date
2025-09-24
2025-01-01
2025-03-14
2025-04-11
2024-03-25


- Works fine because Spark knows how to **parse 2025-09-24 without a custom format**.

In [0]:
data = [("2025-09-24 12:30:45",),
        ("2025-01-01 19:47:53",),
        ("2025-09-24 17:30:45",),
        ("2025-04-21 23:55:45",),
        ("2025-05-18 15:35:45",)]

df_ts = spark.createDataFrame(data, ["my_timestamp"])
display(df_ts)

# For timestamps in default "yyyy-MM-dd HH:mm:ss"
df_ts_casted = data_fun(df_ts, "my_timestamp", None)
display(df_ts_casted)

my_timestamp
2025-09-24 12:30:45
2025-01-01 19:47:53
2025-09-24 17:30:45
2025-04-21 23:55:45
2025-05-18 15:35:45


my_timestamp
2025-09-24
2025-01-01
2025-09-24
2025-04-21
2025-05-18


**2) When format="..."**

- Use this when your data is **not** in Spark’s **default format**.

In [0]:
data = [("24-09-2025",),
        ("01-01-2025",),
        ("20-08-2025",),
        ("21-11-2025",),
        ("15-04-2025",)]

df_dt_cust = spark.createDataFrame(data, ["my_date"])

# Needs custom format
df_dt_cust_casted = data_fun(df_dt_cust, "my_date", "dd-MM-yyyy")
display(df_dt_cust_casted)

my_date
2025-09-24
2025-01-01
2025-08-20
2025-11-21
2025-04-15


In [0]:
data = [("2025/09/24 10:15:30",),
        ("2025/01/01 23:59:59",),
        ("2024/11/14 15:25:39",),
        ("2023/05/23 20:49:44",),
        ("2022/07/18 23:59:24",)]

df_ts_cust = spark.createDataFrame(data, ["my_timestamp"])

# Needs custom format
df_ts_cust_casted = data_fun(df_ts_cust, "my_timestamp", "yyyy/MM/dd HH:mm:ss")
display(df_ts_cust_casted)

my_timestamp
2025-09-24
2025-01-01
2024-11-14
2023-05-23
2022-07-18


**Summary:**
- **format=None** → when data is already in **standard** Spark format.
  - yyyy-MM-dd
  - yyyy-MM-dd HH:mm:ss
- **format="..."** → when data uses a **custom format**.
  - dd-MM-yyyy
  - MM/dd/yyyy
  - yyyy/MM/dd HH:mm:ss, etc.

**Case A**
- Using **format=None** with **correct default format**

In [0]:
data = [("2025-09-24",),
        ("2025-01-01",),
        ("2025-03-14",),
        ("2025-04-11",),
        ("2024-03-21",)]

df_dt = spark.createDataFrame(data, ["my_date"])

# No format needed (already yyyy-MM-dd)
df_dt_casted = data_fun(df_dt, "my_date", None)
display(df_dt_casted)

my_date
2025-09-24
2025-01-01
2025-03-14
2025-04-11
2024-03-21


- Works fine, because **input** matches Spark’s **default yyyy-MM-dd**

**Case B**
- Using **format=None** with **wrong format**.

In [0]:
data = [("24-09-2025",),
        ("01-01-2025",),
        ("20-08-2025",),
        ("21-11-2025",),
        ("15-04-2025",)]

df_dt_cust = spark.createDataFrame(data, ["my_date"])

# Needs custom format
df_dt_cust_cast = data_fun(df_dt_cust, "my_date", None)
display(df_dt_cust_cast)

[0;31m---------------------------------------------------------------------------[0m
[0;31mDateTimeException[0m                         Traceback (most recent call last)
File [0;32m<command-7297568476460635>, line 11[0m
[1;32m      9[0m [38;5;66;03m# Needs custom format[39;00m
[1;32m     10[0m df_dt_cust_cast [38;5;241m=[39m data_fun(df_dt_cust, [38;5;124m"[39m[38;5;124mmy_date[39m[38;5;124m"[39m, [38;5;28;01mNone[39;00m)
[0;32m---> 11[0m display(df_dt_cust_cast)

File [0;32m/databricks/python_shell/lib/dbruntime/display.py:133[0m, in [0;36mDisplay.display[0;34m(self, input, *args, **kwargs)[0m
[1;32m    131[0m     [38;5;28;01mpass[39;00m
[1;32m    132[0m [38;5;28;01melif[39;00m [38;5;28mself[39m[38;5;241m.[39m_cf_helper [38;5;129;01mis[39;00m [38;5;129;01mnot[39;00m [38;5;28;01mNone[39;00m [38;5;129;01mand[39;00m [38;5;28misinstance[39m([38;5;28minput[39m, ConnectDataFrame):
[0;32m--> 133[0m     [38;5;28mself[39m[38;5;241m.[39

- Spark expected **yyyy-MM-dd**, but got **dd-MM-yyyy**.
- Since **24-09-2025 cannot be parsed**, it throughs **error**.

**Case C**
- Using **format="dd-MM-yyyy"**

In [0]:
df_cust_casted = data_fun(df_dt_cust, "my_date", "dd-MM-yyyy")
display(df_cust_casted)

my_date
2025-09-24
2025-01-01
2025-08-20
2025-11-21
2025-04-15


**Takeaway:**
- If you use **format=None** with data in **default format**, it works.
- If you use **format=None** with data in **non-standard format**, Spark converts it to **NULL**.
- To avoid this, always provide the **right format="..."** when your input doesn’t match the default.