
#### Question
**How to convert custom date into date format?**

In [0]:
data = [("0240312", "0231225", "0221120"),
        ("0231225", "0211225", "0251225"),
        ("0980312", "0991225", "0971225"),
        ("0961225", "0951225", "0940921"),
        ("0240312", "0231225", "0970618"),
        ("0850911", "0880713", "0820219"),
        ("0", "0991225", "0221120")
       ]
columns = ["d1", "d2", "d3"]

df_samp = spark.createDataFrame(data, columns)
display(df_samp)

d1,d2,d3
240312,231225,221120
231225,211225,251225
980312,991225,971225
961225,951225,940921
240312,231225,970618
850911,880713,820219
0,991225,221120


##### 1) Using substring and concat (Manual Parsing)

In [0]:
import pyspark.sql.functions as F
from pyspark.sql.functions import col, lit, concat, concat_ws, to_date, lpad

In [0]:
df_substr = df_samp.withColumn("parsed_date", to_date(
    concat_ws("-", 
        concat(lit("20"), col("d1").substr(2, 2)),  # Year
        col("d1").substr(4, 2),                     # Month
        col("d1").substr(6, 2)                      # Day
    ), "yyyy-MM-dd"
))

display(df_substr)

d1,d2,d3,parsed_date
240312,231225,221120,2024-03-12
231225,211225,251225,2023-12-25
980312,991225,971225,2098-03-12
961225,951225,940921,2096-12-25
240312,231225,970618,2024-03-12
850911,880713,820219,2085-09-11
0,991225,221120,


|             |   |   |   |   |   |   |   |
|-------------|---|---|---|---|---|---|---|
| INDEX       | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|String_date  | 0 | 2 | 4 | 0 | 3 | 1 | 2 |
|substr(2, 2) |   | 1 | 2 |   |   |   |   |
|substr(4, 2) |   |   |   | 1 | 2 |   |   |
|substr(6, 2) |   |   |   |   |   | 1 | 2 |

#### 2) Using unix_timestamp

In [0]:
from pyspark.sql.functions import col, concat, lit, unix_timestamp, from_unixtime

df1_ut = df_samp.withColumn("date_str", concat(lit("20"), col("d1").substr(2, 2), col("d1").substr(4, 2), col("d1").substr(6, 2))) \
                .withColumn("parsed_date_ut", unix_timestamp(col("date_str"), "yyyyMMdd")) \
                .withColumn("parsed_date_fu_ut", from_unixtime(unix_timestamp(col("date_str"), "yyyyMMdd")).cast("date")) \
                .withColumn("parsed_date_fu_ut", from_unixtime(unix_timestamp(col("date_str"), "yyyyMMdd")).cast("timestamp"))

display(df1_ut)

d1,d2,d3,date_str,parsed_date_ut,parsed_date_fu_ut
240312,231225,221120,20240312,1710201600.0,2024-03-12T00:00:00Z
231225,211225,251225,20231225,1703462400.0,2023-12-25T00:00:00Z
980312,991225,971225,20980312,4045420800.0,2098-03-12T00:00:00Z
961225,951225,940921,20961225,4007232000.0,2096-12-25T00:00:00Z
240312,231225,970618,20240312,1710201600.0,2024-03-12T00:00:00Z
850911,880713,820219,20850911,3651004800.0,2085-09-11T00:00:00Z
0,991225,221120,20,,


**from_unixtime:**

- Converting **Unix Time** to a **Human-Readable Format** of timestamp.

| unix_time |      timestamp          |
|-----------|-------------------------|
| 1648974310  |  2023-04-03 09:45:10  |

- **Returns:** string of **default: yyyy-MM-dd HH:mm:ss**

**unix_timestamp:**

| string_date  |    unix_timestamp  |
|--------------|--------------------|
| 20140228     |   1393545600       |

#### 3) Using to_date with format directly
- This method works best when you're sure the column only contains **valid date-like strings**.
- **Non-date values (like "0")** will be returned as **null**.

In [0]:
from pyspark.sql.functions import col, lit, concat, to_date, when

df_todate = df_samp.withColumn("date_str", when(col("d1") != "0", concat(lit("20"), col("d1").substr(2, 2), col("d1").substr(4, 2), col("d1").substr(6, 2)))) \
                   .withColumn("parsed_date", to_date(col("date_str"), "yyyyMMdd"))

display(df_todate)

d1,d2,d3,date_str,parsed_date
240312,231225,221120,20240312.0,2024-03-12
231225,211225,251225,20231225.0,2023-12-25
980312,991225,971225,20980312.0,2098-03-12
961225,951225,940921,20961225.0,2096-12-25
240312,231225,970618,20240312.0,2024-03-12
850911,880713,820219,20850911.0,2085-09-11
0,991225,221120,,


**col(col_name) != "0"**

- This checks **if the value** in column **col_name** is **not equal to "0"**.
- This condition ensures that the transformation is applied **only to rows** where the **column does not contain "0"**.

**When the condition is True (col(col_name) != "0"):**

- **substr(2, len(col_name)-1)** extracts a substring from the **2nd character** onward.
- **to_date(..., 'yyMMdd')** converts the extracted substring into a proper date format **(yyMMdd)**.

**Otherwise (col(col_name) == "0"):**

- The original value **(col(col_name))** is retained without modification.

**.cast("date"):**

- The entire column is **cast** to **date type**.

**Processing Step-by-Step:**

- For **"0240312"** (Not "0") => Extract **240312** => Convert to **2024-03-12**.

- For **"0"** (Matches "0") => Keep **"0"** as is.

**Final Output (df):**
     
     col_name
     2024-03-12
     2023-12-25
     0

#### 4) Handling Non-Date Values (e.g., "0") Safely

In [0]:
from pyspark.sql.functions import when, col, lit, concat, to_date

df_nondate = df_samp.withColumn(
    "parsed_date",
    when(
        col("d1").rlike("^[0-9]{7}$"),  # Only 7-digit numbers
        to_date(
            concat(lit("20"), col("d1").substr(2, 2), col("d1").substr(4, 2), col("d1").substr(6, 2)),
            "yyyyMMdd"
        )
    ).otherwise(None)
)

display(df_nondate)

d1,d2,d3,parsed_date
240312,231225,221120,2024-03-12
231225,211225,251225,2023-12-25
980312,991225,971225,2098-03-12
961225,951225,940921,2096-12-25
240312,231225,970618,2024-03-12
850911,880713,820219,2085-09-11
0,991225,221120,


**Regex check**
- **^[0-9]{7}$** → ensures the value is **exactly 7 digits** (so **"0" will be skipped**).

**String transformation**
- col("d1").substr(2, 2) → year (last two digits)
- col("d1").substr(4, 2) → month
- col("d1").substr(6, 2) → day
- Then prepend **"20"** to get a proper **yyyyMMdd**.

**to_date(..., "yyyyMMdd")**
- Converts the new string to a real date type.

#### 5) to_date & substr

In [0]:
from pyspark.sql import functions as F

# function to convert the date fields into required format
def convert_date_fields(df, col_names):
    for col_name in col_names:
        df = df.withColumn(
            col_name,
            F.when(
                (F.col(col_name) != "0") & (F.col(col_name).rlike("^[0-9]{7}$")),
                F.to_date(
                    F.concat(
                        F.lit("20"),
                        F.col(col_name).substr(2, 2),
                        F.col(col_name).substr(4, 2),
                        F.col(col_name).substr(6, 2)
                    ),
                    "yyyyMMdd"
                )
            ).otherwise(F.lit(None).cast("date"))
        )
    return df

In [0]:
q = convert_date_fields(df_samp, columns)
display(q)

d1,d2,d3
2024-03-12,2023-12-25,2022-11-20
2023-12-25,2021-12-25,2025-12-25
2098-03-12,2099-12-25,2097-12-25
2096-12-25,2095-12-25,2094-09-21
2024-03-12,2023-12-25,2097-06-18
2085-09-11,2088-07-13,2082-02-19
,2099-12-25,2022-11-20


- Regex check to ensure only 7-digit numbers are processed.
- Proper substr() slicing:
  - .substr(2, 2) → **last two digits of year**
  - .substr(4, 2) → **month**
  - .substr(6, 2) → **day**
- concat() with **"20"** to make a **yyyyMMdd** string.
- otherwise() returns a real date-typed **null** for **invalid entries**.