<a href="https://colab.research.google.com/github/anjli01/PySpark-Notes/blob/main/14_Working_with_Dates_and_Timestamps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Mastering date and timestamp manipulation in PySpark is fundamental for Data Engineers, especially when dealing with time-series data, ETL processes, and analytical reporting. PySpark offers a robust set of functions in `pyspark.sql.functions` to handle these operations efficiently.

---

### Key Concepts for Dates and Timestamps

*   **DateType vs. TimestampType**:
    *   `DateType`: Represents a date (year, month, day) without time or time zone information.
    *   `TimestampType`: Represents a point in time (year, month, day, hour, minute, second, microsecond) with time zone information (internally stored as UTC).
*   **Time Zones**:
    *   Spark stores `TimestampType` values internally as **UTC (Coordinated Universal Time)**.
    *   When you read/write or display timestamps, Spark converts them to/from the session's configured time zone (`spark.sql.session.timeZone`).
    *   Always be explicit with time zones during ingestion and presentation to avoid ambiguity.
    *   **Best Practice**: Store all timestamp data in UTC in your data lake/warehouse. Convert to local time zones only at the presentation layer (e.g., during reporting).

---

### Essential PySpark Date & Timestamp Functions

These functions are available in `pyspark.sql.functions`.

| Function                      | Description                                                                                              | Notes for Beginners                                                                                                                                                                             |
| :---------------------------- | :------------------------------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `to_date(col, format=None)`   | Converts a `StringType` column to `DateType`. An optional `format` string can be provided.               | Always specify a `format` string if your input date string isn't in a standard `yyyy-MM-dd` format. If `format` is omitted, Spark will try to infer, but explicit is better.                   |
| `to_timestamp(col, format=None, timeZone=None)` | Converts a `StringType` to `TimestampType`.                                                        | Similar to `to_date`, specify `format`. Use `timeZone` to interpret the input string in a specific time zone before converting to internal UTC.                                            |
| `cast("timestamp")`           | Converts a `StringType` (e.g., `yyyy-MM-dd HH:mm:ss`) to `TimestampType`.                                | A simpler way for standard formats. If the string format doesn't match, `to_timestamp` with an explicit format is more reliable.                                                              |
| `current_date()`              | Returns the current date as a `DateType` literal.                                                        | Useful for adding a "snapshot date" to your data.                                                                                                                                               |
| `current_timestamp()`         | Returns the current timestamp as a `TimestampType` literal.                                              | Useful for auditing or "last updated" fields. Reflects the session's configured time zone when displayed, but internally is UTC.                                                               |
| `datediff(end_date, start_date)` | Returns the number of days between two `DateType` columns.                                               | `end_date - start_date`. The result is an `IntegerType`.                                                                                                                                       |
| `months_between(ts1, ts2, roundOff=True)` | Returns the number of months between two timestamps.                                                     | Can handle fractional months. `roundOff=True` rounds to 8 decimal places.                                                                                                                       |
| `add_months(start_date, num_months)` | Returns the date that is `num_months` after `start_date`.                                                | `num_months` can be negative to subtract months.                                                                                                                                                |
| `date_add(start_date, num_days)` | Adds `num_days` to `start_date`.                                                                         | `num_days` can be negative.                                                                                                                                                                     |
| `date_sub(start_date, num_days)` | Subtracts `num_days` from `start_date`.                                                                  | Equivalent to `date_add(start_date, -num_days)`.                                                                                                                                                |
| `date_format(date, format)`   | Formats a `DateType` or `TimestampType` column to a `StringType` according to the specified format.        | Essential for presenting dates/times in a user-friendly or specific reporting format.                                                                                                           |
| `year(col)`, `month(col)`, `dayofmonth(col)`, `dayofweek(col)` | Extracts year, month, day of month, day of week (1=Sunday, 7=Saturday) from a date/timestamp. | Useful for time-based aggregations (e.g., sales by month, weekly trends). `dayofweek` is important for calendrical operations.                                                                   |
| `hour(col)`, `minute(col)`, `second(col)` | Extracts hour, minute, second from a timestamp.                                                          | Useful for detailed time-series analysis or grouping by time of day.                                                                                                                            |
| `from_unixtime(unixtime_col, format)` | Converts Unix timestamp (seconds since epoch) to a formatted string.                                     | Unix timestamps are common in system logs. Use this to make them human-readable.                                                                                                                |
| `unix_timestamp(ts_col, format)` | Converts a timestamp string (with optional format) to a Unix timestamp (seconds since epoch).            | Useful when you need to store timestamps as numerical values or for compatibility with other systems that use Unix timestamps.                                                                  |

---
### PySpark Date & Timestamp Manipulation Examples


In [1]:
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, DateType, TimestampType

# Initialize Spark Session
spark = SparkSession.builder.appName("DateTimestampFunctions").getOrCreate()

# Sample Data
data = [
    ("2023-01-01", "2023-01-31 10:30:00"),
    ("2023-02-15", "2023-02-15 14:00:00"),
    ("2022-12-25", "2022-12-25 23:59:59")
]
columns = ["event_date_str", "event_ts_str"]
df = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df.show(truncate=False)
df.printSchema()

# --- 1. Convert string to DateType and TimestampType ---
print("\n--- After converting to DateType and TimestampType ---")
df_converted = df.withColumn("event_date", F.to_date(F.col("event_date_str"))) \
                 .withColumn("event_ts", F.col("event_ts_str").cast("timestamp"))

df_converted.show(truncate=False)
df_converted.printSchema()

# --- 2. Current Date and Timestamp ---
print("\n--- Current Date and Timestamp ---")
spark.range(1).select(
    F.current_date().alias("Today"),
    F.current_timestamp().alias("Now")
).show(truncate=False)

# --- 3. Date Difference ---
print("\n--- Date difference (days between current date and event_date) ---")
df_converted.withColumn(
    "days_since_event",
    F.datediff(F.current_date(), F.col("event_date"))
).show()

# --- 4. Add/Subtract Months and Days ---
print("\n--- Date after adding 3 months and 5 days ---")
df_converted.withColumn("date_plus_3_months", F.add_months(F.col("event_date"), 3)) \
            .withColumn("date_plus_5_days", F.date_add(F.col("event_date"), 5)) \
            .withColumn("date_minus_5_days", F.date_sub(F.col("event_date"), 5)) \
            .show()

# --- 5. Date and Timestamp Formatting ---
print("\n--- Formatted Date and Timestamp ---")
df_converted.withColumn("formatted_date", F.date_format(F.col("event_date"), "yyyy/MM/dd")) \
            .withColumn("formatted_ts", F.date_format(F.col("event_ts"), "MM-dd-yyyy HH:mm:ss")) \
            .show(truncate=False)

# --- 6. Extracting Components ---
print("\n--- Date/Time Components ---")
df_converted.select(
    F.col("event_date"),
    F.year(F.col("event_date")).alias("Year"),
    F.month(F.col("event_date")).alias("Month"),
    F.dayofmonth(F.col("event_date")).alias("DayOfMonth"),
    F.dayofweek(F.col("event_date")).alias("DayOfWeek"), # 1=Sunday, 7=Saturday
    F.hour(F.col("event_ts")).alias("Hour"),
    F.minute(F.col("event_ts")).alias("Minute"),
    F.second(F.col("event_ts")).alias("Second")
).show()

# --- 7. Filtering by Date Range ---
print("\n--- Filtering for tasks completed in February 2023 ---")
tasks_data = [
    ("TaskA", "2023-01-10"),
    ("TaskB", "2023-02-20"),
    ("TaskC", "2023-01-05"),
    ("TaskD", "2023-03-01"),
    ("TaskE", "2023-02-28")
]
tasks_cols = ["Task", "CompletionDate"]
df_tasks = spark.createDataFrame(tasks_data, tasks_cols) \
                .withColumn("CompletionDate", F.to_date(F.col("CompletionDate")))

df_tasks.show()
df_tasks.printSchema()

df_tasks.filter(
    (F.col("CompletionDate") >= "2023-02-01") & \
    (F.col("CompletionDate") <= "2023-02-28")
).show()

# Filtering for tasks completed within 30 days of a specific date
target_date = F.to_date(F.lit("2023-02-15"))
print(f"\n--- Filtering for tasks completed within 30 days of 2023-02-15 ---")
df_tasks.filter(
    F.datediff(F.col("CompletionDate"), target_date).between(-30, 30)
).show()

spark.stop()


Original DataFrame:
+--------------+-------------------+
|event_date_str|event_ts_str       |
+--------------+-------------------+
|2023-01-01    |2023-01-31 10:30:00|
|2023-02-15    |2023-02-15 14:00:00|
|2022-12-25    |2022-12-25 23:59:59|
+--------------+-------------------+

root
 |-- event_date_str: string (nullable = true)
 |-- event_ts_str: string (nullable = true)


--- After converting to DateType and TimestampType ---
+--------------+-------------------+----------+-------------------+
|event_date_str|event_ts_str       |event_date|event_ts           |
+--------------+-------------------+----------+-------------------+
|2023-01-01    |2023-01-31 10:30:00|2023-01-01|2023-01-31 10:30:00|
|2023-02-15    |2023-02-15 14:00:00|2023-02-15|2023-02-15 14:00:00|
|2022-12-25    |2022-12-25 23:59:59|2022-12-25|2022-12-25 23:59:59|
+--------------+-------------------+----------+-------------------+

root
 |-- event_date_str: string (nullable = true)
 |-- event_ts_str: string (nullable = tr

---

### Handling Time Zones and Formats

Time zone handling is critical for data consistency, especially with distributed data.

*   **`spark.sql.session.timeZone`**: This configuration property sets the default time zone for the Spark session. All timestamp operations that don't explicitly specify a time zone will use this. The default is typically the JVM's local time zone.
*   **Internal Storage is UTC**: Remember, Spark stores `TimestampType` values internally as UTC. Any display or conversion to a string format will apply the session's time zone or an explicitly provided one.

#### Example (Python):

In [2]:
import pyspark.sql.functions as F
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("TimezoneHandling").getOrCreate()

# --- Initial Setup ---
# Set session time zone for demonstration
spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
print(f"Session Time Zone: {spark.conf.get('spark.sql.session.timeZone')}")

data = [("2023-07-19 10:00:00",), ("2023-07-19 15:00:00",)]
columns = ["timestamp_str"]
df = spark.createDataFrame(data, columns)
print("\nOriginal string DataFrame:")
df.show(truncate=False)

# --- 1. Convert string to TimestampType (assuming input is in session timezone or no TZ specified) ---
# Spark will interpret 'timestamp_str' as being in 'America/Los_Angeles' (session TZ)
# and convert it to UTC internally. When shown, it's converted back to 'America/Los_Angeles'.
df_ts = df.withColumn("timestamp", F.col("timestamp_str").cast("timestamp"))
print("\nOriginal timestamps (converted to session timezone display):")
df_ts.show(truncate=False)
df_ts.printSchema()

# --- 2. Demonstrate different time zones during conversion ---
# Interpret the input string "2023-07-19 10:00:00" as belonging to GMT
# and convert to a timestamp. This internal UTC value will then be displayed in the session TZ.
df_tz_gmt = df.withColumn(
    "ts_gmt",
    F.to_timestamp(F.col("timestamp_str"), "yyyy-MM-dd HH:mm:ss") # Removed the extra "GMT" argument
)
print("\nTimestamp interpreted as GMT, then displayed in session TZ (America/Los_Angeles):")
df_tz_gmt.show(truncate=False)

# --- 3. Revert session time zone to UTC for comparison ---
spark.conf.set("spark.sql.session.timeZone", "UTC")
print(f"\nSession Time Zone changed to: {spark.conf.get('spark.sql.session.timeZone')}")

# Now observe how the same internal timestamp value is displayed differently
print("\nOriginal timestamps (displayed in new session TZ - UTC):")
df_ts.show(truncate=False) # The internal value of 'timestamp' column hasn't changed, only its display

# --- 4. Format timestamp to specific time zone strings ---
# You can explicitly specify a timezone for formatting
# print("\nFormatting timestamp to specific time zone strings:")
# df_ts.withColumn("formatted_utc", F.date_format(F.col("timestamp"), "yyyy-MM-dd HH:mm:ss z")) \
#      .withColumn("formatted_la", F.date_format(F.col("timestamp"), "yyyy-MM-dd HH:mm:ss z", "America/Los_Angeles")) \
#      .withColumn("formatted_london", F.date_format(F.col("timestamp"), "yyyy-MM-dd HH:mm:ss z", "Europe/London")) \
#      .show(truncate=False)

spark.stop()

Session Time Zone: America/Los_Angeles

Original string DataFrame:
+-------------------+
|timestamp_str      |
+-------------------+
|2023-07-19 10:00:00|
|2023-07-19 15:00:00|
+-------------------+


Original timestamps (converted to session timezone display):
+-------------------+-------------------+
|timestamp_str      |timestamp          |
+-------------------+-------------------+
|2023-07-19 10:00:00|2023-07-19 10:00:00|
|2023-07-19 15:00:00|2023-07-19 15:00:00|
+-------------------+-------------------+

root
 |-- timestamp_str: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)


Timestamp interpreted as GMT, then displayed in session TZ (America/Los_Angeles):
+-------------------+-------------------+
|timestamp_str      |ts_gmt             |
+-------------------+-------------------+
|2023-07-19 10:00:00|2023-07-19 10:00:00|
|2023-07-19 15:00:00|2023-07-19 15:00:00|
+-------------------+-------------------+


Session Time Zone changed to: UTC

Original timestamp

### Key Principles for Beginner Data Engineers

*   **Be Explicit**: Always specify date/timestamp formats when converting strings. If omitted, Spark might guess incorrectly, leading to `null` values or wrong conversions.
*   **UTC for Storage**: Store all `TimestampType` data in your underlying data storage (Parquet, Delta, etc.) in UTC. This standardizes your data and avoids ambiguities.
*   **Local Time Zones at Edges**: Only convert to a specific local time zone when ingesting data from a source that provides local timestamps, or when presenting data to users for reporting.
*   **Leverage Built-in Functions**: PySpark's `pyspark.sql.functions` module is optimized for performance and is the preferred way to handle date and time operations, rather than UDFs (User Defined Functions) which can be slower.