# 5 - Date and Time Challenges
Working with date and time data is always a struggle, because there are many different notations and formats as well as the timezone issue. So todaywewant to check, how the `pysparl.sql.functions` module can help us here.

In [1]:
from pyspark_start import *

There are just two kind of objects regarding date and time known to Spark
1. `DateType`: which refers to a calender
1. `TimeStampType`: which is a date extended by a time releated to a timezone

By default, Spark derives the current date and timestamp from the local machine settings, unless it get's overruled by `SparkConf` settings.

In [2]:
# import pyspark.sql.functions as F
from pyspark.sql.functions import *

df = spark\
    .range(1)\
    .withColumn("Today", current_date())\
    .withColumn("Now", current_timestamp())

df.show(truncate=False)

+---+----------+-----------------------+
|id |Today     |Now                    |
+---+----------+-----------------------+
|0  |2023-01-20|2023-01-20 18:23:32.583|
+---+----------+-----------------------+



## Decomposing Date and Timestamp Objects
Ifweneed to check for a particular part of a date/timestap, maybe to get all data records created in Apr-2020,wecan decompose these objects by one of the following functions.

In [3]:
df.select(
    "now",
   year("Now").alias("year"),
   quarter("Now").alias("quarter"),
   month("now").alias("month"),
   dayofmonth("now").alias("day of month"),
   hour("now").alias("hours"),
   minute("now").alias("minutes"),
   minute("now").alias("seconds")
).show(vertical=True, truncate=False)

-RECORD 0-------------------------------
 now          | 2023-01-20 18:23:37.073 
 year         | 2023                    
 quarter      | 1                       
 month        | 1                       
 day of month | 20                      
 hours        | 18                      
 minutes      | 23                      
 seconds      | 23                      



We can even derive some further attributes from a given date. Especially the `last_day()` function can help us implementing month-end reports.

In [4]:
df.select(
    "now",
    weekofyear("now").alias("week of year"),
    dayofweek("now").alias("day of week"),
    last_day("now").alias("month end"),
    next_day("today", "Sun").alias("next Sunday")
).show(vertical=True, truncate=False)

-RECORD 0------------------------------
 now          | 2023-01-20 18:23:37.86 
 week of year | 3                      
 day of week  | 6                      
 month end    | 2023-01-31             
 next Sunday  | 2023-01-22             



## Date Shifts and Date Period Calculation

Spark provides native support for calulations of time periodes on daily or monthly basis.

In [5]:
df.select(
    "today",
    date_add("today", 2).alias("day after tomorrow"),
    date_add("today", -1).alias("yesterday"),
    date_sub("today", 1).alias("yesterday"),
    add_months("today", 1).alias("same day next month"),
    add_months("today", -1).alias("same day prev. month"),
    "now",
    date_add("now", 1).alias("tomorrow at same time"),
    date_sub("now", 1).alias("yesterday at same time"),
    add_months("now", 1).alias("next month"),
    datediff(date_add("today", 1), date_sub("today", 1)).alias("# of days between yesterday and tomorrow"),
    months_between("today", date_add("today", 77)).alias("# of months between two dates"),
    months_between("today", date_sub("today", 77)).alias("# of months between two dates")
).show(vertical=True, truncate=False)

-RECORD 0-----------------------------------------------------------
 today                                    | 2023-01-20              
 day after tomorrow                       | 2023-01-22              
 yesterday                                | 2023-01-19              
 yesterday                                | 2023-01-19              
 same day next month                      | 2023-02-20              
 same day prev. month                     | 2022-12-20              
 now                                      | 2023-01-20 18:23:38.456 
 tomorrow at same time                    | 2023-01-21              
 yesterday at same time                   | 2023-01-19              
 next month                               | 2023-02-20              
 # of days between yesterday and tomorrow | 2                       
 # of months between two dates            | -2.58064516             
 # of months between two dates            | 2.51612903              



It's quite counter-intuitive to us that the month diff is negative when the second date is later than the first date and vice versa. The calculation is simply date1 - date2 and since earlier dates are smaller than later ones,  the negative result obvious is absolutely intuitive for mathematicians.

We should keep also in mind, that when applying these funtions to timestamps, they get truncated to dates and timestamps by myself.

The `date_sub()` function is redundant because `date_add()` accepts also negative day shifts.

In [6]:
df.select(
    "today",
    date_trunc("yyyy", "today").alias("yyyy"),
    date_trunc("year", "today").alias("year"),
    date_trunc("year", "today").alias("yy"),
    date_trunc("quarter", "today").alias("quarter"),
    date_trunc("month", "today").alias("month"),
    date_trunc("mon", "today").alias("mon"),
    date_trunc("week", "today").alias("week"),
    date_trunc("mm", "today").alias("mm"),
    "now",
    date_trunc("yyyy", "now").alias("yyyy"),
    date_trunc("year", "now").alias("year"),
    date_trunc("year", "now").alias("yy"),
    date_trunc("quarter", "now").alias("quarter"),
    date_trunc("month", "now").alias("month"),
    date_trunc("mon", "now").alias("mon"),
    date_trunc("week", "now").alias("week"),
    date_trunc("mm", "now").alias("mm"),
    date_trunc("day", "now").alias("day"),
    date_trunc("dd", "now").alias("dd"),
    date_trunc("hour", "now").alias("hour"),
    date_trunc("minute", "now").alias("minute"),
    date_trunc("second", "now").alias("second"),
    date_trunc("quarter", "today").alias("quarter"),
).show(vertical=True, truncate=False)

-RECORD 0--------------------------
 today   | 2023-01-20              
 yyyy    | 2023-01-01 00:00:00     
 year    | 2023-01-01 00:00:00     
 yy      | 2023-01-01 00:00:00     
 quarter | 2023-01-01 00:00:00     
 month   | 2023-01-01 00:00:00     
 mon     | 2023-01-01 00:00:00     
 week    | 2023-01-16 00:00:00     
 mm      | 2023-01-01 00:00:00     
 now     | 2023-01-20 18:23:39.272 
 yyyy    | 2023-01-01 00:00:00     
 year    | 2023-01-01 00:00:00     
 yy      | 2023-01-01 00:00:00     
 quarter | 2023-01-01 00:00:00     
 month   | 2023-01-01 00:00:00     
 mon     | 2023-01-01 00:00:00     
 week    | 2023-01-16 00:00:00     
 mm      | 2023-01-01 00:00:00     
 day     | 2023-01-20 00:00:00     
 dd      | 2023-01-20 00:00:00     
 hour    | 2023-01-20 18:00:00     
 minute  | 2023-01-20 18:23:00     
 second  | 2023-01-20 18:23:39     
 quarter | 2023-01-01 00:00:00     



## Type Conversion: String vs. Date/Timstamps
Especially during data import or export of dates/timestamps,weoften have to read from strings or write to strings. 

### Reading Strings:

In [30]:
stringDF = spark.range(1).withColumn("ts string", lit("2020-04-16 14:16:23"))
stringDF.show(truncate=False)

+---+-------------------+
|id |ts string          |
+---+-------------------+
|0  |2020-04-16 14:16:23|
+---+-------------------+



In [33]:
stringDF.printSchema()

root
 |-- id: long (nullable = false)
 |-- ts string: string (nullable = false)



Whenweread from file with `option("inferSchema", "true")` Spark can identifiy most of the common date and datetime notations and derive the correspondig column type. If the notation is unknon to Spark, it will set the column to DataTypeString. In that casewecan convert these strings to dates or timestamps during a subsequent transformation. This requires a string format according to the Java <a href="https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html">SimpleDateFormats</a>

**Note:** Internally Spark works with Java dates and timestamps.

In [31]:
stringDF.select(
        "ts string",
        to_timestamp("ts string", "yyyy-MM-dd HH:mm:ss").alias("converted to ts"),
        to_date("ts string", "yyyy-MM-dd HH:mm:ss").alias("converted to date")
).show(truncate=False)

+-------------------+-------------------+-----------------+
|ts string          |converted to ts    |converted to date|
+-------------------+-------------------+-----------------+
|2020-04-16 14:16:23|2020-04-16 14:16:23|2020-04-16       |
+-------------------+-------------------+-----------------+



In [16]:
help(to_date)

Help on function to_date in module pyspark.sql.functions:

to_date(col: 'ColumnOrName', format: Optional[str] = None) -> pyspark.sql.column.Column
    Converts a :class:`~pyspark.sql.Column` into :class:`pyspark.sql.types.DateType`
    using the optionally specified format. Specify formats according to `datetime pattern`_.
    By default, it follows casting rules to :class:`pyspark.sql.types.DateType` if the format
    is omitted. Equivalent to ``col.cast("date")``.
    
    .. _datetime pattern: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
    
    .. versionadded:: 2.2.0
    
    Examples
    --------
    >>> df = spark.createDataFrame([('1997-02-28 10:30:00',)], ['t'])
    >>> df.select(to_date(df.t).alias('date')).collect()
    [Row(date=datetime.date(1997, 2, 28))]
    
    >>> df = spark.createDataFrame([('1997-02-28 10:30:00',)], ['t'])
    >>> df.select(to_date(df.t, 'yyyy-MM-dd HH:mm:ss').alias('date')).collect()
    [Row(date=datetime.date(1997, 2, 28))]

### Writing as Strings
To convert either a date or a timestamp into a string, againwehave to specify the format according to the Java [SimpleDateFormats](https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html)

In [32]:
df.printSchema()

root
 |-- id: long (nullable = false)
 |-- Today: date (nullable = false)
 |-- Now: timestamp (nullable = false)



In [27]:
df.select(
    date_format("today", "dd.MM.yyyy").alias("date string"),
    date_format("now", "dd.MM:yyyy HH:mm.ss").alias("ts string")
).show(truncate=False)

+-----------+-------------------+
|date string|ts string          |
+-----------+-------------------+
|20.01.2023 |20.01:2023 19:21.25|
+-----------+-------------------+



## Timezone Conversion
Processing timestamp gets even more complicated, whenweneed to load data from source systems, which are located in different timezones. Whenwewant to analyse flight durations,weshould consider that the timezone offsets between daparture time and arrival time. Otherwise in  our dataresult a flight from Hamburg to London might take just 5 minutes.

So to ensure proper time period calculationsweshould normalize all timestamps to a common timezone which is the *Unified Time Coordinated (UTC)*.

In [34]:
utcDF = df.select(
    "now",
    to_utc_timestamp("now", "CET").alias("CET ts in UTC"),
    to_utc_timestamp("now", "CET").alias("CEST ts ts in UTC"),
    to_utc_timestamp("now", "Europe/Berlin").alias("Europe/Berlin ts in UTC"),
)

utcDF.show(truncate=False)

+-----------------------+-----------------------+-----------------------+-----------------------+
|now                    |CET ts in UTC          |CEST ts ts in UTC      |Europe/Berlin ts in UTC|
+-----------------------+-----------------------+-----------------------+-----------------------+
|2023-01-20 19:23:05.322|2023-01-20 18:23:05.322|2023-01-20 18:23:05.322|2023-01-20 18:23:05.322|
+-----------------------+-----------------------+-----------------------+-----------------------+



Obviously, Spark is smart enough to identify, that the given timestamp is in the day-light-saving period where CET = UTC+1 get's shifted to CEST = UTC+2, so in all cases Spark applies the 'Europe/Berlin' rules regardless whetherwedefine this rule rexplicitly orwepass the CET or CEST timezone.

Ifweneed to localize UTC timestamps,wejust need to apply the reverse function.



In [35]:
utcDF.select(
    from_utc_timestamp("Europe/Berlin ts in UTC", "Europe/Berlin").alias("local ts")
).show(truncate=False)

+-----------------------+
|local ts               |
+-----------------------+
|2023-01-20 19:23:39.474|
+-----------------------+



The Unix timestamp is another alternative for timestamp harmonization:

**from_unixtime(timestamp, format='yyyy-MM-dd HH:mm:ss')**
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.


**unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss')**
Convert time string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix time stamp (in seconds), using the default timezone and the default locale, return null if fail.

In [None]:
String Manipulation Functions

Case Conversion - lower, upper

Getting Length - length

Extracting substrings - substring, split

Trimming - trim, ltrim, rtrim

Padding - lpad, rpad

Concatenating string - concat, concat_ws