- Author: Ben Du
- Date: 2020-09-05 18:19:33
- Title: Date Functions in Spark
- Slug: spark-dataframe-func-date
- Category: Computer Science
- Tags: programming, PySpark, Spark, DataFrame, date, Spark SQL, function, SQL

## Comment

1. Most date functions work on a string of the format `yyyy-MM-dd`
    which is automatically casted to a date object.
    
2. Functions `second`, `minute`, `day`/`dayofmonth`, `weekofyear`, `monthofyear`, `quarter` 
    and `year` extract the corresponding part from a date object/string.
    
3. `date_add`, `date_sub`, `datediff` and `add_months` performs arithmatical operations on dates.

4. `to_date`, `to_timestamp`, `to_utc_timestamp`, `to_unix_timestamp` and `timestamp` 
    cast date objects/strings.

In [6]:
from pathlib import Path
import findspark
findspark.init(str(next(Path("/opt").glob("spark-3*"))))
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType
spark = SparkSession.builder.appName("PySpark_Str_Func") \
    .enableHiveSupport().getOrCreate()

## +/- Operators

The +/- operators are supported on date/time columns in Spark 3.

In [3]:
spark.sql("""
    select 
        current_date as today
    """).show()

+----------+
|     today|
+----------+
|2020-09-05|
+----------+



In [4]:
spark.sql("""
    select 
        current_date + 10 as today
    """).show()

+----------+
|     today|
+----------+
|2020-09-15|
+----------+



In [5]:
spark.sql("""
    select 
        current_date - 1 as today
    """).show()

+----------+
|     today|
+----------+
|2020-09-04|
+----------+



## add_months

## date_add

In [8]:
val df = Seq(
    ("2017-01-01", "2017-01-07"),
    ("2017-02-01", "2019-02-10")
).toDF("d1", "d2")
df.show

+----------+----------+
|        d1|        d2|
+----------+----------+
|2017-01-01|2017-01-07|
|2017-02-01|2019-02-10|
+----------+----------+



null

In [9]:
import org.apache.spark.sql.functions._

val df1 = df.withColumn("d3", date_sub($"d1", 30))
    .withColumn("d4", date_add($"d1", 30))
    .withColumn("check", $"d2".between($"d3", $"d4"))
df1.show
df1.schema

+----------+----------+----------+----------+-----+
|        d1|        d2|        d3|        d4|check|
+----------+----------+----------+----------+-----+
|2017-01-01|2017-01-07|2016-12-02|2017-01-31| true|
|2017-02-01|2019-02-10|2017-01-02|2017-03-03|false|
+----------+----------+----------+----------+-----+



[[StructField(d1,StringType,true), StructField(d2,StringType,true), StructField(d3,DateType,true), StructField(d4,DateType,true), StructField(check,BooleanType,true)]]

## date_trunc

## date_sub

## datediff

In [15]:
val df2 = df.withColumn("diff", datediff($"d2", $"d1"))
df2.show
df2.schema

+----------+----------+----+
|        d1|        d2|diff|
+----------+----------+----+
|2017-01-01|2017-01-07|   6|
|2017-02-01|2019-02-10| 739|
+----------+----------+----+



[[StructField(d1,StringType,true), StructField(d2,StringType,true), StructField(diff,IntegerType,true)]]

## current_date

In [17]:
val df3 = df.withColumn("current", current_date())
df3.show
df3.schema

+----------+----------+----------+
|        d1|        d2|   current|
+----------+----------+----------+
|2017-01-01|2017-01-07|2018-05-02|
|2017-02-01|2019-02-10|2018-05-02|
+----------+----------+----------+



[[StructField(d1,StringType,true), StructField(d2,StringType,true), StructField(current,DateType,false)]]

## current_timestamp / now

Both `current_timestamp` and `now` returns the current timestamp.
The difference is that `now` must be called with parentheses 
while `curent_timestamp` can be called without parentheses.
If you are not sure, 
always call functions with parentheses.

In [17]:
spark.sql("""
    select
        current_timestamp
    """).show(n=1, truncate=False)

+-----------------------+
|current_timestamp()    |
+-----------------------+
|2020-09-07 10:58:58.381|
+-----------------------+



In [20]:
spark.sql("""
    select
        current_timestamp()
    """).show(n=1, truncate=False)

+----------------------+
|current_timestamp()   |
+----------------------+
|2020-09-07 11:00:53.47|
+----------------------+



In [19]:
spark.sql("""
    select
        now()
    """).show(n=1, truncate=False)

+-----------------------+
|now()                  |
+-----------------------+
|2020-09-07 10:59:37.629|
+-----------------------+



## dayofmonth / day

Returns the day from a given date or timestamp. This function is the same as the day function.
`dayofmonth` is the same as the `day` function.

In [21]:
val df4 = df.withColumn("day_of_d2", dayofmonth($"d2"))
df4.show
df4.schema

+----------+----------+---------+
|        d1|        d2|day_of_d2|
+----------+----------+---------+
|2017-01-01|2017-01-07|        7|
|2017-02-01|2019-02-10|       10|
+----------+----------+---------+



[[StructField(d1,StringType,true), StructField(d2,StringType,true), StructField(day_of_d2,IntegerType,true)]]

In [11]:
spark.sql("""
    select 
        dayofmonth("2017-01-07") 
    """).show()

+------------------------------------+
|dayofmonth(CAST(2017-01-07 AS DATE))|
+------------------------------------+
|                                   7|
+------------------------------------+



## dayofyear

In [22]:
val df5 = df.withColumn("day_of_year_d1", dayofyear($"d1")).withColumn("day_of_year_d2", dayofyear($"d2"))
df5.show
df5.schema

+----------+----------+--------------+--------------+
|        d1|        d2|day_of_year_d1|day_of_year_d2|
+----------+----------+--------------+--------------+
|2017-01-01|2017-01-07|             1|             7|
|2017-02-01|2019-02-10|            32|            41|
+----------+----------+--------------+--------------+



[[StructField(d1,StringType,true), StructField(d2,StringType,true), StructField(day_of_year_d1,IntegerType,true), StructField(day_of_year_d2,IntegerType,true)]]

## date_format

In [25]:
val df6 = df.withColumn("format_d1", date_format($"d1", "dd/MM/yyyy"))
df6.show
df6.schema

+----------+----------+----------+
|        d1|        d2| format_d1|
+----------+----------+----------+
|2017-01-01|2017-01-07|01/01/2017|
|2017-02-01|2019-02-10|01/02/2017|
+----------+----------+----------+



[[StructField(d1,StringType,true), StructField(d2,StringType,true), StructField(format_d1,StringType,true)]]

## date_trunc

Returns a timestamp specified as (ts) truncated to the unit specified by format (fmt) 
[“YEAR”, “YYYY”, “YY”, “MON”, “MONTH”, “MM”, “DAY”, “DD”, “HOUR”, “MINUTE”, “SECOND”, “WEEK”, “QUARTER”]

## minute

## month

In [9]:
spark.sql("""
    select
        month("2018-01-01") as month
    """).show()

+-----+
|month|
+-----+
|    1|
+-----+



## now

Returns the current timestamp.

## next_day

Returns the day after the start_date specified by day_of_week. 
Day of week can be specified as ‘MON’, ‘TUE’, ‘WED’, ‘THU’, ‘FRI’, ‘SAT’, ‘SUN’ 
or as ‘MO’, ‘TU’, ‘WE’, ‘TH’, ‘FR’, ‘SA’, ‘SU’.

## quarter

## second

## timestamp

## to_date

## to_utc_timestamp

## to_unix_timestamp

## to_timestamp

## unix_timestamp

## weekofyear

## year

In [8]:
spark.sql("""
    select
        year("2018-01-01") as year
    """).show()

+----+
|year|
+----+
|2018|
+----+



## References

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/functions.html

https://obstkel.com/spark-sql-functions

https://obstkel.com/spark-sql-date-functions