#PySpark SQL Date and Timestamp Functions



---


**PySpark Date and Timestamp Functions are supported on DataFrame and SQL queries and they work similarly to traditional SQL, Date and Time are very important if you are using PySpark for ETL. Most of all these functions accept input as, Date type, Timestamp type, or String. If a String used, it should be in a default format that can be cast to date.**


- DateType default format is yyyy-MM-dd 

- TimestampType default format is yyyy-MM-dd HH:mm:ss.SSSS

- Returns null if the input is a string that can not be cast to Date or Timestamp.


**PySpark SQL provides several Date & Timestamp functions hence keep an eye on and understand these. Always you should choose these functions instead of writing your own functions (UDF) as these functions are compile-time safe, handles null, and perform better when compared to PySpark UDF. If your PySpark application is critical on performance try to avoid using custom UDF at all costs as these are not guarantee performance.**

##PySpark SQL Date and Timestamp Functions Examples


**Following are the most used PySpark SQL Date and Timestamp Functions with examples, you can use these on DataFrame and SQL expressions.**

In [0]:
from pyspark.sql.functions import *

In [0]:
data = [
    ["1","2020-02-01"],["2","2019-03-02"],["3","2021-03-01"]
]

df = spark.createDataFrame(data=data, schema=["id", "input"])
df.printSchema()
df.show(truncate=False)

root
 |-- id: string (nullable = true)
 |-- input: string (nullable = true)

+---+----------+
|id |input     |
+---+----------+
|1  |2020-02-01|
|2  |2019-03-02|
|3  |2021-03-01|
+---+----------+



##current_date()


**Use current_date() to get the current system date. By default, the data will be returned in yyyy-dd-mm format.**

In [0]:
df.select(current_date().alias("current_date")).show(1)

+------------+
|current_date|
+------------+
|  2023-01-08|
+------------+
only showing top 1 row



##date_format()


**The below example uses date_format() to parses the date and converts from yyyy-mm-dd to MM-dd-yyyy format.**

In [0]:
#date_format()

df.select( col("input"), date_format( col("input"), "MM-dd-yyyy").alias("date_format") ).show()

+----------+-----------+
|     input|date_format|
+----------+-----------+
|2020-02-01| 02-01-2020|
|2019-03-02| 03-02-2019|
|2021-03-01| 03-01-2021|
+----------+-----------+



##to_date()


**Below example converts string in date format yyyy-MM-dd to a DateType yyyy-MM-dd using to_date(). You can also use this to convert into any specific format. PySpark supports all patterns supports on Java DateTimeFormatter.**

In [0]:
#to_date()

df2 =df.select(col("input"), to_date(col("input"), "yyyy-MM-dd").alias("to_date"))
df.printSchema()
df2.printSchema()
df2.show()


root
 |-- id: string (nullable = true)
 |-- input: string (nullable = true)

root
 |-- input: string (nullable = true)
 |-- to_date: date (nullable = true)

+----------+----------+
|     input|   to_date|
+----------+----------+
|2020-02-01|2020-02-01|
|2019-03-02|2019-03-02|
|2021-03-01|2021-03-01|
+----------+----------+



##datediff()


**The below example returns the difference between two dates using datediff().**

In [0]:
#datediff()
df.select(col("input"), datediff(current_date(), col("input")).alias("datediff")).show()

+----------+--------+
|     input|datediff|
+----------+--------+
|2020-02-01|    1072|
|2019-03-02|    1408|
|2021-03-01|     678|
+----------+--------+



##months_between()


**The below example returns the months between two dates using months_between().**

In [0]:
#months_between()

df.select(col("input"), months_between(current_date(), col("input")).alias("months_between")).show()

+----------+--------------+
|     input|months_between|
+----------+--------------+
|2020-02-01|   35.22580645|
|2019-03-02|   46.19354839|
|2021-03-01|   22.22580645|
+----------+--------------+



##trunc()


**The below example truncates the date at a specified unit using trunc().**

In [0]:
#trunc()

df.select(col("input"),\
         trunc(col("input"), "Month").alias("month_trunc"),\
         trunc(col("input"), "Year").alias("year_trunc")\
         ).show()

+----------+-----------+----------+
|     input|month_trunc|year_trunc|
+----------+-----------+----------+
|2020-02-01| 2020-02-01|2020-01-01|
|2019-03-02| 2019-03-01|2019-01-01|
|2021-03-01| 2021-03-01|2021-01-01|
+----------+-----------+----------+



##add_months() , date_add(), date_sub()


**Here we are adding and subtracting date and month from a given input.**

In [0]:
#add_months(), date_add(), date_sub()

df.select(col("input"),\
         add_months(col("input"), 3).alias("add_months"),\
         add_months(col("input"), -3).alias("sub_months"),\
         date_add(col("input"), 5).alias("date_add"),\
         date_sub(col("input"), 5).alias("date_sub")\
         ).show(truncate=False)

+----------+----------+----------+----------+----------+
|input     |add_months|sub_months|date_add  |date_sub  |
+----------+----------+----------+----------+----------+
|2020-02-01|2020-05-01|2019-11-01|2020-02-06|2020-01-27|
|2019-03-02|2019-06-02|2018-12-02|2019-03-07|2019-02-25|
|2021-03-01|2021-06-01|2020-12-01|2021-03-06|2021-02-24|
+----------+----------+----------+----------+----------+



##year(), month(), month(),next_day(), weekofyear()

In [0]:
df.select(col("input"),\
         year(col("input")).alias("year"),\
         month(col("input")).alias("month"),\
         next_day(col("input"), "Sunday").alias("next_days"),\
         weekofyear(col("input")).alias("weekofyear")\
         ).show(truncate=False)

+----------+----+-----+----------+----------+
|input     |year|month|next_days |weekofyear|
+----------+----+-----+----------+----------+
|2020-02-01|2020|2    |2020-02-02|5         |
|2019-03-02|2019|3    |2019-03-03|9         |
|2021-03-01|2021|3    |2021-03-07|9         |
+----------+----+-----+----------+----------+



##dayofweek(), dayofmonth(), dayofyear()

In [0]:
df.select(col("input"),\
         dayofweek(col("input")).alias("dayofweek"),\
         dayofmonth(col("input")).alias("dayofmonth"),\
         dayofyear(col("input")).alias("dayofyear")\
         ).show(truncate=False)

+----------+---------+----------+---------+
|input     |dayofweek|dayofmonth|dayofyear|
+----------+---------+----------+---------+
|2020-02-01|7        |1         |32       |
|2019-03-02|7        |2         |61       |
|2021-03-01|2        |1         |60       |
+----------+---------+----------+---------+



#current_timestamp()

**Following are the Timestamp Functions that you can use on SQL and on DataFrame. Let’s learn these with examples.**

**Let’s create a test data.**

In [0]:
data = [["1","02-01-2020 11 01 19 06"],["2","03-01-2019 12 01 19 406"],["3","03-01-2021 12 01 19 406"]]

df2 = spark.createDataFrame(data=data, schema=["id", "input"])
df2.printSchema()
df2.show(truncate=False)

root
 |-- id: string (nullable = true)
 |-- input: string (nullable = true)

+---+-----------------------+
|id |input                  |
+---+-----------------------+
|1  |02-01-2020 11 01 19 06 |
|2  |03-01-2019 12 01 19 406|
|3  |03-01-2021 12 01 19 406|
+---+-----------------------+



**Below example returns the current timestamp in spark default format yyyy-MM-dd HH:mm:ss**

In [0]:
#current_timestamp()

df2.select(current_timestamp().alias("current_timestamp")).show(1,truncate=False)

+-----------------------+
|current_timestamp      |
+-----------------------+
|2023-01-08 14:17:19.959|
+-----------------------+
only showing top 1 row



##to_timestamp()


**Converts string timestamp to Timestamp type format.**

In [0]:
#to_timestamp()

df2.select(col("input"),\
         to_timestamp(col("input"), "MM-dd-yyyy HH mm ss SSS").alias("to_timestamp")\
         ).show(truncate=False)

+-----------------------+-----------------------+
|input                  |to_timestamp           |
+-----------------------+-----------------------+
|02-01-2020 11 01 19 06 |2020-02-01 11:01:19.06 |
|03-01-2019 12 01 19 406|2019-03-01 12:01:19.406|
|03-01-2021 12 01 19 406|2021-03-01 12:01:19.406|
+-----------------------+-----------------------+



##hour(), Minute() and second()

In [0]:
#hour, minute, second

data = [["1","2020-02-01 11:01:19.06"],["2","2019-03-01 12:01:19.406"],["3","2021-03-01 12:01:19.406"]]

df3 = spark.createDataFrame(data=data, schema=["id", "input"])

df3.select(col("input"),\
          hour(col("input")).alias("hour"),\
          minute(col("input")).alias("minute"),\
          second(col("input")).alias("second")\
          ).show(truncate=False)

+-----------------------+----+------+------+
|input                  |hour|minute|second|
+-----------------------+----+------+------+
|2020-02-01 11:01:19.06 |11  |1     |19    |
|2019-03-01 12:01:19.406|12  |1     |19    |
|2021-03-01 12:01:19.406|12  |1     |19    |
+-----------------------+----+------+------+



##Conclusion:

**In this post, I’ve consolidated the complete list of Date and Timestamp Functions with a description and example of some commonly used. You can find the complete list on the <a href="https://www.databricks.com/blog/2015/09/16/apache-spark-1-5-dataframe-api-highlights.html">Blog</a>**