#### datediff() and months_between()

##### 1) datediff
- Difference between two dates (days, months, years)
- `datediff` is used to calculate the **date difference** between **two dates** in terms of **DAYS**.
- To compute the duration between **two timestamps or date values**.

##### **Syntax**
     datediff(endDate, startDate)

- This function takes the **end date** as the **first argument** and the **start date** as the **second argument** and returns the **number of days** in between them.

In [0]:
# define data
data = [['Sowmya', 29, 'Chennai', '2020-10-25', '2023-01-15', '2021-11-20', '2022-05-11'],
        ['Bole', 32, 'Bangalore', '2013-10-11', '2029-01-18', '2012-08-17', '2028-04-28'],
        ['Chandini', 35, 'Hyderabad', '2015-10-17', '2022-04-15', '2017-12-27', '2023-04-05'],
        ['Deepthi', 40, 'Nasik', '2022-12-21', '2023-04-23', '2023-02-26', '2024-08-20'],
        ['Swapna', 37, 'Mumbai', '2021-04-14', '2023-07-25', '2022-09-24', '2025-04-22'],
        ['Tharun', 25, 'Delhi', '2021-06-26', '2021-07-12', '2023-06-17', '2025-09-22']] 
  
# define column names
columns = ['emp_name', 'Age', 'City', 'start_date', 'end_date', 'purchase_date', 'delivery_date'] 
  
# create dataframe using data and column names
df_diff = spark.createDataFrame(data, columns)
  
# view dataframe
display(df_diff)

df_diff.printSchema()

emp_name,Age,City,start_date,end_date,purchase_date,delivery_date
Sowmya,29,Chennai,2020-10-25,2023-01-15,2021-11-20,2022-05-11
Bole,32,Bangalore,2013-10-11,2029-01-18,2012-08-17,2028-04-28
Chandini,35,Hyderabad,2015-10-17,2022-04-15,2017-12-27,2023-04-05
Deepthi,40,Nasik,2022-12-21,2023-04-23,2023-02-26,2024-08-20
Swapna,37,Mumbai,2021-04-14,2023-07-25,2022-09-24,2025-04-22
Tharun,25,Delhi,2021-06-26,2021-07-12,2023-06-17,2025-09-22


root
 |-- emp_name: string (nullable = true)
 |-- Age: long (nullable = true)
 |-- City: string (nullable = true)
 |-- start_date: string (nullable = true)
 |-- end_date: string (nullable = true)
 |-- purchase_date: string (nullable = true)
 |-- delivery_date: string (nullable = true)



In [0]:
from pyspark.sql import functions as F
from pyspark.sql.functions import col, floor, to_date, to_timestamp, current_date, datediff, months_between, round, lit, unix_timestamp

In [0]:
df_diff = df_diff.withColumn("start_date", to_date("start_date")) \
                 .withColumn("end_date", to_date("end_date"))

df_diff.printSchema()

root
 |-- emp_name: string (nullable = true)
 |-- Age: long (nullable = true)
 |-- City: string (nullable = true)
 |-- start_date: date (nullable = true)
 |-- end_date: date (nullable = true)
 |-- purchase_date: string (nullable = true)
 |-- delivery_date: string (nullable = true)



In [0]:
# create new DataFrame with date differences columns
df_diff_01 = df_diff.withColumn('diff_days', F.datediff(col('end_date'), col('start_date')))
display(df_diff_01)

emp_name,Age,City,start_date,end_date,purchase_date,delivery_date,diff_days
Sowmya,29,Chennai,2020-10-25,2023-01-15,2021-11-20,2022-05-11,812
Bole,32,Bangalore,2013-10-11,2029-01-18,2012-08-17,2028-04-28,5578
Chandini,35,Hyderabad,2015-10-17,2022-04-15,2017-12-27,2023-04-05,2372
Deepthi,40,Nasik,2022-12-21,2023-04-23,2023-02-26,2024-08-20,123
Swapna,37,Mumbai,2021-04-14,2023-07-25,2022-09-24,2025-04-22,832
Tharun,25,Delhi,2021-06-26,2021-07-12,2023-06-17,2025-09-22,16


- The are **812 days** between `2020-10-25 and 2023-01-15`.
- The are **5578 days** between `2029-01-18 and 2013-10-11`.
- The are **2372 days** between `2022-04-15 and 2015-10-17`.
- The are **123 days** between `2023-04-23 and 2022-12-21`.
- The are **832 days** between `2023-07-25 and 2021-04-14`.
- The are **16 days** between `2021-07-12 and 2021-06-26`.

In [0]:
# Calculate the difference between two dates
df_diff_02 = df_diff.select(current_date().alias("current_date"),
                            col("start_date"),
                            datediff(current_date(), col("start_date")).alias("datediff_in_days")
                    )
display(df_diff_02)

df_diff_02.printSchema()

current_date,start_date,datediff_in_days
2026-01-30,2020-10-25,1923
2026-01-30,2013-10-11,4494
2026-01-30,2015-10-17,3758
2026-01-30,2022-12-21,1136
2026-01-30,2021-04-14,1752
2026-01-30,2021-06-26,1679


root
 |-- current_date: date (nullable = false)
 |-- start_date: date (nullable = true)
 |-- datediff_in_days: integer (nullable = true)



In [0]:
# Calculate Difference Between Dates in Days
df_diff_02 = df_diff.withColumn('diff_days', F.datediff(F.to_date('delivery_date'), F.to_date('purchase_date')))
display(df_diff_02)
df_diff_02.printSchema()

emp_name,Age,City,start_date,end_date,purchase_date,delivery_date,diff_days
Sowmya,29,Chennai,2020-10-25,2023-01-15,2021-11-20,2022-05-11,172
Bole,32,Bangalore,2013-10-11,2029-01-18,2012-08-17,2028-04-28,5733
Chandini,35,Hyderabad,2015-10-17,2022-04-15,2017-12-27,2023-04-05,1925
Deepthi,40,Nasik,2022-12-21,2023-04-23,2023-02-26,2024-08-20,541
Swapna,37,Mumbai,2021-04-14,2023-07-25,2022-09-24,2025-04-22,941
Tharun,25,Delhi,2021-06-26,2021-07-12,2023-06-17,2025-09-22,828


root
 |-- emp_name: string (nullable = true)
 |-- Age: long (nullable = true)
 |-- City: string (nullable = true)
 |-- start_date: date (nullable = true)
 |-- end_date: date (nullable = true)
 |-- purchase_date: string (nullable = true)
 |-- delivery_date: string (nullable = true)
 |-- diff_days: integer (nullable = true)



##### 2) months_between()

- Date Difference in Months
- Date Difference in Seconds
  - Using unix_timestamp()
  - Using to_timestamp()
- Date Difference in Minutes
- Date Difference in Hours

##### a) Date Difference in Months

##### months_between():
  - Returns **number of months** between dates **date1 and date2**.
  - If `date1` is later than `date2`, then the result is `positive`.
  - A whole number is returned if both inputs have the same day of month or both are the last day of their respective months.
  - The difference is calculated assuming **31 days per month**.
  - The result is **rounded off** to **8 digits** unless **roundOff** is set to **False**.

##### **Syntax**

      Syntax: months_between(end_date, start_date)
      
         Returns => number of months between two dates.

|    Parameter Name	      |  Required  | Description                                            |
|--------------------------|------------|--------------------------------------------------------|
| end_date (str, Column)   |    Yes	    | It represents the ending date.                         |
| start_date (str, Column) |    Yes	    | It represents the starting date.                       |
| roundOff (bool)	         | Optional	 | It represents the difference to be rounded off or not. |

- **roundOffbool**, optional => whether to **round** (to **8 digits**) the final value or not `(default: True)`.

In [0]:
data = [("2019-01-11", "2021-04-12", "2019-09-17 12:02:21", "2021-07-12 18:29:29"),
        ("2019-08-04", "2021-04-15", "2018-11-11 14:17:05", "2021-08-03 16:21:40"),
        ("2019-03-24", "2021-02-08", "2019-02-07 04:26:49", "2020-11-28 05:20:33"),
        ("2019-04-13", "2021-06-05", "2019-07-08 20:04:09", "2021-05-18 08:21:12"),
        ("2019-02-22", "2021-10-01", "2018-11-28 05:46:54", "2021-06-17 21:39:42")
       ]
 
columns = ["from_date", "to_date", "from_datetime", "to_datetime"]

df_samp01 = spark.createDataFrame(data, schema=columns)
display(df_samp01)

df_samp01.printSchema()

from_date,to_date,from_datetime,to_datetime
2019-01-11,2021-04-12,2019-09-17 12:02:21,2021-07-12 18:29:29
2019-08-04,2021-04-15,2018-11-11 14:17:05,2021-08-03 16:21:40
2019-03-24,2021-02-08,2019-02-07 04:26:49,2020-11-28 05:20:33
2019-04-13,2021-06-05,2019-07-08 20:04:09,2021-05-18 08:21:12
2019-02-22,2021-10-01,2018-11-28 05:46:54,2021-06-17 21:39:42


root
 |-- from_date: string (nullable = true)
 |-- to_date: string (nullable = true)
 |-- from_datetime: string (nullable = true)
 |-- to_datetime: string (nullable = true)



In [0]:
df_samp01 = df_samp01.withColumn("from_date", to_date("from_date")) \
                     .withColumn("to_date", to_date("to_date")) \
                     .withColumn("from_datetime", to_timestamp("from_datetime", "yyyy-MM-dd HH:mm:ss")) \
                     .withColumn("to_datetime", to_timestamp("to_datetime", "yyyy-MM-dd HH:mm:ss"))

display(df_samp01)

df_samp01.printSchema()

from_date,to_date,from_datetime,to_datetime
2019-01-11,2021-04-12,2019-09-17T12:02:21.000Z,2021-07-12T18:29:29.000Z
2019-08-04,2021-04-15,2018-11-11T14:17:05.000Z,2021-08-03T16:21:40.000Z
2019-03-24,2021-02-08,2019-02-07T04:26:49.000Z,2020-11-28T05:20:33.000Z
2019-04-13,2021-06-05,2019-07-08T20:04:09.000Z,2021-05-18T08:21:12.000Z
2019-02-22,2021-10-01,2018-11-28T05:46:54.000Z,2021-06-17T21:39:42.000Z


root
 |-- from_date: date (nullable = true)
 |-- to_date: date (nullable = true)
 |-- from_datetime: timestamp (nullable = true)
 |-- to_datetime: timestamp (nullable = true)



**How to find the month difference between days?**

In [0]:
df1 = df_samp01.select("from_date",
                       months_between("to_date", "from_date").alias("months_between"),
                       floor(months_between("to_date", "from_date")).alias("months_between_floor"),
                       "to_date")
                       
display(df1)

from_date,months_between,months_between_floor,to_date
2019-01-11,27.03225806,27,2021-04-12
2019-08-04,20.35483871,20,2021-04-15
2019-03-24,22.48387097,22,2021-02-08
2019-04-13,25.74193548,25,2021-06-05
2019-02-22,31.32258065,31,2021-10-01


In [0]:
df2 = df_samp01.withColumn("months_between", floor(months_between("to_datetime", "from_datetime"))) \
               .select("to_datetime", "months_between", "from_datetime")
               
display(df2)

to_datetime,months_between,from_datetime
2021-07-12T18:29:29.000Z,21,2019-09-17T12:02:21.000Z
2021-08-03T16:21:40.000Z,32,2018-11-11T14:17:05.000Z
2020-11-28T05:20:33.000Z,21,2019-02-07T04:26:49.000Z
2021-05-18T08:21:12.000Z,22,2019-07-08T20:04:09.000Z
2021-06-17T21:39:42.000Z,30,2018-11-28T05:46:54.000Z


In [0]:
df22 = df_samp01.withColumn("months_between", months_between("to_datetime", "to_date")) \
                .withColumn("months_between_False", months_between("to_datetime", "to_date", False)) \
                .withColumn("months_between_True", months_between("to_datetime", "to_date", True)) \
                .withColumn("months_between_floor", floor(months_between("to_datetime", "to_date"))) \
                .select("to_datetime", "months_between", "months_between_False", "months_between_True", "months_between_floor", "to_date")
               
display(df22)

to_datetime,months_between,months_between_False,months_between_True,months_between_floor,to_date
2021-07-12T18:29:29.000Z,3.0,3.0,3.0,3,2021-04-12
2021-08-03T16:21:40.000Z,3.63489397,3.634893966547192,3.63489397,3,2021-04-15
2020-11-28T05:20:33.000Z,-2.34765793,-2.347657930107527,-2.34765793,-3,2021-02-08
2021-05-18T08:21:12.000Z,-0.56941756,-0.5694175627240143,-0.56941756,-1,2021-06-05
2021-06-17T21:39:42.000Z,-3.45475582,-3.45475582437276,-3.45475582,-4,2021-10-01


In [0]:
# Create DataFrame
data = [("1", "2019-07-01"),
        ("2", "2019-06-24"),
        ("3", "2019-08-24"),
        ("4", "2019-09-24"),
        ("5", "2019-10-24")
        ]

df_samp02 = spark.createDataFrame(data=data, schema=["id", "date"])
df_samp02.printSchema()

df_samp02 = df_samp02.withColumn("date", to_date("date"))
display(df_samp02)
df_samp02.printSchema()

root
 |-- id: string (nullable = true)
 |-- date: string (nullable = true)



id,date
1,2019-07-01
2,2019-06-24
3,2019-08-24
4,2019-09-24
5,2019-10-24


root
 |-- id: string (nullable = true)
 |-- date: date (nullable = true)



In [0]:
# Calculate the difference between two dates in months
df3 = (df_samp02.withColumn("diff_months_def", months_between(current_date(), col("date")))
                .withColumn("diff_months_True", months_between(current_date(), col("date"), True))     # round (to 8 digits)
                .withColumn("diff_months_False", months_between(current_date(), col("date"), False))
                .withColumn("diff_months_round", F.round(months_between(current_date(), col("date")), 1))
                .withColumn("diff_months_floor", floor(months_between(current_date(), col("date")))))
  
display(df3)

id,date,diff_months_def,diff_months_True,diff_months_False,diff_months_round,diff_months_floor
1,2019-07-01,78.93548387,78.93548387,78.93548387096774,78.9,78
2,2019-06-24,79.19354839,79.19354839,79.19354838709677,79.2,79
3,2019-08-24,77.19354839,77.19354839,77.19354838709677,77.2,77
4,2019-09-24,76.19354839,76.19354839,76.19354838709677,76.2,76
5,2019-10-24,75.19354839,75.19354839,75.19354838709677,75.2,75


**Spark SQL**

In [0]:
df_samp01.createOrReplaceTempView("days")

In [0]:
spark.sql("""
SELECT
    from_date,
    floor(months_between(to_date, from_date)) AS months_between,
    to_date
FROM days
""").display()

from_date,months_between,to_date
2019-01-11,27,2021-04-12
2019-08-04,20,2021-04-15
2019-03-24,22,2021-02-08
2019-04-13,25,2021-06-05
2019-02-22,31,2021-10-01


In [0]:
%sql
SELECT months_between('1997-02-28 10:30:00', '1996-10-30') AS Months_True;

Months_True
3.94959677


In [0]:
%sql
SELECT months_between('1997-02-28 10:30:00', '1996-10-30', false) AS Months_False;

Months_False
3.949596774193549


##### b) Get Differences Between Dates in Years

In [0]:
# Calculate the difference between two dates in years
df4 = df_samp02.withColumn("diff_years", F.round(months_between(current_date(), col("date"))/12, 1)) \
               .withColumn("diff_years_roundoff", F.round(months_between(current_date(), col("date")) / lit(12), 1))
display(df4)

id,date,diff_years,diff_years_roundoff
1,2019-07-01,6.6,6.6
2,2019-06-24,6.6,6.6
3,2019-08-24,6.4,6.4
4,2019-09-24,6.3,6.3
5,2019-10-24,6.3,6.3


##### c) Date Difference in Hours / Minutes

     df.withColumn("ux_current_date", unix_timestamp(col("current_date")))
                                     (or)
     df.withColumn("ux_current_date", unix_timestamp(current_date()))

In [0]:
df5 = df_samp02.withColumn("ux_current_date", unix_timestamp(col("current_date"))) \
               .withColumn("ux_date", unix_timestamp(col("date"))) \
               .withColumn("seconds_between", unix_timestamp(current_date()) - unix_timestamp(col("date"))) \
               .withColumn("minutes_between", col("seconds_between")/60) \
               .withColumn("hours_between", col("minutes_between")/60)

display(df5)

id,date,ux_current_date,ux_date,seconds_between,minutes_between,hours_between
1,2019-07-01,1769731200,1561939200,207792000,3463200.0,57720.0
2,2019-06-24,1769731200,1561334400,208396800,3473280.0,57888.0
3,2019-08-24,1769731200,1566604800,203126400,3385440.0,56424.0
4,2019-09-24,1769731200,1569283200,200448000,3340800.0,55680.0
5,2019-10-24,1769731200,1571875200,197856000,3297600.0,54960.0


##### d) Date Difference in Seconds

**i) Using unix_timestamp()**

##### **Syntax:**
     unix_timestamp(timestamp, TimestampFormat)

**Note:** The UNIX timestamp function converts the timestamp into the **number of seconds since the first of January 1970**.

In [0]:
df6 = df_samp02.withColumn("ux_current_date", unix_timestamp(col("current_date"))) \
               .withColumn("ux_date", unix_timestamp(col("date"))) \
               .withColumn("seconds_between", unix_timestamp(col("current_date")) - unix_timestamp(col("date")))

display(df6)

id,date,ux_current_date,ux_date,seconds_between
1,2019-07-01,1739145600,1561939200,177206400
2,2019-06-24,1739145600,1561334400,177811200
3,2019-08-24,1739145600,1566604800,172540800
4,2019-09-24,1739145600,1569283200,169862400
5,2019-10-24,1739145600,1571875200,167270400


**ii) Using to_timestamp()**

##### Syntax:
     to_timestamp(timestamp, format])

In [0]:
# unix_timestamp() function to convert timestamps to seconds
# to_timestamp(col("current_date")) => Ensures the column is treated as a timestamp
# unix_timestamp() => Converts the timestamp to seconds
# time difference in "seconds" between "current_date" and "date"

df7 = df_samp02.withColumn("date_ts", to_timestamp(col("date"))) \
               .withColumn("current_date", current_date()) \
               .withColumn("current_date_ts", to_timestamp(current_date())) \
               .withColumn("seconds_between_to_timestamp",
                           unix_timestamp(to_timestamp(col("current_date"))) - unix_timestamp(to_timestamp(col("date"))))

display(df7)

id,date,date_ts,current_date,current_date_ts,seconds_between_to_timestamp
1,2019-07-01,2019-07-01T00:00:00Z,2025-02-11,2025-02-11T00:00:00Z,177292800
2,2019-06-24,2019-06-24T00:00:00Z,2025-02-11,2025-02-11T00:00:00Z,177897600
3,2019-08-24,2019-08-24T00:00:00Z,2025-02-11,2025-02-11T00:00:00Z,172627200
4,2019-09-24,2019-09-24T00:00:00Z,2025-02-11,2025-02-11T00:00:00Z,169948800
5,2019-10-24,2019-10-24T00:00:00Z,2025-02-11,2025-02-11T00:00:00Z,167356800


##### e) Calculating Differences when Dates are in Custom Format

- Difference between two dates when dates are `not in DateType format yyyy-MM-dd`.
- When dates are `not in DateType format`, all date functions return `null`.
- Hence, first `convert` the input date to DateType using `to_date()` function and then calculate the `differences`.

In [0]:
# Calculate Difference Between Dates in Months
df8 = df_samp01.withColumn('diff_months', F.round(F.months_between(F.to_date('to_date'), F.to_date('from_date')),2))
display(df8)

from_date,to_date,from_datetime,to_datetime,diff_months
2019-01-11,2021-04-12,2019-09-17 12:02:21,2021-07-12 18:29:29,27.03
2019-08-04,2021-04-15,2018-11-11 14:17:05,2021-08-03 16:21:40,20.35
2019-03-24,2021-02-08,2019-02-07 04:26:49,2020-11-28 05:20:33,22.48
2019-04-13,2021-06-05,2019-07-08 20:04:09,2021-05-18 08:21:12,25.74
2019-02-22,2021-10-01,2018-11-28 05:46:54,2021-06-17 21:39:42,31.32


In [0]:
# Calculate Difference Between Dates in Years
df9 = df_samp01.withColumn('diff_years', F.round(F.months_between(F.to_date('to_date'), F.to_date('from_date'))/12,2))
display(df9)

from_date,to_date,from_datetime,to_datetime,diff_years
2019-01-11,2021-04-12,2019-09-17 12:02:21,2021-07-12 18:29:29,2.25
2019-08-04,2021-04-15,2018-11-11 14:17:05,2021-08-03 16:21:40,1.7
2019-03-24,2021-02-08,2019-02-07 04:26:49,2020-11-28 05:20:33,1.87
2019-04-13,2021-06-05,2019-07-08 20:04:09,2021-05-18 08:21:12,2.15
2019-02-22,2021-10-01,2018-11-28 05:46:54,2021-06-17 21:39:42,2.61
