## Importing Libraries

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

**3268. Find Overlapping Shifts II (Hard)**

**Table: EmployeeShifts**

| Column Name      | Type     |
|------------------|----------|
| employee_id      | int      |
| start_time       | datetime |
| end_time         | datetime |

(employee_id, start_time) is the unique key for this table.
This table contains information about the shifts worked by employees, including the start time, and end time.

**Write a solution to analyze overlapping shifts for each employee. Two shifts are considered overlapping if they occur on the same date and one shift's end_time is later than another shift's start_time.**

For each employee, calculate the following:
- The maximum number of shifts that overlap at any given time.
- The total duration of all overlaps in minutes.

Return the result table ordered by employee_id in ascending order.

The query result format is in the following example.

**Example:**

**Input:**

**EmployeeShifts table:**

| employee_id | start_time          | end_time            |
|-------------|---------------------|---------------------|
| 1           | 2023-10-01 09:00:00 | 2023-10-01 17:00:00 |
| 1           | 2023-10-01 15:00:00 | 2023-10-01 23:00:00 |
| 1           | 2023-10-01 16:00:00 | 2023-10-02 00:00:00 |
| 2           | 2023-10-01 09:00:00 | 2023-10-01 17:00:00 |
| 2           | 2023-10-01 11:00:00 | 2023-10-01 19:00:00 |
| 3           | 2023-10-01 09:00:00 | 2023-10-01 17:00:00 |

**Output:**

| employee_id | max_overlapping_shifts    | total_overlap_duration |
|-------------|---------------------------|------------------------|
| 1           | 3                         | 600                    |
| 2           | 2                         | 360                    |
| 3           | 1                         | 0                      |


**Explanation:**
- Employee 1 has 3 shifts:
  - 2023-10-01 09:00:00 to 2023-10-01 17:00:00
  - 2023-10-01 15:00:00 to 2023-10-01 23:00:00
  - 2023-10-01 16:00:00 to 2023-10-02 00:00:00

The maximum number of overlapping shifts is 3 (from 16:00 to 17:00). The total overlap duration is: - 2 hours (15:00-17:00) between 1st and 2nd shifts - 1 hour (16:00-17:00) between 1st and 3rd shifts - 7 hours (16:00-23:00) between 2nd and 3rd shifts Total: 10 hours = 600 minutes

- Employee 2 has 2 shifts:
  - 2023-10-01 09:00:00 to 2023-10-01 17:00:00
  - 2023-10-01 11:00:00 to 2023-10-01 19:00:00

The maximum number of overlapping shifts is 2. The total overlap duration is 6 hours (11:00-17:00) = 360 minutes.
- Employee 3 has only 1 shift, so there are no overlaps.

The output table contains the employee_id, the maximum number of simultaneous overlaps, and the total overlap duration in minutes for each employee, ordered by employee_id in ascending order.

In [0]:
shifts_data_3268 = [
    (1, "2023-10-01 09:00:00", "2023-10-01 17:00:00"),
    (1, "2023-10-01 15:00:00", "2023-10-01 23:00:00"),
    (1, "2023-10-01 16:00:00", "2023-10-02 00:00:00"),
    (2, "2023-10-01 09:00:00", "2023-10-01 17:00:00"),
    (2, "2023-10-01 11:00:00", "2023-10-01 19:00:00"),
    (3, "2023-10-01 09:00:00", "2023-10-01 17:00:00")
]

shifts_columns_3268 = ["employee_id", "start_time", "end_time"]
shifts_df_3268 = spark.createDataFrame(shifts_data_3268, shifts_columns_3268)
shifts_df_3268.show()

+-----------+-------------------+-------------------+
|employee_id|         start_time|           end_time|
+-----------+-------------------+-------------------+
|          1|2023-10-01 09:00:00|2023-10-01 17:00:00|
|          1|2023-10-01 15:00:00|2023-10-01 23:00:00|
|          1|2023-10-01 16:00:00|2023-10-02 00:00:00|
|          2|2023-10-01 09:00:00|2023-10-01 17:00:00|
|          2|2023-10-01 11:00:00|2023-10-01 19:00:00|
|          3|2023-10-01 09:00:00|2023-10-01 17:00:00|
+-----------+-------------------+-------------------+



In [0]:
shifts_df_3268 = shifts_df_3268\
                    .withColumn("start_time", col("start_time").cast("timestamp"))\
                        .withColumn("end_time", col("end_time").cast("timestamp"))

In [0]:
shifts_df_3268 = shifts_df_3268\
                    .withColumn("start_date", to_date("start_time"))\
                        .withColumn("end_date",   to_date("end_time"))\
                            .withColumn("date", explode(sequence(col("start_date"), col("end_date"))))

In [0]:
start_of_day = to_timestamp(concat(col("date").cast("string"), lit(" 00:00:00")))
start_of_next_day = to_timestamp(concat(date_add(col("date"), 1).cast("string"), lit(" 00:00:00")))

In [0]:
segments_3268 = shifts_df_3268\
                    .withColumn("seg_start", greatest(col("start_time"), start_of_day))\
                        .withColumn("seg_end",   least(col("end_time"), start_of_next_day))\
                            .where(col("seg_start") < col("seg_end"))\
                                .select(
                                    col("employee_id"),
                                    col("date"),
                                    col("seg_start").alias("start_ts"),
                                    col("seg_end").alias("end_ts")
                                    )

In [0]:
start_events_3268 = segments_3268\
                        .select("employee_id", "date", col("start_ts").alias("event_time")) \
                            .withColumn("change", lit(1))\
                                .withColumn("event_order", lit(1))
end_events_3268 = segments_3268\
                        .select("employee_id", "date", col("end_ts").alias("event_time")) \
                            .withColumn("change", lit(-1))\
                                .withColumn("event_order", lit(0))

events_3268 = start_events_3268\
                    .unionByName(end_events_3268)

In [0]:
win_order = Window.partitionBy("employee_id", "date").orderBy(col("event_time"), col("event_order"))
win_cumulative = win_order.rowsBetween(Window.unboundedPreceding, Window.currentRow)

In [0]:
events_3268 = events_3268\
                .withColumn("active_shifts", sum("change").over(win_cumulative))\
                    .withColumn("next_event_time", lead("event_time").over(win_order))\
                        .withColumn("duration_sec", unix_timestamp(col("next_event_time")) - unix_timestamp(col("event_time")))

In [0]:
comb = ((col("active_shifts") * (col("active_shifts") - 1)) / 2).cast("long")

events_3268 = events_3268\
                .withColumn("pairwise_overlap_sec", (col("duration_sec") * comb).cast("long"))

In [0]:
max_overlap = events_3268\
                .groupBy("employee_id")\
                    .agg(max("active_shifts").alias("max_overlapping_shifts"))
total_pairwise = events_3268\
                    .groupBy("employee_id")\
                        .agg(sum("pairwise_overlap_sec").alias("total_pairwise_overlap_sec"))

In [0]:
max_overlap\
    .join(total_pairwise, "employee_id", "left")\
        .withColumn("total_pairwise_overlap_sec", coalesce(col("total_pairwise_overlap_sec"), lit(0)))\
            .withColumn("total_overlap_duration", (col("total_pairwise_overlap_sec") / 60).cast("long"))\
                .select("employee_id", "max_overlapping_shifts", "total_overlap_duration").orderBy("employee_id").display()

employee_id,max_overlapping_shifts,total_overlap_duration
1,3,600
2,2,360
3,1,0
