## Importing Libraries

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window
from datetime import datetime, timedelta

**1225. Report Contiguous Dates (Hard)**

**Table: Failed**

| Column Name  | Type    |
|--------------|---------|
| fail_date    | date    |

fail_date is the primary key (column with unique values) for this table.
This table contains the days of failed tasks.
 

**Table: Succeeded**

| Column Name  | Type    |
|--------------|---------|
| success_date | date    |

success_date is the primary key (column with unique values) for this table.
This table contains the days of succeeded tasks.
 
A system is running one task every day. Every task is independent of the previous tasks. The tasks can fail or succeed.

**Write a solution to report the period_state for each continuous interval of days in the period from 2019-01-01 to 2019-12-31.**

period_state is 'failed' if tasks in this interval failed or 'succeeded' if tasks in this interval succeeded. Interval of days are retrieved as start_date and end_date.

Return the result table ordered by start_date.

The result format is in the following example.

**Example 1:**

**Input:** 

**Failed table:**

| fail_date         |
|-------------------|
| 2018-12-28        |
| 2018-12-29        |
| 2019-01-04        |
| 2019-01-05        |

**Succeeded table:**
| success_date      |
|-------------------|
| 2018-12-30        |
| 2018-12-31        |
| 2019-01-01        |
| 2019-01-02        |
| 2019-01-03        |
| 2019-01-06        |

**Output:** 
| period_state | start_date   | end_date     |
|--------------|--------------|--------------|
| succeeded    | 2019-01-01   | 2019-01-03   |
| failed       | 2019-01-04   | 2019-01-05   |
| succeeded    | 2019-01-06   | 2019-01-06   |

**Explanation:** 
- The report ignored the system state in 2018 as we care about the system in the period 2019-01-01 to 2019-12-31.
- From 2019-01-01 to 2019-01-03 all tasks succeeded and the system state was "succeeded".
- From 2019-01-04 to 2019-01-05 all tasks failed and the system state was "failed".
- From 2019-01-06 to 2019-01-06 all tasks succeeded and the system state was "succeeded".

In [0]:
failed_data_1225 = [
    ("2018-12-28",), ("2018-12-29",), ("2019-01-04",), ("2019-01-05",)
]
failed_df_1225 = spark.createDataFrame(failed_data_1225, ["date"])
failed_df_1225.show()

succeeded_data_1225 = [
    ("2018-12-30",), ("2018-12-31",), ("2019-01-01",), 
    ("2019-01-02",), ("2019-01-03",), ("2019-01-06",)
]

succeeded_df_1225 = spark.createDataFrame(succeeded_data_1225, ["date"])
succeeded_df_1225.show()

+----------+
|      date|
+----------+
|2018-12-28|
|2018-12-29|
|2019-01-04|
|2019-01-05|
+----------+

+----------+
|      date|
+----------+
|2018-12-30|
|2018-12-31|
|2019-01-01|
|2019-01-02|
|2019-01-03|
|2019-01-06|
+----------+



In [0]:
start = datetime(2019, 1, 1)
end = datetime(2019, 12, 31)
calendar = [(start + timedelta(days=i),) for i in range((end - start).days + 1)]
calendar_df_1225 = spark.createDataFrame(calendar, ["dt"])

In [0]:
failed_df_1225 = failed_df_1225.withColumnRenamed("date", "fail_date")
succeeded_df_1225 = succeeded_df_1225.withColumnRenamed("date", "success_date")

In [0]:
annotated_df_1225 = calendar_df_1225 \
    .join(failed_df_1225, calendar_df_1225.dt == failed_df_1225.fail_date, "left") \
    .join(succeeded_df_1225, calendar_df_1225.dt == succeeded_df_1225.success_date, "left") \
    .withColumn("period_state", when(col("fail_date").isNotNull(), "failed").otherwise("succeeded"))


In [0]:
window_spec = Window.orderBy("dt")

annotated_df_1225 = annotated_df_1225\
                            .withColumn("prev_state", lag("period_state").over(window_spec)
                                        )

annotated_df_1225 = annotated_df_1225\
                            .withColumn("is_new_group",
                                when(col("prev_state").isNull() | (col("period_state") != col("prev_state")), lit(1)).otherwise(lit(0))
                            )



In [0]:
group_window = Window.orderBy("dt").rowsBetween(Window.unboundedPreceding, 0)

annotated_df_1225 = annotated_df_1225\
                            .withColumn("group_id", sum("is_new_group").over(group_window)
                                        )



In [0]:
annotated_df_1225\
            .groupBy("group_id", "period_state")\
                .agg(
                    min("dt").alias("start_date"),
                    max("dt").alias("end_date")
                )\
                .orderBy("start_date")\
                    .select("period_state", "start_date", "end_date").show()



+------------+-------------------+-------------------+
|period_state|         start_date|           end_date|
+------------+-------------------+-------------------+
|   succeeded|2019-01-01 00:00:00|2019-01-03 00:00:00|
|      failed|2019-01-04 00:00:00|2019-01-05 00:00:00|
|   succeeded|2019-01-06 00:00:00|2019-12-31 00:00:00|
+------------+-------------------+-------------------+

