## Importing Libraries

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

**2474.Â Customers With Strictly Increasing Purchases (Hard)**

**Table: Orders**

| Column Name  | Type |
|--------------|------|
| order_id     | int  |
| customer_id  | int  |
| order_date   | date |
| price        | int  |

order_id is the column with unique values for this table.
Each row contains the id of an order, the id of customer that ordered it, the date of the order, and its price.
 
**Write a solution to report the IDs of the customers with the total purchases strictly increasing yearly.**

The total purchases of a customer in one year is the sum of the prices of their orders in that year. If for some year the customer did not make any order, we consider the total purchases 0.
The first year to consider for each customer is the year of their first order.
The last year to consider for each customer is the year of their last order.
Return the result table in any order.

The result format is in the following example.

**Example 1:**

**Input:** 

**Orders table:**

| order_id | customer_id | order_date | price |
|----------|-------------|------------|-------|
| 1        | 1           | 2019-07-01 | 1100  |
| 2        | 1           | 2019-11-01 | 1200  |
| 3        | 1           | 2020-05-26 | 3000  |
| 4        | 1           | 2021-08-31 | 3100  |
| 5        | 1           | 2022-12-07 | 4700  |
| 6        | 2           | 2015-01-01 | 700   |
| 7        | 2           | 2017-11-07 | 1000  |
| 8        | 3           | 2017-01-01 | 900   |
| 9        | 3           | 2018-11-07 | 900   |

**Output:** 
| customer_id |
|-------------|
| 1           |

**Explanation:** 
- Customer 1: The first year is 2019 and the last year is 2022
  - 2019: 1100 + 1200 = 2300
  - 2020: 3000
  - 2021: 3100
  - 2022: 4700

  We can see that the total purchases are strictly increasing yearly, so we include customer 1 in the answer.

- Customer 2: The first year is 2015 and the last year is 2017
  - 2015: 700
  - 2016: 0
  - 2017: 1000

  We do not include customer 2 in the answer because the total purchases are not strictly increasing. Note that customer 2 did not make any purchases in 2016.

- Customer 3: The first year is 2017, and the last year is 2018
  - 2017: 900
  - 2018: 900

 We do not include customer 3 in the answer because the total purchases are not strictly increasing.

In [0]:
orders_data_2474 = [
    (1, 1, "2019-07-01", 1100),
    (2, 1, "2019-11-01", 1200),
    (3, 1, "2020-05-26", 3000),
    (4, 1, "2021-08-31", 3100),
    (5, 1, "2022-12-07", 4700),
    (6, 2, "2015-01-01", 700),
    (7, 2, "2017-11-07", 1000),
    (8, 3, "2017-01-01", 900),
    (9, 3, "2018-11-07", 900)
]

orders_columns_2474 = ["order_id", "customer_id", "order_date", "price"]
orders_df_2474 = spark.createDataFrame(orders_data_2474, orders_columns_2474)
orders_df_2474.show()


+--------+-----------+----------+-----+
|order_id|customer_id|order_date|price|
+--------+-----------+----------+-----+
|       1|          1|2019-07-01| 1100|
|       2|          1|2019-11-01| 1200|
|       3|          1|2020-05-26| 3000|
|       4|          1|2021-08-31| 3100|
|       5|          1|2022-12-07| 4700|
|       6|          2|2015-01-01|  700|
|       7|          2|2017-11-07| 1000|
|       8|          3|2017-01-01|  900|
|       9|          3|2018-11-07|  900|
+--------+-----------+----------+-----+



In [0]:
orders_df_2474 = orders_df_2474\
                    .withColumn("order_date", col("order_date").cast("date")) \
                        .withColumn("year", year("order_date"))

In [0]:
yearly_df_2474 = orders_df_2474\
                    .groupBy("customer_id", "year") \
                        .agg(sum("price").alias("total_price"))

In [0]:
years_df_2474 = yearly_df_2474\
                    .groupBy("customer_id") \
                        .agg(
                            min("year").alias("start_year"),
                            max("year").alias("end_year")
                            )

In [0]:
full_years_df_2474 = years_df_2474\
                        .withColumn( "year", explode(sequence(col("start_year"), col("end_year"))))\
                            .select("customer_id", "year")

In [0]:
full_df_2474 = full_years_df_2474\
                    .join( yearly_df_2474, ["customer_id", "year"], "left")\
                        .fillna(0, subset=["total_price"])\
                            .orderBy("customer_id", "year")

In [0]:
window_spec = Window.partitionBy("customer_id").orderBy("year")

In [0]:
full_df_2474 = full_df_2474\
                    .withColumn("prev_total", lag("total_price").over(window_spec))

In [0]:
violations_df_2474 = full_df_2474\
                        .filter(col("prev_total").isNotNull() & (col("total_price") <= col("prev_total"))) \
                            .select("customer_id").distinct()

In [0]:
full_df_2474\
    .select("customer_id").distinct() \
        .join(violations_df_2474, "customer_id", "left_anti").display()

customer_id
1
