## Importing Libraries

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

**1384. Total Sales Amount by Year (Hard)**

**Table: Product**

| Column Name   | Type    |
|---------------|---------|
| product_id    | int     |
| product_name  | varchar |

product_id is the primary key (column with unique values) for this table.
product_name is the name of the product.
 
**Table: Sales**

| Column Name         | Type    |
|---------------------|---------|
| product_id          | int     |
| period_start        | date    |
| period_end          | date    |
| average_daily_sales | int     |

product_id is the primary key (column with unique values) for this table. 
period_start and period_end indicate the start and end date for the sales period, and both dates are inclusive.
The average_daily_sales column holds the average daily sales amount of the items for the period.
The dates of the sales years are between 2018 to 2020.
 
**Write a solution to report the total sales amount of each item for each year, with corresponding product_name, product_id, report_year, and total_amount.**

Return the result table ordered by product_id and report_year.

The result format is in the following example.

**Example 1:**

**Input:** 

**Product table:**

| product_id | product_name |
|------------|--------------|
| 1          | LC Phone     |
| 2          | LC T-Shirt   |
| 3          | LC Keychain  |

**Sales table:**
| product_id | period_start | period_end  | average_daily_sales |
|------------|--------------|-------------|---------------------|
| 1          | 2019-01-25   | 2019-02-28  | 100                 |
| 2          | 2018-12-01   | 2020-01-01  | 10                  |
| 3          | 2019-12-01   | 2020-01-31  | 1                   |

**Output:** 
| product_id | product_name | report_year | total_amount |
|------------|--------------|-------------|--------------|
| 1          | LC Phone     |    2019     | 3500         |
| 2          | LC T-Shirt   |    2018     | 310          |
| 2          | LC T-Shirt   |    2019     | 3650         |
| 2          | LC T-Shirt   |    2020     | 10           |
| 3          | LC Keychain  |    2019     | 31           |
| 3          | LC Keychain  |    2020     | 31           |

**Explanation:** 
- LC Phone was sold for the period of 2019-01-25 to 2019-02-28, and there are 35 days for this period. Total amount 35*100 = 3500. 
- LC T-shirt was sold for the period of 2018-12-01 to 2020-01-01, and there are 31, 365, 1 days for years 2018, 2019 and 2020 respectively.
- LC Keychain was sold for the period of 2019-12-01 to 2020-01-31, and there are 31, 31 days for years 2019 and 2020 respectively.

In [0]:
product_data_1384 = [
    (1, "LC Phone"),
    (2, "LC T-Shirt"),
    (3, "LC Keychain")
]
product_columns_1384 = ["product_id", "product_name"]
product_df_1384 = spark.createDataFrame(product_data_1384, product_columns_1384)
product_df_1384.show()

sales_data_1384 = [
    (1, "2019-01-25", "2019-02-28", 100),
    (2, "2018-12-01", "2020-01-01", 10),
    (3, "2019-12-01", "2020-01-31", 1)
]
sales_columns_1384 = ["product_id", "period_start", "period_end", "average_daily_sales"]
sales_df_1384 = spark.createDataFrame(sales_data_1384, sales_columns_1384)\
                    .withColumn("period_start", col("period_start").cast(DateType())) \
                    .withColumn("period_end", col("period_end").cast(DateType()))\
                    .withColumn("year_list",sequence(year(col("period_start")), year(col("period_end"))))

sales_df_1384.show()

+----------+------------+
|product_id|product_name|
+----------+------------+
|         1|    LC Phone|
|         2|  LC T-Shirt|
|         3| LC Keychain|
+----------+------------+

+----------+------------+----------+-------------------+------------------+
|product_id|period_start|period_end|average_daily_sales|         year_list|
+----------+------------+----------+-------------------+------------------+
|         1|  2019-01-25|2019-02-28|                100|            [2019]|
|         2|  2018-12-01|2020-01-01|                 10|[2018, 2019, 2020]|
|         3|  2019-12-01|2020-01-31|                  1|      [2019, 2020]|
+----------+------------+----------+-------------------+------------------+



In [0]:
sales_exploded_df_1384 = sales_df_1384.withColumn("report_year", explode(col("year_list")))

In [0]:
sales_split_df_1384 = sales_exploded_df_1384\
                    .withColumn("start_date",greatest(col("period_start"), col("report_year").cast("string").cast(DateType())))\
                    .withColumn("end_date",least(col("period_end"),(col("report_year") + lit(1)).cast("string").cast(DateType()) - expr("INTERVAL 1 DAY")))


In [0]:
sales_split_df_1384 = sales_split_df_1384\
                        .withColumn("days",datediff(col("end_date"), col("start_date")) + lit(1))\
                            .withColumn("total_amount",col("days") * col("average_daily_sales"))

In [0]:
sales_split_df_1384\
    .join(product_df_1384, on="product_id", how="left") \
    .select("product_id", "product_name", "report_year", "total_amount") \
    .orderBy("product_id", "report_year").show()

+----------+------------+-----------+------------+
|product_id|product_name|report_year|total_amount|
+----------+------------+-----------+------------+
|         1|    LC Phone|       2019|        3500|
|         2|  LC T-Shirt|       2018|         310|
|         2|  LC T-Shirt|       2019|        3650|
|         2|  LC T-Shirt|       2020|          10|
|         3| LC Keychain|       2019|          31|
|         3| LC Keychain|       2020|          31|
+----------+------------+-----------+------------+

