## Importing Libraries

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

**2752. Customers with Maximum Number of Transactions on Consecutive Days (Hard)**

**Table: Transactions**

| Column Name      | Type |
|------------------|------|
| transaction_id   | int  |
| customer_id      | int  |
| transaction_date | date |
| amount           | int  |

transaction_id is the column with unique values of this table.
Each row contains information about transactions that includes unique (customer_id, transaction_date) along with the corresponding customer_id and amount.   

**Write a solution to find all customer_id who made the maximum number of transactions on consecutive days.**

Return all customer_id with the maximum number of consecutive transactions. Order the result table by customer_id in ascending order.

The result format is in the following example.

**Example 1:**

**Input:** 

**Transactions table:**
| transaction_id | customer_id | transaction_date | amount |
|----------------|-------------|------------------|--------|
| 1              | 101         | 2023-05-01       | 100    |
| 2              | 101         | 2023-05-02       | 150    |
| 3              | 101         | 2023-05-03       | 200    |
| 4              | 102         | 2023-05-01       | 50     |
| 5              | 102         | 2023-05-03       | 100    |
| 6              | 102         | 2023-05-04       | 200    |
| 7              | 105         | 2023-05-01       | 100    |
| 8              | 105         | 2023-05-02       | 150    |
| 9              | 105         | 2023-05-03       | 200    |

**Output:** 
| customer_id | 
|-------------|
| 101         | 
| 105         | 

**Explanation:** 
- customer_id 101 has a total of 3 transactions, and all of them are consecutive.
- customer_id 102 has a total of 3 transactions, but only 2 of them are consecutive. 
- customer_id 105 has a total of 3 transactions, and all of them are consecutive.
In total, the highest number of consecutive transactions is 3, achieved by customer_id 101 and 105. The customer_id are sorted in ascending order.

In [0]:
transactions_data_2752 = [
    (1, 101, "2023-05-01", 100),
    (2, 101, "2023-05-02", 150),
    (3, 101, "2023-05-03", 200),
    (4, 102, "2023-05-01", 50),
    (5, 102, "2023-05-03", 100),
    (6, 102, "2023-05-04", 200),
    (7, 105, "2023-05-01", 100),
    (8, 105, "2023-05-02", 150),
    (9, 105, "2023-05-03", 200),
]

transactions_columns_2752 = ["transaction_id", "customer_id", "transaction_date", "amount"]
transactions_df_2752 = spark.createDataFrame(transactions_data_2752, transactions_columns_2752)
transactions_df_2752.show()


+--------------+-----------+----------------+------+
|transaction_id|customer_id|transaction_date|amount|
+--------------+-----------+----------------+------+
|             1|        101|      2023-05-01|   100|
|             2|        101|      2023-05-02|   150|
|             3|        101|      2023-05-03|   200|
|             4|        102|      2023-05-01|    50|
|             5|        102|      2023-05-03|   100|
|             6|        102|      2023-05-04|   200|
|             7|        105|      2023-05-01|   100|
|             8|        105|      2023-05-02|   150|
|             9|        105|      2023-05-03|   200|
+--------------+-----------+----------------+------+



In [0]:
transactions_df_2752 = transactions_df_2752.withColumn("transaction_date", col("transaction_date").cast("date"))

In [0]:
windowSpec = Window.partitionBy("customer_id").orderBy("transaction_date")

In [0]:
transactions_df_2752 = transactions_df_2752\
                            .withColumn("prev_date", lag("transaction_date").over(windowSpec))

In [0]:
transactions_df_2752 = transactions_df_2752\
                            .withColumn(
                                    "break_flag",
                                        ((datediff(col("transaction_date"), col("prev_date")) != 1) | col("prev_date").isNull()).cast("int")
                                        )\
                            .withColumn(
                                    "grp",
                                    sum("break_flag").over(Window.partitionBy("customer_id").orderBy("transaction_date").rowsBetween(Window.unboundedPreceding, 0))
                                    )

In [0]:
streak_df_2752 = transactions_df_2752\
                        .groupBy("customer_id", "grp")\
                            .agg(
                                (datediff(max("transaction_date"), min("transaction_date")) + 1).alias("streak_length")
                                )

In [0]:
max_streak_df_2750 = streak_df_2752\
                        .groupBy("customer_id")\
                            .agg(
                                max("streak_length").alias("max_streak")
                                )

In [0]:
overall_max_streak = max_streak_df_2750.agg(max("max_streak")).collect()[0][0]

In [0]:
max_streak_df_2750\
    .filter(col("max_streak") == overall_max_streak)\
        .select("customer_id").orderBy("customer_id").display()

customer_id
101
105
