## Importing Libraries

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

**3089. Find Bursty Behavior (Medium)**

**Table: Posts**

| Column Name | Type    |
|-------------|---------|
| post_id     | int     |
| user_id     | int     |
| post_date   | date    |

post_id is the primary key (column with unique values) for this table.
Each row of this table contains post_id, user_id, and post_date.

**Write a solution to find users who demonstrate bursty behavior in their posting patterns during February 2024. Bursty behavior is defined as any period of 7 consecutive days where a user's posting frequency is at least twice to their average weekly posting frequency for February 2024.**

**Note:** Only include the dates from February 1 to February 28 in your analysis, which means you should count February as having exactly 4 weeks.

Return the result table orderd by user_id in ascending order.

The result format is in the following example.

**Example:**

**Input:**

**Posts table:**

| post_id | user_id | post_date  |
|---------|---------|------------|
| 1       | 1       | 2024-02-27 |
| 2       | 5       | 2024-02-06 |
| 3       | 3       | 2024-02-25 |
| 4       | 3       | 2024-02-14 |
| 5       | 3       | 2024-02-06 |
| 6       | 2       | 2024-02-25 |

**Output:**

| user_id | max_7day_posts | avg_weekly_posts |
|---------|----------------|------------------|
| 1       | 1              | 0.2500           |
| 2       | 1              | 0.2500           |
| 5       | 1              | 0.2500           |

**Explanation:**
- **User 1:** Made only 1 post in February, resulting in an average of 0.25 posts per week and a max of 1 post in any 7-day period.
- **User 2:** Also made just 1 post, with the same average and max 7-day posting frequency as User 1.
- **User 5:** Like Users 1 and 2, User 5 made only 1 post throughout February, leading to the same average and max 7-day posting metrics.
- **User 3:** Although User 3 made more posts than the others (3 posts), they did not reach twice the average weekly posts in their consecutive 7-day window, so they are not listed in the output.

**Note:** Output table is ordered by user_id in ascending order.

In [0]:
posts_data_3089 = [
    (1, 1, "2024-02-27"),
    (2, 5, "2024-02-06"),
    (3, 3, "2024-02-25"),
    (4, 3, "2024-02-14"),
    (5, 3, "2024-02-06"),
    (6, 2, "2024-02-25"),
]

post_columns_3089 = ["post_id", "user_id", "post_date"]
post_df_3089 = spark.createDataFrame(posts_data_3089, post_columns_3089)
post_df_3089.show()

+-------+-------+----------+
|post_id|user_id| post_date|
+-------+-------+----------+
|      1|      1|2024-02-27|
|      2|      5|2024-02-06|
|      3|      3|2024-02-25|
|      4|      3|2024-02-14|
|      5|      3|2024-02-06|
|      6|      2|2024-02-25|
+-------+-------+----------+



In [0]:
post_df_3089 = post_df_3089\
                    .withColumn("post_date", to_timestamp("post_date"))

In [0]:
post_df_3089 = post_df_3089\
                    .filter(
                        (col("post_date") >= "2024-02-01") & 
                        (col("post_date") <= "2024-02-28")
                        )

In [0]:
daily_posts_3089 = post_df_3089\
                        .groupBy("user_id", "post_date")\
                            .agg(count("*").alias("daily_posts"))

In [0]:
daily_posts_3089 = daily_posts_3089\
                        .withColumn("post_ts", unix_timestamp("post_date"))

In [0]:
seven_days = 7 * 24 * 3600

In [0]:
windowSpec = Window.partitionBy("user_id").orderBy("post_ts").rangeBetween(-seven_days+1, 0)

In [0]:
daily_posts_3089 = daily_posts_3089\
                        .withColumn("rolling_7day", sum("daily_posts").over(windowSpec))

In [0]:
max_7day_3089 = daily_posts_3089\
                    .groupBy("user_id")\
                        .agg(max("rolling_7day").alias("max_7day_posts"))

In [0]:
avg_weekly_3089 = daily_posts_3089\
                    .groupBy("user_id")\
                        .agg((count("*") / lit(4)).alias("avg_weekly_posts"))

In [0]:
max_7day_3089\
    .join(avg_weekly_3089, "user_id")\
        .filter(col("max_7day_posts") >= 2 * col("avg_weekly_posts"))\
            .orderBy("user_id").display()

user_id,max_7day_posts,avg_weekly_posts
1,1,0.25
2,1,0.25
5,1,0.25
