## Importing Libraries

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

**1454. Active Users (Medium)**

**Table: Accounts**

| Column Name   | Type    |
|---------------|---------|
| id            | int     |
| name          | varchar |

id is the primary key (column with unique values) for this table.
This table contains the account id and the user name of each account.
 
**Table: Logins**

| Column Name   | Type    |
|---------------|---------|
| id            | int     |
| login_date    | date    |

This table may contain duplicate rows.
This table contains the account id of the user who logged in and the login date. A user may log in multiple times in the day.
 
Active users are those who logged in to their accounts for five or more consecutive days.

**Write a solution to find the id and the name of active users.**

Return the result table ordered by id.

The result format is in the following example.

**Example 1:**

**Input:** 

**Accounts table:**
| id | name     |
|----|----------|
| 1  | Winston  |
| 7  | Jonathan |

**Logins table:**
| id | login_date |
|----|------------|
| 7  | 2020-05-30 |
| 1  | 2020-05-30 |
| 7  | 2020-05-31 |
| 7  | 2020-06-01 |
| 7  | 2020-06-02 |
| 7  | 2020-06-02 |
| 7  | 2020-06-03 |
| 1  | 2020-06-07 |
| 7  | 2020-06-10 |

**Output:** 
| id | name     |
|----|----------|
| 7  | Jonathan |

**Explanation:** 
- User Winston with id = 1 logged in 2 times only in 2 different days, so, Winston is not an active user.
- User Jonathan with id = 7 logged in 7 times in 6 different days, five of them were consecutive days, so, Jonathan is an active user.
 
**Follow up:** Could you write a general solution if the active users are those who logged in to their accounts for n or more consecutive days?

In [0]:
accounts_data_1454 = [
    (1, "Winston"),
    (7, "Jonathan")
]

accounts_columns_1454 = ["id", "name"]
accounts_df_1454 = spark.createDataFrame(accounts_data_1454, accounts_columns_1454)
accounts_df_1454.show()

logins_data_1454 = [
    (7, "2020-05-30"),
    (1, "2020-05-30"),
    (7, "2020-05-31"),
    (7, "2020-06-01"),
    (7, "2020-06-02"),
    (7, "2020-06-02"),
    (7, "2020-06-03"),
    (1, "2020-06-07"),
    (7, "2020-06-10")
]

logins_columns_1454 = ["id", "login_date"]
logins_df_1454 = spark.createDataFrame(logins_data_1454, logins_columns_1454)
logins_df_1454.show()

+---+--------+
| id|    name|
+---+--------+
|  1| Winston|
|  7|Jonathan|
+---+--------+

+---+----------+
| id|login_date|
+---+----------+
|  7|2020-05-30|
|  1|2020-05-30|
|  7|2020-05-31|
|  7|2020-06-01|
|  7|2020-06-02|
|  7|2020-06-02|
|  7|2020-06-03|
|  1|2020-06-07|
|  7|2020-06-10|
+---+----------+



In [0]:
logins_df_1454 = logins_df_1454\
                    .withColumn("login_date", to_date("login_date"))

In [0]:
n = 5

In [0]:
logins_df_1454 = logins_df_1454.dropDuplicates(["id", "login_date"])

In [0]:
windowSpec = Window.partitionBy("id").orderBy("login_date")

In [0]:
logins_df_1454 = logins_df_1454\
                    .withColumn("grp", 
                        datediff("login_date", lag("login_date", 1).over(windowSpec))
                    )\
                    .withColumn("grp", when(col("grp").isNull() | (col("grp") != 1), 0).otherwise(1)
                    )\
                     .withColumn("streak_grp", sum(when(col("grp") == 0, 1).otherwise(0)).over(windowSpec))


In [0]:
streak_counts_1454 = logins_df_1454\
                        .groupBy("id", "streak_grp").agg(count("*").alias("streak_days")
                                )

In [0]:
active_users_df_1454 = streak_counts_1454\
                                .filter(col("streak_days") >= n)\
                                    .select("id").distinct()

In [0]:
active_users_df_1454\
            .join(accounts_df_1454, "id").orderBy("id").show()

+---+--------+
| id|    name|
+---+--------+
|  7|Jonathan|
+---+--------+

