## Importing Libraries

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

**1097. Game Play Analysis V (Hard)**

**Table: Activity**

| Column Name  | Type    |
|--------------|---------|
| player_id    | int     |
| device_id    | int     |
| event_date   | date    |
| games_played | int     |

(player_id, event_date) is the primary key (combination of columns with unique values) of this table.
This table shows the activity of players of some games.
Each row is a record of a player who logged in and played a number of games (possibly 0) before logging out on someday using some device.
 
The install date of a player is the first login day of that player.

We define day one retention of some date x to be the number of players whose install date is x and they logged back in on the day right after x, divided by the number of players whose install date is x, rounded to 2 decimal places.

**Write a solution to report for each install date, the number of players that installed the game on that day, and the day one retention.**

Return the result table in any order.

The result format is in the following example.

**Example 1:**

**Input:** 

**Activity table:**

| player_id | device_id | event_date | games_played |
|-----------|-----------|------------|--------------|
| 1         | 2         | 2016-03-01 | 5            |
| 1         | 2         | 2016-03-02 | 6            |
| 2         | 3         | 2017-06-25 | 1            |
| 3         | 1         | 2016-03-01 | 0            |
| 3         | 4         | 2016-07-03 | 5            |

**Output:** 
| install_dt | installs | Day1_retention |
|------------|----------|----------------|
| 2016-03-01 | 2        | 0.50           |
| 2017-06-25 | 1        | 0.00           |

**Explanation:** 
- Player 1 and 3 installed the game on 2016-03-01 but only player 1 logged back in on 2016-03-02 so the day 1 retention of 2016-03-01 is 1 / 2 = 0.50
- Player 2 installed the game on 2017-06-25 but didn't log back in on 2017-06-26 so the day 1 retention of 2017-06-25 is 0 / 1 = 0.00

In [0]:
activity_data_1097 = [
    (1, 2, "2016-03-01", 5),
    (1, 2, "2016-03-02", 6),
    (2, 3, "2017-06-25", 1),
    (3, 1, "2016-03-01", 0),
    (3, 4, "2016-07-03", 5),
]
activity_columns_1097 = ["player_id", "device_id", "event_date", "games_played"]
activity_df_1097 = spark.createDataFrame(activity_data_1097, activity_columns_1097)
activity_df_1097.show()

+---------+---------+----------+------------+
|player_id|device_id|event_date|games_played|
+---------+---------+----------+------------+
|        1|        2|2016-03-01|           5|
|        1|        2|2016-03-02|           6|
|        2|        3|2017-06-25|           1|
|        3|        1|2016-03-01|           0|
|        3|        4|2016-07-03|           5|
+---------+---------+----------+------------+



In [0]:
install_df_1097 = activity_df_1097.groupBy("player_id") \
                        .agg(min("event_date").alias("install_dt"))

In [0]:
joined_df_1097 = activity_df_1097.join(install_df_1097, on="player_id")

In [0]:
joined_df_1097.show()

+---------+---------+----------+------------+----------+
|player_id|device_id|event_date|games_played|install_dt|
+---------+---------+----------+------------+----------+
|        1|        2|2016-03-02|           6|2016-03-01|
|        2|        3|2017-06-25|           1|2017-06-25|
|        3|        4|2016-07-03|           5|2016-03-01|
|        1|        2|2016-03-01|           5|2016-03-01|
|        3|        1|2016-03-01|           0|2016-03-01|
+---------+---------+----------+------------+----------+



In [0]:
day_after_df_1097 = joined_df_1097.withColumn("next_day", date_add(col("install_dt"), 1))

In [0]:
day_after_df_1097.show()

+---------+---------+----------+------------+----------+----------+
|player_id|device_id|event_date|games_played|install_dt|  next_day|
+---------+---------+----------+------------+----------+----------+
|        1|        2|2016-03-02|           6|2016-03-01|2016-03-02|
|        2|        3|2017-06-25|           1|2017-06-25|2017-06-26|
|        3|        4|2016-07-03|           5|2016-03-01|2016-03-02|
|        1|        2|2016-03-01|           5|2016-03-01|2016-03-02|
|        3|        1|2016-03-01|           0|2016-03-01|2016-03-02|
+---------+---------+----------+------------+----------+----------+



In [0]:
retention_df_1097 = day_after_df_1097.filter(col("event_date") == col("next_day")) \
                           .select("player_id", "install_dt").distinct()

In [0]:
retention_df_1097.show()

+---------+----------+
|player_id|install_dt|
+---------+----------+
|        1|2016-03-01|
+---------+----------+



In [0]:
installs_df_1097 = install_df_1097.groupBy("install_dt").agg(countDistinct("player_id").alias("installs"))
returned_df_1097 = retention_df_1097.groupBy("install_dt").agg(countDistinct("player_id").alias("returns"))


In [0]:
installs_df_1097.join(returned_df_1097, on="install_dt", how="left") \
                    .na.fill(0) \
                    .withColumn("Day1_retention", round(col("returns") / col("installs"), 2)) \
                    .select("install_dt", "installs", "Day1_retention").show()

+----------+--------+--------------+
|install_dt|installs|Day1_retention|
+----------+--------+--------------+
|2016-03-01|       2|           0.5|
|2017-06-25|       1|           0.0|
+----------+--------+--------------+

