# Assignment 4
## Econ 8310 - Business Forecasting

This assignment will make use of the bayesian statistical models covered in Lessons 10 to 12. 

A/B Testing is a critical concept in data science, and for many companies one of the most relevant applications of data-driven decision-making. In order to improve product offerings, marketing campaigns, user interfaces, and many other user-facing interactions, scientists and engineers create experiments to determine the efficacy of proposed changes. Users are then randomly assigned to either the treatment or control group, and their behavior is recorded.
If the changes that the treatment group is exposed to can be measured to have a benefit in the metric of interest, then those changes are scaled up and rolled out to across all interactions.
Below is a short video detailing the A/B Testing process, in case you want to learn a bit more:
[https://youtu.be/DUNk4GPZ9bw](https://youtu.be/DUNk4GPZ9bw)

For this assignment, you will use an A/B test data set, which was pulled from the Kaggle website (https://www.kaggle.com/datasets/yufengsui/mobile-games-ab-testing). I have added the data from the page into Codio for you. It can be found in the cookie_cats.csv file in the file tree. It can also be found at [https://github.com/dustywhite7/Econ8310/raw/master/AssignmentData/cookie_cats.csv](https://github.com/dustywhite7/Econ8310/raw/master/AssignmentData/cookie_cats.csv)

The variables are defined as follows:

| Variable Name  | Definition |
|----------------|----|
| userid         | A unique number that identifies each player  |
| version        | Whether the player was put in the control group (gate_30 - a gate at level 30) or the group with the moved gate (gate_40 - a gate at level 40) |
| sum_gamerounds | The number of game rounds played by the player during the first 14 days after install.  |
| retention1     | Did the player come back and play 1 day after installing?     |
| retention7     | Did the player come back and play 7 days after installing?    |               

### The questions

You will be asked to answer the following questions in a small quiz on Canvas:
1. What was the effect of moving the gate from level 30 to level 40 on 1-day retention rates?
2. What was the effect of moving the gate from level 30 to level 40 on 7-day retention rates?
3. What was the biggest challenge for you in completing this assignment?

You will also be asked to submit a URL to your forked GitHub repository containing your code used to answer these questions.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
url = "https://github.com/dustywhite7/Econ8310/raw/master/AssignmentData/cookie_cats.csv"
df = pd.read_csv(url)
print(df.head())


In [None]:
# Count and display the number of unique players
print("Number of players: \n", df.userid.nunique(), '\n',
        "Number of records: \n", len(df.userid),'\n')

In [None]:
df.(dtypes)

In [None]:
# Convert boolean to integer (True → 1, False → 0)
df['retention_1'] = df['retention_1'].astype(int)
df['retention_7'] = df['retention_7'].astype(int)

# Check if it's correct
df[['retention_1', 'retention_7']].dtypes


In [None]:
# Control group (gate_30)
control = df[df['version'] == 'gate_30']

# Treatment group (gate_40)
treatment = df[df['version'] == 'gate_40']


In [None]:
print(df.info())
print(df['version'].value_counts())
print(df[['retention_1', 'retention_7']].mean())


In [None]:
# 1-day retention rate by group
retention_1_rates = df.groupby('version')['retention_1'].mean() * 100
# 7-day retention rate by group
retention_7_rates = df.groupby('version')['retention_7'].mean() * 100

print("1-day retention rates (%):\n", retention_1_rates)
print("7-day retention rates (%):\n", retention_7_rates)


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta

# Get successes and trials for each group
for group in ['gate_30', 'gate_40']:
    successes = df[df['version'] == group]['retention_1'].sum()
    trials = df[df['version'] == group]['retention_1'].count()
    print(f"{group}: {successes} successes out of {trials} trials")

# Example for 1-day retention
successes_30 = df[df['version'] == 'gate_30']['retention_1'].sum()
trials_30 = df[df['version'] == 'gate_30']['retention_1'].count()
successes_40 = df[df['version'] == 'gate_40']['retention_1'].sum()
trials_40 = df[df['version'] == 'gate_40']['retention_1'].count()

# Posterior distributions
x = np.linspace(0, 0.6, 1000)
posterior_30 = beta(successes_30 + 1, trials_30 - successes_30 + 1)
posterior_40 = beta(successes_40 + 1, trials_40 - successes_40 + 1)

plt.plot(x, posterior_30.pdf(x), label='gate_30')
plt.plot(x, posterior_40.pdf(x), label='gate_40')
plt.xlabel('Retention Rate')
plt.ylabel('Density')
plt.title('Posterior Distributions for 1-Day Retention')
plt.legend()
plt.show()


In [None]:
# Calculating 1-day retention for each AB-group

# CONTROL GROUP
prop_gate30 = len(players_g30[players_g30['retention_1'] == True])/len(players_g30['retention_1']) * 100

# TREATMENT GROUP
prop_gate40 = len(players_g40[players_g40['retention_1'] == True])/len(players_g40['retention_1']) * 100

print('Group 30 at 1 day retention: ',str(round(prop_gate30,2)),"%","\n",
     'Group 40 at 1 day retention: ',str(round(prop_gate40,2)),"%")

In [None]:
# Calculating 1-day retention for each AB-group

# CONTROL GROUP
prop_gate30 = len(players_g30[players_g30['retention_7'] == True])/len(players_g30['retention_1']) * 100

# TREATMENT GROUP
prop_gate40 = len(players_g40[players_g40['retention_7'] == True])/len(players_g40['retention_1']) * 100

print('Group 30 at 1 day retention: ',str(round(prop_gate30,2)),"%","\n",
     'Group 40 at 1 day retention: ',str(round(prop_gate40,2)),"%")

In [None]:
# The % of users that came back the day after they installed
prop = len(df[df['retention_1'] == True]) / len(df['retention_1']) * 100

print("The overall retention for 1 day is: ", str(round(prop,2)),"%")

In [None]:
# The % of users that came back the day after they installed
prop = len(df[df['retention_7'] == True]) / len(df['retention_7']) * 100

print("The overall retention for 7 day is: ", str(round(prop,2)),"%")

In [None]:
import matplotlib.pyplot as plt

# Calculate means
means = df.groupby('version')[['retention_1', 'retention_7']].mean().reset_index()

# Bar plot for 1-day and 7-day retention
fig, ax = plt.subplots(1, 2, figsize=(10,5))

# 1-day retention
ax[0].bar(means['version'], means['retention_1'], color=['skyblue', 'orange'])
ax[0].set_title('1-Day Retention Rate')
ax[0].set_ylabel('Retention Rate')
ax[0].set_ylim(0, 0.5)

# 7-day retention
ax[1].bar(means['version'], means['retention_7'], color=['skyblue', 'orange'])
ax[1].set_title('7-Day Retention Rate')
ax[1].set_ylim(0, 0.25)

plt.tight_layout()
plt.show()


In [None]:
import numpy as np

# Simulate samples from the Beta posteriors
samples_30 = np.random.beta(successes_30 + 1, trials_30 - successes_30 + 1, 100000)
samples_40 = np.random.beta(successes_40 + 1, trials_40 - successes_40 + 1, 100000)
diff = samples_30 - samples_40

plt.hist(diff, bins=50, color='purple', alpha=0.7)
plt.title('Posterior Distribution of the Difference (gate_30 - gate_40)\n1-Day Retention')
plt.xlabel('Difference in Retention Rate')
plt.ylabel('Frequency')
plt.axvline(0, color='black', linestyle='--')
plt.show()
