### Mobile App retention

A mobile app development company has collected data on its users. \
The file 'App_Retention.csv' contains the results of an experiment where two versions of a mobile app are compared. \
Here are the main attributes of this data:
- UserID - ID of the user
- Treat - Type of group the user belongs to (control/treatment)
- Retention - Behaviour of user in terms of app usag 


This experiment aims to test if there is an increase in mobile app retention rates by reducing the frequency of mobile app notifications from 5 or more times per day to exactly two times per day. 

Users using this mobile app are randomly assigned to a control (old set-up) or treatment (new set-up) group.

The variable 'Treat' identifies how each user is assigned. Retention occurs if a user goes on without uninstalling the app, in which case we have 

- Retention = 1 (not uninstalled app)
- Retention = 0 (uninstalled app)

 you wish to build an A/B test that will determine if reducing the frequency of mobile app notifications is helping in retaining the app users or not.

### Importing Important libraries

In [53]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.stats.api as sms
import math

In [54]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [55]:
df = pd.read_csv('App_Retention.csv')

In [56]:
df.head()

Unnamed: 0,UserID,Treat,Retention
0,1,control,0
1,2,control,0
2,3,treatment,0
3,4,treatment,0
4,5,control,0


### Creating Control and Treatment group dataframes.

- for control group.

In [57]:
dfc = df[df['Treat'] == 'control']

In [58]:
(dfc['Treat'] == 'control').all()

True

#### Creating cumulative conversion.

In [59]:
Nc = len(dfc.index)
Sc = dfc['Retention'].sum()

In [60]:
dfc['cumu_ret'] = np.around(np.cumsum(dfc['Retention']))

In [61]:
dfc['ret_rate'] = dfc['cumu_ret'] / dfc['UserID']

In [62]:
dfc.describe()

Unnamed: 0,UserID,Retention,cumu_ret,ret_rate
count,2500.0,2500.0,2500.0,2500.0
mean,2496.5164,0.04,48.3248,0.018374
std,1443.491456,0.195998,30.118801,0.003363
min,1.0,0.0,0.0,0.0
25%,1250.25,0.0,23.0,0.017485
50%,2496.5,0.0,43.0,0.019189
75%,3734.25,0.0,78.0,0.020141
max,5000.0,1.0,100.0,0.022346


- For treatment group

In [63]:
dft = df[df['Treat'] == 'treatment']

In [64]:
Nt = len(dft.index)
St = dft['Retention'].sum()

In [65]:
dft['cumu_ret'] = np.around(np.cumsum(dft['Retention']))

In [66]:
dft['ret_rate'] = dft['cumu_ret'] / dft['UserID']

In [67]:
dft.describe()

Unnamed: 0,UserID,Retention,cumu_ret,ret_rate
count,2500.0,2500.0,2500.0,2500.0
mean,2504.4836,0.06,76.1392,0.030788
std,1443.82634,0.237534,41.565261,0.004883
min,3.0,0.0,0.0,0.0
25%,1251.5,0.0,44.0,0.029518
50%,2509.5,0.0,80.0,0.030871
75%,3771.5,0.0,107.0,0.033155
max,4999.0,1.0,150.0,0.044872


### Question 1

Does the given data have enough samples to determine the differences in retention rates for the control & treatment groups?

##### We wil use the `proportion_effectsize` function from statsmodels package.
The inputs will be the following

- Pc (Proportion of the control group)
- Pt (Proportion of the treatment group)

In [68]:
Pc = Sc / Nc
Pt = St / Nt

In [69]:
(Pc, Pt)

(0.04, 0.06)

In [70]:
p_effect_size = sms.proportion_effectsize(Pc, Pt)

##### We will now use `sms.NormalIndPower.solve_power` function from statsmodels package.

The inputs wil be following:
- effect_size
- alpha
- power

In [71]:
alpha = 0.05
power = 0.8

In [72]:
required_n = sms.NormalIndPower().solve_power(effect_size=p_effect_size, alpha=alpha, power=power)

In [73]:
print(f"The required sampple size is: {math.ceil(required_n)}")

The required sampple size is: 1846


In [74]:
print(f"The treatment group has sufficient sample size: {Nt} (required is {math.ceil(required_n)})")

The treatment group has sufficient sample size: 2500 (required is 1846)


In [75]:
print(f"The control group has sufficient sample size: {Nc} (required is {math.ceil(required_n)})")

The control group has sufficient sample size: 2500 (required is 1846)


### Question 2

What is the retention rate for the control group?

In [78]:
pd.crosstab(df['Treat'], df['Retention'], normalize='index')

Retention,0,1
Treat,Unnamed: 1_level_1,Unnamed: 2_level_1
control,0.96,0.04
treatment,0.94,0.06


In [77]:
Pt

0.06

### Question 3

What is the value of the z-statistic that you obtained from the hypothesis test?

In [83]:
Nc
Nt
Sc
St

alpha
power

0.8

In [84]:
from statsmodels.stats.proportion import confint_proportions_2indep

##### We will be using `confint_proportions_2indep` for getting CI of proportions.

In [89]:
?confint_proportions_2indep

[1;31mSignature:[0m
[0mconfint_proportions_2indep[0m[1;33m([0m[1;33m
[0m    [0mcount1[0m[1;33m,[0m[1;33m
[0m    [0mnobs1[0m[1;33m,[0m[1;33m
[0m    [0mcount2[0m[1;33m,[0m[1;33m
[0m    [0mnobs2[0m[1;33m,[0m[1;33m
[0m    [0mmethod[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcompare[0m[1;33m=[0m[1;34m'diff'[0m[1;33m,[0m[1;33m
[0m    [0malpha[0m[1;33m=[0m[1;36m0.05[0m[1;33m,[0m[1;33m
[0m    [0mcorrection[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Confidence intervals for comparing two independent proportions.

This assumes that we have two independent binomial samples.

Parameters
----------
count1, nobs1 : float
    Count and sample size for first sample.
count2, nobs2 : float
    Count and sample size for the second sample.
method : str
    Method for computing confidence interval. If method is None, then a
    default method is used. The default m

In [90]:
Nc

2500

In [91]:
(lower, upper) = confint_proportions_2indep(Sc, Nc, St, Nt, compare='diff', alpha=0.05)

In [92]:
print(f"The confidence intervals for proportion is: [{lower:.3f}, {upper:.3f}]")

The confidence intervals for proportion is: [-0.032, -0.008]


### Hypothesis Test for Difference in Proportions



Let $p_1$ and $p_2$ be the sample proportions

Our null and alternate hypotheses are as follows:

$$H_0: p_1-p_2 = 0$$
$$H_a: p_1-p_2 \neq 0$$

> This will be a two-sided test, with $\alpha$ = 0.05

We'll be performing this test directly using the methods from `statsmodels` library

In [94]:
from statsmodels.stats.proportion import test_proportions_2indep

In [95]:
?test_proportions_2indep

[1;31mSignature:[0m
[0mtest_proportions_2indep[0m[1;33m([0m[1;33m
[0m    [0mcount1[0m[1;33m,[0m[1;33m
[0m    [0mnobs1[0m[1;33m,[0m[1;33m
[0m    [0mcount2[0m[1;33m,[0m[1;33m
[0m    [0mnobs2[0m[1;33m,[0m[1;33m
[0m    [0mvalue[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mmethod[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcompare[0m[1;33m=[0m[1;34m'diff'[0m[1;33m,[0m[1;33m
[0m    [0malternative[0m[1;33m=[0m[1;34m'two-sided'[0m[1;33m,[0m[1;33m
[0m    [0mcorrection[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mreturn_results[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Hypothesis test for comparing two independent proportions

This assumes that we have two independent binomial samples.

The Null and alternative hypothesis are

for compare = 'diff'

- H0: prop1 - prop2 - value = 0
- H1: prop1 - prop2 - value != 0  if alte

## The inputs will be

- Sc (Members of control group who preferred old brand)
- Nc (Sample size of control group)
- St (Members of treatment group who preferred new brand)
- Nt (Sample size of treatment group)
- alternative = 'two-sided' (We want to check differece in proportions)
- compare = 'diff'  (We want to check differece in proportions )

In [98]:
z_stat, pval = test_proportions_2indep(Sc, Nc, St, Nt, compare='diff', alternative='two-sided')

In [103]:
print(f"The z-statistic is: {z_stat:.2f}")

The z-statistic is: -3.24


In [106]:
print(f"The p-val is: {pval:.3f}")

The p-val is: 0.001


In [107]:
pval < alpha

True

### Question 4

What is the 95% confidence interval for the difference in the retention rates for the control and treatment groups?

