# Module 1: Data Science Fundamentals

## Sprint 2: Statistical Tests and Experiments

## Let's analyse Udacity A/B test!

<div><img style="height: 350px;" src="https://upload.wikimedia.org/wikipedia/commons/3/3b/Udacity_logo.png" /></div>

## Background

For the last day of this sprint, we are analyzing once run, online Udacity A/B test results! The dataset is part of the Udacity Google's A/B testing course, which we have watched a couple of lessons from in Subproject 2 of this sprint.

Data is available at https://docs.google.com/spreadsheets/d/1Mu5u9GrybDdska-ljPXyBjTpdZIUev_6i7t4LRDfXM8/edit#gid=0

---------

## How to start?

The data consists of two sheets, each for different group of the test. Download the sheets as CSVs and union them to form 1 dataset.

## Concepts to explore

You will mostly focus on statistical hypothesis testing in this project on real-life data.

## Requirements

* Visualize 95% confidence interval between control and experiment groups on [click-through rate](https://en.wikipedia.org/wiki/Click-through_rate) metric. Explain what the confidence interval means, what's the relationship with the population of a group.
* Verify that the difference in each of the metrics between control and experiment groups is statistically significant using z-test, 95% confidence level.
* Verify that the difference in each of the metrics between control and experiment groups is statistically significant using t-test, 95% confidence level.
* Compare both test method results. Explain why they differ / do not differ that much and why.
* Choose 1 method (either z or t) and explore statistical significance of any metric under different confidence levels - 60%, 90%, 95%, 99%. If conclusions about significance differ under different confidence levels, explain why.
* Calculate p-values.

## Evaluation Criteria

- Correctness of used test methods.
- Soundness of explanations given.
- Adherence to the requirements.


## Sample correction questions

During a correction, you may get asked questions that test your understanding of covered topics.

- Why collect data from sample rather than data from a population?
- What is the Central Limit Theorem and why is it important?
- Explain confidence intervals and significance in statistics
- Explain what p-value is

# **Introduction**

## Experiment Overview: Free Trial Screener

Udacity tested a change on their website where if the student clicked the "start free trial" button on the course overview page, they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial,  or access the course materials for free instead. 

It is assumed that the free trial screener will reduce the number of students who leave the free trial because of lack of time and increase the likelihood of students who continue past the free trial to eventually make payments and complete the course.


The overall business objective is to increase the likelihood that students who continue past the free trial will make payments and their complete courses


H0: There is no significant difference between students who went through the free trial screener and those who did not.

Ha: There is a significant difference between students who went through the free trial screener and those who did not.

Metrics:

Click-through rate

Retention

Gross conversion

Net conversion

In [1]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [20]:
import numpy as np
import pandas as pd
import scipy.stats as stats

In [3]:
control =  pd.read_csv('/content/drive/My Drive/125/Control.csv')

experiment = pd.read_csv('/content/drive/My Drive/125/Experiment.csv')


In [4]:
data = pd.merge(control, experiment, on= 'Date', how = 'inner')

In [5]:
data.head()

Unnamed: 0,Date,Pageviews_ctl,Clicks_ctl,Enrollments_ctl,Payments_ctl,Pageviews_exp,Clicks_exp,Enrollments_exp,Payments_exp
0,"Sat, Oct 11",7723,687,134.0,70.0,7716,686,105.0,34.0
1,"Sun, Oct 12",9102,779,147.0,70.0,9288,785,116.0,91.0
2,"Mon, Oct 13",10511,909,167.0,95.0,10480,884,145.0,79.0
3,"Tue, Oct 14",9871,836,156.0,105.0,9867,827,138.0,92.0
4,"Wed, Oct 15",10014,837,163.0,64.0,9793,832,140.0,94.0


## Sanity Check

In [6]:
total_control = data.Pageviews_ctl.sum()
total_experiment = data.Pageviews_exp.sum()
observed_fraction_control = total_control / (total_control + total_experiment)

In [7]:
standard_deviation = ((0.5*0.5) / (total_control + total_experiment)) ** 0.5
round(standard_deviation, 6)

margin_of_error = standard_deviation * 1.96

lower_bound = 0.5 - margin_of_error
upper_bound = 0.5 + margin_of_error
print(f'The confidence is interval is between {lower_bound:.{4}f} and {upper_bound:.{4}f}')

print(f'Observed fraction of Control is {observed_fraction_control: .{4}f} ')

The confidence is interval is between 0.4988 and 0.5012
Observed fraction of Control is  0.5006 


The observed fraction is within what is expected, hence sanity check passed!

1. 95% confidence interval between control and experiment groups on click-through rate

. Explain what the confidence interval means, what's the relationship with the population of a group.

clickthrough rate = number of clicks /  number of pageviews

In [27]:
p1 = data.Clicks_ctl.sum() / data.Pageviews_ctl.sum() #proportion of clicks in control
n1 = data.Pageviews_ctl.sum()                         #total pageviews in control

p2 = data.Clicks_exp.sum() / data.Pageviews_exp.sum() #proportion of click in experiment
n2 = data.Pageviews_exp.sum()                         #total pageviews in experiment

p_diff = p2 - p1                                       #difference between the two proportions

standard_error_pooled = np.sqrt(((p1 * (1- p1))/n1 + (p2 * (1- p2))/n2 ))  #pooled standard error
z_critical = stats.norm.ppf(q = 0.975)                                     #critical value at 95% confidence level
margin_of_error = standard_error_pooled * z_critical                        #critical value at 95% confidence level

upperbound = p_diff + margin_of_error
lowerbound = p_diff - margin_of_error

print(f'The confidence interval is between {lowerbound:.{4}f} and {upperbound:.{4}f}')

The confidence interval is between -0.0012 and 0.0014


## T-Test for Metrics


In [40]:
data['ctr_control']= data['Clicks_ctl'] / data['Pageviews_ctl']
data['ctr_experiment']= data['Clicks_exp'] / data['Pageviews_exp']

In [60]:
df = data[data.Enrollments_ctl.notnull()]
data2 = df.copy()

In [61]:
data2['retention_control']= data2['Payments_ctl'] / data2['Enrollments_ctl']
data2['retention_experiment']= data2['Payments_exp'] / data2['Enrollments_exp']

data2['GrossConversion_control']= data2['Enrollments_ctl'] / data2['Clicks_ctl']
data2['GrossConversion_experiment']= data2['Enrollments_exp'] / data2['Clicks_exp']

data2['NetConversion_control']= data2['Payments_ctl'] / data2['Clicks_ctl']
data2['NetConversion_experiment']= data2['Payments_exp'] / data2['Clicks_exp']

In [62]:
data2.head()

Unnamed: 0,Date,Pageviews_ctl,Clicks_ctl,Enrollments_ctl,Payments_ctl,Pageviews_exp,Clicks_exp,Enrollments_exp,Payments_exp,ctr_control,ctr_experiment,retention_control,retention_experiment,GrossConversion_control,GrossConversion_experiment,NetConversion_control,NetConversion_experiment
0,"Sat, Oct 11",7723,687,134.0,70.0,7716,686,105.0,34.0,0.088955,0.088906,0.522388,0.32381,0.195051,0.153061,0.101892,0.049563
1,"Sun, Oct 12",9102,779,147.0,70.0,9288,785,116.0,91.0,0.085586,0.084518,0.47619,0.784483,0.188703,0.147771,0.089859,0.115924
2,"Mon, Oct 13",10511,909,167.0,95.0,10480,884,145.0,79.0,0.086481,0.084351,0.568862,0.544828,0.183718,0.164027,0.10451,0.089367
3,"Tue, Oct 14",9871,836,156.0,105.0,9867,827,138.0,92.0,0.084693,0.083815,0.673077,0.666667,0.186603,0.166868,0.125598,0.111245
4,"Wed, Oct 15",10014,837,163.0,64.0,9793,832,140.0,94.0,0.083583,0.084959,0.392638,0.671429,0.194743,0.168269,0.076464,0.112981


1. Test of difference in clickthrough rates

In [31]:
t_stat, p_val= stats.ttest_ind(data.ctr_control, data.ctr_experiment)
t_stat , p_val

(-0.08302608446561602, 0.9340611962072063)

2. Test of difference in retention rates

In [64]:
t_stat, p_val= stats.ttest_ind(data2.retention_control, data2.retention_experiment)
t_stat , p_val

(-1.0081408912731535, 0.3188989662061842)

3. Test of difference in gross conversion

In [65]:
t_stat, p_val= stats.ttest_ind(data2.GrossConversion_control, data2.GrossConversion_experiment)
t_stat , p_val

(1.5396752696188791, 0.13080105104323278)

4. Test of difference in net conversion

In [67]:
t_stat, p_val= stats.ttest_ind(data2.NetConversion_control, data2.NetConversion_experiment)
t_stat , p_val

(0.5387777625331603, 0.5927558614268024)

* Visualize 95% confidence interval between control and experiment groups on [click-through rate](https://en.wikipedia.org/wiki/Click-through_rate) metric. Explain what the confidence interval means, what's the relationship with the population of a group.
* Verify that the difference in each of the metrics between control and experiment groups is statistically significant using z-test, 95% confidence level.
* Verify that the difference in each of the metrics between control and experiment groups is statistically significant using t-test, 95% confidence level.
* Compare both test method results. Explain why they differ / do not differ that much and why.
* Choose 1 method (either z or t) and explore statistical significance of any metric under different confidence levels - 60%, 90%, 95%, 99%. If conclusions about significance differ under different confidence levels, explain why.
* Calculate p-values.

In [None]:
treatment = data.loc[(data.group == 'treatment') & (data.landing_page == 'new_page')]
treatment_converted= treatment.loc[treatment.converted==1]
treatment_converted_group= treatment_converted.groupby('date')['converted'].count().reset_index()
treatment_total_group= treatment.groupby('date')['landing_page'].count().reset_index()
treatment_converted_group['treatment_conversion_rate'] = treatment_converted_group['converted']/treatment_total_group['landing_page']
treatment_converted_group.head()