# Website A/B Testing - Lab

## Introduction

In this lab, you'll get another chance to practice your skills at conducting a full A/B test analysis. It will also be a chance to practice your data exploration and processing skills! The scenario you'll be investigating is data collected from the homepage of a music app page for audacity.

## Objectives

You will be able to:
* Analyze the data from a website A/B test to draw relevant conclusions
* Explore and analyze web action data

## Exploratory Analysis

Start by loading in the dataset stored in the file 'homepage_actions.csv'. Then conduct an exploratory analysis to get familiar with the data.

> Hints:
    * Start investigating the id column:
        * How many viewers also clicked?
        * Are there any anomalies with the data; did anyone click who didn't view?
        * Is there any overlap between the control and experiment groups? 
            * If so, how do you plan to account for this in your experimental design?

In [1]:
#Your code here
import pandas as pd
df = pd.read_csv('homepage_actions.csv')
df.head(10)
import numpy as np

#### Investigating the id column

In [2]:
action_view = df.loc[(df['action'] == 'view')]
len(action_view)

6328

In [3]:
action_click = df.loc[(df['action'] == 'click')]
len(action_click)

1860

##### Are there any anomalies with the data; did anyone click who didn't view?

In [4]:
viewSet = set(action_view.id)
clickSet = set(action_click.id)
x = viewSet.intersection(clickSet)
len(x)

1860

In [5]:
print(f"id column has {len(list(action_click.id.values))} duplicate values")

id column has 1860 duplicate values


#### Droping the duplicates 

In [6]:
df.drop_duplicates(subset = 'id', keep ='last', inplace = True)
len(df)

6328

#### How many viewers also clicked?

In [7]:
print(f"{len(x)} viewers also clicked")

1860 viewers also clicked


#### Is there any overlap between the control and experiment groups?

In [8]:
expGroup = df.loc[(df['group'] == 'experiment')]
len(expGroup)

2996

In [9]:
ctrGroup = df.loc[(df['group'] == 'control')]
len(ctrGroup)

3332

In [10]:
len(ctrGroup) - len(expGroup)

336

In [11]:
expGroup['action'].value_counts()

view     2068
click     928
Name: action, dtype: int64

In [12]:
ctrGroup['action'].value_counts()

view     2400
click     932
Name: action, dtype: int64

## Conduct a Statistical Test

Conduct a statistical test to determine whether the experimental homepage was more effective than that of the control group.

#### Hypothesis

_null hypothesis_ : control group homepage was more effective than experimental homepage

_alternative hypothesis_ : experimental group homepage was more effective than control homepage

In [13]:
#Your code here
import scipy.stats as stats


In [14]:
dframe = pd.crosstab(df['group'], df['action'])
dframe

action,click,view
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,932,2400
experiment,928,2068


#### Testing our null hypothesis

In [15]:
result = stats.contingency.chi2_contingency(dframe)
chi, p, dof, exp = result
result

(6.712921132285344,
 0.009571680497042247,
 1,
 array([[ 979.38053097, 2352.61946903],
        [ 880.61946903, 2115.38053097]]))

As P-value is  0.009571680497042247, and it is less than alpha value (0.05), we reject the Null Hypotheses.  

## Verifying Results

One sensible formulation of the data to answer the hypothesis test above would be to create a binary variable representing each individual in the experiment and control group. This binary variable would represent whether or not that individual clicked on the homepage; 1 for they did and 0 if they did not. 

The variance for the number of successes in a sample of a binomial variable with n observations is given by:

## $n\bullet p (1-p)$

Given this, perform 3 steps to verify the results of your statistical test:
1. Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 
2. Calculate the number of standard deviations that the actual number of clicks was from this estimate. 
3. Finally, calculate a p-value using the normal distribution based on this z-score.

### Step 1:
Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 

In [16]:
#Your code here
exp[1][0]

880.6194690265487

In [17]:
df['conversion'] = np.where(df['action'] == 'click', 1,0)
df

Unnamed: 0,timestamp,id,group,action,conversion
0,2016-09-24 17:42:27.839496,804196,experiment,view,0
1,2016-09-24 19:19:03.542569,434745,experiment,view,0
2,2016-09-24 19:36:00.944135,507599,experiment,view,0
3,2016-09-24 19:59:02.646620,671993,control,view,0
4,2016-09-24 20:26:14.466886,536734,experiment,view,0
...,...,...,...,...,...
8183,2017-01-18 09:11:41.984113,192060,experiment,view,0
8184,2017-01-18 09:42:12.844575,755912,experiment,view,0
8185,2017-01-18 10:01:09.026482,458115,experiment,view,0
8186,2017-01-18 10:08:51.588469,505451,control,view,0


### Step 2:
Calculate the number of standard deviations that the actual number of clicks was from this estimate.

In [37]:
n = expGroup.shape[0]


In [29]:
expGroup.action.value_counts()
#expGroup.action.value_counts()[1]

view     2068
click     928
Name: action, dtype: int64

In [36]:
p = expGroup.action.value_counts()[1]/expGroup.shape[0]
p

0.3097463284379172

In [41]:
#Your code here
var = n * p * (1 - p)
var

640.5554072096129

### Step 3: 
Finally, calculate a p-value using the normal distribution based on this z-score.

In [19]:
#Your code here

### Analysis:

Does this result roughly match that of the previous statistical test?

> Comment: **Your analysis here**

## Summary

In this lab, you continued to get more practice designing and conducting AB tests. This required additional work preprocessing and formulating the initial problem in a suitable manner. Additionally, you also saw how to verify results, strengthening your knowledge of binomial variables, and reviewing initial statistical concepts of the central limit theorem, standard deviation, z-scores, and their accompanying p-values.