### ** Analyze A/B Test Results**

### Dain Russell, 2020

### Udacity Data Analyst Nanodegree Project 3






## Table of Contents
- [Introduction](#intro)
- [Part I - Probability](#probability)
- [Part II - A/B Test](#ab_test)
- [Part III - Regression](#regression)





<a id='intro'></a>
### Introduction

A/B tests are very commonly performed by data analysts and data scientists. It is important that you get some practice working with the difficulties of these.

For this project, I worked to understand the results of an A/B test run by an e-commerce website. The company has developed a new web page in order to try and increase the number of users who "convert," meaning the number of users who decide to pay for the company's product. My goal was to work through this notebook to help the company understand if they should implement this new page, keep the old page, or perhaps run the experiment longer to make their decision.

<a id='probability'></a>
#### Part I - Probability

To get started, let's import our libraries.


#### Let's get started!
We set up the import statements for all of the packages we plan to use.



In [4]:
# import statements for all of the packages 
import pandas as pd  
import numpy as np
import random
import csv 
import seaborn as sns
from scipy.stats import norm
import matplotlib.pyplot as plt

# 'magic word' so that your visualizations are plotted
%matplotlib inline

# We are setting the seed to assure you get the same answers on quizzes 
# as we set up
random.seed(42)
print("Set up complete")




Set up complete


**1. Let's read in the ab_data.csv data and store it in df.**

**a. Read in the dataset and take a look at the top few rows here:**

In [6]:
# Load and Read the CSV File Using Pandas read_csv function
df = pd.read_csv('ab_data.csv')
# printing first five rows with defined columns of database
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


**b. Use the below cell to find the number of rows in the dataset.**

In [7]:
# dataframe.shape 
# Get the number of rows and columns
print(("There are {} rows and {} columns in the dataset.".format(df.shape[0], df.shape[1])))


There are 294478 rows and 5 columns in the dataset.


**c. The number of unique users in the dataset.**

In [8]:
# counts the unique users in the dataset
unique_users= df.user_id.nunique()
print("There are" + " " + str(unique_users) + " " + "unique users in the dataset")


There are 290584 unique users in the dataset


**d. The proportion of users converted.**

In [9]:
# percentage by value tells us the occurence of each unique value in that column
df['converted'].value_counts(normalize=True) * 100

0    88.034081
1    11.965919
Name: converted, dtype: float64

12% of users converted and decided to pay for the company's product

**e. The number of times the new_page and treatment don't match.**

In [6]:
# Conditional statement to count number of rows where treatment group does not correspond with new landing page
df[((df['group'] == 'treatment') != (df['landing_page'] == 'new_page')) == True].shape[0]

3893

**f. Do any of the rows have missing values?**

In [7]:
#raw datset summary that displays missing values in each column
df.info()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294478 non-null  int64 
 1   timestamp     294478 non-null  object
 2   group         294478 non-null  object
 3   landing_page  294478 non-null  object
 4   converted     294478 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

There are no missing values

**`2.` For the rows where treatment does not match with new_page or control does not match with old_page, we cannot be sure if this row truly received the new or old page.**

**a. Now use the answer to the quiz to create a new dataset that meets the specifications from the quiz.  Store your new dataframe in df2**.

In [22]:
# making copy of dataframe

# Load and Read the CSV File Using Pandas read_csv function
df = pd.read_csv('ab_data.csv')

# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html
df2 = df.copy()

# Import into df2 only rows where treatment
# is not aligned with new_page or control is not aligned with old_page
df2 = df[((df.group == 'treatment') & (df.landing_page == 'new_page')) |
         ((df.group == 'control') & (df.landing_page == 'old_page'))]

In [16]:
# Double Check all of the correct rows were removed - this should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]

0

**`3.` Use **df2** and the cells below to answer questions for **Quiz3** in the classroom.**

a. How many unique **user_id**s are in **df2**?

In [23]:
# counts the unique users in the dataset
unique_users= df.user_id.nunique()
print("There are" + " " + str(unique_users) + " " + "unique users in the dataset")


There are 290584 unique users in the dataset


b. There is one **user_id** repeated in **df2**.  What is it?

In [11]:
# Tdisplay user_id for duplicte row
df2[df2.duplicated(['user_id'], keep=False)]['user_id']

1899    773192
2893    773192
Name: user_id, dtype: int64

c. What is the row information for the repeat user_id?

In [12]:
# displaying row information for duplicate user_ids
df2[df2.duplicated(['user_id'], keep = False)]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
1899,773192,2017-01-09 05:37:58.781806,treatment,new_page,0
2893,773192,2017-01-14 02:55:59.590927,treatment,new_page,0


In [24]:
#using the shape to see how many rows we have before dropping the duplicate row.
df2.shape

(290585, 5)

d. Remove one of the rows with a duplicate user_id, but keep your dataframe as df2.

In [25]:
# Drop one of the rows that belongs to the repeated user_id

df2 = df2.drop_duplicates(subset='user_id');

In [15]:
#using shape to confirm the row has been dropped
df2.shape

(290584, 5)

**`4.` Use df2 in the cells below to answer the quiz questions related to Quiz 4 in the classroom.

a. What is the probability of an individual converting regardless of the page they receive?**

In [26]:
# The rounded probability of an individual converting regardless of the page they receive

round(float(df2['converted'].mean()),4)

0.1196

b. Given that an individual was in the `control` group, what is the probability they converted?

In [27]:
control_probability = round(float((df2.query('group == "control"')['converted'] == 1).mean()),4)
control_probability

0.1204

c. Given that an individual was in the `treatment` group, what is the probability they converted?

In [28]:
treatment_probability = round(float((df2.query('group == "treatment"')['converted'] == 1).mean()),4)
treatment_probability

0.1188

d. What is the probability that an individual received the new page?

e. Consider your results from parts (a) through (d) above, and explain below whether you think there is sufficient evidence to conclude that the new treatment page leads to more conversions.