## Analyze A/B Test Results


## Table of Contents
- [Part I - Probability](#probability)
- [Part II - A/B Test](#ab_test)
- [Part III - Regression](#regression)


<a id='probability'></a>
#### Part I - Probability


In [38]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline

random.seed(42)

`1.` The basic thing is to to read the data and to get a basic knowledge about it

In [2]:
df=pd.read_csv('ab_data.csv')
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


b. Number of rows in the data

In [3]:
df.shape[0]

294478

c. Number of unique users in the dataset. nunique() is used to get only the number of different user_id

In [4]:
df.user_id.nunique()

290584

d. As we want to get the number of users converted, we can slice the data before using any variable and count to get the number of conversions to compare it with any variable and count.

In [5]:
df[df['converted']==1].user_id.count()/df.user_id.count()

0.11965919355605512

e. Number of times the `new_page` and `treatment` don't match.

The mismatch ais "new_page" and "control" in the same observations, and "old_page" and "treatment" in the same observation.The user_id.count() gets the number of rows in this case.

In [6]:
df[((df["landing_page"])=='new_page')&(df['group']=='control')|(df["landing_page"]=='old_page')&(df['group']=='treatment')].user_id.count()

3893

f. Number of rows containing missing values

To get the missing values, we can use the info() function to get the overall number of non-null rows for each variable. As the numbers are the same, there isn't any null value in the data.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
user_id         294478 non-null int64
timestamp       294478 non-null object
group           294478 non-null object
landing_page    294478 non-null object
converted       294478 non-null int64
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


`2.`Storing the values that match in a new database.

We slice the df2 dataset to get only group and landing page which match. First, we get indicate match of the value of "treatment" and "new_page", then we indicate that this match should be true. As we sliced the dataset in this formula, df2 only gets the rows that match the result True for "group" is equivalent of "landing_page".

In [8]:
df2=df[((df['group'] == 'treatment') == (df['landing_page'] == 'new_page')) == True]

In [9]:
# Double Check all of the correct rows were removed - this should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]
df2[df2['group']=='control'].head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1
5,936923,2017-01-10 15:20:49.083499,control,old_page,0
7,719014,2017-01-17 01:48:29.539573,control,old_page,0


`3.` Users and repeat users in the new database

a. Number of users in df2

As we want to get the number of users, nunique gets only the values of user_id who are different from each other.

In [10]:
df2['user_id'].nunique()

290584

b. Duplicate users

We want to know the duplicate user_id. The slice and the duplicated functions make it able to show this duplicated user_id. There is only one repeat user.

In [11]:
df2[df2['user_id'].duplicated() == True]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
2893,773192,2017-01-14 02:55:59.590927,treatment,new_page,0


c. User ID of the duplicate user

We only have to get the user_id and to slice the data for this user ID.

In [12]:
df2[df2['user_id']==773192]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
1899,773192,2017-01-09 05:37:58.781806,treatment,new_page,0
2893,773192,2017-01-14 02:55:59.590927,treatment,new_page,0


d. Duplicate user removal

The function drop_duplicates is the way to get rid off duplicated values. 'user_id' is the value we want to have only one time in our dataset and inplace make the change permanent.

In [13]:
df2.drop_duplicates('user_id',inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


`4.` Probability

a. Conversion rate

I've named this variable df2_CR because it's the conversion rate. As we did it before we sliced the observations who converted in the dataset, and we divide the conversions by the total number of rows by the function "/"(division), and shape[0] (to get the number of rows.

In [14]:
df2_CR=df2[df2['converted']==1].shape[0]/df2.shape[0]
df2_CR

0.11959708724499628

b. Conversion rate in the control group

We do the same operation. but in slicing the group to get only "control" segment.

In [15]:
df2_CR_control=df2[(df2['group']=='control')&df2['converted']==1].shape[0]/df2[df2['group']=='control'].shape[0]
df2_CR_control

0.1203863045004612

c. Conversion rate in the treatment group

Same thing in order to get the "treatment" segment.

In [16]:
df2_CR_treatment= df2[(df2['group']=='treatment')&df2['converted']==1].shape[0]/df2[df2['group']=='treatment'].shape[0]
df2_CR_treatment

0.11880806551510564

d. Probability a user receive the treatment page

The operation is slightly different. We only sliced the "group" variable to get "treatment" data. shape[0] is always used to get the number of rows.

In [17]:
df2[df2['group']=='treatment'].shape[0]/df2.shape[0]

0.5000619442226688

In part « d », we have seen that there is about 50% "control" observations and 50% "treatment" observations. But we cannot say that the new treatment leads to more conversion because it's a matter of 0.3%, so this part might suggest that the new treatment doesn't have the impact we would have thought

<a id='ab_test'></a>
### Part II - A/B Test

`1.` Hypothesis

H₀: μ₀ >= μ₁

H₁: μ₀ < μ₁

α=0.05

`2.` A/B test calculation

a. Conversion Rate for $p_{new}$ under the null? 

As we want to suggest that p_new and p_old are the same, they should be equivalent to the conversion rate. So it should be the number of conversions divided by the number of observations.

In [18]:
p_new=df2_CR=df2[df2['converted']==1].shape[0]/df2.shape[0]

b. Conversion Rate for $p_{old}$ under the null? <br><br>

In [19]:
p_old=df2_CR=df2[df2['converted']==1].shape[0]/df2.shape[0]

c. Number of individuals in the treatment group?

As we did it before, we slice the df2 dataset only to get the "treatment" group and we use shape[0] to get the number of rows.

In [20]:
n_new=df2[df2['group']=='treatment'].shape[0]
n_new

145310

d. Number of individuals in the control group?

Same thing but with the "control" group

In [21]:
n_old=df2[df2['group']=='control'].shape[0]
n_old

145274

e. Transactions simulation in the treatment page with hypothesis probability

We've got to get a random number of new_pages converted to stay into the null hypothesis.

In [22]:
new_page_converted=np.random.choice(2,size=n_new,p=[(1-p_new),p_new])
new_page_converted.mean()

0.11902828435758035

f. Treatment simulation in the control page with hypothesis probability

In [23]:
old_page_converted=np.random.choice(2,size=n_old,p=[(1-p_old),p_old])
old_page_converted.mean()

0.11775679061635255

g. Difference between control and treatment page with hypothesis probability 

We only had got the number of observations. As new_page_converted and old_page_converted have got '1' value when there is a conversion, and '0' value when there isn't. We take the mean of these two variables and we subtract new_page_converted mean by old_page_converted one.

In [24]:
pages_difference=new_page_converted.mean()-old_page_converted.mean()

h. Loop

In [25]:
p_diffs=[]
for _ in range(10000):
    new_converted_simulation = np.random.binomial(n_new, p_new,  size=10000).mean()
    old_converted_simulation = np.random.binomial(n_old, p_old,  size= 10000).mean()
    page_difference = new_converted_simulation - old_converted_simulation
    p_diffs.append(page_difference)

i. Histogram

In [26]:
p_diffs=np.array(p_diffs)


j. P-value and signification

I had some doubts because of the exercises I did before. But the previous exercises weren't from the null hypothesis, so I did like it was explained. As we got the mean of the new page conversion rate subtracted by the old page conversion rate, I calculated the P-value by taking all the p_diffs simulated before to see if they were greater or not than the difference of the observed page.

In [27]:
pages_difference_mean=df2_CR_treatment-df2_CR_control
p_value=(p_diffs>pages_difference).mean()

The p-value is the simulated probability than the null hypothesis is true. So we simulated differences of conversion rates to get some values that would confirm or reject the null hypothesis. As alpha is 0.05 we cannot reject the null hypothesis. So the new page conversion rate isn't considered as being greater than the old page conversion rate.

l. Stats Proportion Test

I want to know the number of observations and of conversions for control and treatment group

In [28]:
import statsmodels.api as sm

convert_old = df2[(df2['group']=='control')&(df2['converted']==1)].shape[0]
convert_new = df2[(df2['group']=='treatment')&(df2['converted']==1)].shape[0]
n_old = n_old
n_new = n_new

m. Calculation

In [29]:
from statsmodels.stats.proportion import proportions_ztest
z_stats,p_value = proportions_ztest(convert_new, n_new,value=convert_old/n_old)
print("z Stats:",str(z_stats),". P-value:",str(p_value),".")

z Stats: -1.859354929150913 . P-value: 0.06297684585982084 .


We really can't understand Z test, but P-value seems to be the squared number of observation times the difference between the observation and the null mean divided by the variance.
The fact is P-value is greater than the P-value of the other test, but it's much less as it would be accepted if alpha would be 0.1 or greater.

<a id='regression'></a>
### Part III - A regression approach


I will use a logistic regression because there are only two possible outcomes and linear regression is perfect for predicting a two-value independent variable.

**Put your answer here.**

b. dummies values

In [30]:
df2['intercept']=1
df2[['control','ab_page']]=pd.get_dummies(df2['group'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


c. After that, I will use the Logistic function to get the P-value. As keeping "Control" variable would get the results wrong (it would put a bad interaction between "treatment" and "control", I've only used the "treatment" variable. The fit function makes it able to get a function and summary function gives us the P-value.

In [31]:
df2.head()
lm=sm.Logit(df2['converted'],df2[['intercept','ab_page']] )
fit=lm.fit()
fit.summary()


Optimization terminated successfully.
         Current function value: 0.366118
         Iterations 6


0,1,2,3
Dep. Variable:,converted,No. Observations:,290584.0
Model:,Logit,Df Residuals:,290582.0
Method:,MLE,Df Model:,1.0
Date:,"Mon, 20 Jan 2020",Pseudo R-squ.:,8.077e-06
Time:,21:44:09,Log-Likelihood:,-106390.0
converged:,True,LL-Null:,-106390.0
Covariance Type:,nonrobust,LLR p-value:,0.1899

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,-1.9888,0.008,-246.669,0.000,-2.005,-1.973
ab_page,-0.0150,0.011,-1.311,0.190,-0.037,0.007


d. P-value isn't better and is still unable to reject the 0 hypothesis.

e. Why we need an intercept

What is very important is to know we have got an intercept, which gives the mean of conversion rate if everything was equal. In this context, P-value calculates the probability "treatment" web page has a greater conversion rate than the old page regarding we have got an initial conversion rate and we have got an initial conversion rate (intercept).

f. Need of various variables in a logistic regression

Because there might be an interaction effect between some variables, the hypothesis is that the experiment web page impacts positively in some countries' mindset and negatively in others. If we don't consider other factors, we could keep the null hypothesis despite there is a change depending on the visitors' country.

g. Getting a new database

In [33]:
countries=pd.read_csv('countries.csv')
countries.head()

Unnamed: 0,user_id,country
0,834778,UK
1,928468,US
2,822059,UK
3,711597,UK
4,710616,UK


We had got to join the two tables. As user_id can be the index to do it, I've put the set_index function so that both tables had the same index before merging, then the join function made the rest.

In [34]:
df_compiled= df2.set_index('user_id').join(countries.set_index('user_id'))
df_compiled['country'].value_counts()

US    203619
UK     72466
CA     14499
Name: country, dtype: int64

The logistic regression needed dummies variables to test the country effect, so I used the get_dummies function and I selected the name of the columns.

In [35]:
df_compiled[['CA','UK','US']]=pd.get_dummies(df_compiled['country'])
df_compiled[df_compiled['country']=='CA'].head()

Unnamed: 0_level_0,timestamp,group,landing_page,converted,intercept,control,ab_page,country,CA,UK,US
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
679687,2017-01-19 03:26:46.940749,treatment,new_page,1,1,0,1,CA,1,0,0
839785,2017-01-15 18:11:06.610965,treatment,new_page,1,1,0,1,CA,1,0,0
650559,2017-01-24 11:55:51.084801,control,old_page,0,1,1,0,CA,1,0,0
640693,2017-01-19 20:22:19.970560,treatment,new_page,0,1,0,1,CA,1,0,0
698331,2017-01-22 21:14:30.175017,control,old_page,0,1,1,0,CA,1,0,0


First, I've checked if the countries had values by themselves. I've removed the 'US' column in order to make this as the comparison variable. The P-value of this logistic regression means that we cannot conclude on a country effect on  the conversion rate.

In [36]:
lm=sm.Logit(df_compiled['converted'],df_compiled[['intercept','UK','CA']] )
fit=lm.fit()
fit.summary()

Optimization terminated successfully.
         Current function value: 0.366116
         Iterations 6


0,1,2,3
Dep. Variable:,converted,No. Observations:,290584.0
Model:,Logit,Df Residuals:,290581.0
Method:,MLE,Df Model:,2.0
Date:,"Mon, 20 Jan 2020",Pseudo R-squ.:,1.521e-05
Time:,21:45:08,Log-Likelihood:,-106390.0
converged:,True,LL-Null:,-106390.0
Covariance Type:,nonrobust,LLR p-value:,0.1984

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,-1.9967,0.007,-292.314,0.000,-2.010,-1.983
UK,0.0099,0.013,0.746,0.456,-0.016,0.036
CA,-0.0408,0.027,-1.518,0.129,-0.093,0.012


h. Result of multiple variables

We don't see any interacted effect, as I believed before because the P-value is more than any country value.

In [37]:
lm=sm.Logit(df_compiled['converted'],df_compiled[['intercept','ab_page','UK','CA']] )
fit=lm.fit()
fit.summary()

Optimization terminated successfully.
         Current function value: 0.366113
         Iterations 6


0,1,2,3
Dep. Variable:,converted,No. Observations:,290584.0
Model:,Logit,Df Residuals:,290580.0
Method:,MLE,Df Model:,3.0
Date:,"Mon, 20 Jan 2020",Pseudo R-squ.:,2.323e-05
Time:,21:45:10,Log-Likelihood:,-106390.0
converged:,True,LL-Null:,-106390.0
Covariance Type:,nonrobust,LLR p-value:,0.176

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,-1.9893,0.009,-223.763,0.000,-2.007,-1.972
ab_page,-0.0149,0.011,-1.307,0.191,-0.037,0.007
UK,0.0099,0.013,0.743,0.457,-0.016,0.036
CA,-0.0408,0.027,-1.516,0.130,-0.093,0.012


<a id='conclusions'></a>
## Conclusion

> The A/B test results and the logistic regressions didn't observe for a difference between control and experiment group tests. Although I've done my best, I’m waiting for feedback to improve my copy.