# Analyze A/B Test Results 

This project will assure you have mastered the subjects covered in the statistics lessons. We have organized the current notebook into the following sections: 

- [Introduction](#intro)
- [Part I - Probability](#probability)
- [Part II - A/B Test](#ab_test)
- [Part III - Regression](#regression)
- [Final Check](#finalcheck)
- [Submission](#submission)

Specific programming tasks are marked with a **ToDo** tag. 

<a id='intro'></a>
## Introduction

A/B tests are very commonly performed by data analysts and data scientists. For this project, you will be working to understand the results of an A/B test run by an e-commerce website.  Your goal is to work through this notebook to help the company understand if they should:
- Implement the new webpage, 
- Keep the old webpage, or 
- Perhaps run the experiment longer to make their decision.

Each **ToDo** task below has an associated quiz present in the classroom.  Though the classroom quizzes are **not necessary** to complete the project, they help ensure you are on the right track as you work through the project, and you can feel more confident in your final submission meeting the [rubric](https://review.udacity.com/#!/rubrics/1214/view) specification. 

>**Tip**: Though it's not a mandate, students can attempt the classroom quizzes to ensure statistical numeric values are calculated correctly in many cases.

<a id='probability'></a>
## Part I - Probability

To get started, let's import our libraries.

In [None]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
#We are setting the seed to assure you get the same answers on quizzes as we set up
random.seed(42)

### ToDo 1.1
Now, read in the `ab_data.csv` data. Store it in `df`. Below is the description of the data, there are a total of 5 columns:

<center>

|Data columns|Purpose|Valid values|
| ------------- |:-------------| -----:|
|user_id|Unique ID|Int64 values|
|timestamp|Time stamp when the user visited the webpage|-|
|group|In the current A/B experiment, the users are categorized into two broad groups. <br>The `control` group users are expected to be served with `old_page`; and `treatment` group users are matched with the `new_page`. <br>However, **some inaccurate rows** are present in the initial data, such as a `control` group user is matched with a `new_page`. |`['control', 'treatment']`|
|landing_page|It denotes whether the user visited the old or new webpage.|`['old_page', 'new_page']`|
|converted|It denotes whether the user decided to pay for the company's product. Here, `1` means yes, the user bought the product.|`[0, 1]`|
</center>
Use your dataframe to answer the questions in Quiz 1 of the classroom.


>**Tip**: Please save your work regularly.

**a.** Read in the dataset from the `ab_data.csv` file and take a look at the top few rows here:

In [3]:
import numpy as np
import pandas as pd
df=pd.read_csv('ab_data.csv')
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
user_id         294478 non-null int64
timestamp       294478 non-null object
group           294478 non-null object
landing_page    294478 non-null object
converted       294478 non-null int64
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


**b.** Use the cell below to find the number of rows in the dataset.

In [7]:
number_of_rows=len(df)
print(number_of_rows)



294478


**c.** The number of unique users in the dataset.

In [13]:
import pandas as pd
df=pd.read_csv('ab_data.csv')
#number_of_unique_users=df['user_id'].value_counts()
#print(number_of_unique_users)
df.nunique()
#df.index.is_unique

user_id         290584
timestamp       294478
group                2
landing_page         2
converted            2
dtype: int64

**d.** The proportion of users converted.

In [20]:
import numpy as np
import pandas as pd
df=pd.read_csv('ab_data.csv')
proportion_of_users_converted=df.query('converted=="1"').count()[0]/df.shape[0]

print(proportion_of_users_converted)

0.119659193556


**e.** The number of times when the "group" is `treatment` but "landing_page" is not a `new_page`.

In [21]:
df.query('group=="treatment" and landing_page=="old_page"').count()[0]
#df.groupby('group').nunique()

1965

**f.** Do any of the rows have missing values?

In [1]:
import numpy as np
import pandas as pd
df=pd.read_csv('ab_data.csv')
df.isnull()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
5,False,False,False,False,False
6,False,False,False,False,False
7,False,False,False,False,False
8,False,False,False,False,False
9,False,False,False,False,False


### ToDo 1.2  
In a particular row, the **group** and **landing_page** columns should have either of the following acceptable values:

|user_id| timestamp|group|landing_page|converted|
|---|---|---|---|---|
|XXXX|XXXX|`control`| `old_page`|X |
|XXXX|XXXX|`treatment`|`new_page`|X |


It means, the `control` group users should match with `old_page`; and `treatment` group users should matched with the `new_page`. 

However, for the rows where `treatment` does not match with `new_page` or `control` does not match with `old_page`, we cannot be sure if such rows truly received the new or old wepage.  


Use **Quiz 2** in the classroom to figure out how should we handle the rows where the group and landing_page columns don't match?

**a.** Now use the answer to the quiz to create a new dataset that meets the specifications from the quiz.  Store your new dataframe in **df2**.

In [24]:
# Remove the inaccurate rows, and store the result in a new dataframe df2
df2=df.drop_duplicates()
df2.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


In [None]:
# Double Check all of the incorrect rows were removed from df2 - 
# Output of the statement below should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]

### ToDo 1.3  
Use **df2** and the cells below to answer questions for **Quiz 3** in the classroom.

**a.** How many unique **user_id**s are in **df2**?

In [29]:
import pandas as pd
df=pd.read_csv('ab_data.csv')
df2=df.drop_duplicates()
#df3=df[df['user_id'].duplicated()]
#print(df3)
df2['user_id'].is_unique

False

In [23]:
df2['user_id'].nunique()

290584

**b.** There is one **user_id** repeated in **df2**.  What is it?

In [2]:
import pandas as pd
df=pd.read_csv('ab_data.csv')
df2=df.drop_duplicates()
o=df2.loc[df['user_id'].duplicated()]


**c.** Display the rows for the duplicate **user_id**? 

In [23]:
import pandas as pd
df=pd.read_csv('ab_data.csv')

df2=df.drop_duplicates()

df2[df2.duplicated('user_id')]

3894

In [18]:
import pandas as pd
df=pd.read_csv('ab_data.csv')
d=df[df.duplicated(subset=None,keep='last')]
d

Unnamed: 0,user_id,timestamp,group,landing_page,converted


**d.** Remove **one** of the rows with a duplicate **user_id**, from the **df2** dataframe.

In [15]:
import pandas as pd
df=pd.read_csv('ab_data.csv')


df2.drop_duplicates()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1
5,936923,2017-01-10 15:20:49.083499,control,old_page,0
6,679687,2017-01-19 03:26:46.940749,treatment,new_page,1
7,719014,2017-01-17 01:48:29.539573,control,old_page,0
8,817355,2017-01-04 17:58:08.979471,treatment,new_page,1
9,839785,2017-01-15 18:11:06.610965,treatment,new_page,1


In [11]:
# Remove one of the rows with a duplicate user_id..
# Hint: The dataframe.drop_duplicates() may not work in this case because the rows with duplicate user_id are not entirely identical. 
#df2.drop_duplicates()
# Check again if the row with a duplicate user_id is deleted or not


### ToDo 1.4  
Use **df2** in the cells below to answer the quiz questions related to **Quiz 4** in the classroom.

**a.** What is the probability of an individual converting regardless of the page they receive?<br><br>

>**Tip**: The probability  you'll compute represents the overall "converted" success rate in the population and you may call it $p_{population}$.



In [13]:
import numpy as np
import pandas as pd
df=pd.read_csv('ab_data.csv')
df2=df.drop_duplicates()
#users[old+new]1/total[old+new]
p_population=df2.query('converted=="1"').count()[0]/df2['converted'].count()
#p_population=df2.converted.mean()
p_population

0.11965919355605512

**b.** Given that an individual was in the `control` group, what is the probability they converted?

In [38]:
#old users 1/old users total
p_old=df.query('group=="control" and converted=="1"').count()[0]/len(df[df['group']=='control'])
#p_old=df2[df2['group']=='control']['converted'].mean()
p_old

0.12039917935897611

**c.** Given that an individual was in the `treatment` group, what is the probability they converted?

In [3]:
#new users 1/new users total
p_new=df.query('group=="treatment" and converted=="1"').count()[0]/len(df[df['group']=='treatment'])

p_new

0.11891957956489856

>**Tip**: The probabilities you've computed in the points (b). and (c). above can also be treated as conversion rate. 
Calculate the actual difference  (`obs_diff`) between the conversion rates for the two groups. You will need that later.  

In [8]:
# Calculate the actual difference (obs_diff) between the conversion rates for the two groups.
import pandas as pd
df=pd.read_csv('ab_data.csv')
df2=df.drop_duplicates()
p_old=df2.query('group=="control" and converted=="1"').count()[0]/len(df[df['group']=='control'])
p_new=df2.query('group=="treatment" and converted=="1"').count()[0]/len(df[df['group']=='treatment'])
obs_diff=p_new-p_old
obs_diff

-0.0014795997940775518

**d.** What is the probability that an individual received the new page?

In [9]:
import numpy as np
import pandas as pd
df=pd.read_csv('ab_data.csv')
df2=df.drop_duplicates()
#n_p=df2.query('landing_page=="new_page"').mean()
new_page=df2.query('landing_page=="new_page"').count()[0]/df2['landing_page'].count()
print(new_page)

0.5


**e.** Consider your results from parts (a) through (d) above, and explain below whether the new `treatment` group users lead to more conversions.

>**Your answer goes here.**
<br>
<strong style="color:blue">As we have seen ,the results show that the old page lead to more conversions than the new page</strong>

<a id='ab_test'></a>
## Part II - A/B Test

Since a timestamp is associated with each event, you could run a hypothesis test continuously as long as you observe the events. 

However, then the hard questions would be: 
- Do you stop as soon as one page is considered significantly better than another or does it need to happen consistently for a certain amount of time?  
- How long do you run to render a decision that neither page is better than another?  

These questions are the difficult parts associated with A/B tests in general.  


### ToDo 2.1
For now, consider you need to make the decision just based on all the data provided.  

> Recall that you just calculated that the "converted" probability (or rate) for the old page is *slightly* higher than that of the new page (ToDo 1.4.c). 

If you want to assume that the old page is better unless the new page proves to be definitely better at a Type I error rate of 5%, what should be your null and alternative hypotheses (**$H_0$** and **$H_1$**)?  

You can state your hypothesis in terms of words or in terms of **$p_{old}$** and **$p_{new}$**, which are the "converted" probability (or rate) for the old and new pages respectively.

>**Put your answer here.**
<p><strong>p_old=0.1203</strong></p>
<p><strong>p_new=0.1189</strong.</p>

### ToDo 2.2 - Null Hypothesis $H_0$ Testing
Under the null hypothesis $H_0$, assume that $p_{new}$ and $p_{old}$ are equal. Furthermore, assume that $p_{new}$ and $p_{old}$ both are equal to the **converted** success rate in the `df2` data regardless of the page. So, our assumption is: <br><br>
<center>
$p_{new}$ = $p_{old}$ = $p_{population}$
</center>

In this section, you will: 

- Simulate (bootstrap) sample data set for both groups, and compute the  "converted" probability $p$ for those samples. 


- Use a sample size for each group equal to the ones in the `df2` data.


- Compute the difference in the "converted" probability for the two samples above. 


- Perform the sampling distribution for the "difference in the converted probability" between the two simulated-samples over 10,000 iterations; and calculate an estimate. 



Use the cells below to provide the necessary parts of this simulation.  You can use **Quiz 5** in the classroom to make sure you are on the right track.

**a.** What is the **conversion rate** for $p_{new}$ under the null hypothesis? 

In [17]:


p_new_r=df2.query('landing_page=="new_page" and converted=="1"').count()[0]/len(df[df['landing_page']=='new_page'])
p_new_r

0.11884079625642663

**b.** What is the **conversion rate** for $p_{old}$ under the null hypothesis? 

In [18]:
p_old_r=df2.query('landing_page=="old_page" and converted=="1"').count()[0]/len(df[df['landing_page']=='old_page'])
p_old_r

0.12047759085568362

**c.** What is $n_{new}$, the number of individuals in the treatment group? <br><br>
*Hint*: The treatment group users are shown the new page.

In [12]:
n_new=df2.query('group=="treatment" and landing_page=="new_page"').count()[0]
n_new

145311

**d.** What is $n_{old}$, the number of individuals in the control group?

In [13]:
n_old=df2.query('group=="control" and landing_page=="old_page"').count()[0]
n_old

145274

**e. Simulate Sample for the `treatment` Group**<br> 
Simulate $n_{new}$ transactions with a conversion rate of $p_{new}$ under the null hypothesis.  <br><br>
*Hint*: Use `numpy.random.choice()` method to randomly generate $n_{new}$ number of values. <br>
Store these $n_{new}$ 1's and 0's in the `new_page_converted` numpy array.


In [14]:
# Simulate a Sample for the treatment Group

sim_new=np.random.choice(2,145311,p=[0.1196,1-0.1196])
sim_new


array([0, 1, 1, ..., 1, 0, 0])

In [53]:
sim_new=np.random.choice(2,145311,p=[0.1196,1-0.1196])
sim_new.mean()

0.88039446428694312

**f. Simulate Sample for the `control` Group** <br>
Simulate $n_{old}$ transactions with a conversion rate of $p_{old}$ under the null hypothesis. <br> Store these $n_{old}$ 1's and 0's in the `old_page_converted` numpy array.

In [47]:
# Simulate a Sample for the control Group

sim_old

array([1, 1, 1, ..., 0, 1, 0])

In [57]:
sim_old=np.random.choice(2,145274,p=[0.1196,1-0.1196])
sim_old.mean()

0.87920068284758457

**g.** Find the difference in the "converted" probability $(p{'}_{new}$ - $p{'}_{old})$ for your simulated samples from the parts (e) and (f) above. 

In [54]:
sim_new.mean()-sim_old.mean()


-0.00083686409941641227


**h. Sampling distribution** <br>
Re-create `new_page_converted` and `old_page_converted` and find the $(p{'}_{new}$ - $p{'}_{old})$ value 10,000 times using the same simulation process you used in parts (a) through (g) above. 

<br>
Store all  $(p{'}_{new}$ - $p{'}_{old})$  values in a NumPy array called `p_diffs`.

In [None]:
# Sampling distribution 
import pandas as pd
import numpy as np
import random
p_diffs = []
for _ in range (10000):
    sim_new=np.random.choice(2,145311,p=[0.1196,1-0.1196])
    sim_old=np.random.choice(2,145274,p=[0.1196,1-0.1196])
    p_diff=sim_new.mean()-sim_old.mean()
    p_diffs.append(p_diff)
print(p_diffs)    

**i. Histogram**<br> 
Plot a histogram of the **p_diffs**.  Does this plot look like what you expected?  Use the matching problem in the classroom to assure you fully understand what was computed here.<br><br>

Also, use `plt.axvline()` method to mark the actual difference observed  in the `df2` data (recall `obs_diff`), in the chart.  

>**Tip**: Display title, x-label, and y-label in the chart.

In [None]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
p_diffs = []
for _ in range (10000):
    sim_new=np.random.choice(2,145311,p=[0.1196,1-0.1196])
    sim_old=np.random.choice(2,145274,p=[0.1196,1-0.1196])
    p_diff=sim_new.mean()-sim_old.mean()
    p_diffs.append(p_diff)
    
plt.hist(p_diffs)
#plt.avxline(p_diffs,color='red')
plt.show()


**j.** What proportion of the **p_diffs** are greater than the actual difference observed in the `df2` data?

In [None]:
import numpy as np
p_diffs = []
for _ in range (10000):
    sim_new=np.random.choice(2,145311,p=[0.1196,1-0.1196])
    sim_old=np.random.choice(2,145274,p=[0.1196,1-0.1196])
    p_diff=sim_new.mean()-sim_old.mean()
    p_diffs.append(p_diff)
   
    
for f in p_diffs:
    sim_new=np.random.choice(2,145311,p=[0.1196,1-0.1196])
    sim_old=np.random.choice(2,145274,p=[0.1196,1-0.1196])
    p_diff=sim_new.mean()-sim_old.mean()
           
    if f>-0.000836:
        d=f.mean()
        print(d)
         
    
print(d)                         

In [None]:
#y=[8,9,5]
#for i in y:
   # if i>3:
       # print('yes')

**k.** Please explain in words what you have just computed in part **j** above.  
 - What is this value called in scientific studies?  
 - What does this value signify in terms of whether or not there is a difference between the new and old pages? *Hint*: Compare the value above with the "Type I error rate (0.05)". 

>**Put your answer here.**
<br>
<p style='color:blue'>the diff value does not equal diffs value because in diffs we are sampling numbers with replacement.</p>



**l. Using Built-in Methods for Hypothesis Testing**<br>
We could also use a built-in to achieve similar results.  Though using the built-in might be easier to code, the above portions are a walkthrough of the ideas that are critical to correctly thinking about statistical significance. 

Fill in the statements below to calculate the:
- `convert_old`: number of conversions with the old_page
- `convert_new`: number of conversions with the new_page
- `n_old`: number of individuals who were shown the old_page
- `n_new`: number of individuals who were shown the new_page


In [None]:
import statsmodels.api as sm
import pandas as pd
df=pd.read_csv('ab_data.csv')
df2=df.drop_duplicates()

# number of conversions with the old_page
convert_old =df2.query('landing_page=="old_page" and converted=="1"').count()[0]
print(convert_old)
# number of conversions with the new_page
convert_new =df2.query('landing_page=="new_page" and converted=="1"').count()[0]
print(convert_new)
# number of individuals who were shown the old_page
n_old = df2.query('landing_page=="old_page"').count()[0]
print(n_old)
# number of individuals who received new_page
n_new = df2.query('landing_page=="new_page"').count()[0]
print(n_new)

**m.** Now use `sm.stats.proportions_ztest()` to compute your test statistic and p-value.  [Here](https://www.statsmodels.org/stable/generated/statsmodels.stats.proportion.proportions_ztest.html) is a helpful link on using the built in.

The syntax is: 
```bash
proportions_ztest(count_array, nobs_array, alternative='larger')
```
where, 
- `count_array` = represents the number of "converted" for each group
- `nobs_array` = represents the total number of observations (rows) in each group
- `alternative` = choose one of the values from `[‘two-sided’, ‘smaller’, ‘larger’]` depending upon two-tailed, left-tailed, or right-tailed respectively. 
>**Hint**: <br>
It's a two-tailed if you defined $H_1$ as $(p_{new} = p_{old})$. <br>
It's a left-tailed if you defined $H_1$ as $(p_{new} < p_{old})$. <br>
It's a right-tailed if you defined $H_1$ as $(p_{new} > p_{old})$. 

The built-in function above will return the z_score, p_value. 

---
### About the two-sample z-test
Recall that you have plotted a distribution `p_diffs` representing the
difference in the "converted" probability  $(p{'}_{new}-p{'}_{old})$  for your two simulated samples 10,000 times. 

Another way for comparing the mean of two independent and normal distribution is a **two-sample z-test**. You can perform the Z-test to calculate the Z_score, as shown in the equation below:

$$
Z_{score} = \frac{ (p{'}_{new}-p{'}_{old}) - (p_{new}  -  p_{old})}{ \sqrt{ \frac{\sigma^{2}_{new} }{n_{new}} + \frac{\sigma^{2}_{old} }{n_{old}}  } }
$$

where,
- $p{'}$ is the "converted" success rate in the sample
- $p_{new}$ and $p_{old}$ are the "converted" success rate for the two groups in the population. 
- $\sigma_{new}$ and $\sigma_{new}$ are the standard deviation for the two groups in the population. 
- $n_{new}$ and $n_{old}$ represent the size of the two groups or samples (it's same in our case)


>Z-test is performed when the sample size is large, and the population variance is known. The z-score represents the distance between the two "converted" success rates in terms of the standard error. 

Next step is to make a decision to reject or fail to reject the null hypothesis based on comparing these two values: 
- $Z_{score}$
- $Z_{\alpha}$ or $Z_{0.05}$, also known as critical value at 95% confidence interval.  $Z_{0.05}$ is 1.645 for one-tailed tests,  and 1.960 for two-tailed test. You can determine the $Z_{\alpha}$ from the z-table manually. 

Decide if your hypothesis is either a two-tailed, left-tailed, or right-tailed test. Accordingly, reject OR fail to reject the  null based on the comparison between $Z_{score}$ and $Z_{\alpha}$. 
>Hint:<br>
For a right-tailed test, reject null if $Z_{score}$ > $Z_{\alpha}$. <br>
For a left-tailed test, reject null if $Z_{score}$ < $Z_{\alpha}$. 


In other words, we determine whether or not the $Z_{score}$ lies in the "rejection region" in the distribution. A "rejection region" is an interval where the null hypothesis is rejected iff the $Z_{score}$ lies in that region.



Reference: 
- Example 9.1.2 on this [page](https://stats.libretexts.org/Bookshelves/Introductory_Statistics/Book%3A_Introductory_Statistics_(Shafer_and_Zhang)/09%3A_Two-Sample_Problems/9.01%3A_Comparison_of_Two_Population_Means-_Large_Independent_Samples), courtesy www.stats.libretexts.org

---

>**Tip**: You don't have to dive deeper into z-test for this exercise. **Try having an overview of what does z-score signify in general.** 

In [18]:
import statsmodels.api as sm
# ToDo: Complete the sm.stats.proportions_ztest() method arguments
convert_new =df2.query('landing_page=="new_page" and converted=="1"').count()[0]
convert_old =df2.query('landing_page=="old_page" and converted=="1"').count()[0]
n_old = df2.query('landing_page=="old_page"').count()[0]
n_new = df2.query('landing_page=="new_page"').count()[0]
z_score, p_value = sm.stats.proportions_ztest(count=[convert_new, convert_old], 
                                              nobs=[n_new, n_old])
print(z_score, p_value)

-1.36833414 0.171207509093


**n.** What do the z-score and p-value you computed in the previous question mean for the conversion rates of the old and new pages?  Do they agree with the findings in parts **j.** and **k.**?<br><br>

>**Tip**: Notice whether the p-value is similar to the one computed earlier. Accordingly, can you reject/fail to reject the null hypothesis? It is important to correctly interpret the test statistic and p-value.

>**Put your answer here.**
<br>
<p> z-score=standard deviations

<a id='regression'></a>
### Part III - A regression approach

### ToDo 3.1 
In this final part, you will see that the result you achieved in the A/B test in Part II above can also be achieved by performing regression.<br><br> 

**a.** Since each row in the `df2` data is either a conversion or no conversion, what type of regression should you be performing in this case?

>**Put your answer here.**
<br>
<strong>logistic regression</strong>

**b.** The goal is to use **statsmodels** library to fit the regression model you specified in part **a.** above to see if there is a significant difference in conversion based on the page-type a customer receives. However, you first need to create the following two columns in the `df2` dataframe:
 1. `intercept` - It should be `1` in the entire column. 
 2. `ab_page` - It's a dummy variable column, having a value `1` when an individual receives the **treatment**, otherwise `0`.  

In [1]:
import pandas as pd
df=pd.read_csv('ab_data.csv')
df2=df.drop_duplicates()
df2.insert(loc=5,column='intercept',value=1)
df2.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted,intercept
0,851104,2017-01-21 22:11:48.556739,control,old_page,0,1
1,804228,2017-01-12 08:01:45.159739,control,old_page,0,1
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0,1
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0,1
4,864975,2017-01-21 01:52:26.210827,control,old_page,1,1


In [5]:
import numpy as np
import pandas as pd
df=pd.read_csv('ab_data.csv')
df2=df.drop_duplicates()
df2['ab_page']=np.where(df2['group']=="treatment",1,0)
df2.insert(loc=5,column='intercept',value=1)
df2.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted,intercept,ab_page
0,851104,2017-01-21 22:11:48.556739,control,old_page,0,1,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0,1,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0,1,1
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0,1,1
4,864975,2017-01-21 01:52:26.210827,control,old_page,1,1,0


**c.** Use **statsmodels** to instantiate your regression model on the two columns you created in part (b). above, then fit the model to predict whether or not an individual converts. 


In [11]:
import statsmodels.api as sm
m=sm.OLS(df2['converted'],df2[['intercept','ab_page']])
xm=m.fit()
xm

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f1a15cfe518>

**d.** Provide the summary of your model below, and use it as necessary to answer the following questions.

In [12]:
xm.summary()

0,1,2,3
Dep. Variable:,converted,R-squared:,0.0
Model:,OLS,Adj. R-squared:,0.0
Method:,Least Squares,F-statistic:,1.53
Date:,"Thu, 04 Nov 2021",Prob (F-statistic):,0.216
Time:,13:13:45,Log-Likelihood:,-86476.0
No. Observations:,294478,AIC:,173000.0
Df Residuals:,294476,BIC:,173000.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,0.1204,0.001,142.325,0.000,0.119,0.122
ab_page,-0.0015,0.001,-1.237,0.216,-0.004,0.001

0,1,2,3
Omnibus:,127167.1,Durbin-Watson:,1.994
Prob(Omnibus):,0.0,Jarque-Bera (JB):,419292.014
Skew:,2.344,Prob(JB):,0.0
Kurtosis:,6.493,Cond. No.,2.62


**e.** What is the p-value associated with **ab_page**? Why does it differ from the value you found in **Part II**?<br><br>  

**Hints**: 
- What are the null and alternative hypotheses associated with your regression model, and how do they compare to the null and alternative hypotheses in **Part II**? 
- You may comment on if these hypothesis (Part II vs. Part III) are one-sided or two-sided. 
- You may also compare the current p-value with the Type I error rate (0.05).


>**Put your answer here.**
<br>
<p> p_value that we have calculated  by using the z-test is bigger than (p).
    the reason for this difference is due to adding intercept,ab_page columns .

**f.** Now, you are considering other things that might influence whether or not an individual converts.  Discuss why it is a good idea to consider other factors to add into your regression model.  Are there any disadvantages to adding additional terms into your regression model?

>**Put your answer here.**
<br>
<p>I think adding other factors is a good thing that can help us to take the right decision.

**g. Adding countries**<br> 
Now along with testing if the conversion rate changes for different pages, also add an effect based on which country a user lives in. 

1. You will need to read in the **countries.csv** dataset and merge together your `df2` datasets on the appropriate rows. You call the resulting dataframe `df_merged`. [Here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html) are the docs for joining tables. 

2. Does it appear that country had an impact on conversion?  To answer this question, consider the three unique values, `['UK', 'US', 'CA']`, in the `country` column. Create dummy variables for these country columns. 
>**Hint:** Use `pandas.get_dummies()` to create dummy variables. **You will utilize two columns for the three dummy variables.** 

 Provide the statistical output as well as a written response to answer this question.

In [6]:
# Read the countries.csv
import pandas as pd
dfc=pd.read_csv('countries.csv')
dfc.head()

Unnamed: 0,user_id,country
0,834778,UK
1,928468,US
2,822059,UK
3,711597,UK
4,710616,UK


In [14]:
# Join with the df2 dataframe
import pandas as pd
df=pd.read_csv('ab_data.csv')
df2=df.drop_duplicates()
dfc=pd.read_csv('countries.csv')
mix=df2.join(dfc.set_index('user_id'),on='user_id')
mix.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted,country
0,851104,2017-01-21 22:11:48.556739,control,old_page,0,US
1,804228,2017-01-12 08:01:45.159739,control,old_page,0,US
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0,US
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0,US
4,864975,2017-01-21 01:52:26.210827,control,old_page,1,US


In [10]:
# Create the necessary dummy variables
import pandas as pd
df=pd.read_csv('ab_data.csv')
df2=df.drop_duplicates()
dfc=pd.read_csv('countries.csv')
mix=df2.join(dfc.set_index('user_id'),on='user_id')
dum=pd.get_dummies(mix['country'])
dum.head()

Unnamed: 0,CA,UK,US
0,0,0,1
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1


In [15]:
mix[['UK','US','CA']]=pd.get_dummies(mix['country'])
mix.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted,country,UK,US,CA
0,851104,2017-01-21 22:11:48.556739,control,old_page,0,US,0,0,1
1,804228,2017-01-12 08:01:45.159739,control,old_page,0,US,0,0,1
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0,US,0,0,1
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0,US,0,0,1
4,864975,2017-01-21 01:52:26.210827,control,old_page,1,US,0,0,1


**h. Fit your model and obtain the results**<br> 
Though you have now looked at the individual factors of country and page on conversion, we would now like to look at an interaction between page and country to see if are there significant effects on conversion.  **Create the necessary additional columns, and fit the new model.** 


Provide the summary results (statistical output), and your conclusions (written response) based on the results. 

>**Tip**: Conclusions should include both statistical reasoning, and practical reasoning for the situation. 

>**Hints**: 
- Look at all of p-values in the summary, and compare against the Type I error rate (0.05). 
- Can you reject/fail to reject the null hypotheses (regression model)?
- Comment on the effect of page and country to predict the conversion.


In [19]:
# Fit your model, and summarize the results
import statsmodels.api as sm
mix['intercept']=1
m2=sm.OLS(mix['converted'],mix[['intercept','US','CA']])

  from pandas.core import datetools


In [20]:
rm2=m2.fit()
rm2.summary()

0,1,2,3
Dep. Variable:,converted,R-squared:,0.0
Model:,OLS,Adj. R-squared:,0.0
Method:,Least Squares,F-statistic:,1.291
Date:,"Thu, 04 Nov 2021",Prob (F-statistic):,0.275
Time:,20:23:54,Log-Likelihood:,-86476.0
No. Observations:,294478,AIC:,173000.0
Df Residuals:,294475,BIC:,173000.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,0.1159,0.003,43.284,0.000,0.111,0.121
US,0.0047,0.003,1.600,0.110,-0.001,0.010
CA,0.0037,0.003,1.339,0.181,-0.002,0.009

0,1,2,3
Omnibus:,127166.345,Durbin-Watson:,1.994
Prob(Omnibus):,0.0,Jarque-Bera (JB):,419286.862
Skew:,2.344,Prob(JB):,0.0
Kurtosis:,6.493,Cond. No.,9.94


>**Put your conclusion answer here.**
<br>
<p style='color:blue'><strong>After studying the previous results, we extracted the following:
    <br>
we can not reject the null hypothesis , therefore we recommend that keeping on using the old web site.</strong></P>


## Final Check!



In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Analyze_ab_test_results_notebook.ipynb'])