# Coding Temple's Data Analytics Program
---
## Python for Data Analytics: Stats Assignment

For today's homework, you will be working with AB Testing Dataset. This data was scraped from a study conducted by Udemy. The study was to determine if there was a relationship with the number of clickthroughs and webtext that was displayed. 

Your goal is to analyze the data that was collected during this study and determine whether or not there is a relationship between the two variables.

### Task 1: Imports
Import your data and your libraries needed to complete this assignment:

In [41]:
import pandas as pd
from scipy.stats import chi2_contingency

In [6]:
df = pd.read_csv('data/ab_testing.csv')
df

Unnamed: 0.1,Unnamed: 0,Pageview,Group,Click
0,0,0,Control,1
1,1,1,Control,1
2,2,2,Control,1
3,3,3,Control,1
4,4,4,Control,1
...,...,...,...,...
690198,690198,344655,Experiment,0
690199,690199,344656,Experiment,0
690200,690200,344657,Experiment,0
690201,690201,344658,Experiment,0


In [7]:
df.drop('Unnamed: 0', axis=1, inplace=True)

In [9]:
lst_cols = ['pageview','group','click']
df.rename(columns=dict(zip(df.columns, lst_cols)), inplace=True)
df.head()

Unnamed: 0,pageview,group,click
0,0,Control,1
1,1,Control,1
2,2,Control,1
3,3,Control,1
4,4,Control,1


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690203 entries, 0 to 690202
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   pageview  690203 non-null  int64 
 1   group     690203 non-null  object
 2   click     690203 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 15.8+ MB


In [57]:
'''
    Brute Force Split Out to Separate DataFrames Per Group
    1) determine that the pageview column is just the unique identifier for each group
    2) get the number of positive clicks and row counts for each group
    3) have two separate group tables to work with later
'''
df_control = df[df['group'] == 'Control']
control_tot = len(df_control)
control_page = df_control['pageview'].nunique()
control_clicks = len(df_control[df_control['click']== 1])

df_exper = df[df['group'] == 'Experiment']
exper_tot = len(df_exper)
exper_page = df_exper['pageview'].nunique()
exper_clicks = len(df_exper[df_exper['click']== 1])

print(f'For Control group of {control_tot} rows, there are {control_page} pageview uniques and {control_clicks} clicks')
print(f'For Experiment group of {exper_tot} rows, there are {exper_page} pageview uniques and {exper_clicks} clicks')

For Control group of 345543 rows, there are 345543 pageview uniques and 28378 clicks
For Experiment group of 344660 rows, there are 344660 pageview uniques and 28325 clicks


### Task 2: Creating Hypothesis

Before you begin working with the data, formulate your null and alternative hypothesis in the markdown cell below:

H0: There is no relationship between the addition of text to the page and subsequent user clickthroughs.  The percentage of clicks in the Experiement group will be not significantly different from the clicks in the Control group. 
<br>Ha: The addition of text on the page leads to a change in the incidence of user clickthroughs.  There is a significantly different percentage of clicks in the Experiment group as compared to the click percentage in the Control group.



### Task 3: Calculating Frequencies

In this task, you will calculate your frequency and your relative frequency of the `group` column

In [22]:
df['group'].value_counts()

Control       345543
Experiment    344660
Name: group, dtype: int64

In [25]:
df['group'].value_counts(normalize=True)*100

Control       50.063967
Experiment    49.936033
Name: group, dtype: float64

Next, let's calculate the frequency and relative frequency of the `click` column:

In [27]:
df['click'].value_counts()

0    633500
1     56703
Name: click, dtype: int64

In [52]:
df['click'].value_counts(normalize=True)*100 # converted to percentages

0    91.784591
1     8.215409
Name: click, dtype: float64

### Task 4a: 

Look at the joint distribution of the `group` and `click` columns and make an inference on if there is a relationship between the group and clicks

In [51]:
# data exploration
# this is the sort of work I would usually do in SQL 
# but the 'normalize=True' statement is very powerful and no equivalent seems to exist in SQL that I know about
df[['group','click']].value_counts(normalize=True).to_frame().sort_values(by=['group','click'],ascending=False)*100

Unnamed: 0_level_0,Unnamed: 1_level_0,0
group,click,Unnamed: 2_level_1
Experiment,1,4.103865
Experiment,0,45.832168
Control,1,4.111544
Control,0,45.952423


In [31]:
# look at the click percentages constrained to the Control group from the brute force split
df_control[['group','click']].value_counts(normalize=True)*100

group    click
Control  0        91.787419
         1         8.212581
dtype: float64

In [32]:
# look at the click percentages constrained to the Experiment group from the brute force split
df_exper[['group','click']].value_counts(normalize=True)*100

group       click
Experiment  0        91.781756
            1         8.218244
dtype: float64

There appears to be very little difference between the number of follow-on clicks between the Control group and the Experiment group.  

The top query using the full DataFrame, df, yields the same numbers as the numbers in the crosstab in Task 4c.  This points to a potential error of this approach versus splitting via brute force and working that way.  In this case, the number of rows per each group is quite close.  Therefore, this error was kept to a minimum.  With a slightly greater difference in row numbers between each group, a misleading crosstab could result.  The crosstab suggests to the uninitiated that the Control group did ever so slightly better than the Experiment group at 4.1115% clicks versus 4.1038% clicks. This is not the correct interpretation and the larger number is because there were 345,543 rows in the Control group versus 344,660 rows in the Experiment group.  The click percentages for each group are a percentage of the total clicks.  The 883 row difference inflates the control group's positive click percentage.

A safer way to envision how results break out is via a brute force split to see the percentage of positive clicks in each group.  In this case, the Experiment group was ever so slightly better at 8.218% positive versus 8.213% positive for the Control group.  

Still no meaningful difference but the perils of a crosstab are evident.

Back to the AB Test, putting in the rows & positive clicks for each group into <a href="https://neilpatel.com/ab-testing-calculator/">Neil Patel's AB Calculator</a>, the Experiment group did work 1% better but only with a 54% confidence level and thus not statistically signifcant.

### Task 4b:

Calculate the marginal distribution of the `group` and `click` variables:

In [39]:
pd.crosstab(df['click'], df['group']).sort_values(by='click', ascending=False)

group,Control,Experiment
click,Unnamed: 1_level_1,Unnamed: 2_level_1
1,28378,28325
0,317165,316335


### Task 4c:
Calculate the conditional distribution of the `group` and `click` variables:

In [40]:
pd.crosstab(df['click'], df['group'], normalize=True).sort_values(by='click', ascending=False)*100

group,Control,Experiment
click,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4.111544,4.103865
0,45.952423,45.832168


### Task 5:

Perform a chi-squared test on your data. Interpret and communicate the results in a markdown cell below your code

In [43]:
_,p_val, ___,__ = chi2_contingency(pd.crosstab(df['group'], df['click']))
print(p_val)

0.9352212452988706


OK, given my text about crosstabs in the interpretation markdown above, that the crosstab feature is used in the Chi Squared calculation worries me about the potential for error in other studies. This one is safe because of the relatively equal number of records will not lead to the potential error I am concerned about.  To wrap up, the large Chi Squared of 93% shows that there are insignificant differences between the positive clicks in each groups.  H0 cannot be rejected.

My question is in a situation where the percentage of positive clicks is identical for the Experiment group but the number of Experiment rows is unequal to the Control group by 10,000, will the Chi Squared test -- using a crosstab -- yield the same result?

### Independent Exploration

So my issues with crosstab in a Chi Square study is that the number of rows in each study influences the result despite the positive click percentages being almost identical.  

So, I have created a second dataframe of results where the Experiment group is down by 10,000 records but has the <b>IDENTICAL</b> positive click percentage that was previously shown to be basically equal to that percentage in the Control group.

In [58]:
import numpy as np

In [72]:
num_rows = (control_tot - 10000)
percentage_of_ones = 0.08218244
num_ones = int(num_rows * percentage_of_ones)
num_zeros = num_rows - num_ones
sequence = range(1,num_rows+1)

click = np.concatenate([np.ones(num_ones), np.zeros(num_zeros)])
np.random.shuffle(click)
df_exper2 = pd.DataFrame({'pageview': sequence, 'group': 'Experiment','click':click})
# percentage breakout of positive clicks is identical in this dataframe
df_exper2['click'].value_counts(normalize=True)*100


0.0    91.781977
1.0     8.218023
Name: click, dtype: float64

In [73]:
# merge the new (lower # record) experiment dataframe with the control 
# dataframe separated above to make a new project dataframe
df2 = pd.concat([df_control, df_exper2], ignore_index=True)

In [74]:
# a new conditional distribution
# here's where the error surfaces because of how it's fed into ch2_contingency and interpreted
pd.crosstab(df2['click'], df2['group'], normalize=True).sort_values(by='click', ascending=False)*100

group,Control,Experiment
click,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,4.166581,4.048681
0.0,46.567541,45.217197


In [75]:
# a new Chi Squared test
# and bingo
_,p_val, ___,__ = chi2_contingency(pd.crosstab(df2['group'], df['click']))
print(p_val)

0.0006376660871901818


In [77]:
# get total number of records in new experiment dataframe
len(df_exper2)

335543

In [76]:
# get number of positive clicks from new experiment dataframe
df_exper2['click'].value_counts()

0.0    307968
1.0     27575
Name: click, dtype: int64

Back to Neil Patel's AB Testing Calculator for the new row numbers and click counts for the reduced record Experiment group, and the results are the same: a 1% improvement at 54% confidence level. 

This shows that crosstabs should not be used in a Chi Square test because it is falsely sensitive to row counts in the categories.

Previous stats courses with Chi Squared test shows it to be useful.  I will explore how to use this in Python not using crosstabs so that it can be depended on.  I also will review my stats notes to see how to calculate the confidence level, p-value as done on Niel Patel's calculator.