In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/loan-csv/contributors.md
/kaggle/input/loan-csv/install.R
/kaggle/input/loan-csv/runtime.txt
/kaggle/input/loan-csv/LICENSE
/kaggle/input/loan-csv/.gitignore
/kaggle/input/loan-csv/Makefile
/kaggle/input/loan-csv/README.md
/kaggle/input/loan-csv/environment.yml
/kaggle/input/loan-csv/.Rhistory
/kaggle/input/loan-csv/requirements.txt
/kaggle/input/loan-csv/practical-statistics-for-data-scientists.Rproj
/kaggle/input/loan-csv/_config.yml
/kaggle/input/loan-csv/R/README.md
/kaggle/input/loan-csv/R/code/Chapter 3 - Statistical Experiments and Significance Testing.R
/kaggle/input/loan-csv/R/code/Chapter 5 - Classification.R
/kaggle/input/loan-csv/R/code/Chapter 2 - Data and sampling distributions.R
/kaggle/input/loan-csv/R/code/Chapter 1 - Exploratory Data Analysis.R
/kaggle/input/loan-csv/R/code/Chapter 4 - Regression and Prediction.R
/kaggle/input/loan-csv/R/code/Chapter 6 - Statistical Machine Learning.R
/kaggle/input/loan-csv/R/code/Chapter 7 - Unsupervised Learning.R
/kag

# Web stickiness

A company selling a relatively high-value service wants to test which of two web pre‐ sentations does a better selling job. One potential proxy variable for the company is the number of clicks on the detailed landing page. A better one is how long people spend on the page. It is reasonable to think that a web presentation (page) that holds people’s attention longer will lead to more sales. Hence, our metric is average session time, comparing page A to page B.

# Chi square test

Web testing often goes beyond A/B testing and tests multiple treatments at once. The
chi-square test is used with count data to test how well it fits some expected distribu‐
tion. The most common use of the chi-square statistic in statistical practice is with
r ×c contingency tables, to assess whether the null hypothesis of independence
among variables is reasonable

## Chi-Square Test: A Resampling Approach

Suppose you are testing three different headlines—A, B, and C—and you run them
each on 1,000 visitors The headlines certainly appear to differ. Headline A returns nearly twice the click rate
of B. The actual numbers are small, though. A resampling procedure can test whether
the click rates differ to an extent greater than chance might cause. For this test, we
need to have the “expected” distribution of clicks, and in this case, that would be
under the null hypothesis assumption that all three headlines share the same click
rate, for an overall click rate of 34/3,000.

The Pearson residual is defined as:

R = (Observed - Expected)/Root of (expected)

The chi-square statistic is defined as the sum of the squared Pearson residuals:

In [3]:
clicks = pd.read_csv("/kaggle/input/loan-csv/data/click_rates.csv")

In [4]:
clicks.head()

Unnamed: 0,Headline,Click,Rate
0,Headline A,Click,14
1,Headline A,No-click,986
2,Headline B,Click,8
3,Headline B,No-click,992
4,Headline C,Click,12


In [5]:
clicks.describe()

Unnamed: 0,Rate
count,6.0
mean,500.0
std,535.314487
min,8.0
25%,12.5
50%,500.0
75%,987.5
max,992.0


In [6]:
clicks = clicks.pivot(index = 'Click',columns = 'Headline',values = 'Rate')

In [7]:
clicks

Headline,Headline A,Headline B,Headline C
Click,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Click,14,8,12
No-click,986,992,988


### Expected distribution

34/3 = 11.33
(986+992+988)/3 = 988.66

In [8]:
row_avrg = clicks.mean(axis = 1)
row_avrg

Click
Click        11.333333
No-click    988.666667
dtype: float64

In [9]:
exp = pd.DataFrame({'Headline A':row_avrg,'Headline B':row_avrg,'Headline C':row_avrg})
exp

Unnamed: 0_level_0,Headline A,Headline B,Headline C
Click,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Click,11.333333,11.333333,11.333333
No-click,988.666667,988.666667,988.666667


# Resampling approach

We can test with this resampling algorithm:
1. Constitute a box with 34 ones (clicks) and 2,966 zeros (no clicks).
2. Shuffle, take three separate samples of 1,000, and count the clicks in each.
3. Find the squared differences between the shuffled counts and the expected
counts and sum them.
4. Repeat steps 2 and 3, say, 1,000 times.
5. How often does the resampled sum of squared deviations exceed the observed?
That’s the p-value.

In [11]:
box = [1]*34

In [12]:
box.extend([0]*2966)

In [15]:
import random
random.shuffle(box)

# Observed chi square

In [16]:
expected_clicks = 34/3 
expected_noclicks = 1000 - expected_clicks 
expected = [expected_clicks,expected_noclicks]
expected

[11.333333333333334, 988.6666666666666]

## Function for Pearson residual

In [17]:
def chi2(observed,expected):
    pearson_residual = []
    
    for row,expect in zip(observed,expected):
        pearson_residual.append([((observe - expect)**2)/expect for observe in row])
        
    return np.sum(pearson_residual)

In [19]:
chi2observed = chi2(clicks.values,expected)
chi2observed

1.6659394708658917

## Permutation function

In [20]:
def perm_func(box):
    random.shuffle(box)
    
    sample_clicks = [sum(box[0:1000]),sum(box[1000:2000]),sum(box[2000:3000])]
    
    sample_noclicks = [1000 - n for n in sample_clicks]
    
    return chi2([sample_clicks,sample_noclicks],expected)

In [21]:
# Running the chi square test for 2000 times using permutationt test function

perm_chi2 = [perm_func(box) for _ in range(2000)]

# P - value

P - value can be computed by checking how often does the resampled sum of squared deviations exceed the observed?

In [26]:
resampled_pvalue = sum(perm_chi2 > chi2observed)/len(perm_chi2)

print(f'Observed chi2: {chi2observed:.4f}')
print(f'Resampled p value: {resampled_pvalue:.4f}')

Observed chi2: 1.6659
Resampled p value: 0.4720


**Here the p value = 0.47, which means we fail to reject the null hypothesis. The test shows that this result could easily have been obtained by randomness.**

# Chi-Square Test: Statistical Theory

Asymptotic statistical theory shows that the distribution of the chi-square statistic
can be approximated by a chi-square distribution . The appropriate standard chi-square distribution is determined by the
degrees of freedom . For a contingency table,
the degrees of freedom are related to the number of rows (r) and columns (c) as
follows:

degrees of freedom = r − 1 × c − 1

The chi-square distribution is typically skewed, with a long tail to the right; The further
out on the chi-square distribution the observed statistic is, the lower the p-value.

**In Python, using the function scipy.stats.chi2_contingency, we can find the chi square and p value**

In [28]:
import scipy.stats as stats



In [29]:
chisq,pvalue,df,expected = stats.chi2_contingency(clicks)

In [30]:
df

2

In [31]:
expected

array([[ 11.33333333,  11.33333333,  11.33333333],
       [988.66666667, 988.66666667, 988.66666667]])

In [32]:
print(f'Observed chi square : {chisq:.4f}')
print(f'P value: {pvalue:.4f}')

Observed chi square : 1.6659
P value: 0.4348


The p-value is a little less than the resampling p-value; this is because the chi-square
distribution is only an approximation of the actual distribution of the statistic.