In [28]:
import numpy as np
from scipy.stats import chi2_contingency
import pandas as pd

In [1]:
nums = [3, 16, 20, 4, 2, 5, 10, 9, 13, 7, 14, 8]

In [8]:
nums = sorted(nums)
print(nums)
print('Number of instances = ' + str(len(nums)))

[2, 3, 4, 5, 7, 8, 9, 10, 13, 14, 16, 20]
Number of instances = 12


For equal frequency binning with 3 bins, we divide the total length of our list by 3 = 12/3 = 4. \
This leaves us with the following bins and corresponding frequencies: \
(2,5] : 4 \
(7,10] : 4 \
(13,20] : 4

For binning by smoothing the boundaries, we need to find what interval between our values is common. In this case, the interval of close to 5 appears to fit our values. \
This leaves us with the following bins and frequencies: \
(2,7] : 5 \
(8,13] : 4 \
(14,20] : 3

In [10]:
nums2 = [10,5,25,50,35]

The formula for min-max normalisation is : (value - min)/(max - min)

Applying this to our data results in: 

(10 - 5)/(50-5) = 5/45 = 0.11 \
(5 - 5)/(50-5) = 0/45 = 0 \
(25 - 5)/(50-5) = 20/45 = 0.44 \
(50 - 5)/(50-5) = 45/45 = 0.1 \
(35 - 5)/(50-5) = 30/45 = 0.66 

The formula for Z-score normalisation is (value - mean)/standard deviation

The mean of our dataset is (10 + 5 + 25 + 50 + 35)/5 = 25

To calculate the standard deviation we need to calculate the variance, this involves calculating the difference of each point to the mean, squaring and summing these differences, then dividing this sum by the number of values in our data. 

Our differences = [-15,-20,0,25,10] \
Our squared differences = [225,400,0,625,100] \
This sums to: 1350 \
For our final variance score, we calculate 1350/5 = 270. 

To calculate the standard deviation, we find the square root of this 270 = 16.43 

Now we can perform Z-score normalisation. 

(10-25)/16.43 = -0.91 \
(5-25)/16.43 = -1.21 \
(25-25)/16.43 = 0 \
(50-25)/16.43 = 1.52 \
(35-25)/16.43 = 0.61

For the chi-square test, first recreate the table and sum both the rows and columns.

| Rating/University | University A | University B | Total |
| --------------- | ------------ | ----------- | ---------- |
|Satisfied | 71 | 129 | 200 |
|Dissatisfied| 37 | 73 | 110 |
|Total | 108 | 202 | 310 |

Then calculate the expected values. This involves multiplying the column and row total of the variable you wish to use, then dividing this figure by the overall total. 

The expected value for University A and Satisfied is : \
(108 * 200)/310 = 69.7 \
The expected value for University A and Dissatisfied is : \
(108 * 110)/310 = 38.3 \
The expected value for University B and Satisfied is : \
(202 * 200)/310 = 130.3 \
The expected value for University B and Dissatisfied is : \
(202 * 110)/310 = 71.7 

Update the table with these figures in parentheses. 

| Rating/University | University A | University B | Total |
| --------------- | ------------ | ----------- | ---------- |
|Satisfied | 71 (69.7) | 129 (130.3)| 200 |
|Dissatisfied| 37 (38.3) | 73 (71.7) | 110 |
|Total | 108 | 202 | 310 |

Calculating the ${x^2}$ statistic uses this formula : ${x^2 = \Sigma\frac{(O-E)^2}{E}}$

This means finding the difference between the original and expected value, squaring this difference then dividing it by the expected value then summing all these resulting values. 

Here are the calculations: \
(71-69.7)^2/69.7 = 1.3^2/69.7 = 1.69/69.7 = 0.024 \
(37-38.3)^2/38.3 = -1.3^2/38.3 = 1.69/38.3 = 0.044 \
(129-130.3)^2/130.3 = -1.3^2/130.3 = 1.69/130.3 = 0.013 \
(73-71.7)^2/71.7 = 1.3^2/71.7 = 1.69/71.7 = 0.024

${x^2}$ = 0.024 + 0.044 + 0.013 + 0.024 = 0.105

The hypothesis that satisfaction and university are independent requires a ${x^2}$ value of 10.828 or below. With this resulting statistic we can accept the hypothesis at a significance level of 0.001 and state that they are independent. 

Note to self - Confused as my chi statistic is different from the one produced by the scipy module although expected values are the same. 

In [24]:
df = pd.DataFrame({
    'Rating': ['satisfied', 'dissatisfied'],
    'A': [71,37],
    'B': [129,73]
})
df = df.set_index('Rating')
df

Unnamed: 0_level_0,A,B
Rating,Unnamed: 1_level_1,Unnamed: 2_level_1
satisfied,71,129
dissatisfied,37,73


In [29]:
chi2_contingency(df)

(0.041999512451245155,
 0.8376207016450208,
 1,
 array([[ 69.67741935, 130.32258065],
        [ 38.32258065,  71.67741935]]))

In [32]:
(((71-69.7)**2)/69.7) + (((37-38.3)**2)/38.3) + (((129-130.3)**2)/130.3) + (((73-71.7)**2)/71.7)


0.1049125996786575