### Problem Statement 1: 
 
Is gender independent of education level? A random sample of 395 people were surveyed and each person was asked to report the highest education level they obtained. The data that resulted from the survey is summarized in the following table: 


| | High School | Bachelors | Master| Ph.d.| Total|
| ---- | ---- | ---- | ---|---|---|
| Female | 60| 54 | 46|41|201|
| Male| 40 | 44 | 53|57|194|
|Total|100|98|99|98|395|


Question: Are gender and education level dependent at 5% level of significance? In other words, given the data collected above, is there a relationship between the gender of an individual and the level of education that they have obtained? 

In [1]:
import pandas as pd

# creating view for Observed table

edu = {'High School': [60, 40], 'Bachelors': [54, 44], 'Master': [46, 53], 'Ph.d': [41, 57]}
df_frq = pd.DataFrame(edu, columns = ['High School', 'Bachelors', 'Master', 'Ph.d'], index = ['Female', 'Male'])
print('Observed table: ')
print('*' * 50 )
print(df_frq)

Observed table: 
**************************************************
        High School  Bachelors  Master  Ph.d
Female           60         54      46    41
Male             40         44      53    57


In [2]:
# H0: No relationship between gender and level of education ontained.
# H1: There is relationship between gender and level of education ontained.

#Now we need to create expected table for the above null hypothesis. Formula -> E = ((row total * column total) / grand total)

Obsv_lst = [60, 40, 54, 44, 46, 53, 41, 57]

# Total for all combinations
grand_Total = sum(Obsv_lst)

# Total level of education column wise
To_cls = df_frq.sum(axis=0)

# Total leader / follower / unclassifiable row wise
To_rows = df_frq.sum(axis=1)

In [3]:
# Now creating the expected list corresponding to the observed list

def exp_list(To_cls, To_rows):
    Expected_lst = []
    for i in range(4):
        for j in range(2):
            res = (To_cls[i] * To_rows[j]) / grand_Total
            Expected_lst.append(res)
            
    return Expected_lst

Expects_lst = exp_list(To_cls, To_rows)
print('The expected list is: ', Expects_lst)

The expected list is:  [50.88607594936709, 49.11392405063291, 49.868354430379746, 48.131645569620254, 50.37721518987342, 48.62278481012658, 49.868354430379746, 48.131645569620254]


In [4]:
import numpy as np

# Now applying Chi-Square independence formula on Observed and Expected list

Ob_vals = np.array(Obsv_lst) # observed value
Ex_vals = np.array(Expects_lst) # expected value

def chi_sqr(Ob_vals, Ex_vals):
    results = (Ob_vals - Ex_vals)**2 # (O-E)**2
    sum_sqr = 0
    for ele in range(len(results)):
        sum_sqr += (results[ele] / Ex_vals[ele])
    return sum_sqr

X_sqr = chi_sqr(Ob_vals, Ex_vals)

print('Calculated Chi-Square value is: ', X_sqr)

Calculated Chi-Square value is:  8.006066246262538


In [5]:
from scipy.stats import chi2

# Given significace level ∝ = 5% = 0.05
# degree of freedom = (2-1) * (4-1) = 3

# prob = 1-0.05 = 0.95
p = 0.95
dof = 3

critical = chi2.ppf(p, dof)
print('Critical value of Chi-Square is: ', critical)

Critical value of Chi-Square is:  7.814727903251179


<font color='blue'>
    
### Conclusions
    
The criticial value of Chi-Square is: Χ^2 = 7.81 at 5% level of significance. <br>
Our computed value of Chi-Square is: Χ^2 = 8.01 <br>

So, computed value falls in rejection region. Hence we reject the Null Hypothesis. <br>
Hence there is relationship between gender and level of education ontained at 5% level of significance.

    
</font>

### Problem Statement 2: 
 
Using the following data, perform a oneway analysis of variance using α=.05. Write up the results in APA format. 
 
 
[Group1: 51, 45, 33, 45, 67]  [Group2: 23, 43, 23, 43, 45]  [Group3: 56, 76, 74, 87, 56] 

In [6]:
import math

# H0 : The mean is same for all the group
# H1 : The mean is not same for all the group

# Group1: 
    
group1 = [51, 45, 33, 45, 67]

# Calculate mean

def mean_gr1(group1):
    gr1_mean = sum(group1) / len(group1)
    return gr1_mean
mean1 = mean_gr1(group1)

# Calculate deviation for each number

def dev_gr1(group1, mean1):
    dev1_lst = []
    for num in group1:
        dev1 = (num - mean1)
        dev1_lst.append(dev1)
    return dev1_lst
deviatoin_gr1 = dev_gr1(group1, mean1)

# Calculate square deviation for each number

def sqrDev_gr1(deviatoin_gr1):
    sqrDev = []
    for num in deviatoin_gr1:
        sqr = num **2
        sqrDev.append(sqr)
    return sqrDev

square_dev_gr1 = sqrDev_gr1(deviatoin_gr1)

print('Calculated square deviation for the numbers in group1 is: ', square_dev_gr1)

Calculated square deviation for the numbers in group1 is:  [7.839999999999984, 10.240000000000018, 231.04000000000008, 10.240000000000018, 353.4399999999999]


In [7]:
import math
# Group2: 
    
group2 = [23, 43, 23, 43, 45]

# Calculate mean

def mean_gr2(group2):
    gr2_mean = sum(group2) / len(group2)
    return gr2_mean
mean2 = mean_gr2(group2)

# Calculate deviation for each number

def dev_gr2(group2, mean2):
    dev2_lst = []
    for num in group2:
        dev2 = (num - mean2)
        dev2_lst.append(dev2)
    return dev2_lst
deviatoin_gr2 = dev_gr2(group2, mean2)

# Calculate square deviation for each number

def sqrDev_gr2(deviatoin_gr2):
    sqrDev = []
    for num in deviatoin_gr2:
        sqr = num **2
        sqrDev.append(sqr)
    return sqrDev

square_dev_gr2 = sqrDev_gr2(deviatoin_gr2)

print('Calculated square deviation for the numbers in group2 is: ', square_dev_gr2)


Calculated square deviation for the numbers in group2 is:  [153.75999999999996, 57.76000000000002, 153.75999999999996, 57.76000000000002, 92.16000000000003]


In [8]:
import math
# Group3: 
    
group3 = [56, 76, 74, 87, 56]

# Calculate mean

def mean_gr3(group3):
    gr3_mean = sum(group3) / len(group3)
    return gr3_mean
mean3 = mean_gr3(group3)

# Calculate deviation for each number

def dev_gr3(group3, mean3):
    dev3_lst = []
    for num in group3:
        dev3 = (num - mean3)
        dev3_lst.append(dev3)
    return dev3_lst
deviatoin_gr3 = dev_gr3(group3, mean3)

# Calculate square deviation for each number

def sqrDev_gr3(deviatoin_gr3):
    sqrDev = []
    for num in deviatoin_gr3:
        sqr = num **2
        sqrDev.append(sqr)
    return sqrDev

square_dev_gr3 = sqrDev_gr3(deviatoin_gr3)

print('Calculated square deviation for the numbers in group3 is: ', square_dev_gr3)


Calculated square deviation for the numbers in group3 is:  [190.4399999999999, 38.44000000000003, 17.640000000000025, 295.8400000000001, 190.4399999999999]


In [9]:
# Now we will calculate summation of square between groups (SS_between) ---
# calculate grand mean:
grand_mean = (mean1+mean2+mean3) / 3

# no of dataset for each group
n1 = len(group1)
n2 = len(group2)
n3 = len(group3)

SS_btw = (n1 * ((mean1 - grand_mean)**2)) + (n2 * ((mean2 - grand_mean)**2)) + (n3 * ((mean3 - grand_mean)**2))

# now we will calculate mean inbetween ---
mean_btw = (SS_btw / 2) # (total no of sample - 1) = 3-1 = 2
print('So calculated mean between value is: ', mean_btw)

So calculated mean between value is:  1511.4666666666665


In [10]:
# Now we will calculate summation of square within groups (SS_within) ---
# Group1
ss_gr1 = sum(square_dev_gr1) # sum of the square deviation for the group
# Group2
ss_gr2 = sum(square_dev_gr2)
# Group3
ss_gr3 = sum(square_dev_gr3)

SS_within = (ss_gr1 + ss_gr2 + ss_gr3)

# now we will calculate mean within ---
mean_within = (SS_within / 12) # total no of dataset = 5*3 = 15, total sample = 3, 15-2 = 12 = DOF
print('So calculated mean within value is: ', mean_within)

So calculated mean within value is:  155.06666666666666


In [11]:
# Conduct F-test
F_test = (mean_btw / mean_within) # mean of summation of square between / mean of summation of square within
print('Calculated value of F-test is: ', F_test)

Calculated value of F-test is:  9.747205503009457


In [12]:
import scipy.stats

# find critical F value for α = 0.05
α = 0.05
q = 1-α
dof_w = 12 # degree of freedom within
dof_b = 2 # degree of freedom between

F_critical = scipy.stats.f.ppf(q, dof_b, dof_w)
print('Critical value of F-test is: ', F_critical)

Critical value of F-test is:  3.8852938346523933


<font color='blue'>
    
### Conclusions
    
The criticial value of F-test is: F_crit = 3.89 at 5% level of significance. <br>
Our computed value of F-test is: F_comp = 9.75 <br>

So, computed value falls in rejection region. Hence we reject the Null Hypothesis. <br>
Hence the mean is not same for atll the groups at 5% level of significance.

    
</font>

In [13]:
import pandas as pd

# creating view for Anova table

anno = {'Source_Variation': ['Between', 'Within', 'Total'], 'Summation_Square(SS)': [SS_btw, SS_within, (SS_btw+SS_within)], 'DOF': [dof_b, dof_w,' '], 'Mean_Square(MS)': [mean_btw, mean_within, ' '], 'F_Score': [F_test, ' ', ' ']}
df_frq = pd.DataFrame(anno, columns = ['Source_Variation', 'Summation_Square(SS)', 'DOF', 'Mean_Square(MS)', 'F_Score'])
print('ANOVA table printed: ')
print('*' * 70 )
print(df_frq)

ANOVA table printed: 
**********************************************************************
  Source_Variation  Summation_Square(SS) DOF Mean_Square(MS)  F_Score
0          Between           3022.933333   2         1511.47  9.74721
1           Within           1860.800000  12         155.067         
2            Total           4883.733333                             


In [14]:
# Effective size: 

η2 = (SS_btw / (SS_btw + SS_within))
print('Effective size is: ', round(η2, 2))

Effective size is:  0.62


<font color='blue'>

### APA writeup

F(2,12) = 9.75, p<0.05, η2 = 0.62

</font>

### Problem Statement 3: 
 
Calculate F Test for given 10, 20, 30, 40, 50 and 5,10,15, 20, 25. 
 
For 10, 20, 30, 40, 50:

In [15]:
import math

lst_one = [10,20,30,40,50]

# calculate mean
def mean_cal(lst_one):
    val = (sum(lst_one) / len(lst_one))
    return val

miu_mean = mean_cal(lst_one)

# calculate standard deviation
def stdDev_cal(lst_one, miu_mean):
    sqr_sum = 0
    for num in lst_one:
        sqr_sum += ( num - miu_mean)**2
        
    stdDev = math.sqrt(sqr_sum * (1/(len(lst_one)-1)))
    return stdDev
        
sig_stdDev = stdDev_cal(lst_one, miu_mean)

var_one = sig_stdDev **2

print('Calculated variance for the first set is: ',var_one )
    

Calculated variance for the first set is:  250.0


In [16]:
lst_two = [5,10,15,20,25]

# calculate mean
def mean_cal(lst_two):
    val = (sum(lst_two) / len(lst_two))
    return val

miu_mean = mean_cal(lst_two)

# calculate standard deviation
def stdDev_cal(lst_two, miu_mean):
    sqr_sum = 0
    for num in lst_two:
        sqr_sum += ( num - miu_mean)**2
        
    stdDev = math.sqrt(sqr_sum * (1/(len(lst_two)-1)))
    return stdDev
        
sig_stdDev = stdDev_cal(lst_two, miu_mean)

var_two = sig_stdDev **2

print('Calculated variance for the first set is: ',var_two )

Calculated variance for the first set is:  62.5


In [17]:
# Calculate F Test

F_test = (var_one / var_two)
print('The F test value is: ', F_test)

The F test value is:  4.0
