## Problem Statement 1:

Is gender independent of education level? A random sample of 395 people were
surveyed and each person was asked to report the highest education level they
obtained. The data that resulted from the survey is summarized in the following table:

       High_School  Bachelors  Masters   Ph.d.  Total

Female     60          54         46       41     201

Male       40          44         53       57     194

Total      100         98         99       98     395

Question: Are gender and education level dependent at 5% level of significance? In
other words, given the data collected above, is there a relationship between the
gender of an individual and the level of education that they have obtained?

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
High_School = [60,40]
Bachelors = [54,44]
Masters = [46,53]
Ph_d = [41,57]
Female = [60,54,46,41]
Male = [40,44,53,57]
marks = Male + Female
print(marks)

[40, 44, 53, 57, 60, 54, 46, 41]


In [3]:
Sex = ["M","M","M","M","F","F","F","F"]
Education = ["High_School","Bachelors","Masters","Ph.d.","High_School","Bachelors","Masters","Ph.d."]
df = pd.DataFrame({"Education" : Education,"Sex" : Sex,"Marks" : marks})
df

Unnamed: 0,Education,Sex,Marks
0,High_School,M,40
1,Bachelors,M,44
2,Masters,M,53
3,Ph.d.,M,57
4,High_School,F,60
5,Bachelors,F,54
6,Masters,F,46
7,Ph.d.,F,41


In [4]:
dataset_table = pd.crosstab(df["Sex"],df["Education"],df["Marks"],aggfunc = "sum")

In [5]:
print(dataset_table)

Education  Bachelors  High_School  Masters  Ph.d.
Sex                                              
F                 54           60       46     41
M                 44           40       53     57


In [6]:
contigency_table = dataset_table.values
print(contigency_table)

[[54 60 46 41]
 [44 40 53 57]]


Null Hypothesis : The gender of an individual and the level of education are independent to each other.

Alternate Hypothesis : The gender of an individual and the level of education are dependent to each other.

In [7]:
#observed frequencies
observed_value = contigency_table

In [8]:
#Expected Frequencies
val = stats.chi2_contingency(dataset_table)
val

(8.006066246262538,
 0.045886500891747214,
 3,
 array([[49.86835443, 50.88607595, 50.37721519, 49.86835443],
        [48.13164557, 49.11392405, 48.62278481, 48.13164557]]))

In [9]:
expected_value = val[3]
dof = val[2] #degree of freedom

In [10]:
#perform chi2 test
chi2 = sum(sum([((O - E)**2)/E for O,E in zip(observed_value,expected_value)]))
print("Chi2 value :",chi2)

Chi2 value : 8.006066246262538


In [11]:
#Critical Value
alpha = 0.05
critical_value = stats.chi2.ppf(1-alpha,dof)
print("Critical Value : ",critical_value)

Critical Value :  7.814727903251179


In [12]:
#P Value
p_value = 1 - stats.chi2.cdf(chi2,dof)
print("P Value :",p_value)

P Value : 0.04588650089174717


In [13]:
#Compairing 
if chi2 >= critical_value :
    print("The gender of an individual and the level of education are dependent to each other")
else :
    print("The gender of an individual and the level of education are independent to each other")
#compairing p value
if p_value <= alpha :
    print("The gender of an individual and the level of education are dependent to each other")
else :
    print("The gender of an individual and the level of education are independent to each other")

The gender of an individual and the level of education are dependent to each other
The gender of an individual and the level of education are dependent to each other


## Problem Statement 2:

Using the following data, perform a oneway analysis of variance using α=.05. Write
up the results in APA format.

[Group1: 51, 45, 33, 45, 67]

[Group2: 23, 43, 23, 43, 45]

[Group3: 56, 76, 74, 87, 56]

In [14]:
import pandas as pd
import numpy as np
from scipy import stats 

In [15]:
d = {
    "Group1" : pd.Series([51, 45, 33, 45, 67]),
    "Group2" : pd.Series([23, 43, 23, 43, 45]),
    "Group3" : pd.Series([56, 76, 74, 87, 56])
}
df = pd.DataFrame(d)

In [16]:
df

Unnamed: 0,Group1,Group2,Group3
0,51,23,56
1,45,43,76
2,33,23,74
3,45,43,87
4,67,45,56


NULL Hypothesis : means of all groups are equal

ALTERNATE Hypothesis : means of all groups are different    

In [17]:
anova_value , p_value = stats.f_oneway(df["Group1"] ,df["Group2"] ,df["Group3"])

In [18]:
print("P Value :",p_value)

P Value : 0.0030597541434430556


In [19]:
if p_value < 0.05 :
    print("Reject Null Hypothesis")
else :
    print("Accept Null Hypothesis")

Reject Null Hypothesis


## Problem Statement 3:

Calculate F Test for given 10, 20, 30, 40, 50 and 5,10,15, 20, 25.
For 10, 20, 30, 40, 50:


NULL Hypothesis : Population variances are equal

ALTERNATE Hypothesis : population variances are unequal

In [20]:
X = [10, 20, 30, 40, 50]
Y = [5,10,15, 20, 25]

In [21]:
import numpy as np
from scipy import stats

#define F-test function
def f_test(x, y):
    x = np.array(x)
    y = np.array(y)
    var1 = np.var(x, ddof=1)
    var2 = np.var(y, ddof=1)
    if var1 > var2 :
        f = var1/var2 #calculate F test statistic 
        dof_x = x.size-1 #define degrees of freedom of x smaple 
        dof_y = y.size-1 #define degrees of freedom of y sample 
        p = 1-stats.f.cdf(f, dof_x, dof_y) #find p-value of F test statistic 
        return f, p
    else :
        f = var2/var1 #calculate F test statistic 
        dof_x = x.size-1 #define degrees of freedom of x smaple 
        dof_y = y.size-1 #define degrees of freedom of y sample 
        p = 1-stats.f.cdf(f, dof_y, dof_x) #find p-value of F test statistic 
        return f, p

#perform F-test
f_value , p_value = f_test(X, Y)

In [22]:
print("F Test Value for given 10, 20, 30, 40, 50 and 5,10,15, 20, 25 :",f_value)

F Test Value for given 10, 20, 30, 40, 50 and 5,10,15, 20, 25 : 4.0


In [23]:
if p_value < 0.05:
    print("reject null hypothesis")
else:
    print("accept null hypothesis")

accept null hypothesis
