# Second Activity

In this activity you will use several hypothesis tests to draw some conclusions from the `diabetes dataset`. The libraries you will need for this activity are imported in the following cell.

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as st

Now we import the data.

In [2]:
data = pd.read_csv('diabetes-dataset.csv')
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,2,138,62,35,0,33.6,0.127,47,1
1,0,84,82,31,125,38.2,0.233,23,0
2,0,145,0,0,0,44.2,0.63,31,1
3,0,135,68,42,250,42.3,0.365,24,1
4,1,139,62,41,480,40.7,0.536,21,0


## Difference of Population Means

In the `diabetes dataset` there are two populations: the people who have diabetes and the people who do not have diabetes. One way to see if there is a significant difference between these two populations is to compute the means of, say, `Insulin` for both populations and checking if the difference between these two means is significant from a statistical point of view. Let us do that with each variable of the dataset.

In [3]:
sample_means = data.groupby('Outcome').mean()
sample_means

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.168693,110.586626,68.094985,20.052432,70.56383,30.567477,0.434676,31.081307
1,4.732456,141.568713,71.166667,22.633041,98.897661,35.320468,0.540681,36.95614


It is clear that the means of each variable for these two populations differ in general. For instance, the mean of `Glucose` for people with diabetes is about 141.6, whereas for people that do not have diabetes is approximately 110.6, so there is a considerable gap between these two quantities, however, is it significant? To answer this we have to performa hypothesis test. In the following cell  

In [None]:
def significant_difference(df, alpha=0.05):
    
    group_0 = df[df['Outcome'] == 0]
    group_1 = df[df['Outcome'] == 1]
    variables = df.columns[0:-1]
    relevant = []
    
    for variable in variables:
        _, p_value = st.ttest_ind(group_0[variable], group_1[variable], equal_var=False)
        if p_value <= alpha:
            relevant.append(variable)
            
    return relevant        

In [None]:
significant_difference(data, 0.001)

In [None]:
def discretize(df1):
    
    df = df1.copy()
    
    for column in df1.columns:
        if column not in ['Overweight', 'Outcome']:
            statistics = df1.loc[df1[column] > 0, column].describe()
            q1 = statistics['25%']
            q2 = statistics['50%']
            q3 = statistics['75%']        
            df.loc[df1[column] < q1, column] = 0
            df.loc[(df1[column] >= q1) & (df1[column] < q2), column] = 1
            df.loc[(df1[column] >= q2) & (df1[column] < q3), column] = 2
            df.loc[df1[column] >= q3, column] = 3
    
    df = df.astype({'BMI': 'int32', 'DiabetesPedigreeFunction': 'int32'})
    
    return df  

In [None]:
def independence_test(df, alpha=0.05):
    
    df_discrete = discretize(df)
    columns = df.columns[0:-1]
    dependent = []
    
    for column in columns:
        contingency_table = pd.crosstab(df_discrete['Outcome'], df_discrete[column])
        _, p_value, _, _ = st.chi2_contingency(contingency_table)
        if p_value <= alpha:
            dependent.append(column)
            
    return dependent   

In [None]:
independence_test(data, 0.01)

In [None]:
methods = np.array([[58, 58, 48], 
                    [64, 69, 57], 
                    [55, 71, 59], 
                    [66, 64, 47], 
                    [67, 68, 49]])
methods

In [None]:
f_statistic, p_value = st.f_oneway(methods[:,0], methods[:,1], methods[:,2])

In [None]:
f_statistic, p_value