# Second Activity

In this activity you will use several hypothesis tests to draw some conclusions from the `diabetes dataset`. The libraries you will need for this activity are imported in the following cell.

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as st

Now we import the data.

In [None]:
data = pd.read_csv('diabetes-dataset.csv')
data.head()

## Difference of Population Means

In the `diabetes dataset`, there are two populations: the people who have diabetes and the people who do not have diabetes. One way to see if there is a significant difference between these two populations is to compute the means of, say, `Insulin` for both populations and check if the difference between these two means is statistically significant. Let us do this for each variable in the data set.

In [None]:
sample_means = data.groupby('Outcome').mean()
sample_means

It is clear that the means of each variable are generally different for these two populations. For example, the mean of `Glucose` for people with diabetes is about 141.6, while for people without diabetes is about 110.6, so there is a considerable gap between these two quantities, but is it significant? To answer this, we need to perform a hypothesis test like this one:

$$H_0: \mu_0-\mu_1\leq0$$
$$H_a: \mu_0-\mu_1\neq0$$

Here, for a given variable, say `Age`, $\mu_0$ and $\mu_1$ are the population means of people without diabetes and people with diabetes, respectively.

In the following cell you will write a function that checks whether the difference between the means is significant for each variable and returns a list with the variables for which this difference was relevant. Of course, you will have to do this with the hypothesis test for the difference of means. 

In [None]:
def significant_difference(df, alpha=0.05):
    
    """
    This function receives a dataframe df and a significance level alpha. The function
    performs a two-tail hypothesis test for the difference of population means and returns
    a list with the variables for which said difference was relevant.
    """
    
    relevant = []
            
    "INSERT YOUR CODE HERE"
    
    return relevant        

In [None]:
significant_variables = significant_difference(data)
significant_variables

Now run your function with a significance level equal to $\alpha=0.001$. Do you obtain the same list of variables?

In [None]:
significant_variables_1 = significant_difference(data, 0.001)
significant_variables_1

## Independence of Two Variables

We can use hypothesis testing to test whether two variables are independent. We will assume that our variable of interest is once again `Outcome`, so we want to know if there are any variables that do not affect the behavior of `Outcome`. Say we are considering the variable BMI, then the hypothesis test to be employed is the following:

$$H_0: \text{BMI and Outcome are independent}$$
$$H_a: \text{BMI and Outcome are not independent}$$
 
However, `BMI` is a continuous variable, and the independence test is valid for categorical variables, so if we want to use this test, we need to discretize the variable `BMI`. We will do this using quantiles: if a variable value is less than Q1, then that value is replaced by a **zero**; if the given value is greater than or equal to Q1 but less than Q2, then the value should be replaced by a **one**; if the variable value is greater than or equal to Q2 but less than Q3, then the value should be replaced by a **two**; finally, if a variable value is greater than Q3, it should be assigned a value of **three**.

The following function discretizes the diabetes data set by performing the previous procedure. Simply run it and work with the discrete data set stored in data_discrete.

In [None]:
def discretize(df1):
    
    """
    This function discretize the continuous variables of the diabetes dataset using quantiles.
    You are not to modify it.
    """
    
    df = df1.copy()
    
    for column in df1.columns:
        if column != 'Outcome':
            statistics = df1.loc[df1[column] > 0, column].describe()
            q1 = statistics['25%']
            q2 = statistics['50%']
            q3 = statistics['75%']        
            df.loc[df1[column] < q1, column] = 0
            df.loc[(df1[column] >= q1) & (df1[column] < q2), column] = 1
            df.loc[(df1[column] >= q2) & (df1[column] < q3), column] = 2
            df.loc[df1[column] >= q3, column] = 3
    
    df = df.astype({'BMI': 'int32', 'DiabetesPedigreeFunction': 'int32'})
    
    return df  

In [None]:
data_discrete = discretize(data)
data_discrete.head()

In the following cell, you will write a function that checks whether a given variable and the variable `Outcome` are independent. The function must do this for each variable and return a list containing the variables that are not independent of `Outcome`. As expected, you will need to do this with the hypothesis test for independence of two variables.

In [None]:
def dependence_test(df, alpha=0.05):
    
    """
    This function receives a dataframe df and a significance level alpha. The function
    performs a hypothesis test for the independence of two variables and returns
    a list with the variables that are not independent of the target variable Outcome.
    """
    
    dependent = []
            
    "INSERT YOUR CODE HERE"
            
    return dependent   

Run your function and use a significance level of $\alpha=0.01$. Is there any variable that is independent of `Outcome`?

In [None]:
dependent_variables = dependence_test(data, 0.01)
dependent_variables

The list you get tells you if there is a relevant relationship between the variables in the dataset and the variable of interest `Outcome`.

## ANOVA

An example of an experimental statistical study is as follows: Chemitech, Inc. developed a filtration system for municipal water supplies. The components of the filtration system were purchased from various suppliers and Chemitech assembled the filtration system at its factory in Columbia, South Carolina. The industrial engineering group is charged with determining the best method for assembling the filtration system. After considering several methods, only three alternatives remained: **Method A**, **Method B** and **Method C**. The difference between these methods was the order of the steps to assemble the system. Chemitech wants to know which method can produce the most filtration systems in a week.

The data collected by Chemitech is stored in a `NumPy` array.

In [None]:
methods = np.array([[58, 58, 48], 
                    [64, 69, 57], 
                    [55, 71, 59], 
                    [66, 64, 47], 
                    [67, 68, 49]])
methods

You will carry out an **analysis of variance**, also known as **ANOVA**, on the data that was collected by Chemitech. Use the following cell to write code that will perform an ANOVA on the mentioned data. Use a significance level of $\alpha=0.05$.

In [None]:
alpha = 0.05
"INSERT YOUR CODE HERE"

Now you are ready to draw a conclusion. What did you get?

In [None]:
if p_value <= alpha:
    print("Reject the null hypothesis (There are significant differences between groups)")
else:
    print("Fail to reject the null hypothesis (No significant differences between groups)")