# <font color='#d35400'> Lab 5 | Hypothesis Testing </font>
Welcome to Lab 5! In this lab we work on hypothesis testing by implementing several different hypothesis tests from scratch and then comparing the results we get to the outputs of python functions. This lab explores using the World Health Organization (WHO) data which tracks a number of useful health-related metrics aggregated at the country level.

<p align="center">
  <img src="dog_loves_sausages.jpeg" alt="Alt Text", width="300"/>
</p>

### <font color='#FF8C00'> About the Dataset </font>
We start off this Jupyter Notebook by first examining the dataset from `who2009.csv`. We look at the features that are in the dataset, as well as the data types being used in each of the features. We would like to get an idea of what we're dealing with, so we start off by displaying the dataset in Pandas.

In [None]:
# importing the pandas library
import pandas as pd

# importing the ipython library
from IPython.display import display

# read the .csv file using the pandas library
who_df = pd.read_csv("who2009.csv")

# displaying the pandas dataframe
who_df

In [None]:
# displaying the descriptive statistics of the dataset
display("Summary Statistics ", who_df.info())

# displaying the summary of the dataset
display("Description of Dataset", who_df.describe())

### <font color='#FF8C00'> Dataset Information & Summary </font>
According `.info()` and `.describe()`, we can see that there are 193 rows & 267 columns. With regards to te column headings on the dataset, they are presented as follows:
- **Country:** Member State of the World Health Organization
- **v9:** life expectancy at birth for both sexes
- **v22:** infant mortality rate of both sexes
- **v159:** health workforce
- **v168:** hospital beds
- **v174:** total expenditure on health as % of GDP
- **v186:** out of pocket expenditure as % of private expenditure on health
- **v192:** per capita total expenditure on health
- **v249:** total fertility rate
- **v259:** gross national income per capita 
- **regionname:** alphabetic ragion name

## <font color = '#FF8C00'> Section 1 </font> | Data Preparation
In this section, we work on reading a `.csv` file in Pandas, and then perform some analysis, such as looking at the size of the dataset and understand what the features are by referring to the code book. We then rename the columns according to the code book. The instructions are arranged in the bullet points as follows:
- [x] First read the who2009.csv ﬁle into your python development environment using pandas. You’ll need to import pandas and then apply the `.read csv` method.
- [x] Assess the size of the data set using the .shape method. You should notice that there are a large number of columns in the data. We will only need a very small number of them. I have provided a codebook to go with this data (WHO2009SubsetCodebook.pdf). Drop columns from the data, keeping only those identiﬁed in the codebook.
- [x] Rename the columns in the reduced data set to names that are appropriate descriptors of the information contained in each variable according to the codebook. You may ﬁnd the `.rename` method to be useful. Use the following variable names: `country`, `life exp`, `infant mortality`, `phys density`, `hospital bed`, `health exp` `percent GDP`, `OOP percent exp`, `health exp PC`, `fertility rate`, `GNI PC`, `regionname`.

### <font color = '#FF8C00'> Finding the Shape of the Dataframe </font>
We start off by finding the shape of the data frame, which means we get information as to how many rows and columns are in the dataset. Using `.shape` returns a tuple, where the first element is the number of rows and the second element is the number of columns.

In [None]:
# printing out the shape of the data frame and the rows and columns
print("The Shape of the Dataframe: ", "\n")
print("Number of Rows: ", who_df.shape[0], "\n")
print("Number of Columns: ", who_df.shape[1])

### <font color = '#FF8C00'> The Codebook & Dropping the Data Columns </font>
It can be seen in the dataset that there are large number of columns. We need a small number of them, so we drop the columns that are not mentioned in the codebook and keep the ones that are.

In [None]:
# the list of columns in the data set that are included in the code book
col_list = ["country", "v9", "v22", "v159", "v168", "v174", "v186", "v192", "v249", "v259", "regionname"]

# we take these columns of interest and create a new data frame
new_who_df = who_df[col_list]

# displaying the new data frame with the first 5 columns
new_who_df.head(5)


### <font color = '#FF8C00'> Renaming the Data Columns </font>
We can now work on renaming the data columns. We rename the data columns as follows:
- `country` as `country`
- `v9` as `life_exp`
- `v22` as `infant_mortality`
- `v159` as `phys_density`
- `v168` as `hospital_bed`
- `v174` as `health_exp_percent_GDP`
- `v186` as `OOP_percent_exp`
- `v192` as `health_exp_PC`
- `v249` as `fertility_rate`
- `v259` as `GNI_PC`
- `regionname` as `regionname`


In [None]:
# renaming the columns as mentioned in the code book
new_who_df = new_who_df.rename(columns={'v9' : 'life_exp', 'v22' : 'infant_mortality', 'v159' : 'phys_density'})
new_who_df = new_who_df.rename(columns={'v168' : 'hospital_bed', 'v174' : 'health_exp_percent_gdp'})
new_who_df = new_who_df.rename(columns={'v186' : 'OOP_percent_exp', 'v192' : 'health_exp_PC'})
new_who_df = new_who_df.rename(columns={'v249' : 'fertility_rate', 'v259' : 'GNI_PC'})

# displaying the new data frame with renamed columns
new_who_df.head(5)

### <font color='#FF8C00'> Sources Used For Section One </font>
- https://stackoverflow.com/questions/16616141/keep-certain-columns-in-a-pandas-dataframe-deleting-everything-else
- https://stackoverflow.com/questions/11346283/renaming-column-names-in-pandas

## <font color = '#FF8C00'> Section 2 | One Sample T-test </font>
The purpose of a one sample t-test is meant to  compare a numerical variable against a fixed number. In this section, we assess the life expectancy in Europe by creating a new data set. We achieve this creating a new function that takes in data and a value, and then returns the test-statistic p-value as outputs. We use that function to then to assess whether the life expectancy in Europe is different from 70 and 76 years, and then compare them using `scipy.stats`. We achieve this in the following bullet points:

- [x] Begin by creating a second data set that includes only rows in which the regionname is "Europe"
- [x] The test statistic in a one sample t-test is shown below, where s is the standard deviation and where $\mu$ is the sample mean, n is the number of observations in the sample, $x_i$ is the value of the variable for the i-th observation, and M is the number chosen in this jupyter notebook. The p-value is found using the formula below.
- [x] Write a function to assess whether the life expectancy in Europe is significantly different from 70 and 76 years
- [x] Validate the code by comparing the results to the built-in one-sample t-test. We achieve this by using the `ttest_1samp` method from `scipy.stats`.

### <font color = '#FF8C00'> Formulas </font>
$$
t = \frac{\mu - M}{s / \sqrt{n}}
$$

$$
s = \sqrt{\frac{1}{n - 1} \sum_{i=1}^{n} (x_i - \mu)^2}
$$

$$
p = 2(1 - P(|t|))
$$

### <font color = '#FF8C00'> Filtering Out Europe </font>
We start off by focusing on the `regionname` column. We would like to filter out the places which belong in the Europe region, and find more details about the features we filtered out for those countries that belong in that region.


In [None]:
# creating the second dataset where the region name is only "Europe"
europe_df = new_who_df[new_who_df["regionname"] == "Europe"]

# displaying the europe data frame of the top 5
europe_df.head(5)

### <font color = '#FF8C00'> Building the Function </font>
Next, we move on to building out function using the formulas mentioned above. The function is designed to take the data and the value M as inputs and returns the test statistic and p-value as outputs. 

We work with the CDF of the t-dsitribution. From scipy.stats, we import t to load the t-distribution and then work with `t.cdf` with the degrees of freedom set to `n-1`. The `.mean` and `.std` methods from `numpy` are helpful, but be sure to use `ddof` = 1 if we use `.std`.

In [None]:
# importing the required libraries
import numpy as np
import math
from scipy.stats import t
from scipy.stats import ttest_1samp

In [None]:
# creating the t-test function
def one_smaple_t_test(feature_col, input_value):

    # converting the column into a list
    feature_list = feature_col.tolist()

    # calculating the mean
    feature_mean = np.mean(feature_col)

    # finding number of observations and square root it
    observation = len(feature_col)
    sqrt_observation = math.sqrt(observation)

    # (observation - mean) ^ 2
    difference_list = []
    for row in range(0, len(feature_list)):
        difference = (feature_list[row] - feature_mean)**2
        difference_list.append(difference)
    
    # finding the sum of the differences
    sum_of_differences = sum(difference_list)

    # calculating the variance
    variance = sum_of_differences * (1 / (observation - 1))

    # calculating the standard deviation
    standard_deviation = math.sqrt(variance)

    # calculating the t-value
    t_value = (feature_mean - input_value) / (standard_deviation / sqrt_observation)

    # calculating the p-value
    p_value = 2 * (1- t.cdf(abs(t_value), observation - 1))

    # returning t value and p value
    return t_value, p_value

### <font color = '#FF8C00'> Life Expectancy </font>
We now use the function that we built to compare the life expectancy in Europe is different from 70 years compared to 76 years. The function will then return the t value and the p value.


In [None]:
# running the function where input_value is 70 as in 70 years
t_value_seventy, p_value_seventy = one_smaple_t_test(europe_df['life_exp'], 70)
print("The t value for Life Expectancy of 70: ", t_value_seventy, "\n")
print("The p value for Life Expectancy of 70: ", p_value_seventy, "\n" )

# running the function where input_value is 76 as in 76 years
t_value_seventy_six, p_value_seventy_six = one_smaple_t_test(europe_df['life_exp'], 76)
print("The t value for Life Expectancy of 76: ", t_value_seventy_six, "\n")
print("The p value for Life Expectancy of 76: ", p_value_seventy_six, "\n" )


### <font color = '#FF8C00'> Validating the Function </font>
We now work on validating the function. Using the results we obtained in the previous python cell, we now compare our results with what we derive using the `scipy.stats` library, using the `ttest_1samp` method.

In [None]:
tt_value_seventy, pp_value_seventy = ttest_1samp(europe_df['life_exp'], 70)
print("The t value for Life Expectancy of 70: ", tt_value_seventy, "\n")
print("The p value for Life Expectancy of 70: ", pp_value_seventy, "\n")

tt_value_seventy_six, pp_value_seventy_six = ttest_1samp(europe_df['life_exp'], 76)
print("The t value for Life Expectancy of 76: ", tt_value_seventy_six, "\n")
print("The p value for Life Expectancy of 76: ", pp_value_seventy_six, "\n")

### <font color='#FF8C00'> Sources Used For Section Two </font>
- https://www.geeksforgeeks.org/ways-to-filter-pandas-dataframe-by-column-values/
- https://sparkbyexamples.com/pandas/pandas-filter-rows-by-conditions/

## <font color = '#FF8C00'> Section 2 | Two Sample T-test </font>
A two-sample t-test is meant to compare a numerical variable against a categorical variable. The goal is to assess whether the numerical variable is "different" across the categories. In this section, we compare the life expectancy in Europe against the life expectancy in Asia. We achieve this in the following steps below:
- [x] Create a third data set that includes only rows in which regionname is “Asia.”
- [x] The test statistic is shown in the formula section below. The p-values are also computed using the degrees of freedom, which is calculated using the formula below. 
- [x] Write a function that takes two data set inputs and returns the test statistic and p-values as outputs
- [ ] We then use the function we built to assess whether the life expetancy in Europe is different than the life expectancy in Asia
- [ ] We then validate the function by comparing the results to the built-in two sample t-test using the the `ttest_ind` method from `scipy.stats`.


### <font color = '#FF8C00'> Formulas </font>
$$
t = \frac{\mu_1 - \mu_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
$$

$$
\nu = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right)^2}
{\frac{s_1^4}{n_1^2 (n_1 - 1)} + \frac{s_2^4}{n_2^2 (n_2 - 1)}}
$$

### <font color = '#FF8C00'> Filtering Out Asia </font>
We start off by using the `new_world_df` data frame to filter out rows of data whose `regionname` is Asia. We would like to collect all the features and the data they have for each corresponding row to perform the two sample t-test.

In [None]:
# filtering out the rows that have the regionname set as Asia
asia_df = new_who_df[new_who_df['regionname'] == "Asia"]

# displaying the data frame with just the top 5
asia_df.head(5)

### <font color = '#FF8C00'> Building the Function </font>
We build this function so that it takes two data sets as inputs and returns the test statistic and p-value as outputs. We implement the formulas mentioned above to help calculate the degrees of freedom, the test statistic and the p-value.

In [None]:
# the function for the two sample t-test
def two_sample_t_test(feature_col_one, feature_col_two):

    # converting the feature into a list
    feature_list_one = feature_col_one.tolist()
    feature_list_two = feature_col_two.tolist()

    # calculating the mean from the two features
    mean_one = sum(feature_list_one) / len(feature_list_one)
    mean_two = sum(feature_list_two) / len(feature_list_two)

    # number of observations in each feature
    observation_one = len(feature_list_one)
    observation_two = len(feature_list_two)

    # (observation - mean) ^ 2 for dataset one
    difference_list_one = []
    for row_one in range(0, len(feature_list_one)):
        difference_one = (feature_list_one[row_one] - mean_one)**2
        difference_list_one.append(difference_one)
    
    # (observation - mean) ^ 2 for dataset two
    difference_list_two = []
    for row_two in range(0, len(feature_list_two)):
        difference_two = (feature_list_two[row_two] - mean_two)**2
        difference_list_two.append(difference_two)
    
    # finding the sum of the different list one and two
    sum_one = sum(difference_list_one)
    sum_two = sum(difference_list_two)

    # calculating the variance one and two
    variance_one = sum_one / (observation_one - 1)
    variance_two = sum_two / (observation_two - 1)

    # calculating the standard deviation one and two
    standard_deviation_one = math.sqrt(variance_one)
    standard_deviation_two = math.sqrt(variance_two)

    # calculating the numerator of the degrees of freedom
    numerator = ((standard_deviation_one**2 / observation_one) + (standard_deviation_two**2 / observation_two))**2

    # calculating the denominator part one
    denominator_part_one = (standard_deviation_one**4) / ((observation_one**2) * (observation_one - 1))
    denominator_part_two = (standard_deviation_two**4) / ((observation_two**2) * (observation_two - 1))
    denominator = denominator_part_one + denominator_part_two

    # calculating the degree of freedom
    degree_of_freedom = numerator / denominator

    # calculating the test statistic
    t_value = (mean_one - mean_two) / math.sqrt((standard_deviation_one**2 / observation_one) + 
                                                (standard_deviation_two**2 / observation_two))
    
    # calculating the p value
    p_value = 2 * (1 - t.cdf(abs(t_value), degree_of_freedom))

    # returning the t_value and p_value
    return t_value, p_value

### <font color = '#FF8C00'> Life Expectancy in Europe vs Asia </font>
We now calculate and compare the life expectancy in Europe vs in Asia. We do this by accessing the `life_exp` feature from both the data frames of Asia `asia_df` and Europe `europe_df`. We then take these features and put it into the function and get the t_value and p_value.

In [None]:
# using the function to retrieve the t_value and p_value
t_value, p_value = two_sample_t_test(asia_df["life_exp"], europe_df["life_exp"])
print("The test statistic value of Life Expectancy in Europe vs Asia: ", t_value, "\n")
print("The p value of Life Expectancy of Europe vs Asia: ", p_value)

### <font color = '#FF8C00'> Validating the Function </font>
We now work on validating the function. We achieve this by using the `ttest_ind` method from the `scipy.stats` library. We then use the values we get from the method, to compare against the function we created.


In [None]:
# importing the scipy.stats library
from scipy import stats

# running the ttest_ind method from the scipy.stats library
t_value, p_value = stats.ttest_ind(a=asia_df["life_exp"], b=europe_df["life_exp"], equal_var=False)

# printing out the test statistic and p value
print("The Test Statistics of Life Expectancy in Europe vs Asia: ", t_value, "\n")
print("The p Value of Life Expectancy in Europe vs Asia: ", p_value, "\n")

### <font color='#FF8C00'> Sources Used For Section Three </font>
- https://www.geeksforgeeks.org/how-to-conduct-a-two-sample-t-test-in-python/