# COGS 108 - Final Project

## Group Members: 

- A14743301 - Nadya Audrey Salim
- A14233594 - Tyler Paulo
- A13856476 - Joycelyn Peng
- A11634379 - Adam Zhang
- A12327392 - Debbie Vo

### Member Contribution:
- Nadya:
- Tyler:
- Joycelyn:
- Adam:
- Debbie: 

### Introduction and Background

Initially, we were only interested in the factors of sex, lifestyle, and environment affecting life expectancy. However, after consulting with a TA about our research question, we decided to be more specific and look at the effects percentage of GDP spent on healthcare and prevalence of sanitation services has on life expectancy. We think that this would be important data considering the growing rate of homeless people on the street and even more severe situations such as the water crisis in Flint, Michigan.

In our hypothesis, we predict that countries that spend more of their GDP on health increase the life expectancy of their citizens. Meanwhile, on reuters.com, we read that the United States spends about twice as much as other high-income countries on health but has a relatively low life expectancy. We’re interested to see if the United States is the only country that has this uncommon relationship between the amount of money spent on health with life expectancy.

From this data, we also want to see how impactful sanitation is on health. According to Duncan Mara et al., lack of sanitation could lead to spread of disease and is often associated with poverty and accounts for about 10% of the global burden of disease. We assume that sanitation will have a big impact on the country’s life expectancy seeing the fact that has been laid out by Duncan Mara et al. Overall, we are interested in the effect of money spent on health and sanitation services on people’s health.


References (include links):
- 1) GDP on health: https://www.reuters.com/article/us-health-spending/u-s-health-spending-twice-other-countries-with-worse-results-idUSKCN1GP2YN
- 2) Sanitation and Health: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2981586/

### Data Description

- __Dataset Name:__ Life expectancy and Healthy life expectancy data by country
- __Link to the dataset:__ http://apps.who.int/gho/data/view.main.SDG2016LEXREGv?lang=en
- __Number of observations:__ 3111
- __Description:__ Gives the average life expectancy at birth by region and by country. Life expectancy is differentiated between that at birth and at 60 years, and also between the sexes. The dataset gives data from 2000 to 2016, but we will only be using 2015 data.


- __Dataset Name:__ Current health expenditure (CHE) as percentage of gross domestic product (GDP) (%) data by country
- __Link to the dataset:__ http://apps.who.int/gho/data/view.main.GHEDCHEGDPSHA2011v?lang=en 
- __Number of Observations:__ 195
- __Description:__ The portion of resources channeled to the health sector in the whole economy of the country. This refers to the level of Current Health Expenditure expressed as a percentage of GDP. We’ll be using only 2015 data and it will be added as an additional column to the main dataset. 


- __Dataset Name:__ Basic and safely managed sanitation services data by country
- __Link to the Dataset:__ http://apps.who.int/gho/data/view.main.WSHSANITATIONv?lang=en
- __Number of Observations:__ 194
- __Description:__ Sanitation in this data refers to the provision of facilities and services for the safe disposal of human urine and feces. Sanitation facilities include flush/pour flush toilets connected to piped sewer systems, septic tanks or pit latrines, and composting toilets. We’ll be using only 2015 data and the total for using at least basic services, which will be added as an additional column on the main dataset. 


### Data Cleaning/Pre-processing

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as skl
import scipy as sp
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest

In [11]:
improvedSani = pd.read_csv('ImprovedSanitation.csv')

#Drop the first 3 rows - consisting unrelated information to the data being used
improvedSani = improvedSani.drop(improvedSani.index[[0,1,2]])

#Set the value of the 1st row to be the column labels
improvedSani.columns = improvedSani.iloc[0]
#Drop the first row
improvedSani = improvedSani.drop(improvedSani.index[[0]])
del improvedSani.columns.name

#Reset the index count and drop unnecessary columns
improvedSani.reset_index(level=0, inplace=True)
improvedSani= improvedSani.drop(columns=['index','Indicator Code','Indicator Name'])

improvedSani = improvedSani.rename(columns={2000.0: 2000, 2001.0:2001, 2002.0:2002, 2003.0:2003, 2004.0:2004,
                                                       2005.0:2005, 2006.0:2006, 2007.0:2007, 2008.0:2008, 2009.0:2009,
                                                       2010.0:2010, 2011.0:2011, 2012.0:2012, 2013.0:2013, 2014.0:2014,
                                                       2015.0:2015,})


improvedSani

Unnamed: 0,Country Name,Country Code,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
0,Aruba,ABW,,,,,,,,,,,,,,,,
1,Afghanistan,AFG,,,,,,,,,,,,,,,,
2,Angola,AGO,,,,,,,,,,,,,,,,
3,Albania,ALB,54.666995,55.488220,56.346663,57.183598,57.998835,58.791754,59.562484,60.310811,61.036563,61.739280,62.419417,63.076686,63.711247,64.305582,64.802296,64.817573
4,Andorra,AND,6.462857,15.314725,24.166593,33.018462,41.870330,50.722198,59.574066,68.425934,77.277802,86.129670,94.981538,100.000000,100.000000,100.000000,100.000000,100.000000
5,Arab World,ARB,42.188745,42.522566,42.984574,43.559851,44.170434,44.878761,45.629301,46.410351,47.195856,47.954341,48.665990,49.316262,49.927409,50.286417,50.506438,50.700261
6,United Arab Emirates,ARE,92.607653,92.667871,92.727084,92.785290,92.842635,92.898830,92.954306,93.007627,93.059080,93.108807,93.156523,93.202515,93.246925,93.289611,93.330572,93.370096
7,Argentina,ARG,28.776303,28.613687,28.455409,28.301871,28.153645,28.011092,27.874005,27.742370,27.615994,27.488010,27.351241,27.219957,27.094306,27.001127,26.742501,26.484001
8,Armenia,ARM,,,,,,,,,,,,,,,,
9,American Samoa,ASM,,,,,,,,,,,,,,,,


### Data Visualization

### Data Analysis and Results

### Privacy/Ethics Considerations

The data that we are going to use is open source for people to view, but we do not have explicit permission from the WHO. There are no privacy concerns regarding our datasets as the data already follows the Safe Harbour method and only displays gender and ages by country. There may be potential biases in the dataset due to the fact that all of the data we are using is coming from the same website. It isn’t quite clear how the data has been collected, but as far as we know, there shouldn’t be any other issues in terms of data privacy since, as we have mentioned before, our data has followed the Safe Harbour method and it doesn’t consist of individual personal information. 

Our data is in a form of statistics collected based on countries, and the only exclusions are due to a failure (or prevention) of information provision by the country in question. For instance, the Democratic People’s Republic of Korea has not disclosed their GDP health expenditure, or WHO was not able to acquire it. Issues of this sort will simply be addressed by cleaning these null cells from the data. Overall, our topic area and the included data analysis is not problematic in terms of data privacy, but an equitable impact is possible. If we approach our dataset with a bias that less well-off countries will immediately have a worse life expectancy, for example, it may skew our collective perspective and results. Instead, a neutral and unbiased approach is absolutely necessary to minimize confounds, even unconscious ones.

### Conclusions and Discussion