# Analysis of WHO life expectancy data 2000-2015: what countries have the highest life expectancy consistently over these 15 years, and why?

In [None]:
#import the pandas, numpy libraries as pd, and np 
import pandas as pd
import numpy as np

# Load the pyplot collection of functions from matplotlib, as plt 
import matplotlib.pyplot as plt

import seaborn as sns
import plotly.express as px
import os

import pandas_profiling

In [None]:
# Put this data into a variable  
WHO_data = pd.read_csv(r'C:\Users\Sarah\Documents\Springboard\DATASETS\Capstone 2\Life Expectancy (WHO) from Kaggle\Life Expectancy Data.csv')

# Using the head() pandas method, observe the first three entries.
WHO_data.head(3)

In [None]:
WHO_data.info()

In [None]:
list(WHO_data.columns)

In [None]:
#I see that some of these column names have spaces at the end - this isn't helpful!
WHO_data.columns = WHO_data.columns.str.strip()

In [None]:
#fixed? yes
list(WHO_data.columns)

In [None]:
#view report
report = WHO_data.profile_report(sort='None', html={'style':{'full_width': True}}, progress_bar=False)
report

In [None]:
#Rename messy columns we think will be useful
#df = df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})
WHO_data.rename(columns = {'Life expectancy':'Life_Expectancy', 'percentage expenditure': 'Percentage_Expenditure', "Income composition of resources": "Income_Composition_Resources"}, inplace = True)
WHO_data.head(5)

In [None]:
#look more closely at what columns have NaNs and zeros
WHO_data.isna().any()

In [None]:
WHO_data.isna().sum().plot(kind="bar")
plt.show()

In [None]:
#what about a large "developed" country - what is missing from data?
WHO_data[WHO_data.Country == "United States of America"].T

Looks like as an example there is data missing for some years for the US, but then there are variables for which the US has no entries for any years, and they seem potentially significant - GDP, Population, etc. Need to do something about this - can't throw away the US entirely.

In [None]:
#Try to fill missing values
#Interpolate backwardly across the column:
WHO_data.interpolate(method ='linear', limit_direction ='backward', inplace=True)
#Interpolate in forward order across the column:
WHO_data.interpolate(method ='linear', limit_direction ='forward', inplace=True)

In [None]:
#check with US as example
WHO_data[WHO_data.Country == "United States of America"].T

In [None]:
#now deal with all the zeros - replace with NaNs then replace as done above
WHO_data.replace(0,np.nan, inplace = True)
WHO_data.interpolate(inplace=True)

In [None]:
#check with US as example
WHO_data[WHO_data.Country == "United States of America"].T

In [None]:
#got all the zeros?
WHO_data.all()

In [None]:
#column Status has few issues with NaNs and needs to be explored - what are the categories?
WHO_data["Status"].unique()

In [None]:
#this seems to be the first big delineation that could affect life expectancy - do developing countries have higher life expectancy?

developing_countries = WHO_data[(WHO_data["Status"].str.contains("Developing"))]
developed_countries = WHO_data[(WHO_data["Status"].str.contains("Developed"))]

In [None]:
#visualize
sns.set(font_scale=1.5)
sns.relplot(x="Year", y="Life_Expectancy", kind="line", hue="Status", data=WHO_data)
plt.title("Life Expectancy: Developed versus Developing Countries", fontsize=18)
sns.set_style("white")
plt.xlabel("Year")
plt.ylabel("Life Expectancy (in years)")
plt.show()

CLEARLY developed countries have much higher life expectancy than developing countries. It also looks like life expectancy went up for each category over the 15 years, which could have to be more deeply considered as to why.

In [None]:
#in here change Status DEVELOPING to 0 and DEVELOP to 1 so that we don't have to deal with strings
WHO_data['Status'].replace(['Developing', 'Developed'],
                        [0, 1], inplace=True)
WHO_data.head()

In [None]:
#try out some other visualizations! a heat map to tease out correlations
plt.figure(figsize=(20,15))
sns.heatmap(WHO_data.corr(),annot=True,cmap='Blues')
plt.show()

It's already looking like there are some strong correlations between some factors and life expectancy. Our previous plot clearly showed that "developing" is a factor is high life expectancy. I'm also seeing: income composition of resources (.83 - the highest), schooling (.73), Alcohol (.4), BMI (.56), GDP (.44) as being potentially related. Comorbidities: Polio (.46), Diptheria (.47). As we already determined, status (developed versus developing) is very much correlated to life expectancy.

I'm fascinated to see what doesn't seem to correlate at all with life expectancy - HIV/AIDS, thinness, and adult mortality.

In [None]:
#schooling
px.scatter(WHO_data,x='Life_Expectancy',y='Schooling',color='Country',size='Year',template='plotly',title='Life Expectancy in Comparison to Schooling')

I'm seeing very high life expectancy in countries where citizens spend a high number of years in school (15.5-17.5 years). New Zealand seems the outlier - high life expectancy and over 20 years of schooling. These countries to take a closer look at where there is a correlation between number of years schooling and life expectancy are: Norway, Finland, Portugal, Spain, Belgium, Italy, France, Sweden.

But I'm also looking at the Democratic Republic of the Congo - high number of years for schooling, but much lower life expectancy (between 50-50 years) - other factors must be at play to lower their life expectancy.

In [None]:
# income composition of resources?
px.scatter(WHO_data,x='Life_Expectancy',y='Income_Composition_Resources',color='Country',size='Year',template='plotly',title='Life Expectancy in Comparison to Income Composition of Resources')

A fascinating plot! Human Development Index in terms of income composition of resources (index ranging from 0 to 1). Income composition of resources means a country utilizes its resources productively. I'm seeing basically the same countries with a high income composition of resources and high life expectancy: Norway, Germany, New Zealand, Sweden, Belgium, France, Spain, Italy, and Portugal. This list is overlapping greatly with high life expectancy <-> number of years of schooling.

In [None]:
px.scatter(WHO_data,x='Life_Expectancy',y='BMI',color='Country',size='Year',template='plotly',title='Life Expectancy in Comparison to BMI')

BMI is average body mass of a population, the higher the number the more body mass. So of course, this one is quite misleading, because you would think the higher the BMI as a number the better, but actually the higher the number, the more someone weighs, which is usually an indicator of poor health, and therefore low life expectancy. But we are seeing that countries with high BMI (people who are overweight) have high life expectancy, so actually, low BMI isn't making people live longer. Countries with high BMI but living the longest: New Zealand, Belgium, Germany, Finland, Norway, Sweden, Spain, France, Italy.

Looking back at the heat map this isn't actually surprising considering that thinness seems to have no relation with life expectancy.

There is one weird outlier and that is Portugal with a high life expectancy but very low BMI of 10 - that is quite underweight and not healthy. May need investigation if Portugal remains an interest. 

In [None]:
#check relationship with GDP
px.scatter(WHO_data,x='Life_Expectancy',y='GDP',color='Country',size='Year',template='plotly',title='Life Expectancy in Relationship to GDP')

Not really seeing the relationship I thought I'd see - the same countries ahead of the pack with high life expectancy actually have low GDP in comparison with many countries. So high GDP doesn't seem to be a determing factor by itself. Countries with relatively low GDP but high life expectancy: Germany, Spain, Portugal, France.

Interesting outlier: Luxembourg - high GDP, very high life expectancy. But it hasn't been showing up in regards to the other correlating factors.

In [None]:
#checking on alcohol just in case
px.scatter(WHO_data,x='Life_Expectancy',y='Alcohol',color='Country',size='Year',template='plotly',title='Life Expectancy in Relationship to Alcohol')

Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol). These same countries with high life expectancy are consuming a fair amount of alcohol, comparatively, so it would indeed appear alcohol is not as important to life expectancy as schooling or income composition of resources. In fact, the countries who drink the most (Estonia, Belarus) are still living into their '70s at least, so that still confirms alcohol is not killing people at a young age to a degree of concern for this question. You also see many countries where alcohol consumption is minimal and yet they are still experiencing very low life expectancy. I don't think this will be a feature in analysis moving forward.

For further analysis based on this graph: Germany, Spain, France, New Zealand, Italy, Finland, Belgium. Portugal (although with a very low BMI - needs double checking).

Now to check polio and diptheria immunization rates:

In [None]:
px.scatter(WHO_data,x='Life_Expectancy',y='Polio',color='Country',size='Year',template='plotly',title='Life Expectancy in Relationship to Polio Immunization')

Polio (Pol3) immunization coverage among 1-year-olds (%)

In [None]:
px.scatter(WHO_data,x='Life_Expectancy',y='Diphtheria',color='Country',size='Year',template='plotly',title='Life Expectancy in Relationship to Diphtheria Immunization')

Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)

Looking at the distribution here, it confirms that high immunization rates are evident among countries with high life expectancy, but also very high where life expectancy is very low. You also see countries with very low rates of immunization where people are still living into their '70s and '80s. I don't think that this is a feature to focus on moving forward.

In [None]:
#I wanted to look at percentage expenditure
px.scatter(WHO_data,x='Life_Expectancy',y='Percentage_Expenditure',color='Country',size='Year',template='plotly',title='Life Expectancy in Relationship to Percentage Expenditure')

Not sure if some of the data is off here - Democratic Republic of Congo has a weird behavior - very low life expectancy but high expenditre on health as a percentage of Gross Domestic Product per capita(%). I also still see countries with a low health percentage expenditure with pretty high life expectancy. So this is a bit counter intuitive.

In [None]:
#checking the histograms
WHO_data.hist(figsize=(20, 15))
plt.subplots_adjust(hspace=0.5);

Even just looking at the histograms Schooling and Income Composition of Resources are the most similar in shape to life expectancy.

The features to model will be SCHOOLING, and INCOME COMPOSITION OF RESOURCES.

STATUS is also correlating, but there are only 2 categories - clearly developed countries have the highest life expectancy, but as the histogram shows, the number of countries in the list that are developing far outnumber the number of developed. Such categories and how they are defined are subjective in comparison to the other 2 most correlative features.

What needs to be remembered, though, is that before adding missing values and zeros, 

1. Income composition of resources has 167 (5.7%) missing values
2. Schooling has 163 (5.5%) missing values	
3. Income composition of resources has 130 (4.4%) zeros

That's not a deal breaker but if real, correct data could be added that would make the model even more sound.