# <font color = 'blue'> Life Expectancy and GDP

## 1. Project Goal

The this project, I will analyze data on GDP and life expectancy from the World Health Organization and the World Bank to try and identify the relationship between the GDP and life expectancy of six countries! 

In [1]:
#import modules 
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats

## 2. The Dataset
I will be exploring the dataset to discover any clues that can lead to interesting anlysis

In [2]:
#import the dataset into a dateframe
life_gdp = pd.read_csv('all_data.csv')

#display the dataset
life_gdp

Unnamed: 0,Country,Year,Life expectancy at birth (years),GDP
0,Chile,2000,77.3,7.786093e+10
1,Chile,2001,77.3,7.097992e+10
2,Chile,2002,77.8,6.973681e+10
3,Chile,2003,77.9,7.564346e+10
4,Chile,2004,78.0,9.921039e+10
...,...,...,...,...
91,Zimbabwe,2011,54.9,1.209845e+10
92,Zimbabwe,2012,56.6,1.424249e+10
93,Zimbabwe,2013,58.0,1.545177e+10
94,Zimbabwe,2014,59.2,1.589105e+10


Let's look into each column 

### a. Country

In [3]:
#find unique values of the column: Country
countries = life_gdp['Country'].unique()
print('There are six countries: {}'.format(countries))

There are six countries: ['Chile' 'China' 'Germany' 'Mexico' 'United States of America' 'Zimbabwe']


### b. Year

In [4]:
#find unique values of the columns: Year
year = life_gdp['Year'].unique()
print('The dataset ranges from {} to {}'.format(year[0],year[-1]))

The dataset ranges from 2000 to 2015


## c. Life expectancy at birth (years)

In [5]:
#display the life expectancy column
life_gdp.['Life expectancy at birth (years)']
print('It is pretty hard to type the column name correctly. I will rename it ')

SyntaxError: invalid syntax (<ipython-input-5-7f581a7af34e>, line 2)

In [None]:
#rename the column for life expectancy
life_gdp.rename(columns = {"Life expectancy at birth (years)" : 'Life_expectancy'},inplace = True)

In [None]:
#disply the column to make sure the change was made correctly
life_gdp['Life_expectancy']

### d. GDP

In [None]:
#disply GDP
life_gdp['GDP']

I would like to change GDP in scientific number format to integer format

In [None]:
#cast a panda object to integer 
life_gdp['GDP'] = life_gdp['GDP'].astype(int)

# 3. Exploratory Analysis 

## a. Line charts of Chile to see if there's any relationships between the life expectancy & GDP

Before plotting or analyzing six coutries' data, I would like to work with one country's data first to get a sense of what I will deal with. Chile data was chosen. I will plot Year vs GDP and Year vs Life expectancy for Chile. 

In [None]:
#Select the row where the country is Chile
chile = life_gdp[life_gdp['Country'] == 'Chile']

In [None]:
#print chile
chile

In [None]:
# get GDP in billions
chile['GDP'] = chile['GDP']/1e9

In [None]:
chile

In [None]:
#plot Year against Life Expectancy for Chile
f,ax = plt.subplots(figsize = (10,7))
sns.lineplot(x = 'Year',y = 'Life_expectancy',data=chile,color='blue',marker='o')
sns.set_context('talk')
ax.set_ylabel('Life Expectancy',fontweight='semibold',color='black')
ax.set_title('Year vs Life Expectancy - Chile',fontsize=30,fontweight='semibold',color='black')
ax.set_xlabel('Year',fontweight='semibold',color='black')



In [None]:
#plot Year against GDP for Chile
f,ax = plt.subplots(figsize = (10,7))
sns.lineplot(x= 'Year',y='GDP',data=chile,color='red',marker='o')
sns.set_context('talk')
ax.set_ylabel('GDP',fontweight='semibold',color='black')
ax.set_title('Year vs GDP - Chile',fontsize=30,fontweight='semibold',color='black')
ax.set_xlabel('Year',fontweight='semibold',color='black')


### I would like to carete a combo chart that displays the two line graphs on the same x-axis. Then , we will see if there's any patterns.

In [None]:
#create combo chart
fig,ax1 = plt.subplots(figsize = (10,7))
color = 'tab:blue'
#create the first line graph
sns.set_style('ticks')
sns.set_context('talk')
ax1.set_title('Chile: Life Expectancy and GDP by Year',fontsize =25)
ax1.set_xlabel('Year',fontsize=20)
ax1.set_ylabel('Life Expectancy',fontsize=20)
ax1 = sns.lineplot(x='Year',y='Life_expectancy',data = chile,color = 'blue',marker = 'o')
ax1.tick_params(axis='y',color=color)
plt.text(2015.3,80.45,'Life Expectancy',color='blue')
plt.text(2015.3,79.88,'GDP',color = 'red')
#specify we want to share the x-axis
ax2 = ax1.twinx()
color = 'tab:red'
#create the second line graph
ax2.set_ylabel('GDP',fontsize=20)
ax2 = sns.lineplot(x='Year',y='GDP',data=chile,color='red',marker = 'o')
ax2.tick_params(axis='y', color=color)

sns.despine(right=True,left = True)



Conclusion: <br>
The combo graph indicates that GDP and life expectancy move upward at a similiar rate. As the GDP increases, the life expectancy increase. 

##  b. scatter plot with a regression line  of Chile 

As can be seen in the above graph, there is a positive correlation between GDP and life expectancy. In order to obtain statistical credibility on linearity, I would like to draw a linear equation between them. $r^2$ and p-value will be computed to see if its linearity is statistically significant.

In [None]:
#linear regression equation between GDP and Life expectancy for Chile
slope, intercept, r_value,p_value,std_err = stats.linregress(chile.GDP,chile.Life_expectancy)

#create text variables for linear equation, r and p-value
line_equation = f'y= {slope}x + {intercept:.0f}'
r_p_value = f"""P value = {p_value} 
R squared = {r_value**2}"""

#textbox properties
box1 = dict(boxstyle='round', facecolor='white', alpha=1)
box2 = dict(boxstyle='round', facecolor='whitesmoke', alpha=.5)


graph = sns.lmplot(data=chile,x='GDP',y='Life_expectancy',height = 8,aspect = 1.2,scatter_kws = {'color':'black'},line_kws = {'color':'pink'})
plt.ylabel('Life Expectancy',fontweight='semibold',color='black')
plt.title('GDP vs Life expectancy - Chile',fontsize=30,fontweight='semibold',color='black')
plt.xlabel('GDP',fontweight='semibold',color='black')
plt.text(180,79.5,line_equation,fontsize=20,color='darkblue',bbox=box1)
plt.text(73,80.5,r_p_value,fontsize=20,color='black',bbox=box2)

Conclusion: <br>
As expected, there's a pretty strong linearity between GDP and life expectancy for Chile. $r^2$ is 90.2%, indicating that 90.2% of the variation in life expectancy may be explained by GDP. if GDP increases by 1.0e+11 then, life expectancy will increase by around 1.3 years on average. Does this mean we all have to work harder to live longer!!??? 

<font size = 3.5> Now, I would like to extend my analysis to all six countries and their GDP and life expectancy relationships

## c. scatter plot with a regression line of GDP vs life expectancy of the six countries

In [None]:
#find average life expectancy and GDP by country
life_GDP_by_country = life_gdp.groupby('Country',as_index = False).agg({'GDP':"mean",'Life_expectancy': "mean"})
life_GDP_by_country['GDP'] = life_GDP_by_country['GDP']/1e9

In [None]:
life_GDP_by_country

In [None]:
#linear regression equation between GDP and life expectancy for all six countries
slope, intercept, r_value,p_value,std_err = stats.linregress(life_GDP_by_country.GDP,life_GDP_by_country.Life_expectancy)

#create text variables for linear equation, r and p-value
line_equation = f'y= {slope}x + {intercept:.0f}'
r_p_value = f"""P value = {p_value:.4f} 
R squared = {r_value**2:.4f}"""

#create x
x = [*range(0,20000,1000)]
y = [x1*slope + intercept for x1 in x]


#textbox properties
box1 = dict(boxstyle='round', facecolor='white', alpha=1)
box2 = dict(boxstyle='round', facecolor='whitesmoke', alpha=.5)

ax,fig = plt.subplots(figsize = (15,10))
pl = sns.scatterplot(data = life_GDP_by_country, x='GDP', y='Life_expectancy',hue='Country',s=1000)
sns.lineplot(x,y,color='black' )
for line in range(0,life_GDP_by_country.shape[0]):
    pl.text(life_GDP_by_country.GDP[line]+0.2,life_GDP_by_country.Life_expectancy[line],life_GDP_by_country.Country[line],horizontalalignment='left',size='medium',color='black',weight='semibold')
plt.text(2250,70,line_equation,fontsize=20,color='darkblue',bbox=box1)
plt.text(13800,60,r_p_value,fontsize=20,color='black',bbox=box2)
    

# Conclusion: <br>
The above scatter plot does not show a strong linear relationship ($r^2$ is only 12.4%). I can construct a hypothesis that  two factors affected such result. <br><br> 1. GDP, Gross Domestic product, might not be a good variable to use since it is defined as the sum of each individual's income or production whereas life expectancy is defined as the average of each individual's life expectancy. I feel like this is not apple to apple. <br><br> 2. Zimbabwe looks like an outlier. <br><br> I will try to work around these two factors to find if its linearity improves.

## c. scatter plot of GDP per capita against life expectancy on average of the six countries
<font size = 3.5> I would like to use GDP per capita (GDP/population) instead of GDP

In [None]:
life_GDP_by_country['GDP'] = life_GDP_by_country['GDP']*1e9

In [None]:
life_GDP_by_country

In [None]:
#create a column: population
life_GDP_by_country['population'] = [18730000,1393000000,83020000,126000000,328200000,14000000]

In [None]:
#craete a column: GDP per capita
life_GDP_by_country['GDP_per_capita'] = life_GDP_by_country['GDP']/life_GDP_by_country['population'] 

In [None]:
#plot GDP per capita against life_expectancy
slope, intercept, r_value,p_value,std_err = stats.linregress(life_GDP_by_country['GDP_per_capita'],life_GDP_by_country['Life_expectancy'])

#create text variables for linear equation, r and p-value
line_equation = f'y= {slope}x + {intercept:.0f}'
r_p_value = f"""P value = {p_value:.4f} 
R squared = {r_value**2:.4f}"""

#create x for linear equation
x = [*range(0,50000,1000)]
y = [x1*slope + intercept for x1 in x]


ax,fig = plt.subplots(figsize = (15,10))
pl = sns.scatterplot(data = life_GDP_by_country, x='GDP_per_capita', y='Life_expectancy',hue='Country', size = 'population',sizes = (200,700),palette = 'pastel')
sns.lineplot(x,y,color='black' )
for line in range(0,life_GDP_by_country.shape[0]):
    pl.text(life_GDP_by_country.GDP_per_capita[line]+0.2,life_GDP_by_country.Life_expectancy[line],life_GDP_by_country.Country[line],horizontalalignment='left',size='medium',color='black',weight='semibold')
plt.text(24300,73.5,line_equation,fontsize=20,color='darkblue',bbox=box1)
plt.text(22000,50,r_p_value,fontsize=20,color='black',bbox=box2)


Conclusion: <br> Visually, there's not much improvement from the previous graph. However, $r^2$ and p-value were increased a lot. How about taking out Zimbabwe from the dataset.

## d. Scatter plot of GDP against life expectancy on average of the six countries minus Zimbabwe

In [None]:
#Select data without Zimbabwe 
life_GDP_by_country2 = life_GDP_by_country[life_GDP_by_country['Country'].isin(['Zimbabwe']) == False ]

In [None]:
#display data
life_GDP_by_country2

In [None]:
#plot GDP against life_expectancy
slope, intercept, r_value,p_value,std_err = stats.linregress(life_GDP_by_country2.GDP_per_capita,life_GDP_by_country2.Life_expectancy)

#create text variables for line equation, r and p-value
line_equation = f'y= {slope}x + {intercept:.0f}'
r_p_value = f"""P value = {p_value:.4f} 
R squared = {r_value**2:.3f}"""

#create x for linear equation
x = [*range(0,50000,10)]
y = [x1*slope + intercept for x1 in x]
ax,fig = plt.subplots(figsize = (15,10))
pl = sns.scatterplot(data = life_GDP_by_country2, x='GDP_per_capita', y='Life_expectancy',hue='Country', size = 'population',sizes = (20,700),palette = 'pastel')
sns.lineplot(x,y,color='black' )
for line in range(0,life_GDP_by_country2.shape[0]):
    pl.text(life_GDP_by_country2.GDP_per_capita[line]+0.2,life_GDP_by_country.Life_expectancy[line],life_GDP_by_country.Country[line],horizontalalignment='left',size='medium',color='black',weight='semibold')

plt.text(15000,76.6,line_equation,fontsize=20,color='darkblue',bbox=box1)
plt.text(23500,74.3,r_p_value,fontsize=20,color='black',bbox=box2)  

Conclusion: <br>
Finally, a stronger positivie linear relationship is more apparent visually and statistically! From my analysis, Zimbabwe data is distant from 5 other countries in terms of life expactany based on its GDP. We can extrapolate that Zimbabwe's life expectancy will be relatively lower than the rest of the world with similar GDPs. This suggests that Zimbabwe needs more aid and attention from non-profit organizations to build a better world. 
    
Further research on 'between continents' would be interesting and informative.