# <h1><center>Assessmet 5 on Advanced Data Analysis using Pandas</center></h1>

# **Project 3: GDP Rate and Unemployment Rate**

In [83]:
import warnings
warnings.simplefilter('ignore', FutureWarning)

import pandas as pd

In [84]:
pip install pandas_datareader




In [95]:
GDP_INDICATOR = 'NY.GDP.MKTP.CD'
gdpReset= pd.read_excel("API_NY.GDP.MKTP.CD.xls")
gdpReset.head()

Unnamed: 0,Country Name,year,NY.GDP.MKTP.CD
0,Aruba,2019,
1,Afghanistan,2019,19291100000.0
2,Angola,2019,88815700000.0
3,Albania,2019,15279180000.0
4,Andorra,2019,3154058000.0


In [96]:
gdpReset.tail()

Unnamed: 0,Country Name,year,NY.GDP.MKTP.CD
259,Kosovo,2019,7926134000.0
260,"Yemen, Rep.",2019,22581080000.0
261,South Africa,2019,351431600000.0
262,Zambia,2019,23309770000.0
263,Zimbabwe,2019,21440760000.0


In [98]:
UNEMPLOYMENT_INDICATOR = 'SL.UEM.TOTL.NE.ZS'
UnemployReset= pd.read_excel('API_SL.UEM.TOTL.NE.ZS.xls')
UnemployReset.head()

Unnamed: 0,Country Name,year,SL.UEM.TOTL.NE.ZS
0,Aruba,2019,
1,Afghanistan,2019,
2,Angola,2019,
3,Albania,2019,11.47
4,Andorra,2019,


In [99]:
UnemployReset.tail()

Unnamed: 0,Country Name,year,SL.UEM.TOTL.NE.ZS
259,Kosovo,2019,25.559999
260,"Yemen, Rep.",2019,
261,South Africa,2019,28.469999
262,Zambia,2019,
263,Zimbabwe,2019,16.860001


Cleaning the data
Inspecting the data with head() and tail() shows that:
Some countries are missing for in the two data sets: GDP and Unemployment Rates.
The data is therefore cleaned by removing the rows with unavailable values using the drop() method.

In [100]:
gdpCountries = gdpReset[2:].dropna()
gdpCountries

Unnamed: 0,Country Name,year,NY.GDP.MKTP.CD
2,Angola,2019,8.881570e+10
3,Albania,2019,1.527918e+10
4,Andorra,2019,3.154058e+09
5,Arab World,2019,2.817415e+12
6,United Arab Emirates,2019,4.211423e+11
...,...,...,...
259,Kosovo,2019,7.926134e+09
260,"Yemen, Rep.",2019,2.258108e+10
261,South Africa,2019,3.514316e+11
262,Zambia,2019,2.330977e+10


In [101]:
lifeCountries = UnemployReset[3:].dropna()
lifeCountries

Unnamed: 0,Country Name,year,SL.UEM.TOTL.NE.ZS
3,Albania,2019,11.470000
6,United Arab Emirates,2019,2.230000
7,Argentina,2019,9.840000
8,Armenia,2019,18.299999
11,Australia,2019,5.160000
...,...,...,...
255,Vietnam,2019,2.040000
257,World,2019,5.685293
259,Kosovo,2019,25.559999
261,South Africa,2019,28.469999


Transforming the data
The World Bank reports GDP in US dollars and cents. To make the data easier to read, the GDP is converted to millions of British pounds (the author's local currency) with the following auxiliary functions, using the average 2013 dollar-to-pound conversion rate provided by http://www.ukforex.co.uk/forex-tools/historical-rate-tools/yearly-average-rates.

In [5]:
def roundToMillions (value):
    return round(value / 1000000)

def usdToGBP (usd):
    return usd / 1.564768

GDP = 'GDP (£m)'
gdpCountries[GDP] = gdpCountries[GDP_INDICATOR].apply(usdToGBP).apply(roundToMillions)
gdpCountries.head()

Unnamed: 0,country,year,NY.GDP.MKTP.CD,GDP (£m)
34,Afghanistan,2013,20458940000.0,13075
35,Albania,2013,12781030000.0,8168
36,Algeria,2013,209703500000.0,134016
38,Andorra,2013,3249101000.0,2076
39,Angola,2013,138356800000.0,88420


The unnecessary columns can be dropped.

In [6]:
COUNTRY = 'country'
headings = [COUNTRY, GDP]
gdpClean = gdpCountries[headings]
gdpClean.head()

Unnamed: 0,country,GDP (£m)
34,Afghanistan,13075
35,Albania,8168
36,Algeria,134016
38,Andorra,2076
39,Angola,88420


The World Bank reports the life expectancy with several decimal places. After rounding, the original column is discarded.

In [7]:
LIFE = 'Life expectancy (years)'
lifeCountries[LIFE] = lifeCountries[LIFE_INDICATOR].apply(round)
headings = [COUNTRY, LIFE]
lifeClean = lifeCountries[headings]
lifeClean.head()

Unnamed: 0,country,Life expectancy (years)
34,Afghanistan,60
35,Albania,78
36,Algeria,75
39,Angola,52
40,Antigua and Barbuda,76


Combining the data
The tables are combined through an inner join on the common 'country' column.

In [8]:
gdpVsLife = merge(gdpClean, lifeClean, on=COUNTRY, how='inner')
gdpVsLife.head()

NameError: name 'merge' is not defined

Calculating the correlation
To measure if the life expectancy and the GDP grow together, the Spearman rank correlation coefficient is used. It is a number from -1 (perfect inverse rank correlation: if one indicator increases, the other decreases) to 1 (perfect direct rank correlation: if one indicator increases, so does the other), with 0 meaning there is no rank correlation. A perfect correlation doesn't imply any cause-effect relation between the two indicators. A p-value below 0.05 means the correlation is statistically significant.

In [None]:
from scipy.stats import spearmanr

gdpColumn = gdpVsLife[GDP]
lifeColumn = gdpVsLife[LIFE]
(correlation, pValue) = spearmanr(gdpColumn, lifeColumn)
print('The correlation is', correlation)
if pValue < 0.05:
    print('It is statistically significant.')
else:
    print('It is not statistically significant.')

The value shows a direct correlation, i.e. richer countries tend to have longer life expectancy, but it is not very strong.

Showing the data
Measures of correlation can be misleading, so it is best to see the overall picture with a scatterplot. The GDP axis uses a logarithmic scale to better display the vast range of GDP values, from a few million to several billion (million of million) pounds.

In [None]:
%matplotlib inline
gdpVsLife.plot(x=GDP, y=LIFE, kind='scatter', grid=True, logx=True, figsize=(10, 4))

The plot shows there is no clear correlation: there are rich countries with low life expectancy, poor countries with high expectancy, and countries with around 10 thousand (104) million pounds GDP have almost the full range of values, from below 50 to over 80 years. Towards the lower and higher end of GDP, the variation diminishes. Above 40 thousand million pounds of GDP (3rd tick mark to the right of 104), most countries have an expectancy of 70 years or more, whilst below that threshold most countries' life expectancy is below 70 years.

Comparing the 10 poorest countries and the 10 countries with the lowest life expectancy shows that total GDP is a rather crude measure. The population size should be taken into account for a more precise definiton of what 'poor' and 'rich' means. Furthermore, looking at the countries below, droughts and internal conflicts may also play a role in life expectancy.

In [None]:
# the 10 countries with lowest GDP
gdpVsLife.sort_values(GDP).head(10)

In [None]:
# the 10 countries with lowest life expectancy
gdpVsLife.sort_values(LIFE).head(10)

Conclusions
To sum up, there is no strong correlation between a country's wealth and the life expectancy of its inhabitants: there is often a wide variation of life expectancy for countries with similar GDP, countries with the lowest life expectancy are not the poorest countries, and countries with the highest expectancy are not the richest countries. Nevertheless there is some relationship, because the vast majority of countries with a life expectancy below 70 years is on the left half of the scatterplot.

Using the NY.GDP.PCAP.PP.CD indicator, GDP per capita in current 'international dollars', would make for a better like-for-like comparison between countries, because it would take population and purchasing power into account. Using more specific data, like expediture on health, could also lead to a better analysis.