## The Life Expectancy and Income
_by Mohammad Fazlan bin Harun_

About the dataset:

Data source : https://databank.worldbank.org/source/world-development-indicators

Database: World Development Indicators  
Countries: check all "Countries" (not All or Aggregates)  
Series: Life expectancy at birth, total (years) [SP.DYN.LE00.IN]    
          Population, total [SP.POP.TOTL]    
        GDP per capita, PPP (current international $)(NY.GDP.PCAP.PP.CD)      
Time:   check all years

layout > orientation > popular > Orientation 4 ( Series on page, Time on Column, Country on Row)

Life expectancy at birth, total (years)   
-indicates the number of years a newborn infant would live if prevailing patterns of mortality at the time of its birth were to stay the same throughout its life.  
-life expectancy at birth used here is the average number of years a newborn is expected to live if mortality patterns at the time of its birth remain constant in the future. It reflects the overall mortality level of a population, and summarizes the mortality pattern that prevails across all age groups in a given year.

Population   
-total is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship. The values shown are mid-year estimates.	

GDP per capita, PPP (current international $)	
-this indicator provides per capita values for gross domestic product (GDP) expressed in current international dollars converted by purchasing power parity (PPP) conversion factor. 
GDP is the sum of gross value added by all resident producers in the country plus any product taxes and minus any subsidies not included in the value of the products.

In [1]:
#importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import pprint as pp

import cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=False,
world_readable=True)

warnings.filterwarnings("ignore")
%matplotlib inline
sns.set()

In [2]:
#loading data

data = pd.read_csv('Data_WB_Life_Expectancy.csv')
data.head(10)

Unnamed: 0,Time,Time Code,Country Name,Country Code,"Life expectancy at birth, total (years) [SP.DYN.LE00.IN]","Population, total [SP.POP.TOTL]","GDP per capita, PPP (current international $) [NY.GDP.PCAP.PP.CD]"
0,1960,YR1960,Afghanistan,AFG,32.446,8996967,..
1,1960,YR1960,Albania,ALB,62.283,1608800,..
2,1960,YR1960,Algeria,DZA,46.141,11057864,..
3,1960,YR1960,American Samoa,ASM,..,20127,..
4,1960,YR1960,Andorra,AND,..,13410,..
5,1960,YR1960,Angola,AGO,37.524,5454938,..
6,1960,YR1960,Antigua and Barbuda,ATG,61.968,54132,..
7,1960,YR1960,Argentina,ARG,65.055,20481781,..
8,1960,YR1960,Armenia,ARM,65.972,1874119,..
9,1960,YR1960,Aruba,ABW,65.662,54208,..


We can see the GDP per capita column has a null values which we should handle later on.

In [3]:
print('List of columns :')
data.columns.to_list()

List of columns :


['Time',
 'Time Code',
 'Country Name',
 'Country Code',
 'Life expectancy at birth, total (years) [SP.DYN.LE00.IN]',
 'Population, total [SP.POP.TOTL]',
 'GDP per capita, PPP (current international $) [NY.GDP.PCAP.PP.CD]']

In [4]:
print(f'There are {data.shape[0]} rows and {data.shape[1]} columns in this dataset.')

There are 13454 rows and 7 columns in this dataset.


In [5]:
#renaming column 

data.rename(
    columns={
        'Time':'Year',
        'GDP per capita, PPP (current international $) [NY.GDP.PCAP.PP.CD]':'GDP_per_capita',
        'Population, total [SP.POP.TOTL]':'Population',
        'Life expectancy at birth, total (years) [SP.DYN.LE00.IN]':'Life_expectancy'
    },inplace=True)

data.head(2)

Unnamed: 0,Year,Time Code,Country Name,Country Code,Life_expectancy,Population,GDP_per_capita
0,1960,YR1960,Afghanistan,AFG,32.446,8996967,..
1,1960,YR1960,Albania,ALB,62.283,1608800,..


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13454 entries, 0 to 13453
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Year             13454 non-null  int64 
 1   Time Code        13454 non-null  object
 2   Country Name     13454 non-null  object
 3   Country Code     13454 non-null  object
 4   Life_expectancy  13454 non-null  object
 5   Population       13454 non-null  object
 6   GDP_per_capita   13454 non-null  object
dtypes: int64(1), object(6)
memory usage: 735.9+ KB


The life expectancy, population and GDP per capita column should have a numeric type of data instead of object. We need to change it to numeric data type later.

In [7]:
#check for missing values

data.isna().sum()

Year               0
Time Code          0
Country Name       0
Country Code       0
Life_expectancy    0
Population         0
GDP_per_capita     0
dtype: int64

It is weird to see no missing values when we certainly know GDP per capita column has a missing values maybe the mising values is denotd by '..' ,let's check.

In [8]:
((data['GDP_per_capita'])=='..').sum()

7684

In [9]:
(data=='..').sum()

Year                  0
Time Code             0
Country Name          0
Country Code          0
Life_expectancy    1524
Population          259
GDP_per_capita     7684
dtype: int64

In [10]:
#now we know missing values is represented by '..', let delete them.

data= data.replace("..", np.nan)
(data=='..').sum()

Year               0
Time Code          0
Country Name       0
Country Code       0
Life_expectancy    0
Population         0
GDP_per_capita     0
dtype: int64

In [11]:
#change object datatype to numeric

data= data.assign(
    Population = pd.to_numeric(data.Population),
    GDP_per_capita = pd.to_numeric(data.GDP_per_capita),
    Life_expectancy = pd.to_numeric(data.Life_expectancy)
)
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13454 entries, 0 to 13453
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Year             13454 non-null  int64  
 1   Time Code        13454 non-null  object 
 2   Country Name     13454 non-null  object 
 3   Country Code     13454 non-null  object 
 4   Life_expectancy  11930 non-null  float64
 5   Population       13195 non-null  float64
 6   GDP_per_capita   5770 non-null   float64
dtypes: float64(3), int64(1), object(3)
memory usage: 735.9+ KB
None


In [12]:
data.isnull().sum()

Year                  0
Time Code             0
Country Name          0
Country Code          0
Life_expectancy    1524
Population          259
GDP_per_capita     7684
dtype: int64

In [13]:
data.dropna(inplace=True)
data.isnull().sum()

Year               0
Time Code          0
Country Name       0
Country Code       0
Life_expectancy    0
Population         0
GDP_per_capita     0
dtype: int64

In [15]:
data['Population (million)'] = round(data['Population']/1000000,2)
data['GDP_per_capita (thousand)'] = round(data['GDP_per_capita']/1000,2)
data['Life_expectancy'] = round((data['Life_expectancy']),0)
data.drop(columns=['GDP_per_capita','Population','Time Code'],inplace=True)
data.set_index('Country Code',inplace = True)
data.head()

Unnamed: 0_level_0,Year,Country Name,Life_expectancy,Population (million),GDP_per_capita (thousand)
Country Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ALB,1990,Albania,72.0,3.29,2.55
DZA,1990,Algeria,67.0,25.76,6.91
AGO,1990,Angola,45.0,11.85,3.28
ATG,1990,Antigua and Barbuda,72.0,0.06,11.59
ARG,1990,Argentina,72.0,32.62,7.19


In [None]:
# let select latest year data to analyze

df = data[(data['Year'] == 2020)]
df.head()

In [None]:
df.info()

In [None]:
# Data transformed by taking the natural logarithm, differences in natural logs show percentage differences.
df["log_gdp"] = round(np.log(df["GDP_per_capita (thousand)"]),2)
df.head(10)

In [None]:
#let see descriptive statistics of the dataset

round(df.describe(),2)

In [None]:
life_expectancy_Malaysia2020 = df[(df['Country Name'] == 'Malaysia')]
life_expectancy_Malaysia2020

<div class="alert alert-block alert-info">
    <b>Average life expectancy across the 181 countries in 2020 is 73 years, with a range of 54 to 85 years old. Thus, there is substantial variation in the average length of life across countries. For a comparison to Malaysia (my country), Malaysia is at the upper half of the range where Malaysian residents expected to live around 76-years-old.</b>
    </div>
    

In [None]:
sorted_by_life_expectancy = df.sort_values(by=['Life_expectancy'])

print('Countries with the lowest life expectancy are from :')
(sorted_by_life_expectancy [['Country Name','Life_expectancy']]).head()

In [None]:
print('''Countries with the highest life expectancy are from East Asian countries and South East Asian (Singapore)''')
sorted_by_life_expectancy [['Country Name','Life_expectancy']].tail()

### Figure 1: The distribution of GDP per capita

(a) Histogram of GDP per capita

In [None]:
df['GDP_per_capita (thousand)'].hist(figsize=(10,5),bins=42,)
plt.title("Figure 1a: The distribution of GDP per capita", fontsize=15) 
plt.xlabel("GDP per capita (thousand)", fontsize=12) 
plt.ylabel("Frequency", fontsize=12)  

<div class="alert alert-block alert-info">
    <b>The distribution of GDP per capita is skewed with a long right tail, Figure 1a shows its histogram. There are 6 countries with extremely high GDP per capita values. Let's see which countries are they.
</b>
    </div>

In [None]:
sorted_by_gdppc = df.sort_values(by=['GDP_per_capita (thousand)'],ascending=False)
sorted_by_gdppc.head(6)


<div class="alert alert-block alert-info">
    <b>There are 6 countries with extremely high GDP per capita: Luxembourg, Singapore, Ireland, Qatar,Bermuda and Switzerland.</b> 
    </div>

In [None]:
round(sorted_by_gdppc.describe(),2)

<div class="alert alert-block alert-info">
    <b>Average GDP per capita is 21k US dollars in this data, the median is 13k US dollars, which also shows the skewness of the distribution. The standard deviation is 22k US dollars, suggesting substantial spread.</b> 
    </div>

(b) Histogram of log (GDP per capita)

In [None]:
df['log_gdp'].hist(figsize=(10,5),bins=30)
plt.title("Figure 1b: The distribution of GDP per capita in log 10", fontsize=15) 
plt.xlabel("ln GDP per capita (thousand)", fontsize=12) 
plt.ylabel("Frequency", fontsize=12)

<div class="alert alert-block alert-info">
    <b>When we take the natural log of GDP per capita, the distribution is close to symmetric (Figure 1b). There are two countries with less than one thousand dollars of GDP per capita (Burundi and the Central African Republic), their log GDP per capita value is less than zero. The average of log GDP per capita is 2.49 in this data, its median is 2.57, suggesting a fairly symmetric distribution. The standard deviation is 1.15.</b>
    </div>

In [None]:
sorted_by_gdppc.tail()

### Figure 2 Life expectancy and GDP per capita

(a) Life expectancy and (GDP per capita)

In [None]:
plt.figure(figsize=(7, 5))
sns.regplot(df['GDP_per_capita (thousand)'], df['Life_expectancy'])
plt.xlabel("GDP per capita (thousands dollar)", fontsize=15) 
plt.ylabel("Life expectancy (years)", fontsize=15) 
plt.axis([0, 120, 50, 100])

"The preceding screenshot shows a strong correlation between these two variables. However, there are some outliers that we can easily spot from the graph"

<div class="alert alert-block alert-info">
    <b>Figure 2a shows the results of a linear regression with life expectancy as the y variable, and GDP per capita as the x variable. This regression shows a positive slope, but the scatterplot implies that there are strong nonlinearities in the association that the linear regression doesn’t capture. In particular, the slope appears a lot steeper at lower levels of GDP per capita and a lot flatter at higher levels. Thus, the linear regression does not capture the shape of the pattern, and because of the skewed distribution of the x variable, it is not a very good approximation of the average of that nonlinear pattern either.</b> 
    </div>

(b) Life expectancy and ln (GDP per capita)

In [None]:
plt.figure(figsize=(8, 5))
sns.regplot(df['log_gdp'],df['Life_expectancy'])
plt.xlabel("log GDP per capita (thousands dollar)", fontsize=15) 
plt.ylabel("Life expectancy (years)", fontsize=15) 
plt.axis([-0.5, 5, 50, 85])

<div class="alert alert-block alert-info">
    <b>Figure 8.5a shows the same regression with the natural log of GDP per capita as the x variable. In contrast with the 
regression without taking logs, the scatterplot with log GDP per capita suggests a remarkably linear pattern.
For most of the range of log GDP per capita, the linear fit appears to be a good approximation
of the conditional expectation.</b>
    </div>

In [None]:
df.iplot(x='log_gdp',
         y='Life_expectancy',
         text='Country Name',
         bestfit=True,
         bestfit_colors=['black'],
         mode='markers',
         xTitle='ln scale GDP per capita (thousands dollar)',
         yTitle='Life expectancy (years)',
         colors = 'blue',
        xrange=[-0.5, 5],
        yrange=[ 50, 90],
         dash='solid',
         interpolation='linear',
         symbol='circle-dot',
         size=8)

In [None]:
#!jupyter nbconvert --to html Life_expectancy_and_Income.ipynb