## Life Expectancy dataset Analysis and Visualization with Altair

###### original source from Kaggle: https://www.kaggle.com/datasets/augustus0498/life-expectancy-who 
###### (Background: data collected from WHO and UN website)

### Goals of Analysis:

1. What is the life expectancy trend in Germany from year 2000 - 2015? 
2. Does life expectancy have positive or negative relationship with drinking alcohol?
3. How does Infant and Adult mortality rates affect life expectancy? 
4. How many countries with life expectancy in ranges <50, 50-60, 61-75, and >75?

In [2]:
import pandas as pd
import numpy as np
import altair as alt
alt.renderers.enable('html')

RendererRegistry.enable('html')

In [3]:
df = pd.read_csv("led.csv")
df.head()

Unnamed: 0,Country,Year,Status,Lifeexpectancy,AdultMortality,infantdeaths,Alcohol,percentageexpenditure,HepatitisB,Measles,...,Polio,Totalexpenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness1-19years,thinness5-9years,Incomecompositionofresources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


In [4]:
df.columns

Index(['Country', 'Year', 'Status', 'Lifeexpectancy', 'AdultMortality',
       'infantdeaths', 'Alcohol', 'percentageexpenditure', 'HepatitisB',
       'Measles', 'BMI', 'under-fivedeaths', 'Polio', 'Totalexpenditure',
       'Diphtheria', 'HIV/AIDS', 'GDP', 'Population', 'thinness1-19years',
       'thinness5-9years', 'Incomecompositionofresources', 'Schooling'],
      dtype='object')

In [5]:
# check for null values
df.isnull().sum()

Country                           0
Year                              0
Status                            0
Lifeexpectancy                   10
AdultMortality                   10
infantdeaths                      0
Alcohol                         194
percentageexpenditure             0
HepatitisB                      553
Measles                           0
BMI                              34
under-fivedeaths                  0
Polio                            19
Totalexpenditure                226
Diphtheria                       19
HIV/AIDS                          0
GDP                             448
Population                      652
thinness1-19years                34
thinness5-9years                 34
Incomecompositionofresources    167
Schooling                       163
dtype: int64

In [6]:
# Null values can be filled in with the mean value of the respective variable based on countries.
fill_df = df.drop(columns='Year')
fill_df = fill_df.groupby(by=['Country', 'Status']).apply(lambda col: col.fillna(col.mean()))

fill_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Lifeexpectancy,AdultMortality,infantdeaths,Alcohol,percentageexpenditure,HepatitisB,Measles,BMI,under-fivedeaths,Polio,Totalexpenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness1-19years,thinness5-9years,Incomecompositionofresources,Schooling
Country,Status,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,Developing,0,65.0,263.0,62,0.01,71.279624,65.0,1154,19.1,83,6.0,8.16,65.0,0.1,584.259210,33736494.0,17.2,17.3,0.479,10.1
Afghanistan,Developing,1,59.9,271.0,64,0.01,73.523582,62.0,492,18.6,86,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
Afghanistan,Developing,2,59.9,268.0,66,0.01,73.219243,64.0,430,18.1,89,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.470,9.9
Afghanistan,Developing,3,59.5,272.0,69,0.01,78.184215,67.0,2787,17.6,93,67.0,8.52,67.0,0.1,669.959000,3696958.0,17.9,18.0,0.463,9.8
Afghanistan,Developing,4,59.2,275.0,71,0.01,7.097109,68.0,3013,17.2,97,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Zimbabwe,Developing,2933,44.3,723.0,27,4.36,0.000000,68.0,31,27.1,42,67.0,7.13,65.0,33.6,454.366654,12777511.0,9.4,9.4,0.407,9.2
Zimbabwe,Developing,2934,44.5,715.0,26,4.06,0.000000,7.0,998,26.7,41,7.0,6.52,68.0,36.7,453.351155,12633897.0,9.8,9.9,0.418,9.5
Zimbabwe,Developing,2935,44.8,73.0,25,4.43,0.000000,73.0,304,26.3,40,73.0,6.53,71.0,39.8,57.348340,125525.0,1.2,1.3,0.427,10.0
Zimbabwe,Developing,2936,45.3,686.0,25,1.72,0.000000,76.0,529,25.9,39,76.0,6.16,75.0,42.1,548.587312,12366165.0,1.6,1.7,0.427,9.8


In [7]:
# checking for null values again
# there are still null values, probably from groupings that all null values and does not have a mean.

fill_df.isnull().sum()

Lifeexpectancy                   10
AdultMortality                   10
infantdeaths                      0
Alcohol                          17
percentageexpenditure             0
HepatitisB                      144
Measles                           0
BMI                              34
under-fivedeaths                  0
Polio                             0
Totalexpenditure                 32
Diphtheria                        0
HIV/AIDS                          0
GDP                             405
Population                      648
thinness1-19years                34
thinness5-9years                 34
Incomecompositionofresources    167
Schooling                       163
dtype: int64

In [8]:
# drop the remaining null values
fill_df = fill_df.dropna()
fill_df.isnull().sum()

# aggregate the countries by mean values
fill_df = fill_df.groupby(by=['Country', 'Status']).mean()
fill_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Lifeexpectancy,AdultMortality,infantdeaths,Alcohol,percentageexpenditure,HepatitisB,Measles,BMI,under-fivedeaths,Polio,Totalexpenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness1-19years,thinness5-9years,Incomecompositionofresources,Schooling
Country,Status,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Afghanistan,Developing,58.19375,269.0625,78.2500,0.014375,34.960110,64.562500,2362.2500,15.51875,107.5625,48.3750,8.252500,52.3125,0.10000,340.015425,9.972260e+06,16.58125,15.58125,0.415375,8.21250
Albania,Developing,75.15625,45.0625,0.6875,4.848750,193.259091,98.000000,53.3750,49.06875,0.9375,98.1250,5.945625,98.0625,0.10000,2119.726679,6.969116e+05,1.61875,1.70000,0.709875,12.13750
Algeria,Developing,73.61875,108.1875,20.3125,0.406667,236.185241,78.000000,1943.8750,48.74375,23.5000,91.7500,4.604000,91.8750,0.10000,2847.853392,2.164983e+07,6.09375,5.97500,0.694875,12.71250
Angola,Developing,49.01875,328.5625,83.7500,5.740667,102.100268,70.222222,3561.3125,18.01875,132.6250,46.1250,3.919333,47.6875,2.36875,1975.143045,1.014710e+07,6.19375,6.66875,0.458375,8.04375
Argentina,Developing,75.15625,106.0000,10.1250,7.966667,773.038981,81.285714,2.0000,54.98125,11.3750,93.3750,6.912667,92.3750,0.10000,6998.575103,2.012120e+07,1.07500,0.95000,0.794125,16.50625
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Uruguay,Developing,76.07500,119.9375,0.5625,6.172667,621.838919,94.312500,0.0000,52.92500,0.7500,94.2500,8.750000,89.1250,0.10000,7192.584875,2.396771e+06,1.60000,1.54375,0.765625,15.23125
Uzbekistan,Developing,68.03125,184.8125,21.9375,1.608667,44.373450,95.642857,208.4375,34.80625,25.6875,98.5625,5.638000,98.4375,0.20625,651.092359,9.036317e+05,3.14375,3.17500,0.603000,11.64375
Vanuatu,Developing,71.38750,137.8750,0.0000,0.806667,282.325746,56.125000,20.8750,44.25625,0.0000,66.1875,3.928667,59.0625,0.10000,2000.245518,1.230962e+05,1.56875,1.49375,0.367500,10.56875
Zambia,Developing,53.90625,354.3125,33.4375,2.239333,89.650407,69.818182,6563.8125,17.45000,52.3750,64.3750,5.824000,74.2500,11.93125,811.811841,6.260246e+06,6.88125,6.76250,0.498437,11.21250


In [9]:
# check for correlation between variables
fill_df.corr()

Unnamed: 0,Lifeexpectancy,AdultMortality,infantdeaths,Alcohol,percentageexpenditure,HepatitisB,Measles,BMI,under-fivedeaths,Polio,Totalexpenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness1-19years,thinness5-9years,Incomecompositionofresources,Schooling
Lifeexpectancy,1.0,-0.877687,-0.16358,0.456047,0.540505,0.333752,-0.158951,0.754873,-0.190052,0.640548,0.274914,0.636588,-0.600072,0.584426,-0.023562,-0.506943,-0.49523,0.817821,0.782516
AdultMortality,-0.877687,1.0,0.052508,-0.276433,-0.416904,-0.190935,0.045912,-0.614093,0.072298,-0.443953,-0.155017,-0.417253,0.73109,-0.448948,-0.023573,0.390085,0.391755,-0.610973,-0.554374
infantdeaths,-0.16358,0.052508,1.0,-0.108648,-0.116911,-0.298115,0.730379,-0.28142,0.996922,-0.224021,-0.200922,-0.216586,0.000143,-0.127314,0.905898,0.53608,0.538977,-0.151371,-0.198182
Alcohol,0.456047,-0.276433,-0.108648,1.0,0.588493,0.1643,-0.055363,0.504825,-0.104216,0.399188,0.356797,0.379951,-0.071651,0.631811,-0.038389,-0.477945,-0.462651,0.651273,0.672403
percentageexpenditure,0.540505,-0.416904,-0.116911,0.588493,1.0,0.01515,-0.119944,0.419319,-0.120538,0.313223,0.331822,0.30687,-0.151574,0.980601,-0.04173,-0.3742,-0.373701,0.581997,0.573334
HepatitisB,0.333752,-0.190935,-0.298115,0.1643,0.01515,1.0,-0.244663,0.248804,-0.311966,0.733854,0.21788,0.750264,-0.147364,0.038848,-0.222068,-0.210622,-0.219602,0.282503,0.34425
Measles,-0.158951,0.045912,0.730379,-0.055363,-0.119944,-0.244663,1.0,-0.288211,0.739258,-0.201307,-0.183938,-0.197878,0.017269,-0.126008,0.523922,0.324725,0.325463,-0.142484,-0.171563
BMI,0.754873,-0.614093,-0.28142,0.504825,0.419319,0.248804,-0.288211,1.0,-0.294589,0.4916,0.413732,0.473657,-0.329246,0.473125,-0.142385,-0.746248,-0.749584,0.718803,0.755723
under-fivedeaths,-0.190052,0.072298,0.996922,-0.104216,-0.120538,-0.311966,0.739258,-0.294589,1.0,-0.250993,-0.201483,-0.24438,0.012154,-0.132356,0.890828,0.537496,0.53875,-0.172836,-0.215238
Polio,0.640548,-0.443953,-0.224021,0.399188,0.313223,0.733854,-0.201307,0.4916,-0.250993,1.0,0.294446,0.950831,-0.214078,0.357642,-0.096655,-0.353144,-0.342677,0.601943,0.632187


#### 1. What is the life expectancy trend in Germany from year 2000 - 2015? 

In [10]:
de_life = df.loc[df['Country'] == 'Germany', ['Year', 'Lifeexpectancy']]
de_life = de_life.set_index('Year').sort_index()


In [11]:
alt.Chart(de_life.reset_index(),title = alt.Title("Life expectancy in Germany between Year 200 - 2015")
    ).mark_line().encode(
    alt.X('Year').title('Year'),
    alt.Y('Lifeexpectancy', scale=alt.Scale(domain=[50, 100])).title('Life expectancy'),
).properties(
    width = 400,
    height = 400
)

##### Life expectancy has shown improvement starting from year 2010, reaching a peak of around 89 years old on average in the year 2014.

#### 2. Does life expectancy have positive or negative relationship with alcohol consumption?

In [11]:
alcohol = fill_df.loc[:, ['Alcohol', 'Lifeexpectancy']].sort_values(by=['Lifeexpectancy'])


In [12]:
alt.Chart(alcohol.reset_index(), title = alt.Title("Life expectancy vs Alcohol consumption")
    ).mark_point(color='green',).encode(
    alt.X('Alcohol', scale=alt.Scale(domain=[-1,14 ])).title('pure Alcohol (litres)'),
    alt.Y('Lifeexpectancy', scale=alt.Scale(domain=[30, 100])).title('Life expectancy'),
).properties(
    width = 400,
    height = 400
)

#### Alcohol consumption is not a major factor in life expectancy. Low alcohol consumption does not necessarily lead to a higher life expectancy. 

### 3. How does Infant and Adult mortality rates affect life expectancy? 

In [13]:
# Adult mortality rate = Percentage of deaths between 15 and 60 years old per 1000 population
# Infant mortality rate = Percentage of Infant Deaths per 1000 population

In [13]:
new_df = fill_df.loc[:, ['Lifeexpectancy', 'AdultMortality', 'infantdeaths']].reset_index()
new_df = new_df.drop(columns='Status')
new_df = new_df.sort_values(by=['Lifeexpectancy'])

new_df['AdultMortality'] = (df['AdultMortality'] / 1000) * 100
new_df['infantdeaths'] = (df['infantdeaths'] / 1000) * 100

new_df

Unnamed: 0,Country,Lifeexpectancy,AdultMortality,infantdeaths
108,SierraLeone,46.11250,13.4,0.1
25,CentralAfricanRepublic,48.51250,9.9,0.1
68,Lesotho,48.78125,13.6,0.0
3,Angola,49.01875,27.2,6.9
73,Malawi,49.89375,14.5,0.0
...,...,...,...,...
6,Australia,81.81250,28.1,7.7
111,Spain,82.06875,14.2,0.1
60,Italy,82.18750,38.8,9.5
43,France,82.21875,1.4,1.9


In [14]:
alt.Chart(new_df, title = alt.Title("Relationship of Adult Mortality Rate and Infant Death Rate to Life Expectancy")
).mark_circle(size=60).encode(
    alt.X('AdultMortality', scale=alt.Scale(domain=[0, 40])).title('Adult Mortality Rate (%)'),
    alt.Y('infantdeaths', scale=alt.Scale(domain=[0, 14])).title('Infant Death Rate (%)'),
    color='Lifeexpectancy',
    tooltip=['Country', 'Lifeexpectancy', 'AdultMortality', 'infantdeaths']
).properties(
    width=600,
    height=400
)


#### Countries with low adult mortality rate and low infant death rate have a better life expectancy.

### 4. How many countries with life expectancy in ranges <50, 50-60, 61-75, and >75?

In [15]:
total_number_of_countries = 133

below50 = fill_df.loc[fill_df['Lifeexpectancy'] < 50].shape[0]
between50_60 = fill_df.loc[(fill_df['Lifeexpectancy'] >= 50) & (fill_df['Lifeexpectancy'] <= 60)].shape[0]
between61_75 = fill_df.loc[(fill_df['Lifeexpectancy'] > 60) & (fill_df['Lifeexpectancy'] <= 75)].shape[0]
above75 = fill_df.loc[fill_df['Lifeexpectancy'] > 75].shape[0]

category = pd.DataFrame({
    "category": ["<50", "50-60", "61-75", ">75"],
    "value": [below50, between50_60, between61_75, above75]
})

alt.Chart(category, title = alt.Title("Number of countries with Life Expectancy ranges")
).mark_arc(innerRadius=50).encode(
    theta="value",
    color="category:N",
)

#### Out of 133 countries, the majority has life expectancy between 61-75 years.