## CA1: Part (ii): Cluster — Life Expectancy Data

**Course:** HDip in Data Analytics

**Lecturer:** Muhammad Iqbal

**Module:** Machine Learning for Business  

# Are there country profiles (clusters) with distinct health and public spending patterns?

### Project Objective
The objective of this project is to identify country profiles with distinct health and public spending patterns through the application of clustering techniques. Using a dataset that includes socio-economic and health-related indicators such as *Life expectancy, Adult mortality, Alcohol consumption, Health expenditure, GDP, Population, Education level, and Infectious disease coverage (Hepatitis B, Polio, Diphtheria), the goal is to uncover groups of countries that share similar characteristics.

The clustering will be performee by K-Means and Hierarchical, preceded by PCA to reduce dimensionality and improve interpretability of the cluster structures. Through analysis of clustering solution and  using the Silhouette Score and Davies–Bouldin Index, It will be possible a quantitative comparison of algorithm performance.

The insights generated can help understand how economic and social factors align with population health outcomes globally.

### Data Dictionary
**Country** Name of the country observed

**Year** Year of data record

**Status** Country classification according to World Bank criteria 

**Life expectancy** Average number of years a newborn is expected to live under current mortality conditions

**Adult Mortality** Probability of dying between ages 15 and 60 years per 1,000 population 

**infant deaths** Number of infant deaths (under 1 year of age) per 1,000 live births

**Alcohol** Estimated average alcohol consumption (in litres of pure alcohol per capita, age 15+)

**percentage expenditure** Health expenditure as a percentage of Gross Domestic Product 

**Hepatitis B** Immunization coverage for Hepatitis B 

**Measles** Number of reported cases of Measles per 1,000 population

**BMI** Average Body Mass Index of the population (kg/m²)

**under-five deaths** Number of deaths of children under five years of age 

**Polio** Immunization coverage for Polio 

**Total expenditure** Total government expenditure on health as a percentage of total government expenditure

**Diphtheria** Immunization coverage for Diphtheria, Pertussis and Tetanus 

**HIV/AIDS** Deaths per 1,000 live births due to HIV/AIDS among children under 5 years old

**GDP** Gross Domestic Product per capita

**Population** Total population of the country for the corresponding year

**thinness 1-19 years** Prevalence of thinness among individuals aged 10 to 19 years (%)

**thinness 5-9 years** Prevalence of thinness among individuals aged 5 to 9 years (%)

**Income composition of resources** Index reflecting human development based on income (0–1 scale, higher = better)

**Schooling** Average number of years of schooling expected for children

In [5]:
# Import Libraries & Load Dataset

In [7]:
import warnings
warnings.filterwarnings('ignore') # I can suppress the warnings

In [9]:
import pandas as pd # Read, organize, and analyze the dataset

# Loaded the data into df dataframe
df = pd.read_csv("Life Expectancy Data.csv")

# Display first 5 records
df.head()

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


In [11]:
# Number of rows and columns
df.shape

(2938, 22)

In [13]:
# Dataset overview 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Country                          2938 non-null   object 
 1   Year                             2938 non-null   int64  
 2   Status                           2938 non-null   object 
 3   Life expectancy                  2928 non-null   float64
 4   Adult Mortality                  2928 non-null   float64
 5   infant deaths                    2938 non-null   int64  
 6   Alcohol                          2744 non-null   float64
 7   percentage expenditure           2938 non-null   float64
 8   Hepatitis B                      2385 non-null   float64
 9   Measles                          2938 non-null   int64  
 10   BMI                             2904 non-null   float64
 11  under-five deaths                2938 non-null   int64  
 12  Polio               

In [15]:
# Descriptive statistics
df.describe()

Unnamed: 0,Year,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
count,2938.0,2928.0,2928.0,2938.0,2744.0,2938.0,2385.0,2938.0,2904.0,2938.0,2919.0,2712.0,2919.0,2938.0,2490.0,2286.0,2904.0,2904.0,2771.0,2775.0
mean,2007.51872,69.224932,164.796448,30.303948,4.602861,738.251295,80.940461,2419.59224,38.321247,42.035739,82.550188,5.93819,82.324084,1.742103,7483.158469,12753380.0,4.839704,4.870317,0.627551,11.992793
std,4.613841,9.523867,124.292079,117.926501,4.052413,1987.914858,25.070016,11467.272489,20.044034,160.445548,23.428046,2.49832,23.716912,5.077785,14270.169342,61012100.0,4.420195,4.508882,0.210904,3.35892
min,2000.0,36.3,1.0,0.0,0.01,0.0,1.0,0.0,1.0,0.0,3.0,0.37,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,0.0
25%,2004.0,63.1,74.0,0.0,0.8775,4.685343,77.0,0.0,19.3,0.0,78.0,4.26,78.0,0.1,463.935626,195793.2,1.6,1.5,0.493,10.1
50%,2008.0,72.1,144.0,3.0,3.755,64.912906,92.0,17.0,43.5,4.0,93.0,5.755,93.0,0.1,1766.947595,1386542.0,3.3,3.3,0.677,12.3
75%,2012.0,75.7,228.0,22.0,7.7025,441.534144,97.0,360.25,56.2,28.0,97.0,7.4925,97.0,0.8,5910.806335,7420359.0,7.2,7.2,0.779,14.3
max,2015.0,89.0,723.0,1800.0,17.87,19479.91161,99.0,212183.0,87.3,2500.0,99.0,17.6,99.0,50.6,119172.7418,1293859000.0,27.7,28.6,0.948,20.7


In [17]:
# Total missing values in the dataset
df.isnull().sum().sum()

2563

In [19]:
# Total missing values in the dataset by columns 
df.isnull().sum()

Country                              0
Year                                 0
Status                               0
Life expectancy                     10
Adult Mortality                     10
infant deaths                        0
Alcohol                            194
percentage expenditure               0
Hepatitis B                        553
Measles                              0
 BMI                                34
under-five deaths                    0
Polio                               19
Total expenditure                  226
Diphtheria                          19
 HIV/AIDS                            0
GDP                                448
Population                         652
 thinness  1-19 years               34
 thinness 5-9 years                 34
Income composition of resources    167
Schooling                          163
dtype: int64

In [21]:
# Total missing values in the dataset by percentage 
df.isnull().sum()/df.shape[0]*100

Country                             0.000000
Year                                0.000000
Status                              0.000000
Life expectancy                     0.340368
Adult Mortality                     0.340368
infant deaths                       0.000000
Alcohol                             6.603131
percentage expenditure              0.000000
Hepatitis B                        18.822328
Measles                             0.000000
 BMI                                1.157250
under-five deaths                   0.000000
Polio                               0.646698
Total expenditure                   7.692308
Diphtheria                          0.646698
 HIV/AIDS                           0.000000
GDP                                15.248468
Population                         22.191967
 thinness  1-19 years               1.157250
 thinness 5-9 years                 1.157250
Income composition of resources     5.684139
Schooling                           5.547992
dtype: flo

In [23]:
# Duplicate records
df.duplicated().sum()

0

In [25]:
# Remove columns that will not be used
drop_cols = ['Life expectancy ','Country','Year','Status',
    'infant deaths','Measles','Population','percentage expenditure',
    'thinness 5-9 years', 'thinness  1-19 years']

use_df = df.drop(columns=drop_cols, errors='ignore').copy()

In [27]:
# Total missing values in the dataset by columns after drop some of them 
use_df.isnull().sum()

Adult Mortality                     10
Alcohol                            194
Hepatitis B                        553
Measles                              0
 BMI                                34
under-five deaths                    0
Polio                               19
Total expenditure                  226
Diphtheria                          19
 HIV/AIDS                            0
GDP                                448
 thinness  1-19 years               34
 thinness 5-9 years                 34
Income composition of resources    167
Schooling                          163
dtype: int64

In [29]:
import numpy as np # used for numerical calculations and array manipulation

# Columns where the value 0 means "missing" in the dataset
zero_as_missing = ['Alcohol','Hepatitis B','Polio','Diphtheria',
                   'BMI','Total expenditure','GDP','HIV/AIDS']

# Filter only the columns that exist in the dataset
cols_zero = [c for c in zero_as_missing if c in use_df.columns]
if cols_zero:
    use_df[cols_zero] = use_df[cols_zero].replace(0, np.nan)

# Treat zeros as missing data
# In this dataset, some fields use `0` to indicate that the value is missing
# I replaced these zeros with `NaN` so that pandas and the cleaning/imputation routines correctly treat them as missing values