<a href="https://colab.research.google.com/github/aayushbhurtel/ml-basics/blob/master/week_5_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Which Public Health Factors have the Greatest Impact on Life Expectancy?
Life expectancy is the crucial metric for evaluating population health. It provides the average number of years that a group of people in a population is estimated to live. This factor is estimated based on various public health factors. The task of this project is to determine what are the various factors which can help in determining life expectancy. <br>

Data Source:
The raw data was extracted from Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status. The various features of the dataset include:
Features include:

* Country
* HIV\AIDS
* Measles
* Year
* Hepatitis B
* Body Mass Index (BMI)
* Life expectancy
* Polio
* Status
* Adult mortality
* Diphtheria
* Prevalence for malnutrition 5-9
* Infant mortality
* Gross Domestic Product (GDP)
* Education
* Alcohol consumption
* Population
* Total expenditure on health
* Expenditure on health (%)
* Prevalence for malnutrition 1-19
* Status


Task 1:
Read the raw data from the source file in Python.
First lets import all the necessy libraries and we keep on adding these libraries in later stages according to necessary.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [19]:
df = pd.read_csv('https://raw.githubusercontent.com/aayushbhurtel/MachineLearning/refs/heads/main/Life_Expectancy_Data.csv')
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Country                          2938 non-null   object 
 1   Year                             2938 non-null   int64  
 2   Status                           2938 non-null   object 
 3   Life expectancy                  2928 non-null   float64
 4   Adult Mortality                  2928 non-null   float64
 5   infant deaths                    2938 non-null   int64  
 6   Alcohol                          2744 non-null   float64
 7   percentage expenditure           2938 non-null   float64
 8   Hepatitis B                      2385 non-null   float64
 9   Measles                          2938 non-null   int64  
 10   BMI                             2904 non-null   float64
 11  under-five deaths                2938 non-null   int64  
 12  Polio               

Perform feature enginnering:

1. Population size: Create a population range that includes three categories:
  1. Small -  a population between 1,000 and 29,999,
  2. Medium -  a population between 30,000 and 99,999, and
  3. Large -  a population of 100,000 or more.
  

In [6]:
small_population = df[(df['Population'] >= 1000) & (df['Population'] <= 29999)]
medium_population = df[(df['Population'] >= 30000) & (df['Population'] <= 99999)]
large_population = df[df['Population'] >= 100000]

Lifestyle – Create a lifestyle feature that combines alcohol consumption and BMI.

I am going to create a composite score by combining alcohol and BMI index by multiplying both features.

In [16]:
alcohol_and_bmi_feature = df['Alcohol'] * df[' BMI ']

alcohol_and_bmi_feature

Unnamed: 0,0
0,0.191
1,0.186
2,0.181
3,0.176
4,0.172
...,...
2933,118.156
2934,108.402
2935,116.509
2936,44.548


Economy – Create an economy feature that combines population and GDP. <br>
The most straightforward way to combine these is by calculating the GDP per capita, which measures the average income per person:

In [17]:
GDP_population_index = df['GDP'] * df['Population']
GDP_population_index

Unnamed: 0,0
0,1.971086e+10
1,2.007083e+08
2,2.004633e+10
3,2.476810e+09
4,1.892519e+08
...,...
2933,5.805675e+09
2934,5.727592e+09
2935,7.198650e+06
2936,6.783921e+09


Death Ratio – Determine the death ratio between adult and infant mortality. <br>
death ratio is a metric indicating the relationship between adult and infant mortality.

In [18]:
death_ratio_feature = df['Adult Mortality'] / df['infant deaths']
death_ratio_feature

Unnamed: 0,0
0,4.241935
1,4.234375
2,4.060606
3,3.942029
4,3.873239
...,...
2933,26.777778
2934,27.500000
2935,2.920000
2936,27.440000


# Task 2:
Perform data cleaning by either removing any fragmented observations or by imputing missing values as necessary. Generate scatter plots between each predictor with the target variable to check the linear relationship and apply data transformations like log transform, if necessary.<br>
First, we will use `isnull()` function to find the missing data in the dataset.

In [20]:
df.isnull()

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2933,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2934,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2935,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2936,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


Looks like there is much of a missing values in our dataset. Lets calculate the sum of null values in our dataset to figure out exactly how many missing values are there. I am going to use isna().sum()

In [22]:
df.isna().sum()

Unnamed: 0,0
Country,0
Year,0
Status,0
Life expectancy,10
Adult Mortality,10
infant deaths,0
Alcohol,194
percentage expenditure,0
Hepatitis B,553
Measles,0
