## Problem Definition 

Life expectancy differs significantly across countries, reflecting a complex interplay of various socioeconomic, environmental, and healthcare factors that influence how long people live. Understanding these factors and their impact on life expectancy is crucial for developing policies and interventions aimed at improving global health outcomes.

## Objective

This project seeks to identify and analyze the key determinants of life expectancy, exploring how variables such as access to healthcare, education, and environmental conditions contribute to disparities in longevity across different regions.

## Research questions

1. **Which country currently has the lowest life expectancy, and what factors contribute to this?**  

2. **Which country has the highest life expectancy, and what are the underlying reasons for its longevity?**  

3. **Is there a gender disparity in life expectancy? If so, how significant is the difference between men and women?**  
   *(Easily solvable with gender-segregated life expectancy data, but still requires a bit more analysis.)*

4. **Has life expectancy been increasing over time, and is this trend consistent across all countries?**  
   *(Requires longitudinal data analysis across multiple countries, which adds some complexity.)*

5. **What are the key factors influencing life expectancy across different countries?**  
   *(This requires statistical analysis to identify and rank the most influential factors.)*

6. **Is the observed gender gap in life expectancy consistent globally, or does it vary by region?**  
   *(Hardest: This requires not only gender-based data but also regional comparison, which involves deeper cross-country and cross-region analysis.)*


In [4]:
#import libraries 
import pandas as pd
import numpy as np
import seaborn as sns

In [17]:
data = pd.read_csv('life_expectancy.csv')
data.head()

Unnamed: 0,Country Name,Country Code,Region,IncomeGroup,Year,Life Expectancy World Bank,Prevelance of Undernourishment,CO2,Health Expenditure %,Education Expenditure %,Unemployment,Corruption,Sanitation,Injuries,Communicable,NonCommunicable
0,Afghanistan,AFG,South Asia,Low income,2001,56.308,47.8,730.0,,,10.809,,,2179727.1,9689193.7,5795426.38
1,Angola,AGO,Sub-Saharan Africa,Lower middle income,2001,47.059,67.5,15960.0,4.483516,,4.004,,,1392080.71,11190210.53,2663516.34
2,Albania,ALB,Europe & Central Asia,Upper middle income,2001,74.288,4.9,3230.0,7.139524,3.4587,18.575001,,40.520895,117081.67,140894.78,532324.75
3,Andorra,AND,Europe & Central Asia,High income,2001,,,520.0,5.865939,,,,21.78866,1697.99,695.56,13636.64
4,United Arab Emirates,ARE,Middle East & North Africa,High income,2001,74.544,2.8,97200.0,2.48437,,2.493,,,144678.14,65271.91,481740.7


### Data cleaning 

In [19]:
#rename columns 
data = data.rename(columns={'Life Expectancy World Bank':'Life_expectancy', 'Prevelance of Undernourishment': 'Undernourishment'})
#check data shape 
print(data.shape)

(3306, 16)


In [20]:
#check for missing values 
missing_values = data.isnull().sum()
missing_values

Country Name                  0
Country Code                  0
Region                        0
IncomeGroup                   0
Year                          0
Life_expectancy             188
Undernourishment            684
CO2                         152
Health Expenditure %        180
Education Expenditure %    1090
Unemployment                304
Corruption                 2331
Sanitation                 1247
Injuries                      0
Communicable                  0
NonCommunicable               0
dtype: int64

### Handle missing values 

For life expectancy, we will fill the missing values with the mean life expectancy of its corresponding region. Similar for Undernourishment, CO2 level and the other values.

We will drop the corruption column since there are only few values provided there it will not be of a great use. More than 70% of the values are missing.

In [35]:
#select column with missing values 
columns_with_missing_values = ['Life_expectancy', 'Undernourishment', 'CO2', 'Health Expenditure %', 
                               'Education Expenditure %', 'Unemployment', 'Sanitation']

#iterate through the column and fillna with the mean of the corresponding region 
for column in columns_with_missing_values:
    data[column] = data.groupby('Region')[column].transform(lambda x: x.fillna(x.mean()))

data = data.drop(columns = ['Corruption'])

print(data.isnull().sum())


KeyError: "['Corruption'] not found in axis"