# Life Expectancy Classifier 
#### Problem: Using random forest trees to predict whether a country will have a high, medium, or low life expectancy based on its health and economic factors.

- Label creation: Define the target variable by categorizing life expectancy into groups, such as:
  - High Life Expectancy ( x > 80 years)
  - Medium Life Expectancy (60 < x < 80 years)
  - Low Life Expectancy (x < 60 years)

- Impact: This could guide policy in identifying which countries need urgent intervention or improvement in healthcare and economic resources in order to allocate aid, funding, and resources more effectively. 


# Introducing the data
#### The dataset you’re using contains life expectancy data along with various health and economic factors for 193 countries. It is sourced from two reputable international organizations:
 - World Health Organization (WHO): WHO data repository, providing health statistics, including life expectancy and mortality rates.
 - United Nations (UN): The UN’s data on economic and demographic factors, including schooling, alcohol consumption, and population figures.
-   link to dataset : https://www.kaggle.com/code/varunsaikanuri/life-expectancy-visualization/input


#### What Is the Data About?
- This dataset focuses on life expectancy and the various factors that affect it. Life expectancy is an important global health indicator, often used to assess the overall well-being and healthcare access of a population.

- In total there are 22 columns, but for this project, we will focus on the following 10 key features, which provide a mix of health and economic data points:

  1. Country
  2. Year
  3. Status
  4. Life Expectancy
  5. Adult Mortality
  6. Infant Deaths
  7. Alcohol
  8. BMI (Body Mass Index)
  9. Population
  10. Schooling


- These 10 features provide a mix of health-related and economic factors that are known to influence life expectancy. By focusing on them, you're able to explore how aspects like mortality rates, alcohol consumption, education, and economic status correlate with life expectancy, which is key for creating meaningful insights and predictions.
  

In [20]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import plot_tree

In [24]:
life_expect = pd.read_csv("/Users/bitaghaffari/Desktop/Data Mining/Life Expectancy Data.csv")

- This will show all the data present in the dataset, for better framing and to avoid over-fitting we will focus on only 10 of these factors.

In [30]:
life_expect 

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.259210,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.470,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959000,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2933,Zimbabwe,2004,Developing,44.3,723.0,27,4.36,0.000000,68.0,31,...,67.0,7.13,65.0,33.6,454.366654,12777511.0,9.4,9.4,0.407,9.2
2934,Zimbabwe,2003,Developing,44.5,715.0,26,4.06,0.000000,7.0,998,...,7.0,6.52,68.0,36.7,453.351155,12633897.0,9.8,9.9,0.418,9.5
2935,Zimbabwe,2002,Developing,44.8,73.0,25,4.43,0.000000,73.0,304,...,73.0,6.53,71.0,39.8,57.348340,125525.0,1.2,1.3,0.427,10.0
2936,Zimbabwe,2001,Developing,45.3,686.0,25,1.72,0.000000,76.0,529,...,76.0,6.16,75.0,42.1,548.587312,12366165.0,1.6,1.7,0.427,9.8


In [35]:
life_expect.dtypes

Country                             object
Year                                 int64
Status                              object
Life expectancy                    float64
Adult Mortality                    float64
infant deaths                        int64
Alcohol                            float64
percentage expenditure             float64
Hepatitis B                        float64
Measles                              int64
 BMI                               float64
under-five deaths                    int64
Polio                              float64
Total expenditure                  float64
Diphtheria                         float64
 HIV/AIDS                          float64
GDP                                float64
Population                         float64
 thinness  1-19 years              float64
 thinness 5-9 years                float64
Income composition of resources    float64
Schooling                          float64
dtype: object

In [37]:
life_expect.head()

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


# Pre-Processing the Data
-  The pre processing steps include the following:
 - life_expect.isnull().sum(): This checks for missing values in each column of the dataset.
 - life_expect.dropna(inplace=True): Drops rows with any missing values. 
 - life_expect.drop_duplicates(): Removes any duplicate rows from the dataset.
 - life_expect.info(): Provides the structure of the cleaned dataset, including the number of entries and data types for each column.

In [40]:
missing_values = life_expect.isnull().sum()
print(missing_values)
life_expect.dropna(inplace=True)
life_expect = life_expect.drop_duplicates()
life_expect.info()


Country                              0
Year                                 0
Status                               0
Life expectancy                     10
Adult Mortality                     10
infant deaths                        0
Alcohol                            194
percentage expenditure               0
Hepatitis B                        553
Measles                              0
 BMI                                34
under-five deaths                    0
Polio                               19
Total expenditure                  226
Diphtheria                          19
 HIV/AIDS                            0
GDP                                448
Population                         652
 thinness  1-19 years               34
 thinness 5-9 years                 34
Income composition of resources    167
Schooling                          163
dtype: int64
<class 'pandas.core.frame.DataFrame'>
Index: 1649 entries, 0 to 2937
Data columns (total 22 columns):
 #   Column                

## What do these results mean? 

#### The first section shows the number of missing values in each column of the dataset. 

1. Life expectancy: 10 missing values
2. Adult Mortality: 10 missing values
3. Alcohol: 194 missing values
4. Hepatitis B: 553 missing values
5. BMI: 34 missing values
6. Polio: 19 missing values
7. Total expenditure: 226 missing values
8. DP: 448 missing values
9. Population: 652 missing values
10. Thinness 1-19 years: 34 missing values
11. Thinness 5-9 years: 34 missing values
12. Income composition of resources: 167 missing values
13. Schooling: 163 missing values

#### In conclusion there are 1649 entries after preprocessing. Therefore, there are 1649 rows left in the dataset after dropping missing values or duplicates. 

In [47]:
life_expect.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1649 entries, 0 to 2937
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Country                          1649 non-null   object 
 1   Year                             1649 non-null   int64  
 2   Status                           1649 non-null   object 
 3   Life expectancy                  1649 non-null   float64
 4   Adult Mortality                  1649 non-null   float64
 5   infant deaths                    1649 non-null   int64  
 6   Alcohol                          1649 non-null   float64
 7   percentage expenditure           1649 non-null   float64
 8   Hepatitis B                      1649 non-null   float64
 9   Measles                          1649 non-null   int64  
 10   BMI                             1649 non-null   float64
 11  under-five deaths                1649 non-null   int64  
 12  Polio                    

### Why use Random Forest Trees?
The model I will be using to analyze this data and address my classification problem is Random Forest Trees. This approach is highly efficient at capturing non-linear relationships and minimizes the risk of overfitting by averaging the results across multiple decision trees.

- Is able to handle complex relationships and interactions.
- Reducing overfitting and increases accuracy.
- Automatically computes the importance of each feature.