# **Expectativa de Vida (OMS)**

O conjunto de dados relacionado à expectativa de vida e aos fatores de saúde de 193 países foi coletado do mesmo site do repositório de dados da OMS, e os dados econômicos correspondentes foram coletados do site das Nações Unidas.



Dentre todas as categorias de fatores relacionados à saúde, foram selecionados apenas os fatores críticos mais representativos. Observou-se que, nos últimos 15 anos, houve um grande desenvolvimento no setor de saúde, resultando na redução das taxas de mortalidade humana, especialmente nos países em desenvolvimento, em comparação com os últimos 30 anos. Portanto, neste projeto, consideramos dados do período de 2000 a 2015 para 193 países para análises posteriores.

# **Explicação das Colunas**


* Country - País                        
* Year - Idade                               
* Status - Status de país desenvolvido ou em desenvolvimento                            
* Life expectancy - Expectativa de vida por idade               
* Adult Mortality - Taxa de mortalidade adulta para ambos os sexos (probabilidade de morte entre 15 e 60 anos por 1000 habitantes)         
* infant deaths - Número de mortes infantis por 1000 habitantes
* Alcohol - Álcool, consumo per capita (15+) registado (em litros de álcool puro)
* percentage expenditure - Percentual de Gastos, despesas com saúde como porcentagem do Produto Interno Bruto per capita (%)
* Hepatitis B - Cobertura vacinal contra hepatite B (HepB) em crianças de 1 ano de idade (%)
* Measles - Sarampo, número de casos notificados por 1000 habitantes.
* BMI - IMC, Índice de Massa Corporal médio de toda a população
* under-five deaths - Número de mortes de crianças menores de cinco anos por 1000 habitantes
* Polio - Cobertura vacinal contra a poliomielite (Pol3) em crianças de 1 ano de idade (%)   
* Total expenditure - Despesas do governo geral com saúde como porcentagem das despesas totais do governo (%)
* Diphtheria - Cobertura da vacinação contra difteria, tétano e coqueluche (DTP3) em crianças de 1 ano de idade (%)
* HIV/AIDS - Mortes por 1.000 nascidos vivos por HIV/AIDS (0-4 anos)  
* GDP - PIB, Produto Interno Bruto per capita (em USD)  
* Population - População do País
* thinness  1-19 years - Magreza 10 - 19 anos, prevalência de magreza entre crianças e adolescentes de 10 a 19 anos (%)  
* thinness 5-9 years - Prevalência de magreza em crianças de 5 a 9 anos (%)     
* Income composition of resources - Índice de Desenvolvimento Humano em termos de composição da renda dos recursos (índice variando de 0 a 1)  
* Schooling - Escolarização, número de anos de escolaridade (anos)   




# **4. Processamento dos Dados**

Nesta etapa, é necessário que você:
* Faça a limpeza das colunas erradas;
* Faça o tratamento dos valores nulos da forma que achar válido;
* Trate as duplicatas;
* Transforme as colunas categóricas em numéricas;
* Faça o escalonamento de variáveis numéricas;

In [390]:
import pandas as pd
import numpy as np

df =pd.read_csv("./expectancy.csv")

# Define a opção para não usar notação científica para números de ponto flutuante
pd.set_option('display.float_format', lambda x: '%.0f' % x)


In [391]:
#Removendo o indice Unnamed e duplicidades
df_clean = df.drop(columns=['Unnamed: 0'])\
             .drop_duplicates()

#checando remoção de duplicidades
df_clean[df_clean.duplicated()]

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling


In [392]:
#Padronização de nomes de colunas snake_case (removendo / - spaces) substituindo por "_"
df_clean.columns = df_clean.columns.str.lower().str.replace(r'[ /-]+', '_', regex=True)

df_clean.tail(5)

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv_aids,gdp,population,thinness_1_19_years,thinness_5_9_years,income_composition_of_resources,schooling
2933,Zimbabwe,2OO4,Developing,44,723,27,4,0,68,31,...,67,7,65,34,454,12777511,9,9,0,9
2934,Zimbabwe,2OO3,Developing,44,715,26,4,0,7,998,...,7,7,68,37,453,12633897,10,10,0,10
2935,Zimbabwe,2OO2,Developing,45,73,25,4,0,73,304,...,73,7,71,40,57,125525,1,1,0,10
2936,Zimbabwe,2OO1,Developing,45,686,25,2,0,76,529,...,76,6,75,42,549,12366165,2,2,0,10
2937,Zimbabwe,2OOO,Developing,46,665,24,2,0,79,1483,...,78,7,78,44,547,12222251,11,11,0,10


In [393]:
#Removendo sujeira no campo year
df_clean['year'] = df_clean['year'].astype(str).str.replace('O', '0')

df_clean.tail(5)


Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv_aids,gdp,population,thinness_1_19_years,thinness_5_9_years,income_composition_of_resources,schooling
2933,Zimbabwe,2004,Developing,44,723,27,4,0,68,31,...,67,7,65,34,454,12777511,9,9,0,9
2934,Zimbabwe,2003,Developing,44,715,26,4,0,7,998,...,7,7,68,37,453,12633897,10,10,0,10
2935,Zimbabwe,2002,Developing,45,73,25,4,0,73,304,...,73,7,71,40,57,125525,1,1,0,10
2936,Zimbabwe,2001,Developing,45,686,25,2,0,76,529,...,76,6,75,42,549,12366165,2,2,0,10
2937,Zimbabwe,2000,Developing,46,665,24,2,0,79,1483,...,78,7,78,44,547,12222251,11,11,0,10


In [394]:
#ajustando datatype da coluna year
df_clean['year'] = df_clean['year'].astype(int)

df_clean.tail(5)

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv_aids,gdp,population,thinness_1_19_years,thinness_5_9_years,income_composition_of_resources,schooling
2933,Zimbabwe,2004,Developing,44,723,27,4,0,68,31,...,67,7,65,34,454,12777511,9,9,0,9
2934,Zimbabwe,2003,Developing,44,715,26,4,0,7,998,...,7,7,68,37,453,12633897,10,10,0,10
2935,Zimbabwe,2002,Developing,45,73,25,4,0,73,304,...,73,7,71,40,57,125525,1,1,0,10
2936,Zimbabwe,2001,Developing,45,686,25,2,0,76,529,...,76,6,75,42,549,12366165,2,2,0,10
2937,Zimbabwe,2000,Developing,46,665,24,2,0,79,1483,...,78,7,78,44,547,12222251,11,11,0,10


In [395]:
# Exemplo: Substituir NaN por zero
df_clean['population'] = df_clean['population'].fillna(0)
df_clean['population'] = df_clean['population'].astype(int)

df_clean.tail(5)

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv_aids,gdp,population,thinness_1_19_years,thinness_5_9_years,income_composition_of_resources,schooling
2933,Zimbabwe,2004,Developing,44,723,27,4,0,68,31,...,67,7,65,34,454,12777511,9,9,0,9
2934,Zimbabwe,2003,Developing,44,715,26,4,0,7,998,...,7,7,68,37,453,12633897,10,10,0,10
2935,Zimbabwe,2002,Developing,45,73,25,4,0,73,304,...,73,7,71,40,57,125525,1,1,0,10
2936,Zimbabwe,2001,Developing,45,686,25,2,0,76,529,...,76,6,75,42,549,12366165,2,2,0,10
2937,Zimbabwe,2000,Developing,46,665,24,2,0,79,1483,...,78,7,78,44,547,12222251,11,11,0,10


In [396]:
#Padronizando casas decimais
df_clean['percentage_expenditure'] = df_clean['percentage_expenditure'].round()
df_clean['gdp'] = df_clean['gdp'].round()

df_clean.tail(5)

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv_aids,gdp,population,thinness_1_19_years,thinness_5_9_years,income_composition_of_resources,schooling
2933,Zimbabwe,2004,Developing,44,723,27,4,0,68,31,...,67,7,65,34,454,12777511,9,9,0,9
2934,Zimbabwe,2003,Developing,44,715,26,4,0,7,998,...,7,7,68,37,453,12633897,10,10,0,10
2935,Zimbabwe,2002,Developing,45,73,25,4,0,73,304,...,73,7,71,40,57,125525,1,1,0,10
2936,Zimbabwe,2001,Developing,45,686,25,2,0,76,529,...,76,6,75,42,549,12366165,2,2,0,10
2937,Zimbabwe,2000,Developing,46,665,24,2,0,79,1483,...,78,7,78,44,547,12222251,11,11,0,10


In [397]:
#Trantando Nan por 0 em colunas numericas
for col in df_clean.columns:
    if df_clean[col].dtype in ['float64', 'int64']:
        df_clean[col] = df_clean[col].fillna(0)


#df_clean.info()
#df_clean.isnull().sum()

df_clean.tail(5)

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,polio,total_expenditure,diphtheria,hiv_aids,gdp,population,thinness_1_19_years,thinness_5_9_years,income_composition_of_resources,schooling
2933,Zimbabwe,2004,Developing,44,723,27,4,0,68,31,...,67,7,65,34,454,12777511,9,9,0,9
2934,Zimbabwe,2003,Developing,44,715,26,4,0,7,998,...,7,7,68,37,453,12633897,10,10,0,10
2935,Zimbabwe,2002,Developing,45,73,25,4,0,73,304,...,73,7,71,40,57,125525,1,1,0,10
2936,Zimbabwe,2001,Developing,45,686,25,2,0,76,529,...,76,6,75,42,549,12366165,2,2,0,10
2937,Zimbabwe,2000,Developing,46,665,24,2,0,79,1483,...,78,7,78,44,547,12222251,11,11,0,10


In [398]:
#Categoriazando a coluna status de forma manual

mapa_status = {
    'Developing': 0,
    'Developed': 1
}

df_clean['categoria_status'] = df_clean['status'].map(mapa_status)

df_clean.tail(5)


Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,total_expenditure,diphtheria,hiv_aids,gdp,population,thinness_1_19_years,thinness_5_9_years,income_composition_of_resources,schooling,categoria_status
2933,Zimbabwe,2004,Developing,44,723,27,4,0,68,31,...,7,65,34,454,12777511,9,9,0,9,0
2934,Zimbabwe,2003,Developing,44,715,26,4,0,7,998,...,7,68,37,453,12633897,10,10,0,10,0
2935,Zimbabwe,2002,Developing,45,73,25,4,0,73,304,...,7,71,40,57,125525,1,1,0,10,0
2936,Zimbabwe,2001,Developing,45,686,25,2,0,76,529,...,6,75,42,549,12366165,2,2,0,10,0
2937,Zimbabwe,2000,Developing,46,665,24,2,0,79,1483,...,7,78,44,547,12222251,11,11,0,10,0


# **5. Feature Engineer**

É aqui que sua imaginação precisa fluir. Crie de duas a três novas variáveis relacionando as variáveis já existentes e explique como elas poderiam agregar na capacidade do modelo de prever expectativa de vida.

In [399]:
#Feature : Cobertura vacinal para Difiteria associada ao GDP

#Identifica que: 
# um pais com alto valor de GDP e com baixa cobertura vacinal terá baixa eficiencia política do país
# um pais com baixo valor de GDP e com alta cobertura vacinal terá alta eficiencia política do país

df_clean['eficiencia_investimento'] = (
    df_clean['diphtheria'] / (df_clean['gdp'] + 1)
)
df_clean.tail(20)

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,diphtheria,hiv_aids,gdp,population,thinness_1_19_years,thinness_5_9_years,income_composition_of_resources,schooling,categoria_status,eficiencia_investimento
2918,Zambia,2003,Developing,46,64,39,2,66,0,881,...,83,18,429,11421984,7,7,0,10,0,0
2919,Zambia,2002,Developing,46,69,41,2,54,0,25036,...,84,18,377,111249,7,7,0,10,0,0
2920,Zambia,2001,Developing,45,611,43,3,47,0,16997,...,85,19,378,1824125,7,7,0,10,0,0
2921,Zambia,2000,Developing,44,614,44,3,46,0,30930,...,85,19,342,1531221,8,8,0,10,0,0
2922,Zimbabwe,2015,Developing,67,336,22,0,0,87,0,...,87,6,119,15777451,6,6,1,10,0,1
2923,Zimbabwe,2014,Developing,59,371,23,6,11,91,0,...,91,6,127,15411675,6,6,0,10,0,1
2924,Zimbabwe,2013,Developing,58,399,25,6,11,95,0,...,95,7,111,155456,6,6,0,10,0,1
2925,Zimbabwe,2012,Developing,57,429,26,6,93,97,0,...,95,9,956,1471826,6,6,0,10,0,0
2926,Zimbabwe,2011,Developing,55,464,28,6,64,94,0,...,93,13,840,14386649,7,7,0,10,0,0
2927,Zimbabwe,2010,Developing,52,527,29,5,53,9,9696,...,89,16,714,1486317,7,7,0,10,0,0


In [400]:
#Feature: Calcula o IDH baseado em qualidade de vida associando os 3 indicadores: life_expectancy, schooling e income_composition_of_resources simplificando o dataset. 

# Soma ponderada dos três principais indicadores de desenvolvimento
df_clean['fator_bem_estar'] = (
    df_clean['income_composition_of_resources'] + # Renda
    df_clean['schooling'] / df_clean['schooling'].max() + # Educação (normaliza o indicador aplicando o .max())
    (df_clean['life_expectancy'] / 100) # Longevidade (normalizada para ~0-1)
) / 3
df_clean.tail(20)

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,hiv_aids,gdp,population,thinness_1_19_years,thinness_5_9_years,income_composition_of_resources,schooling,categoria_status,eficiencia_investimento,fator_bem_estar
2918,Zambia,2003,Developing,46,64,39,2,66,0,881,...,18,429,11421984,7,7,0,10,0,0,0
2919,Zambia,2002,Developing,46,69,41,2,54,0,25036,...,18,377,111249,7,7,0,10,0,0,0
2920,Zambia,2001,Developing,45,611,43,3,47,0,16997,...,19,378,1824125,7,7,0,10,0,0,0
2921,Zambia,2000,Developing,44,614,44,3,46,0,30930,...,19,342,1531221,8,8,0,10,0,0,0
2922,Zimbabwe,2015,Developing,67,336,22,0,0,87,0,...,6,119,15777451,6,6,1,10,0,1,1
2923,Zimbabwe,2014,Developing,59,371,23,6,11,91,0,...,6,127,15411675,6,6,0,10,0,1,1
2924,Zimbabwe,2013,Developing,58,399,25,6,11,95,0,...,7,111,155456,6,6,0,10,0,1,1
2925,Zimbabwe,2012,Developing,57,429,26,6,93,97,0,...,9,956,1471826,6,6,0,10,0,0,1
2926,Zimbabwe,2011,Developing,55,464,28,6,64,94,0,...,13,840,14386649,7,7,0,10,0,0,0
2927,Zimbabwe,2010,Developing,52,527,29,5,53,9,9696,...,16,714,1486317,7,7,0,10,0,0,0


In [401]:
# Feature: Calculo da média da 'Life expectancy' por país em todo o dataset
#Indica se o o pais está dentro da sua média de expectativa de vida ou não.
media_do_pais = df_clean.groupby('country')['life_expectancy'].transform('mean')

df_clean['media_expectativa_vida_pais'] = media_do_pais
df_clean.tail(20)

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,gdp,population,thinness_1_19_years,thinness_5_9_years,income_composition_of_resources,schooling,categoria_status,eficiencia_investimento,fator_bem_estar,media_expectativa_vida_pais
2918,Zambia,2003,Developing,46,64,39,2,66,0,881,...,429,11421984,7,7,0,10,0,0,0,54
2919,Zambia,2002,Developing,46,69,41,2,54,0,25036,...,377,111249,7,7,0,10,0,0,0,54
2920,Zambia,2001,Developing,45,611,43,3,47,0,16997,...,378,1824125,7,7,0,10,0,0,0,54
2921,Zambia,2000,Developing,44,614,44,3,46,0,30930,...,342,1531221,8,8,0,10,0,0,0,54
2922,Zimbabwe,2015,Developing,67,336,22,0,0,87,0,...,119,15777451,6,6,1,10,0,1,1,50
2923,Zimbabwe,2014,Developing,59,371,23,6,11,91,0,...,127,15411675,6,6,0,10,0,1,1,50
2924,Zimbabwe,2013,Developing,58,399,25,6,11,95,0,...,111,155456,6,6,0,10,0,1,1,50
2925,Zimbabwe,2012,Developing,57,429,26,6,93,97,0,...,956,1471826,6,6,0,10,0,0,1,50
2926,Zimbabwe,2011,Developing,55,464,28,6,64,94,0,...,840,14386649,7,7,0,10,0,0,0,50
2927,Zimbabwe,2010,Developing,52,527,29,5,53,9,9696,...,714,1486317,7,7,0,10,0,0,0,50


In [402]:
#Escalonamento posterior as features para não interferir nas features

from sklearn.preprocessing import StandardScaler

# Lista das colunas numéricas CONTÍNUAS que precisam ser escalonadas 
# EXCETO a coluna categorica criada categoria_status, a coluna year e a coluna life_expectancy que é a coluna alvo do dataframe.

colunas_para_escalar = [
    'adult_mortality', 'infant_deaths', 'alcohol', 
    'percentage_expenditure', 'hepatitis_b', 'measles', 'bmi', 
    'under_five_deaths', 'polio', 'total_expenditure', 'diphtheria', 
    'hiv_aids', 'gdp', 'population', 'thinness_1_19_years', 
    'thinness_5_9_years', 'income_composition_of_resources', 'schooling',
    'media_expectativa_vida_pais', 'eficiencia_investimento','fator_bem_estar'
]

# 1. Instanciar o escalonador
scaler = StandardScaler() #para modelo de regressão

# 2. Aplicar o FIT (aprender a média e o desvio padrão dos dados) e TRANSFORM (aplicar a transformação)
# Usamos .values para garantir que estamos alimentando o NumPy array subjacente
df_escalonado = df_clean
dados_escalonados = scaler.fit_transform(df_escalonado[colunas_para_escalar])

# 3. Criar um novo DataFrame ou substituir as colunas originais
df_escalonado[colunas_para_escalar] = dados_escalonados

print("Padronização (StandardScaler) aplicada.")
print("Média e Desvio Padrão das colunas escalonadas devem ser ~0 e ~1.")
print(df_escalonado[colunas_para_escalar].describe().loc[['mean', 'std']])
#describe exibe todos os valores estatisticos e loc filtra neste caso apenas a media e o desvio padrão
df_escalonado

Padronização (StandardScaler) aplicada.
Média e Desvio Padrão das colunas escalonadas devem ser ~0 e ~1.
      adult_mortality  infant_deaths  alcohol  percentage_expenditure  \
mean                0             -0        0                       0   
std                 1              1        1                       1   

      hepatitis_b  measles  bmi  under_five_deaths  polio  total_expenditure  \
mean           -0       -0   -0                  0     -0                 -0   
std             1        1    1                  1      1                  1   

      ...  hiv_aids  gdp  population  thinness_1_19_years  thinness_5_9_years  \
mean  ...        -0   -0           0                    0                  -0   
std   ...         1    1           1                    1                   1   

      income_composition_of_resources  schooling  media_expectativa_vida_pais  \
mean                                0          0                            0   
std                         

Unnamed: 0,country,year,status,life_expectancy,adult_mortality,infant_deaths,alcohol,percentage_expenditure,hepatitis_b,measles,...,gdp,population,thinness_1_19_years,thinness_5_9_years,income_composition_of_resources,schooling,categoria_status,eficiencia_investimento,fator_bem_estar,media_expectativa_vida_pais
0,Afghanistan,2015,Developing,65,1,0,-1,-0,-0,-0,...,-0,0,3,3,-0,-0,0,-0,-0,-1
1,Afghanistan,2014,Developing,60,1,0,-1,-0,-0,-0,...,-0,-0,3,3,-0,-0,0,-0,-1,-1
2,Afghanistan,2013,Developing,60,1,0,-1,-0,-0,-0,...,-0,0,3,3,-0,-0,0,-0,-1,-1
3,Afghanistan,2012,Developing,60,1,0,-1,-0,0,0,...,-0,-0,3,3,-1,-0,0,-0,-1,-1
4,Afghanistan,2011,Developing,59,1,0,-1,-0,0,0,...,-0,-0,3,3,-1,-0,0,-0,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2933,Zimbabwe,2004,Developing,44,4,-0,0,-0,0,-0,...,-0,0,1,1,-1,-0,0,-0,-1,-2
2934,Zimbabwe,2003,Developing,44,4,-0,-0,-0,-2,-0,...,-0,0,1,1,-1,-0,0,-0,-1,-2
2935,Zimbabwe,2002,Developing,45,-1,-0,0,-0,0,-0,...,-0,-0,-1,-1,-1,-0,0,-0,-1,-2
2936,Zimbabwe,2001,Developing,45,4,-0,-1,-0,0,-0,...,-0,0,-1,-1,-1,-0,0,-0,-1,-2
