# Base Census
**Preveja se a renda excede US$ 50 mil/ano com base nos dados do censo. Também conhecido como conjunto de dados "Renda do Censo".**

A extração foi feita por Barry Becker do banco de dados do Censo de 1994. Um conjunto de registros razoavelmente limpos foi extraído usando as seguintes condições: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

A tarefa de previsão é determinar se uma pessoa faz mais de 50K ano.

https://archive.ics.uci.edu/ml/datasets/adult

# #3 - Seleção de Features

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Modelos de classificação
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import RFE
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

import pickle as pkl
import pandas as pd
import numpy as np
import math
path = '/content/drive/MyDrive/Machine Learning e Data Science com Python/Projeto - Census/'

In [4]:
pd.set_option('max_columns', 50)
pd.set_option('max_rows', 150)

## Carregando dataset tratado

In [5]:
with open(path+'census_data.pkl', 'rb') as f:
  X_train, X_test, y_train, y_test = pkl.load(f)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((22792, 17), (9769, 17), (22792,), (9769,))

In [6]:
X_train.head()

Unnamed: 0,workclass,marital-status,occupation,native-country,age,final-weight,education-num,capital-gain,capital-loos,hour-per-week,race_1,race_2,race_3,race_4,race_5,sex_1,sex_2
0,0.218703,0.045307,0.062436,0.247893,0.054795,0.291932,0.533333,0.0,0.0,0.397959,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1,0.297915,0.099617,0.139279,0.247893,0.520548,0.104189,0.533333,0.0,0.0,0.397959,0.0,0.0,0.0,0.0,1.0,0.0,1.0
2,0.218703,0.045307,0.062436,0.247893,0.136986,0.142369,0.533333,0.0,0.0,0.397959,0.0,0.0,0.0,0.0,1.0,0.0,1.0
3,0.218703,0.451339,0.225772,0.247893,0.465753,0.164305,0.8,0.0,0.0,0.193878,0.0,0.0,0.0,0.0,1.0,0.0,1.0
4,0.218703,0.045307,0.265051,0.247893,0.013699,0.172506,0.466667,0.0,0.0,0.346939,0.0,0.0,0.0,0.0,1.0,1.0,0.0


## RFE
O algoritmo deve fornecer uma maneira de calcular pontuações importantes, como uma árvore de decisão. O algoritmo usado no RFE não precisa ser o algoritmo que se encaixa nos recursos selecionados; diferentes algoritmos podem ser usados.  

Uma vez configurada, a classe deve ser ajustada em um conjunto de dados de treinamento para selecionar os recursos chamando a função fit() . Após o ajuste da classe, a escolha das variáveis ​​de entrada pode ser visualizada através do atributo “ support_ ” que fornece um True ou False para cada variável de entrada.  

## RFECV
Também é possível selecionar automaticamente o número de recursos escolhidos pelo RFE.

Isso pode ser alcançado realizando a avaliação de validação cruzada de diferentes números de recursos, como fizemos na seção anterior e selecionando automaticamente o número de recursos que resultaram na melhor pontuação média.

A classe RFECV implementa isso para nós.

*Fonte: https://machinelearningmastery.com/rfe-feature-selection-in-python/*

## Obtendo a lista de modelos para avaliar
**LIMITAÇÕES**: Os modelos: KNeighborsClassifier, MLPClassifier e SVC (exceto kernel linear) não possuem os atributos `coef_` or `feature_importances_`. Com isso, não é possível realizar a seleção de atributos com esses algoritmos.

In [7]:
list_models = [
      ('svm-lin', SVC(kernel = 'linear', class_weight = 'balanced'))
    , ('tree', DecisionTreeClassifier(class_weight = 'balanced'))
    , ('forest', RandomForestClassifier(class_weight = 'balanced'))
    , ('log-reg', LogisticRegression(class_weight = 'balanced'))
]

Vamos avaliar o desempenho dos modelos utilizando de 8 a 17 features

In [8]:
models = {}
for i in range(8, X_train.shape[1]+1):
  for mdl_nm, mdl in list_models:
    rfe = RFE(estimator = mdl, n_features_to_select = i)
    models[f'{mdl_nm}_{i}'] = Pipeline(steps=[('s',rfe),('m',mdl)])

- Utilizando **validação cruzada**, vamos separar a base de treino e validação 10x
- Vamos utilizar a métrica f1 que é mais recomendada para uma base de dados desbalanceada. Essa métrica é uma média harmônica entre as métricas: precision e o recall

In [9]:
seed = 14
model_scores = []

for mdl_nm, mdl_pipeline in models.items():
  kfold = KFold(n_splits = 10, shuffle = True, random_state = seed)
  score = cross_val_score(mdl_pipeline, X_train, y_train, cv = kfold, n_jobs=4, error_score='raise', scoring = 'f1')
  model_scores.append((mdl_nm, score.mean()))

In [11]:
data_df = []
for m_f, f1 in model_scores:
  m, f = m_f.split('_')
  data_df.append((m, f, f1))
rfe_results = pd.DataFrame(data_df, columns = ['model', 'num_features', 'f1_score'])

In [14]:
rfe_results.sort_values(by=['f1_score'], ascending = False, inplace = True)
rfe_results

Unnamed: 0,model,num_features,f1_score
26,forest,14,0.687561
38,forest,17,0.687246
22,forest,13,0.687043
14,forest,11,0.686904
34,forest,16,0.685145
18,forest,12,0.684771
30,forest,15,0.683101
10,forest,10,0.68227
6,forest,9,0.679127
7,log-reg,9,0.675951


## Verificando features selecionadas no melhor resultado de cada algoritmo
A base de dados será ajustada conforme o desempenho do modelo. Caso não for possível realizar a seleção de atributos com um algoritmo específico, então esse algoritmo receberá a base de dados original transformada na etapa #2

Melhor resultado da seleção de features de cada algoritmo

In [64]:
pd.concat([rfe_results[ rfe_results['model'] == 'forest'].head(1),
           rfe_results[ rfe_results['model'] == 'log-reg'].head(1),
           rfe_results[ rfe_results['model'] == 'svm-lin'].head(1),
           rfe_results[ rfe_results['model'] == 'tree'].head(1)], ignore_index=True)

Unnamed: 0,model,num_features,f1_score
0,forest,14,0.687561
1,log-reg,9,0.675951
2,svm-lin,10,0.659894
3,tree,13,0.623232


In [28]:
rfe_forest  = RFE(estimator=RandomForestClassifier(class_weight = 'balanced'), n_features_to_select=14)
rfe_log_reg = RFE(estimator=LogisticRegression(class_weight = 'balanced')    , n_features_to_select=9)
rfe_svm_lin = RFE(estimator=SVC(kernel = 'linear', class_weight = 'balanced'), n_features_to_select=10)
rfe_tree    = RFE(estimator=DecisionTreeClassifier(class_weight = 'balanced'), n_features_to_select=13)

In [None]:
rfe_forest.fit(X_train, y_train.values)
rfe_log_reg.fit(X_train, y_train.values)
rfe_svm_lin.fit(X_train, y_train.values)
rfe_tree.fit(X_train, y_train.values)

**Support:** A máscara dos recursos selecionados.  
**Ranking**: Ranking dos recursos selecionados.

In [47]:
sup_rank = []
for i in range(X_train.shape[1]):
  sup_rank.append(('forest', X_train.columns[i][0], rfe_forest.support_[i], rfe_forest.ranking_[i]))
  sup_rank.append(('log-reg', X_train.columns[i][0], rfe_log_reg.support_[i], rfe_log_reg.ranking_[i]))
  sup_rank.append(('svm-lin', X_train.columns[i][0], rfe_svm_lin.support_[i], rfe_svm_lin.ranking_[i]))
  sup_rank.append(('tree', X_train.columns[i][0], rfe_tree.support_[i], rfe_tree.ranking_[i]))

In [48]:
support_ranking = pd.DataFrame(sup_rank, columns = ['model', 'feature', 'support', 'ranking'])

Visualizando resultado do Support e Ranking do algoritmo Random Forest

In [49]:
support_ranking[ support_ranking['model'] == 'forest' ]

Unnamed: 0,model,feature,support,ranking
0,forest,workclass,True,1
4,forest,marital-status,True,1
8,forest,occupation,True,1
12,forest,native-country,True,1
16,forest,age,True,1
20,forest,final-weight,True,1
24,forest,education-num,True,1
28,forest,capital-gain,True,1
32,forest,capital-loos,True,1
36,forest,hour-per-week,True,1


Visualizando resultado do Support e Ranking do algoritmo Logistic Regression

In [50]:
support_ranking[ support_ranking['model'] == 'log-reg' ]

Unnamed: 0,model,feature,support,ranking
1,log-reg,workclass,False,2
5,log-reg,marital-status,True,1
9,log-reg,occupation,True,1
13,log-reg,native-country,True,1
17,log-reg,age,True,1
21,log-reg,final-weight,True,1
25,log-reg,education-num,True,1
29,log-reg,capital-gain,True,1
33,log-reg,capital-loos,True,1
37,log-reg,hour-per-week,True,1


Visualizando resultado do Support e Ranking do algoritmo SVM Linear

In [51]:
support_ranking[ support_ranking['model'] == 'svm-lin' ]

Unnamed: 0,model,feature,support,ranking
2,svm-lin,workclass,True,1
6,svm-lin,marital-status,True,1
10,svm-lin,occupation,True,1
14,svm-lin,native-country,True,1
18,svm-lin,age,True,1
22,svm-lin,final-weight,True,1
26,svm-lin,education-num,True,1
30,svm-lin,capital-gain,True,1
34,svm-lin,capital-loos,True,1
38,svm-lin,hour-per-week,True,1


Visualizando resultado do Support e Ranking do algoritmo Decision Tree

In [52]:
support_ranking[ support_ranking['model'] == 'tree' ]

Unnamed: 0,model,feature,support,ranking
3,tree,workclass,True,1
7,tree,marital-status,True,1
11,tree,occupation,True,1
15,tree,native-country,True,1
19,tree,age,True,1
23,tree,final-weight,True,1
27,tree,education-num,True,1
31,tree,capital-gain,True,1
35,tree,capital-loos,True,1
39,tree,hour-per-week,True,1


## Ajustando base de dados dos atributo previsores para cada classe

In [54]:
columns_forest  = support_ranking[ (support_ranking['model'] == 'forest') & (support_ranking['support'] == True) ]['feature'].values
columns_log_reg = support_ranking[ (support_ranking['model'] == 'log-reg') & (support_ranking['support'] == True) ]['feature'].values
columns_svm_lin = support_ranking[ (support_ranking['model'] == 'svm-lin') & (support_ranking['support'] == True) ]['feature'].values
columns_tree    = support_ranking[ (support_ranking['model'] == 'tree') & (support_ranking['support'] == True) ]['feature'].values

In [62]:
X_train_forest, X_test_forest   = X_train[columns_forest], X_test[columns_forest]
X_train_log_reg, X_test_log_reg = X_train[columns_log_reg], X_test[columns_log_reg]
X_train_svm_lin, X_test_svm_lin = X_train[columns_svm_lin], X_test[columns_svm_lin]
X_train_tree, X_test_tree       = X_train[columns_tree], X_test[columns_tree]

In [68]:
with open('census_data_2.pkl', 'wb') as f:
  pkl.dump([X_train, X_test,
            X_train_forest, X_test_forest,
            X_train_log_reg, X_test_log_reg,
            X_train_svm_lin, X_test_svm_lin,
            X_train_tree, X_test_tree,
            y_train, y_test], f)