# Prevendo a renda

O dataset [Adult Census Income](https://archive.ics.uci.edu/ml/datasets/Adult) é um conjunto de dados público do censo norte-americano de 1994 muito utilizado para estudo de modelos de machine learning.

Esse conjunto de dados possui informações como educação, sexo, país de origem, idade e etc. A tarefa consiste em prever segundo as características de um indivíduo se ele possui uma renda anual maior que $50.000 ou menor que 
$50.000.

o modelo aqui desenvolvido não têm a pretensão de obter a maior assertividade possível e nem incentivamos a utilização do modelo para qualquer outro propósito além de estudo. O modelo aqui desenvolvido serve apenas para ilustrar como testar um modelo de machine learning.

In [1]:
from pathlib import Path
import joblib
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

In [2]:
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 
            'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 
            'hours_per_week', 'native_country', 'income']
dataset_train = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", 
                            header=None, names=columns)
dataset_test = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test", 
                            header=None, names=columns, skiprows=1)
dataset_train.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
dataset_train.shape

(32561, 15)

In [4]:
dataset_test.shape

(16281, 15)

## Pré-processamento

In [5]:
# input_columns = 
latin_american_countries = [' Cuba', ' Jamaica', ' Mexico',
       ' Puerto-Rico', ' Honduras', ' Haiti', 
       ' El-Salvador', ' Guatemala', ' Nicaragua',
       ' Columbia', ' Ecuador', ' Peru', ' Dominican-Republic']

european_countries = [' England', ' Canada',
       ' Germany', ' Italy', ' Poland',' Portugal', 
       ' France', ' Yugoslavia', ' Scotland',
       ' Greece', ' Ireland', ' Hungary', ' Holand-Netherlands']

asian_countries = [' India', ' Iran', ' Philippines', 
       ' Cambodia', ' Thailand', ' Laos',
       ' Taiwan', ' China', ' Japan', ' Yugoslavia', ' Vietnam', ' Hong']

input_columns = ['native_usa', 'native_latin_america', 'native_europe', 'native_asia',
                'is_male', 'is_married', 'hours_per_week', 'age', 'education_num',
                'capital_gain', 'capital_loss']
X_train = (
    dataset_train.assign(native_usa=lambda _df: (_df['native_country']==' United-States').astype(int),
                            native_latin_america=lambda _df: (_df['native_country'].isin(latin_american_countries)).astype(int),
                            native_europe=lambda _df: (_df['native_country'].isin(european_countries)).astype(int),
                            native_asia=lambda _df: (_df['native_country'].isin(asian_countries)).astype(int))
                     .assign(is_married=lambda _df: _df['marital_status'].fillna('?').str.startswith(' Married').astype(int))
                     .assign(is_male=lambda _df: (_df['sex'] == " Male").astype(int))
                     .loc[:,input_columns]
)

y_train = (
    dataset_train['income'].dropna()
                            .map({' <=50K': 0, ' >50K': 1, ' <=50K.': 0, ' >50K.': 1})
                            .astype(int)
                            .rename("income>50k")
)

X_test = (
    dataset_test.assign(native_usa=lambda _df: (_df['native_country']==' United-States').astype(int),
                            native_latin_america=lambda _df: (_df['native_country'].isin(latin_american_countries)).astype(int),
                            native_europe=lambda _df: (_df['native_country'].isin(european_countries)).astype(int),
                            native_asia=lambda _df: (_df['native_country'].isin(asian_countries)).astype(int))
                     .assign(is_married=lambda _df: _df['marital_status'].str.startswith(' Married').astype(int))
                     .assign(is_male=lambda _df: (_df['sex'] == " Male").astype(int))
                     .loc[:,input_columns]
)

y_test = (
    dataset_test['income'].dropna()
                            .map({' <=50K': 0, ' >50K': 1, ' <=50K.': 0, ' >50K.': 1})
                            .astype(int)
                            .rename("income>50k")
)

X_train.head()

Unnamed: 0,native_usa,native_latin_america,native_europe,native_asia,is_male,is_married,hours_per_week,age,education_num,capital_gain,capital_loss
0,1,0,0,0,1,0,40,39,13,2174,0
1,1,0,0,0,1,1,13,50,13,0,0
2,1,0,0,0,1,0,40,38,9,0,0
3,1,0,0,0,1,1,40,53,7,0,0
4,0,1,0,0,0,1,40,28,13,0,0


In [6]:
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")

print(f"\ny_train (% de positivos): {y_train.mean()*100:.2f} %")
print(f"y_test (% de positivos): {y_test.mean()*100:.2f} %")

X_train shape: (32561, 11)
X_test shape: (16281, 11)

y_train (% de positivos): 24.08 %
y_test (% de positivos): 23.62 %


In [7]:
RANDOM_STATE = 42

params = {
    'n_estimators': [60, 80, 100, 120],
    'max_depth': [5, 6, 7, 8],
    'min_samples_leaf': [10, 20, 30, 40, 50]
}

rf = RandomForestClassifier(random_state=RANDOM_STATE)

gs = GridSearchCV(rf, param_grid=params, scoring='roc_auc', n_jobs=-1, cv=5, verbose=3)
gs.fit(X_train, y_train)


Fitting 5 folds for each of 80 candidates, totalling 400 fits


GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42), n_jobs=-1,
             param_grid={'max_depth': [5, 6, 7, 8],
                         'min_samples_leaf': [10, 20, 30, 40, 50],
                         'n_estimators': [60, 80, 100, 120]},
             scoring='roc_auc', verbose=3)

In [8]:
best_rf = gs.best_estimator_
best_rf

RandomForestClassifier(max_depth=8, min_samples_leaf=10, random_state=42)

In [9]:
gs.best_score_

0.9091742074898939

In [10]:
pd.DataFrame(gs.cv_results_).sort_values('rank_test_score', ascending=True).head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_min_samples_leaf,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
62,2.334368,0.146033,0.151308,0.00659,8,10,100,"{'max_depth': 8, 'min_samples_leaf': 10, 'n_es...",0.905524,0.906158,0.911717,0.910583,0.911889,0.909174,0.002766,1
63,2.847383,0.304467,0.177938,0.016454,8,10,120,"{'max_depth': 8, 'min_samples_leaf': 10, 'n_es...",0.905317,0.905902,0.911882,0.910706,0.911839,0.909129,0.002911,2
61,1.96701,0.2313,0.123312,0.007473,8,10,80,"{'max_depth': 8, 'min_samples_leaf': 10, 'n_es...",0.905432,0.906058,0.911683,0.91038,0.91163,0.909036,0.002735,3
60,1.440918,0.091439,0.098033,0.014043,8,10,60,"{'max_depth': 8, 'min_samples_leaf': 10, 'n_es...",0.905579,0.905961,0.911286,0.91029,0.911411,0.908905,0.002592,4
67,2.662778,0.045478,0.167609,0.009838,8,20,120,"{'max_depth': 8, 'min_samples_leaf': 20, 'n_es...",0.904365,0.903958,0.911039,0.908755,0.911295,0.907882,0.003167,5


## Avaliando Modelo

In [11]:
pred_train = best_rf.predict(X_train)
proba_train = best_rf.predict_proba(X_train)[:,1]

pred_test = best_rf.predict(X_test)
proba_test = best_rf.predict_proba(X_test)[:,1]

### Treino


In [12]:
print(f"ROC-AUC: {roc_auc_score(y_train, proba_train):.4f}")
print(classification_report(y_train, pred_train))

ROC-AUC: 0.9135
              precision    recall  f1-score   support

           0       0.87      0.96      0.91     24720
           1       0.80      0.54      0.64      7841

    accuracy                           0.86     32561
   macro avg       0.84      0.75      0.78     32561
weighted avg       0.85      0.86      0.85     32561



### Teste

In [13]:
print(f"ROC-AUC: {roc_auc_score(y_test, proba_test):.4f}")
print(classification_report(y_test, pred_test))

ROC-AUC: 0.9086
              precision    recall  f1-score   support

           0       0.87      0.96      0.91     12435
           1       0.80      0.53      0.64      3846

    accuracy                           0.86     16281
   macro avg       0.83      0.74      0.77     16281
weighted avg       0.85      0.86      0.85     16281



### Previsão de exemplo

Para utilizar em caso de teste, vamos realizar a previsão de um exemplo. Porém em cenários reais, é recomendado a utilização de vários exemplos como casos de teste.


In [14]:
input_columns_raw = ["native_country", "marital_status", "sex",
                    "capital_gain", "capital_loss", "hours_per_week",
                    "age", "education_num"]
id_num = 23
dataset_test.iloc[id_num,:][input_columns_raw]

native_country              Peru
marital_status     Never-married
sex                         Male
capital_gain                   0
capital_loss                   0
hours_per_week                43
age                           25
education_num                 13
Name: 23, dtype: object

In [15]:
X_test.iloc[id_num, :]

native_usa               0
native_latin_america     1
native_europe            0
native_asia              0
is_male                  1
is_married               0
hours_per_week          43
age                     25
education_num           13
capital_gain             0
capital_loss             0
Name: 23, dtype: int64

In [16]:
prediction_example = best_rf.predict(X_test.iloc[id_num].to_frame().T)[0]
prediction_proba_example = best_rf.predict_proba(X_test.iloc[id_num].to_frame().T)[0,1]
print(f"Income > 50k prediction: {prediction_example}")
print(f"Income > 50k prediction probability: {prediction_proba_example}")

Income > 50k prediction: 0
Income > 50k prediction probability: 0.05645205977672475


## Salvando modelo

In [21]:
model_path = Path.cwd().parents[0]/'model'/'model.joblib'
_ = joblib.dump(best_rf, filename=model_path)