``Aviso: Realize a atividade prática antes de responder o questionário, pois o mesmo faz algumas questões sobre a prática.``

### Dataset
Nesta atividade será utilizado o mesmo dataset da semana passada, o dataset spambase que possui 58 atributos preditivos e uma classe binária no qual o valor 1 indica spam e o valor 0 um e-mail comum. Os atributos preditivos são resultados do pré-processamento dos e-mails e contem a frequência de algumas palavras e também características como proporção de letras maiúsculas, etc.

Mais detalhe sobre o dataset por ser encontrado em: [https://www.openml.org/search?type=data&id=44&sort=runs&status=active](https://www.openml.org/search?type=data&id=44&sort=runs&status=active)

### Atividades práticas
* Usando Validação cruzada com 5-folds, avalie a revocação dos algoritmos RF e XGBoost usando os valores dos hyperparâmetros padrão.
* Usando holdout 70%/30% e o valor de random_state=10, calcule a diferença da métrica AUC obtida pelo RF com 50 árvores e com 300 árvores.
* Usando todo o conjunto de dados e as configurações de hyperparâmetro padrão, calcule a importância de cada um dos atributos identificado pelo RF. Use random_state=10.

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier # Bagging
from sklearn.ensemble import ExtraTreesClassifier # Bagging
from sklearn.ensemble import AdaBoostClassifier # Boost
import xgboost as xgb

from sklearn.metrics import roc_auc_score

from sklearn.model_selection import train_test_split, cross_val_score

import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('./spambase.csv')
df.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_%3B,char_freq_%28,char_freq_%5B,char_freq_%21,char_freq_%24,char_freq_%23,capital_run_length_average,capital_run_length_longest,capital_run_length_total,class
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [3]:
X = df.drop(columns=['class'])
Y = df['class']

In [37]:
# Without ensembler
xg_boost = xgb.XGBClassifier()
rf = RandomForestClassifier()

for model in [xg_boost, rf]:
    scores = cross_val_score(model, X, Y, cv=5, scoring='recall')
    print(model, scores.mean())

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              feature_weights=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=None,
              n_jobs=None, num_parallel_tree=None, ...) 0.9172533978661553
RandomForestClassifier() 0.9051291417439081


In [34]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=10, stratify=Y)

for depth in [50, 300]:
    rf = RandomForestClassifier(max_depth=depth, random_state=10)
    rf.fit(x_train, y_train)
    y_pred = rf.predict(x_test)
    print(rf, roc_auc_score(y_test, y_pred))


RandomForestClassifier(max_depth=50, random_state=10) 0.9479331383090871
RandomForestClassifier(max_depth=300, random_state=10) 0.9479331383090871


In [25]:
feature_importance = {}
for importance, feature in zip(rf.feature_importances_, X.columns.values):
    feature_importance[feature] = importance

for feature in sorted(feature_importance, key=feature_importance.get, reverse=True):
    print(feature, ':', feature_importance[feature].round(3))

char_freq_%21 : 0.117
char_freq_%24 : 0.088
word_freq_free : 0.074
word_freq_remove : 0.071
word_freq_your : 0.056
capital_run_length_longest : 0.056
capital_run_length_average : 0.054
word_freq_hp : 0.047
capital_run_length_total : 0.045
word_freq_money : 0.037
word_freq_our : 0.033
word_freq_you : 0.029
word_freq_000 : 0.025
word_freq_hpl : 0.024
word_freq_george : 0.023
word_freq_edu : 0.02
word_freq_business : 0.013
word_freq_internet : 0.013
char_freq_%28 : 0.012
word_freq_will : 0.012
word_freq_1999 : 0.011
word_freq_all : 0.011
word_freq_re : 0.01
word_freq_mail : 0.01
word_freq_over : 0.008
word_freq_receive : 0.008
word_freq_credit : 0.008
word_freq_order : 0.008
word_freq_email : 0.008
word_freq_650 : 0.005
word_freq_meeting : 0.005
char_freq_%3B : 0.005
word_freq_address : 0.004
word_freq_85 : 0.004
word_freq_labs : 0.004
char_freq_%23 : 0.004
word_freq_people : 0.004
word_freq_make : 0.004
word_freq_pm : 0.003
word_freq_data : 0.003
word_freq_technology : 0.003
word_freq_fo