```Aviso: Realize a atividade prática antes de responder o questionário, pois o mesmo faz algumas questões sobre a prática.```

### Dataset
O dataset spambase  possui 58 atributos preditivos e uma classe binária no qual o valor 1 indica spam e o valor 0 um e-mail comum. Os atributos preditivos são resultados do pré-processamento dos e-mails e contem a frequência de algumas palavras e também características como proporção de letras maiúsculas, etc.

Mais detalhe sobre o dataset por ser encontrado em: [https://www.openml.org/search?type=data&id=44&sort=runs&status=active](https://www.openml.org/search?type=data&id=44&sort=runs&status=active)

 
Para as atividades de 1 a 4 execute validação cruzada com 10-folds, para a atividade 5 utilize todo o dataset como conjunto de treinamento.

### Atividades práticas
* Compare a medida F1 obtida pelos 4 algoritmos apresentados em aula usando suas configurações padrão (sem mudar nenhum hiperparâmetro). *Para evitar warnings de execução do algoritmo Regressão Logistica defina (max_iter=3000).
* Execute o K-NN testando os valores de k=3, 5, 7 e 9 e analise o resultado da medida de precisão e revocação.
* Calcule a acurácia do modelo de árvore de decisão quando a altura máxima da árvore é definido para 5.
* Calcule a medida AUC para o algoritmo SVM usando a combinação de hyperparâmetros: C=0.1 e gamma=0.1.
* Usando o algoritmo de Regressão Logística com o datataset inteiro identifique qual é o atributo mais importante segundo os valores dos coeficientes. *Para evitar warnings de execução do algoritmo Regressão Logistica defina (max_iter=3000).**Use os valores absolutos 

In [9]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split
from sklearn.metrics import classification_report

In [2]:
df = pd.read_csv('./spambase.csv')
df.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_%3B,char_freq_%28,char_freq_%5B,char_freq_%21,char_freq_%24,char_freq_%23,capital_run_length_average,capital_run_length_longest,capital_run_length_total,class
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [3]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

In [4]:
X = df.drop(columns=['class'])
Y = df['class']

In [5]:
knn = KNeighborsClassifier()
lreg = LogisticRegression(max_iter=3000)
dt = DecisionTreeClassifier()
svm = SVC()

for model in [knn, lreg, dt, svm]:
    scores = cross_val_score(model, X, Y, cv=10, scoring='f1_macro')
    print(model, scores.mean())

KNeighborsClassifier() 0.7768480750255765
LogisticRegression(max_iter=3000) 0.9144394892117331
DecisionTreeClassifier() 0.8957573254558431
SVC() 0.6696154584522886


In [10]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, stratify=Y)
for i in [3, 5, 7, 9]:
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(x_train, y_train)
    y_pred = knn.predict(x_test)
    print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.85      0.85      0.85       837
           1       0.77      0.77      0.77       544

    accuracy                           0.82      1381
   macro avg       0.81      0.81      0.81      1381
weighted avg       0.82      0.82      0.82      1381

              precision    recall  f1-score   support

           0       0.84      0.85      0.84       837
           1       0.77      0.74      0.75       544

    accuracy                           0.81      1381
   macro avg       0.80      0.80      0.80      1381
weighted avg       0.81      0.81      0.81      1381

              precision    recall  f1-score   support

           0       0.83      0.85      0.84       837
           1       0.76      0.73      0.75       544

    accuracy                           0.80      1381
   macro avg       0.80      0.79      0.79      1381
weighted avg       0.80      0.80      0.80      1381

              preci

In [11]:
dt = DecisionTreeClassifier(max_depth=5)
scores = cross_val_score(dt, X, Y, cv=10, scoring='accuracy')
print(dt, scores.mean())

DecisionTreeClassifier(max_depth=5) 0.9026261435442798


In [19]:
svm = SVC(C=0.1, gamma=0.1)
scores = cross_val_score(dt, X, Y, cv=10, scoring='roc_auc')
print(dt, scores.mean())

DecisionTreeClassifier(max_depth=5) 0.924463912901242


In [None]:
# TOP 5 best coefs
lreg = LogisticRegression(max_iter=3000)
lreg.fit(X, Y)
np.sort(lreg.coef_.flatten())[::-1][:5]


array([3.9369725 , 2.18577869, 2.09792989, 1.33298601, 1.02006995])