# K-Fold Breast Cancer

Aqui, buscamos usar o k-fold para buscar o melhor modelo para classificação de câncer de mama.

Como temos dados que estão fora do intervalo [0, 1], é bom escalar os dados para que estes estejam nesse intervalo.

Isso fará com que todos os atributos tenham o mesmo peso no cálculo das distâncias para o KNN,
e não causará diferença no algoritmo da árvore de decisão.

In [14]:
import pandas as pd
from sklearn.utils import shuffle
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

data = pd.read_csv('dataR2.csv')
data, labels = data.iloc[:, :-1], data.iloc[:, -1]

data[:] = scaler.fit_transform(data)

display(data.head(4))
display(labels.head(4))


Unnamed: 0,Age,BMI,Glucose,Insulin,HOMA,Leptin,Adiponectin,Resistin,MCP.1
0,0.369231,0.25385,0.070922,0.004908,0.0,0.052299,0.221152,0.060665,0.224659
1,0.907692,0.114826,0.22695,0.01219,0.009742,0.052726,0.103707,0.010826,0.255926
2,0.892308,0.235278,0.219858,0.036874,0.022058,0.158526,0.571021,0.076906,0.307912
3,0.676923,0.148328,0.120567,0.014171,0.005911,0.064811,0.151538,0.121131,0.533934


0    1
1    1
2    1
3    1
Name: Classification, dtype: int64

Agora podemos criar o algoritmo que vai testar os diferentes hiperparâmetros,
considerando cada combinação destes para cada uma das divisões possíveis do k-fold:

In [15]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score


def train_model(m, n_splits, X, y):
    split_scores = []
    kf = StratifiedKFold(n_splits, shuffle=False)
    for i, (tr_id, te_id) in enumerate(kf.split(X, y)):
        trData, trLabels = X.iloc[tr_id], y.iloc[tr_id]
        teData, teLabels = X.iloc[te_id], y.iloc[te_id]
        
        m.fit(trData, trLabels)
        split_scores.append(accuracy_score(teLabels, m.predict(teData)))
    
    return np.average(split_scores)


def train_tree(n_splits, X, y, models_scores):
    for cr in ['gini', 'entropy', 'log_loss']:
        for md in range(10, 101, 10):
            m = DecisionTreeClassifier(criterion=cr, max_depth=md, random_state=42)
            avg_score = train_model(m, n_splits, X, y)
            models_scores.append([('tree', cr, md), m, avg_score])
            

def train_knn(n_splits, X, y, models_scores):
    for k in [1, 3, 5, 11, 21, 31]:
        m = KNeighborsClassifier(k)
        avg_score = train_model(m, n_splits, X, y)
        models_scores.append([('knn', k), m, avg_score])


accuracies = []
highest = (None, None, -1)
X, y = data, labels
kf = StratifiedKFold(5, shuffle=True, random_state=42)
for i, (tr_id, te_id) in enumerate(kf.split(X, y)):
    trData, trLabels = X.iloc[tr_id], y.iloc[tr_id]
    teData, teLabels = X.iloc[te_id], y.iloc[te_id]

    models_scores = []
    train_tree(5, trData, trLabels, models_scores)
    train_knn(5, trData, trLabels, models_scores)
    
    scores = np.array(list(zip(*models_scores))[2])
    highest_accuracy = models_scores[np.argmax(scores)]
    
    mattrs, m, score = tuple(highest_accuracy)
    print('Model with highest accuracy:', mattrs, ', average accuracy: ', score)
    
    m.fit(trData, trLabels)
    acc = accuracy_score(teLabels, m.predict(teData))
    accuracies.append(acc)
    
    print('Model accuracy with test dataset:', acc)

    if acc > highest[2]:
        highest = (mattrs, m, acc)

total_acc = np.average(accuracies)
print('Average accuracy for all models:', total_acc)

print('Model with highest overall accuracy:', highest[0], ', accuracy with test dataset:', highest[2])


Model with highest accuracy: ('knn', 5) , average accuracy:  0.6514619883040936
Model accuracy with test dataset: 0.625
Model with highest accuracy: ('knn', 5) , average accuracy:  0.6742690058479532
Model accuracy with test dataset: 0.6956521739130435
Model with highest accuracy: ('knn', 21) , average accuracy:  0.6432748538011697
Model accuracy with test dataset: 0.6086956521739131
Model with highest accuracy: ('tree', 'gini', 10) , average accuracy:  0.6538011695906432
Model accuracy with test dataset: 0.9130434782608695
Model with highest accuracy: ('knn', 5) , average accuracy:  0.6959064327485379
Model accuracy with test dataset: 0.782608695652174
Average accuracy for all models: 0.725
Model with highest overall accuracy: ('tree', 'gini', 10) , accuracy with test dataset: 0.9130434782608695


Percebemos, então, que o melhor modelo é o da Árvore de decisão com o critério de gini e altura máxima 10.