# Atividade sobre KNN

Nesta atividade, deverá ser implementado um modelo de classificação binária sobre os [dados de doenças cardíacas](https://www.kaggle.com/ronitf/heart-disease-uci). No conjunto de dados existem 13 variáveis numéricas sobre pacientes, como idade e sexo, além da variável resposta (target) que indica 1 quando o paciente tem uma doença cardíaca e 0 caso não tenha.

Para realizar a atividade, siga os seguintes passos:

* Separe o conjunto de dados em treino e teste na proporção 80%/20% respectivamente.
* Treine o modelo com o KNN sobre o conjunto de treinamento.
    * Treine com o número de vizinhos diferentes (sugestão: 5 e 11).
* Teste os modelos com o conjunto de teste.
* A partir das predições, obtenha as matrizes de confusão e informe qual a eficácia dos modelos através das seguintes métricas: acurácia, precisão, revocação, informedness e markedness.
* Analise os resultados para os dois modelos e informe qual o modelo conseguiu prever melhor os resultados.
    * Note que nesse contexto médico existe uma particularidade: é mais crítico o caso de o modelo dizer que o paciente **não tem** uma doença quando na verdade **tem** (**FN**) em comparação ao caso de dizer que ele **tem** quando na verdade **não tem** (**FP**). Para o primeiro caso o paciente ficará despreocupado quando está doente, enquanto o segundo caso o paciente será alertado desnecessariamente.

In [0]:
import pandas as pd
import random
import numpy as np
from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate 
from sklearn.metrics import f1_score
from matplotlib.colors import ListedColormap
from sklearn.metrics import confusion_matrix

# Importando libs de visualização de dados
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

In [0]:
# Função que calcula os reais positivos
def rp(tp, fn):
    return tp + fn

# Função que calcula os reais negativos     
def rn(fp, tn):
    return fp + tn

# Função que calcula as predicoes positivas  
def pp(tp, fp):
    return tp + fp

# Função que calcula as predicoes negativas   
def pn(fn, tn):
    return fn + tn

# Função que calcula acurácia do modelo
def accuracy (tp, fp, fn, tn):
     accuracy = ((tp + tn) / (tp + tn + fp + fn))
     return (accuracy)
    
# Função que calcula a precisão 
def precision (tp, fp):
    precision =  (tp / (tp + fp)) #predições positivas
    return precision

# Função que calcula o recall
def recall(tp, fn):
    recall =  (tp / (tp + fn)) # reais positivos
    return recall

## Função que calcula o f-measure (media harmonica entre precision e recall)
def f_measure(tp, fp, fn):
    f_measure = (2 * precision(tp, fp) * recall(tp, fn)) / (recall(tp, fn) + precision(tp, fp))
    return f_measure
  
# Função que calcula o Informedness 
def informedness(tp, fp, fn, tn):
    inform = ((tp/rp(tp, fn)) - (fp/rn(fp, tn)))
    return inform

# Função que calcula o Markedness
def markdness(tp, fp, fn, tn):    
    mark = ((tp/pp(tp,fp)) - (fn/pn(fn,tn)))
    return mark

# Função de escalonamento
def feature_scaling(data):
    sc = StandardScaler()
    return sc.fit_transform(data)

# Função que gera o gráfico dos resultados de classificação
def plot_results_class(X, y, classifier, title):
    X_set, y_set = X, y
    X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                         np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
    plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
                 alpha = 0.75, cmap = ListedColormap(('red', 'green')))
    plt.xlim(X1.min(), X1.max())
    plt.ylim(X2.min(), X2.max())
    for i, j in enumerate(np.unique(y_set)):
        plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                    c = ListedColormap(('red', 'green'))(i), label = j)
    plt.title(title)
    plt.xlabel('Idade')
    plt.ylabel('Tarifa')
    plt.legend()
    plt.show()

In [52]:
import pandas as pd

pacientes_cardiacos = pd.read_csv("https://orionwinter.github.io/datasets/heart.csv")

pacientes_cardiacos.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [53]:
# Exporando o dataset
pacientes_cardiacos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null int64
cp          303 non-null int64
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null int64
restecg     303 non-null int64
thalach     303 non-null int64
exang       303 non-null int64
oldpeak     303 non-null float64
slope       303 non-null int64
ca          303 non-null int64
thal        303 non-null int64
target      303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [54]:
# Visualizando o sumário das colunas numéricas do dataset
pacientes_cardiacos.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [55]:
pacientes_cardiacos.head(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [56]:
X = pacientes_cardiacos.iloc[:,[0,1,2,3,4,5,6,7,8,9,10,11,12]].values
X[:5]

array([[ 63. ,   1. ,   3. , 145. , 233. ,   1. ,   0. , 150. ,   0. ,
          2.3,   0. ,   0. ,   1. ],
       [ 37. ,   1. ,   2. , 130. , 250. ,   0. ,   1. , 187. ,   0. ,
          3.5,   0. ,   0. ,   2. ],
       [ 41. ,   0. ,   1. , 130. , 204. ,   0. ,   0. , 172. ,   0. ,
          1.4,   2. ,   0. ,   2. ],
       [ 56. ,   1. ,   1. , 120. , 236. ,   0. ,   1. , 178. ,   0. ,
          0.8,   2. ,   0. ,   2. ],
       [ 57. ,   0. ,   0. , 120. , 354. ,   0. ,   1. , 163. ,   1. ,
          0.6,   2. ,   0. ,   2. ]])

In [57]:
y = pacientes_cardiacos.iloc[:, 13].values
y[:5]

array([1, 1, 1, 1, 1])

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 42)

print("Tamanho do Dataset. {}".format(pacientes_cardiacos.shape[0]))
print("Tamanho do Conjunto de Treinamento. {}".format(len(X_train)))
print("Tamanho do Conjunto de Testes. {}".format(len(X_test)))

Tamanho do Dataset. 303
Tamanho do Conjunto de Treinamento. 242
Tamanho do Conjunto de Testes. 61


In [59]:
X_train = feature_scaling(X_train)
X_test = feature_scaling(X_test)

X_train[:5]

array([[-1.35679832,  0.72250438,  0.00809909, -0.61685555,  0.91403366,
        -0.38330071,  0.8431327 ,  0.53278078, -0.67663234, -0.92086403,
         0.95390513, -0.68970073, -0.50904773],
       [ 0.38508599,  0.72250438, -0.97189094,  1.1694912 ,  0.43952674,
        -0.38330071, -1.04610909, -1.75358236,  1.47790748, -0.19378705,
         0.95390513, -0.68970073,  1.17848036],
       [-0.92132724,  0.72250438,  0.98808912,  1.1694912 , -0.30070405,
        -0.38330071,  0.8431327 , -0.13967897, -0.67663234,  2.3509824 ,
        -0.69498803, -0.68970073, -0.50904773],
       [ 0.05848269, -1.38407465,  0.00809909,  0.27631782,  0.0599212 ,
        -0.38330071, -1.04610909,  0.48795013, -0.67663234,  0.35152069,
        -0.69498803, -0.68970073, -0.50904773],
       [ 0.60282153,  0.72250438, -0.97189094, -0.79549023, -0.31968433,
         2.60891771,  0.8431327 ,  0.44311948,  1.47790748,  0.35152069,
         0.95390513,  1.33342142,  1.17848036]])

Treinamento e Validação

In [60]:
from sklearn.neighbors import KNeighborsClassifier

classifier5 = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier11 = KNeighborsClassifier(n_neighbors = 11, metric = 'minkowski', p = 2)


classifier5.fit(X_train, y_train)
classifier11.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=11, p=2,
                     weights='uniform')

In [61]:
y_pred5 = classifier5.predict(X_test)
y_pred11 = classifier11.predict(X_test)

print(y_pred5)
print(y_pred11)

[0 0 1 0 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 0 1 1 1 1 1 0 1 0 0 0 0 1 0 1
 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 1 1 1 0 0 1 0 0 0]
[0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 0 0 1 0 1
 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0]


In [62]:
tn5, fp5, fn5, tp5 = confusion_matrix(y_test, y_pred5).ravel()
tn11, fp11, fn11, tp11 = confusion_matrix(y_test, y_pred11).ravel()

print("Confusion Matrix 5: ", confusion_matrix(y_test, y_pred5))
print("Confusion Matrix 11: ", confusion_matrix(y_test, y_pred11))

Confusion Matrix 5:  [[26  3]
 [ 3 29]]
Confusion Matrix 11:  [[25  4]
 [ 3 29]]


In [63]:
print("Accurary 5: ", accuracy(tp5, fp5, fn5, tn5))
print("Accuracy 11: ", accuracy(tp11, fp11, fn11, tn11))
print("Score 5: ", classifier5.score(X_test, y_test))
print("Score 11: ",classifier11.score(X_test, y_test))

Accurary 5:  0.9016393442622951
Accuracy 11:  0.8852459016393442
Score 5:  0.9016393442622951
Score 11:  0.8852459016393442


In [64]:
print("f_measure 5: ", f_measure(tp5, fp5, fn5))
print("f_measure 11: ",f_measure(tp11, fp11, fn11))
print("F1 Score 5: ",f1_score(y_test, y_pred5))  
print("F1 Score 11: ",f1_score(y_test, y_pred11))  

f_measure 5:  0.90625
f_measure 11:  0.8923076923076922
F1 Score 5:  0.90625
F1 Score 11:  0.8923076923076922


In [69]:
print("Informedness 5: ", informedness(tp5, fp5, fn5, tn5))
print("Informedness 11: ", informedness(tp11, fp11, fn11, tn11))
print("Markedness 5: ", markdness(tp5, fp5, fn5, tn5))
print("Markedness 11: ", markdness(tp11, fp11, fn11, tn11))
print("Precision 5: ",precision (tp5, fp5))
print("Precision 11: ",precision (tp11, fp11))
print("Recall 5: ", recall(tp5, fn5))
print("Recall 11: ", recall(tp11, fn11))
print("FN 5: ",fn5)
print("FN 11: ",fn11)

Informedness 5:  0.802801724137931
Informedness 11:  0.7683189655172413
Markedness 5:  0.802801724137931
Markedness 11:  0.7716450216450217
Precision 5:  0.90625
Precision 11:  0.8787878787878788
Recall 5:  0.90625
Recall 11:  0.90625
FN 5:  3
FN 11:  3


A partir da análise dos indicadores, podemos concluir que o melhor modelo é o que utiliza 5 vizinhos somente. Seus valores são sempre superiores ao modelo que utiliza 11 vizinhos. Com relação a quantidade de falsos negativos, ambos apresentam o mesmo resultado. 