<a href="https://colab.research.google.com/github/esalbuquerquebr/projeto3_programacao_ia/blob/master/ifes_2020_1_ia_t3_p2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Projeto de Programação – Trabalho 3 – Learning
## Problema 2 - Aprendizado Supervisionado
### IFES | 2020/1 | Inteligência Artificial
### Eduardo Soares Albuquerque
--------------------------------------

## Instruções para download do dataset e demais arquivos

### Instalando o Kaggle

In [1]:
!pip install kaggle



### Orientações para autenticar no Kaggle usando kaggle.json

Navegue até as configurações de perfil do seu usuário: https://www.kaggle.com/me/account e clique em 'Create API Token' para fazer o download do kaggle.json para ser utilizado abaixo.

In [2]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  
# Then move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json
User uploaded file "kaggle.json" with length 74 bytes


### Download e unzip do dataset

In [3]:
!kaggle datasets download -d ronitf/heart-disease-uci -p ./sample_data --unzip

Downloading heart-disease-uci.zip to ./sample_data
  0% 0.00/3.40k [00:00<?, ?B/s]
100% 3.40k/3.40k [00:00<00:00, 3.14MB/s]


## Código-fonte para o Problema 2

### Imports das bibliotecas

In [4]:
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Metricas, pre-processamento e relatorios de resultados
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

# Classificadores
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier



  import pandas.util.testing as tm


### Leitura do dataset (heart.csv)

In [5]:
dataset = pd.read_csv('./sample_data/heart.csv')
dataset.head(10)


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,1
8,52,1,2,172,199,1,1,162,0,0.5,2,0,3,1
9,57,1,2,150,168,0,1,174,0,1.6,2,0,2,1


### Separação do dataset: features e target


In [6]:
FEATURES_COLS = ['age',	'sex',	'cp',	'trestbps',	'chol',	'fbs',	'restecg',	'thalach',	'exang',	'oldpeak',	'slope',	'ca',	'thal']	
TARGET_COL = ['target']	

features = dataset[FEATURES_COLS]
target = dataset[TARGET_COL]

### Normalização

In [7]:
def normalize_dataset(dataset):
    normalized = dataset.copy()
    for feature in dataset.columns:
        max_value = dataset[feature].max()
        min_value = dataset[feature].min()
        normalized[feature] = (dataset[feature] - min_value) / (max_value - min_value)
    return normalized

normalized = normalize_dataset(features)
normalized.head(10)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,0.708333,1.0,1.0,0.481132,0.244292,1.0,0.0,0.603053,0.0,0.370968,0.0,0.0,0.333333
1,0.166667,1.0,0.666667,0.339623,0.283105,0.0,0.5,0.885496,0.0,0.564516,0.0,0.0,0.666667
2,0.25,0.0,0.333333,0.339623,0.178082,0.0,0.0,0.770992,0.0,0.225806,1.0,0.0,0.666667
3,0.5625,1.0,0.333333,0.245283,0.251142,0.0,0.5,0.816794,0.0,0.129032,1.0,0.0,0.666667
4,0.583333,0.0,0.0,0.245283,0.520548,0.0,0.5,0.70229,1.0,0.096774,1.0,0.0,0.666667
5,0.583333,1.0,0.0,0.433962,0.150685,0.0,0.5,0.587786,0.0,0.064516,0.5,0.0,0.333333
6,0.5625,0.0,0.333333,0.433962,0.383562,0.0,0.0,0.625954,0.0,0.209677,0.5,0.0,0.666667
7,0.3125,1.0,0.333333,0.245283,0.312785,0.0,0.5,0.778626,0.0,0.0,1.0,0.0,1.0
8,0.479167,1.0,0.666667,0.735849,0.166667,1.0,0.5,0.694656,0.0,0.080645,1.0,0.0,1.0
9,0.583333,1.0,0.666667,0.528302,0.09589,0.0,0.5,0.78626,0.0,0.258065,1.0,0.0,0.666667


### Separação do dataset em Treinamento e Teste

In [8]:
features_train, features_test, target_train, target_test = train_test_split(normalized, target, test_size=0.30, random_state=42)

### Dataframe para comparação

In [44]:
result_dataframe = []
result_index = []
col_names = ['Acurácia','Precisão','Recall','Tempo de Treinamento','Tempo de Predição']

### K-NN

Modelo

In [45]:
knn = KNeighborsClassifier(n_neighbors=5)

Treinamento

In [46]:
knn_fit_start = time.time()
knn.fit(features_train, target_train)
knn_fit_end = time.time()

knn_training_time = knn_fit_end - knn_fit_start

print(f'Tempo de treinamento: {knn_training_time}s')

Tempo de treinamento: 0.008011341094970703s


  


Classificação

In [47]:
knn_predict_start = time.time()
knn_pred = knn.predict(features_test)
knn_predict_end = time.time()

knn_prediction_time = knn_predict_end - knn_predict_start
print(f'Tempo de predição: {knn_prediction_time}s')

Tempo de predição: 0.016013145446777344s


Matriz de Confusão - K-NN

In [48]:
print(confusion_matrix(target_test, knn_pred))

[[32  9]
 [11 39]]


Relatório de Classificação K-NN

In [49]:
print(classification_report(target_test, knn_pred))

              precision    recall  f1-score   support

           0       0.74      0.78      0.76        41
           1       0.81      0.78      0.80        50

    accuracy                           0.78        91
   macro avg       0.78      0.78      0.78        91
weighted avg       0.78      0.78      0.78        91



Coleta de demais dados

In [50]:
knn_result = [accuracy_score(target_test, knn_pred), precision_score(target_test, knn_pred), recall_score(target_test, knn_pred), knn_training_time, knn_prediction_time]
result_dataframe.append(knn_result)
result_index.append('K-NN')
print(knn_result)


[0.7802197802197802, 0.8125, 0.78, 0.008011341094970703, 0.016013145446777344]


### SVM

Modelo

In [51]:
svc = SVC(gamma=2, C=1)

Treinamento

In [52]:
svc_fit_start = time.time()
svc.fit(features_train, target_train)
svc_fit_end = time.time()

svc_training_time = svc_fit_end - svc_fit_start

print(f'Tempo de treinamento: {svc_training_time}s')

Tempo de treinamento: 0.007027864456176758s


  y = column_or_1d(y, warn=True)


Classificação

In [53]:
svc_predict_start = time.time()
svc_pred = svc.predict(features_test)
svc_predict_end = time.time()

svc_prediction_time = svc_predict_end - svc_predict_start
print(f'Tempo de predição: {svc_prediction_time}s')

Tempo de predição: 0.003755807876586914s


Matriz de Confusão - SVC

In [54]:
print(confusion_matrix(target_test, svc_pred))

[[34  7]
 [ 9 41]]


Relatório de Classificação SVC

In [55]:
print(classification_report(target_test, svc_pred))

              precision    recall  f1-score   support

           0       0.79      0.83      0.81        41
           1       0.85      0.82      0.84        50

    accuracy                           0.82        91
   macro avg       0.82      0.82      0.82        91
weighted avg       0.83      0.82      0.82        91



Coleta dos demais dados

In [56]:
svc_result = [accuracy_score(target_test, svc_pred), precision_score(target_test, svc_pred), recall_score(target_test, svc_pred), svc_training_time, svc_prediction_time]
result_dataframe.append(svc_result)
result_index.append('SVM')
print(svc_result)

[0.8241758241758241, 0.8541666666666666, 0.82, 0.007027864456176758, 0.003755807876586914]


### Random Forest



Modelo

In [57]:
rf = RandomForestClassifier(max_depth=5, n_estimators=10)

Treinamento

In [58]:
rf_fit_start = time.time()
rf.fit(features_train, target_train)
rf_fit_end = time.time()

rf_training_time = rf_fit_end - rf_fit_start

print(f'Tempo de treinamento: {rf_training_time}s')

Tempo de treinamento: 0.023244142532348633s


  


Classificação

In [59]:
rf_predict_start = time.time()
rf_pred = rf.predict(features_test)
rf_predict_end = time.time()

rf_prediction_time = rf_predict_end - rf_predict_start
print(f'Tempo de predição: {rf_prediction_time}s')

Tempo de predição: 0.004756450653076172s


Matriz de Confusão - Random Forest

In [60]:
print(confusion_matrix(target_test, rf_pred))

[[32  9]
 [10 40]]


Relatório de Classificação Random Forest

In [61]:
print (classification_report(target_test, rf_pred))

              precision    recall  f1-score   support

           0       0.76      0.78      0.77        41
           1       0.82      0.80      0.81        50

    accuracy                           0.79        91
   macro avg       0.79      0.79      0.79        91
weighted avg       0.79      0.79      0.79        91



Coleta dos demais dados

In [62]:
rf_result = [accuracy_score(target_test, rf_pred), precision_score(target_test, rf_pred), recall_score(target_test, rf_pred), rf_training_time, rf_prediction_time]
result_dataframe.append(rf_result)
result_index.append('Random Forest')
print(rf_result)

[0.7912087912087912, 0.8163265306122449, 0.8, 0.023244142532348633, 0.004756450653076172]


### MLP

Modelo

In [63]:
mlp =  MLPClassifier(hidden_layer_sizes=(32, 32), max_iter=50000, batch_size=64, alpha=1e-1)

Treinamento

In [64]:
mlp_fit_start = time.time()
mlp.fit(features_train, target_train)
mlp_fit_end = time.time()

mlp_training_time = mlp_fit_end - mlp_fit_start

print(f'Tempo de treinamento: {mlp_training_time}s')

  y = column_or_1d(y, warn=True)


Tempo de treinamento: 1.660374641418457s


Classificação

In [65]:
mlp_predict_start = time.time()
mlp_pred = mlp.predict(features_test)
mlp_predict_end = time.time()

mlp_prediction_time = mlp_predict_end - mlp_predict_start
print(f'Tempo de predição: {mlp_prediction_time}s')

Tempo de predição: 0.004683017730712891s


Matriz de Confusão - MLP

In [66]:
print(confusion_matrix(target_test, mlp_pred))

[[32  9]
 [ 7 43]]


Relatório de Classificação MPL

In [67]:
print(classification_report(target_test, mlp_pred))

              precision    recall  f1-score   support

           0       0.82      0.78      0.80        41
           1       0.83      0.86      0.84        50

    accuracy                           0.82        91
   macro avg       0.82      0.82      0.82        91
weighted avg       0.82      0.82      0.82        91



Coleta dos demais dados

In [68]:
mlp_result = [accuracy_score(target_test, mlp_pred), precision_score(target_test, mlp_pred), recall_score(target_test, mlp_pred), mlp_training_time, mlp_prediction_time]
result_dataframe.append(mlp_result)
result_index.append('MLP')
print(mlp_result)

[0.8241758241758241, 0.8269230769230769, 0.86, 1.660374641418457, 0.004683017730712891]


### Gradient Boosting

Modelo

In [69]:
gBoosting =  GradientBoostingClassifier(n_estimators=10)

Treinamento

In [70]:
gBoosting.fit(features_train, target_train)
gb_fit_start = time.time()
gBoosting.fit(features_train, target_train)
gb_fit_end = time.time()

gb_training_time = gb_fit_end - gb_fit_start

print(f'Tempo de treinamento: {gb_training_time}s')

Tempo de treinamento: 0.011766910552978516s


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Classificação

In [71]:
gb_predict_start = time.time()
gBoosting_pred = gBoosting.predict(features_test)
gb_predict_end = time.time()

gb_prediction_time = gb_predict_end - gb_predict_start
print(f'Tempo de predição: {gb_prediction_time}s')

Tempo de predição: 0.0034372806549072266s


Matriz de Confusão - Gradient Boosting

In [72]:
print(confusion_matrix(target_test, gBoosting_pred))

[[33  8]
 [10 40]]


Relatório de Classificação Gradient Boosting

In [73]:
print(classification_report(target_test, gBoosting_pred))

              precision    recall  f1-score   support

           0       0.77      0.80      0.79        41
           1       0.83      0.80      0.82        50

    accuracy                           0.80        91
   macro avg       0.80      0.80      0.80        91
weighted avg       0.80      0.80      0.80        91



Coleta dos demais dados

In [74]:
gb_result = [accuracy_score(target_test, gBoosting_pred), precision_score(target_test, gBoosting_pred), recall_score(target_test, gBoosting_pred), gb_training_time, gb_prediction_time]
result_dataframe.append(gb_result)
result_index.append('Gradient Boosting')
print(gb_result)

[0.8021978021978022, 0.8333333333333334, 0.8, 0.011766910552978516, 0.0034372806549072266]


### Tablela comparativa dos Classificadores

Compilando dados

In [75]:
final_df = pd.DataFrame(result_dataframe, columns = col_names, index=result_index)

Exibindo a tabela comparativa

In [76]:
final_df.head()

Unnamed: 0,Acurácia,Precisão,Recall,Tempo de Treinamento,Tempo de Predição
K-NN,0.78022,0.8125,0.78,0.008011,0.016013
SVM,0.824176,0.854167,0.82,0.007028,0.003756
Random Forest,0.791209,0.816327,0.8,0.023244,0.004756
MLP,0.824176,0.826923,0.86,1.660375,0.004683
Gradient Boosting,0.802198,0.833333,0.8,0.011767,0.003437


## Comentários finais

Após várias execuções pude veriricar que as melhores acurácias ficam com SVM, MLP e Random Forest, com maior destaque para o SVM. O K-NN ficou sempre como pior em relação a este quesito. Também pude perceber que o SVM e o Gradient Boosting, nesta ordem, sempre tiveram a melhor performance em relação ao tempo de treinamento e de predição. Neste quesito o MLP quase sempre figurou como pior. Em relação à precisão, ou seja, não classificar como hipertenso quem não é, se destacaram o SVM e o Gradient Boosting. Por último, para o recall, capacidade do classificador de encontrar todas as amostras positivas, o MLP teve destaque na maior partes das execuções que acompanhei.  