## Exercício Desafiador: Detecção de Diabetes com Classificadores Supervisionados

## Objetivo
Desenvolver um sistema completo de classificação para prever o diagnóstico de diabetes em mulheres de origem indígena Pima, com base em variáveis clínicas. O exercício envolve desde a preparação dos dados até a avaliação comparativa de diversos modelos.


#### **Base utilizada:** *Pima Indians Diabetes Database*
Dataset disponível aqui no repositório.

### Dicionário de Atributos (Traduzido)
| Atributo | Descrição |
|----------|-----------|
| `preg` | Número de gestações |
| `plas` | Concentração de glicose na hora do teste oral de glicose |
| `pres` | Pressão arterial diastólica (mm Hg) |
| `skin` | Espessura da dobra cutânea do tríceps (mm) |
| `insu` | Nível de insulina sérica (mu U/ml) |
| `mass` | Índice de massa corporal (IMC) |
| `pedi` | Função pedigree do diabetes (histórico familiar) |
| `age` | Idade (anos) |
| `class` | Diagnóstico (0 = negativo, 1 = positivo para diabetes) |

**Responder às perguntas a seguir**
   - Qual modelo teve o melhor **recall**? Por que isso é importante para diagnóstico?
   - Houve modelo com alto **precision**, mas baixo **recall**? O que isso representa?
   - Qual modelo teve o melhor equilíbrio geral (F1-score)?
   - Qual modelo você recomendaria para ser usado em produção hospitalar?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, precision_score,
    recall_score, f1_score, confusion_matrix
)

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

In [12]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

path = "/content/drive/MyDrive/DATA_SCIENCE"

Mounted at /content/drive


In [25]:
column_names = [
    'Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
    'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'
]

df = pd.read_csv(path + '/diabetes.csv', sep=',')

cols_with_zeros = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
df[cols_with_zeros] = df[cols_with_zeros].replace(0, np.nan)
df.isnull().sum()

Unnamed: 0,0
Pregnancies,0
Glucose,5
BloodPressure,35
SkinThickness,227
Insulin,374
BMI,11
DiabetesPedigreeFunction,0
Age,0
Outcome,0


# Analisando a correlação dos dados e Data Cleaning

In [26]:
# Dataframe com dos dados nulos removidos
df_clean = df.dropna(inplace=False)

# Select only numerical features for correlation analysis
numerical_df = df_clean.select_dtypes(include=['number'])

# Calculate the correlation matrix
corr = numerical_df.corr()

# Display the correlation matrix
corr

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
Pregnancies,1.0,0.198291,0.213355,0.093209,0.078984,-0.025347,0.007562,0.679608,0.256566
Glucose,0.198291,1.0,0.210027,0.198856,0.581223,0.209516,0.14018,0.343641,0.515703
BloodPressure,0.213355,0.210027,1.0,0.232571,0.098512,0.304403,-0.015971,0.300039,0.192673
SkinThickness,0.093209,0.198856,0.232571,1.0,0.182199,0.664355,0.160499,0.167761,0.255936
Insulin,0.078984,0.581223,0.098512,0.182199,1.0,0.226397,0.135906,0.217082,0.301429
BMI,-0.025347,0.209516,0.304403,0.664355,0.226397,1.0,0.158771,0.069814,0.270118
DiabetesPedigreeFunction,0.007562,0.14018,-0.015971,0.160499,0.135906,0.158771,1.0,0.085029,0.20933
Age,0.679608,0.343641,0.300039,0.167761,0.217082,0.069814,0.085029,1.0,0.350804
Outcome,0.256566,0.515703,0.192673,0.255936,0.301429,0.270118,0.20933,0.350804,1.0


In [28]:
# Padronização dos dados
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

NameError: name 'X' is not defined

# Separar treino e teste

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42
)

# Algoritmos de Classificação

In [None]:
modelos = {
    "Regressão Logística": LogisticRegression(max_iter=200),
    "Árvore de Decisão": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "SVM": SVC(kernel='rbf'),
    "K-NN": KNeighborsClassifier(n_neighbors=5)
}

# Avaliação por modelo

In [None]:
for nome, modelo in modelos.items():
    modelo.fit(X_train, y_train)
    y_pred = modelo.predict(X_test)

    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, average='macro', zero_division=0)
    rec = recall_score(y_test, y_pred, average='macro', zero_division=0)
    f1 = f1_score(y_test, y_pred, average='macro', zero_division=0)
    cm = confusion_matrix(y_test, y_pred)

    print(f"\n {nome}")
    print(f"Acurácia : {acc:.2f}")
    print(f"Precisão : {prec:.2f}")
    print(f"Recall   : {rec:.2f}")
    print(f"F1-Score : {f1:.2f}")

    # Matriz de confusão
    plt.figure(figsize=(4, 3))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=class_names, yticklabels=class_names)
    plt.title(f'Matriz de Confusão - {nome}')
    plt.xlabel("Previsto")
    plt.ylabel("Verdadeiro")
    plt.tight_layout()
    plt.show()