<a href="https://colab.research.google.com/github/anapaaula/mlp-raio-x/blob/main/Projeto_Classifica%C3%A7%C3%A3o_de_Raio_X_com_MLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classificação de Raio X

### Importando bibliotecas

In [22]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from imblearn.over_sampling import SMOTE
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

### Importando os dados

O objetivo deste projeto é desenvolver um Multi-Layer Perceptron capaz de classificar imagens de Raio-X Torácico em duas categorias: NORMAL ou PNEUMONIA. A proposta envolve a construção de um classificador eficiente, utilizando imagens médicas para auxiliar no diagnóstico de pneumonia com base em padrões visuais extraídos dos exames de Raio-X.

As imagens utilizadas no projeto estão disponíveis aqui: [link](https://drive.google.com/drive/folders/1dkmO4y-vzo1SS7cmbT-CumymqUfPWlrf?usp=sharing)

As imagens foram convertidas em formato de vetores numéricos, e podem ser acessadas diretamente através dos arquivos CSV: [treino](https://drive.google.com/file/d/1p8QQIfkCQxjS1sSR-xtIU5qXsqcXqsPQ/view?usp=sharing) e [teste](https://drive.google.com/file/d/1K0eu-4H28VocKWKyZcT9enSx9vJM0R7Q/view?usp=sharing).

In [23]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [24]:
file_path_train = '/content/drive/MyDrive/CDP/x_ray_train.csv'
df_train = pd.read_csv(file_path_train)

In [25]:
file_path_test = '/content/drive/MyDrive/CDP/x_ray_test.csv'
df_test = pd.read_csv(file_path_test)

In [26]:
df_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,label
0,0.043137,0.117647,0.078431,0.047059,0.05098,0.203922,0.27451,0.372549,0.470588,0.529412,...,0.552941,0.52549,0.482353,0.439216,0.352941,0.211765,0.05098,0.066667,0.066667,1
1,0.427451,0.47451,0.435294,0.329412,0.360784,0.372549,0.368627,0.4,0.419608,0.345098,...,0.760784,0.733333,0.701961,0.662745,0.596078,0.529412,0.176471,0.086275,0.188235,1
2,0.403922,0.32549,0.341176,0.360784,0.337255,0.317647,0.329412,0.419608,0.54902,0.588235,...,0.737255,0.705882,0.682353,0.666667,0.65098,0.607843,0.533333,0.196078,0.015686,1
3,0.294118,0.313725,0.305882,0.32549,0.364706,0.403922,0.533333,0.423529,0.380392,0.380392,...,0.592157,0.568627,0.54902,0.521569,0.486275,0.462745,0.431373,0.443137,0.407843,1
4,0.007843,0.117647,0.301961,0.4,0.556863,0.541176,0.490196,0.482353,0.537255,0.462745,...,0.505882,0.509804,0.627451,0.572549,0.454902,0.564706,0.47451,0.062745,0.0,1


In [27]:
df_test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,label
0,0.145098,0.207843,0.392157,0.396078,0.34902,0.372549,0.356863,0.368627,0.517647,0.678431,...,0.67451,0.686275,0.67451,0.627451,0.580392,0.521569,0.458824,0.423529,0.164706,1
1,0.05098,0.207843,0.368627,0.454902,0.415686,0.431373,0.490196,0.517647,0.490196,0.588235,...,0.835294,0.788235,0.768627,0.752941,0.713725,0.654902,0.356863,0.0,0.0,1
2,0.207843,0.227451,0.333333,0.223529,0.247059,0.219608,0.196078,0.321569,0.396078,0.34902,...,0.623529,0.607843,0.588235,0.607843,0.533333,0.443137,0.192157,0.070588,0.105882,1
3,0.529412,0.333333,0.282353,0.14902,0.176471,0.196078,0.168627,0.152941,0.231373,0.196078,...,0.545098,0.494118,0.439216,0.376471,0.305882,0.439216,0.047059,0.043137,0.058824,1
4,0.007843,0.082353,0.168627,0.133333,0.105882,0.105882,0.129412,0.184314,0.290196,0.482353,...,0.866667,0.847059,0.827451,0.772549,0.760784,0.427451,0.098039,0.117647,0.011765,0


In [28]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5234 entries, 0 to 5233
Columns: 785 entries, 0 to label
dtypes: float64(784), int64(1)
memory usage: 31.3 MB


In [29]:
df_train.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,label
count,5234.0,5234.0,5234.0,5234.0,5234.0,5234.0,5234.0,5234.0,5234.0,5234.0,...,5234.0,5234.0,5234.0,5234.0,5234.0,5234.0,5234.0,5234.0,5234.0,5234.0
mean,0.22409,0.285522,0.327118,0.336399,0.349791,0.357629,0.368767,0.392606,0.43837,0.500947,...,0.693873,0.677663,0.657491,0.629152,0.579572,0.479711,0.313809,0.151523,0.083151,0.74188
std,0.183808,0.165954,0.161803,0.171717,0.178899,0.185078,0.195747,0.206407,0.210975,0.200511,...,0.119645,0.12203,0.124553,0.132123,0.155744,0.1982,0.216578,0.166743,0.121439,0.437642
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.078431,0.172549,0.223529,0.215686,0.219608,0.223529,0.223529,0.239216,0.286275,0.360784,...,0.635294,0.615686,0.592157,0.560784,0.501961,0.360784,0.109804,0.019608,0.0,0.0
50%,0.192157,0.27451,0.321569,0.329412,0.345098,0.356863,0.372549,0.4,0.454902,0.523529,...,0.713725,0.698039,0.682353,0.65098,0.607843,0.52549,0.305882,0.094118,0.039216,1.0
75%,0.32549,0.384314,0.431373,0.462745,0.482353,0.498039,0.513725,0.556863,0.607843,0.654902,...,0.772549,0.760784,0.745098,0.72549,0.690196,0.627451,0.498039,0.227451,0.113725,1.0
max,0.941176,0.917647,0.937255,0.929412,0.952941,0.933333,0.980392,0.960784,0.992157,0.960784,...,0.992157,0.976471,0.941176,0.945098,0.952941,0.972549,0.984314,0.886275,1.0,1.0


In [30]:
df_train.isna().sum()

Unnamed: 0,0
0,0
1,0
2,0
3,0
4,0
...,...
780,0
781,0
782,0
783,0


### Balanceando os dados

In [31]:
x_train, y_train = df_train.drop(columns=["label"]), df_train["label"]

In [32]:
x_test, y_test = df_test.drop(columns=["label"]), df_test["label"]

In [33]:
print(y_train.value_counts())

label
1    3883
0    1351
Name: count, dtype: int64


Os dados de treino apresentam um desbalanceamento significativo na variável de saída (label), com uma quantidade maior de ocorrências do valor 1 (indicando pneumonia) em comparação com o valor 0 (indicando ausência de pneumonia). Esse desbalanceamento pode resultar em um viés do modelo a favor da classe majoritária, prejudicando sua capacidade de detectar corretamente a classe minoritária.

Para mitigar esse problema e garantir que o modelo classifique ambas as classes de forma equilibrada, será aplicada a técnica SMOTE (Synthetic Minority Over-sampling Technique). O SMOTE cria novas instâncias sintéticas da classe minoritária, balanceando os dados de entrada e reduzindo o risco de viés durante o treinamento.

In [21]:
smote = SMOTE()
x_train_balanced, y_train_balanced = smote.fit_resample(x_train, y_train)

In [34]:
print(y_train_balanced.value_counts())

label
1    3883
0    3883
Name: count, dtype: int64


## Treinando MLPs

Serão treinados três modelos com as seguintes configurações:

* Modelo 1: Uma camada oculta com 50 neurônios, utilizando a função de ativação logistic (sigmóide).

* Modelo 2: Duas camadas ocultas, com 50 neurônios na primeira camada e 25 neurônios na segunda, utilizando a função de ativação ReLU.

* Modelo 3: Três camadas ocultas, com 50 neurônios na primeira camada, 25 neurônios na segunda e 13 neurônios na terceira, utilizando a função de ativação tanh.

### Modelo 1

In [35]:
mlp = MLPClassifier(hidden_layer_sizes=(50), activation='logistic', max_iter=500, verbose = True)
mlp.fit(x_train_balanced, y_train_balanced)
y_pred = mlp.predict(x_test)
report = classification_report(y_test, y_pred)
print(report)

Iteration 1, loss = 0.60142318
Iteration 2, loss = 0.43074087
Iteration 3, loss = 0.32482165
Iteration 4, loss = 0.26327788
Iteration 5, loss = 0.22262926
Iteration 6, loss = 0.19431250
Iteration 7, loss = 0.17405242
Iteration 8, loss = 0.16277690
Iteration 9, loss = 0.14799245
Iteration 10, loss = 0.14265007
Iteration 11, loss = 0.13452570
Iteration 12, loss = 0.12946276
Iteration 13, loss = 0.12460468
Iteration 14, loss = 0.12350514
Iteration 15, loss = 0.12067226
Iteration 16, loss = 0.11742632
Iteration 17, loss = 0.11549560
Iteration 18, loss = 0.11420150
Iteration 19, loss = 0.11346591
Iteration 20, loss = 0.11083935
Iteration 21, loss = 0.10913015
Iteration 22, loss = 0.10732810
Iteration 23, loss = 0.10763223
Iteration 24, loss = 0.10495503
Iteration 25, loss = 0.10420122
Iteration 26, loss = 0.10625244
Iteration 27, loss = 0.10756828
Iteration 28, loss = 0.10754508
Iteration 29, loss = 0.10489851
Iteration 30, loss = 0.11059751
Iteration 31, loss = 0.10257199
Iteration 32, los

###Modelo 2

In [36]:
mlp = MLPClassifier(hidden_layer_sizes=(50, 25), activation='relu', max_iter=500, verbose = True)
mlp.fit(x_train_balanced, y_train_balanced)
y_pred = mlp.predict(x_test)
report = classification_report(y_test, y_pred)
print(report)

Iteration 1, loss = 0.49078654
Iteration 2, loss = 0.23553112
Iteration 3, loss = 0.16343278
Iteration 4, loss = 0.13815246
Iteration 5, loss = 0.12272503
Iteration 6, loss = 0.12205482
Iteration 7, loss = 0.11538695
Iteration 8, loss = 0.11259879
Iteration 9, loss = 0.12141650
Iteration 10, loss = 0.11177937
Iteration 11, loss = 0.10647009
Iteration 12, loss = 0.10323007
Iteration 13, loss = 0.10245387
Iteration 14, loss = 0.09755931
Iteration 15, loss = 0.09952236
Iteration 16, loss = 0.09888873
Iteration 17, loss = 0.09463584
Iteration 18, loss = 0.10311815
Iteration 19, loss = 0.11110395
Iteration 20, loss = 0.09773008
Iteration 21, loss = 0.08826620
Iteration 22, loss = 0.08570819
Iteration 23, loss = 0.08865987
Iteration 24, loss = 0.08956014
Iteration 25, loss = 0.09482459
Iteration 26, loss = 0.09570384
Iteration 27, loss = 0.09092536
Iteration 28, loss = 0.08063353
Iteration 29, loss = 0.09062969
Iteration 30, loss = 0.07922107
Iteration 31, loss = 0.08269551
Iteration 32, los

### Modelo 3

In [37]:
mlp = MLPClassifier(hidden_layer_sizes=(50, 25, 13), activation='tanh', max_iter=500, verbose = True)
mlp.fit(x_train_balanced, y_train_balanced)
y_pred = mlp.predict(x_test)
report = classification_report(y_test, y_pred)
print(report)

Iteration 1, loss = 0.51607151
Iteration 2, loss = 0.25613901
Iteration 3, loss = 0.18501909
Iteration 4, loss = 0.15356598
Iteration 5, loss = 0.13647092
Iteration 6, loss = 0.12161011
Iteration 7, loss = 0.12117365
Iteration 8, loss = 0.11264876
Iteration 9, loss = 0.14113042
Iteration 10, loss = 0.10748791
Iteration 11, loss = 0.11685996
Iteration 12, loss = 0.10187794
Iteration 13, loss = 0.09760666
Iteration 14, loss = 0.09307357
Iteration 15, loss = 0.09033339
Iteration 16, loss = 0.09738205
Iteration 17, loss = 0.09042392
Iteration 18, loss = 0.08939183
Iteration 19, loss = 0.09523366
Iteration 20, loss = 0.08755649
Iteration 21, loss = 0.08151276
Iteration 22, loss = 0.08363506
Iteration 23, loss = 0.08159749
Iteration 24, loss = 0.07263148
Iteration 25, loss = 0.07148891
Iteration 26, loss = 0.07509087
Iteration 27, loss = 0.06578545
Iteration 28, loss = 0.06863482
Iteration 29, loss = 0.06690132
Iteration 30, loss = 0.07143136
Iteration 31, loss = 0.07483224
Iteration 32, los

### Resultados observados

O Modelo 3 apresenta um melhor equilíbrio entre precisão, recall e acurácia geral, mostrando maior eficácia na classificação das duas classes do conjunto de dados. Comparado com o Modelo 1 e o Modelo 2, os Modelos 2 e 3 possuem um melhor recall para a classe 0 (ausência de pneumonia), indicando que são mais eficientes em identificar corretamente os casos negativos. Para a classe 1 (presença de pneumonia), todos os modelos têm valores de recall semelhantes, sugerindo que todos são eficazes em identificar casos de pneumonia. No entanto, o Modelo 1 se destaca por ter o maior recall para a classe 1.

A arquitetura da rede neural e a função de ativação desempenham um papel fundamental no desempenho do modelo. Modelos mais complexos, com mais camadas e neurônios (como os Modelos 2 e 3), possuem maior capacidade de aprendizado, permitindo que capturem representações mais complexas dos dados, mas também podem aumentar o risco de overfitting.

Em termos de escolha do melhor modelo, é importante considerar que o modelo ideal deve ser capaz de identificar todos os casos de pneumonia, mesmo que isso resulte em mais falsos positivos. A capacidade de detectar todos os casos de pneumonia é crucial, pois o custo de não detectar um caso de pneumonia (falso negativo) pode ser mais alto do que o custo de ter alguns falsos positivos.



## Comparando o melhor modelo com um Ensemble

In [39]:
rf = RandomForestClassifier()

hiperpametros = {
    'n_estimators': [50, 100, 200],
    'max_depth': [1, 5, 10],
}

rf_gridsearch = GridSearchCV(rf, hiperpametros, cv=5, scoring='accuracy')
rf_gridsearch.fit(x_train_balanced, y_train_balanced)
rf_best = rf_gridsearch.best_estimator_
rf_predict = rf_best.predict(x_test)

print("Random Forest:")
print("Melhores hiperâmetros:", rf_gridsearch.best_params_)

accuracy = accuracy_score(y_test, rf_predict)
precision = precision_score(y_test, rf_predict)
recall = recall_score(y_test, rf_predict)
f1 = f1_score(y_test, rf_predict)

evaluation = f"Accuracy: {accuracy}\nPrecision: {precision}\nRecall: {recall}\nF1 Score: {f1}"

print(evaluation)

Random Forest:
Melhores hiperâmetros: {'max_depth': 10, 'n_estimators': 100}
Accuracy: 0.8076923076923077
Precision: 0.7777777777777778
Recall: 0.9692307692307692
F1 Score: 0.863013698630137


### Comparação do Random Forest com o melhor MLP

O modelo Random Forest superou o Modelo 3 em vários aspectos importantes. Com uma acurácia de 80%, em comparação com 77% do MLP, o Random Forest demonstra maior precisão geral na classificação. Além disso, apresenta um recall de 97%, evidenciando uma boa habilidade em identificar corretamente a classe positiva (classe 1), ao contrário do MLP, cujo recall é desigual entre as classes.

O F1-Score do Random Forest também é mais equilibrado e superior ao do MLP, que varia significativamente entre as classes. Isso indica que o Random Forest oferece uma performance mais robusta e equilibrada, destacando-se tanto em termos de recall quanto de F1-Score. Essas métricas sugerem uma vantagem geral em eficácia e balanceamento no desempenho do Random Forest.

Comparando os dois modelos, podemos observar que o Random Forest é mais eficiente para este conjunto de dados. Entretanto, o MLP, uma rede neural artificial, pode se destacar em dados mais complexos, mas é sensível aos hiperparâmetros, o que pode impactar significativamente seu desempenho. Além disso, o MLP é mais propenso ao overfitting, o que deve ser cuidadosamente monitorado.
