$$\Large \textit{C0708 | Reconocimiento de Patrones}$$

$$\large \textbf{Challenge Nº2 | KNN, Naive Bayes}$$


- _Alessandra Mercedes Aldave Javier_

# 1. Introducción

El cáncer de mama es una enfermedad en la cual las células del tejido mamario crecen y se multiplican descontroladamente [1]. Si esta condición permanece sin revisarse, los tumores formados pueden realizar metastasis, comprometiendo el pronóstico del paciente [2].

Un diagnostico temprano mediante *screenings* regulares permite reducir el riesgo de fallecer por cáncer de mama [3]. Asimismo, la evaluacion del tamaño del cancer de mama asi como de cuanto esta se ha expandido son algunos de los factores mas relevantes e involucrados en la prediccion de cancer de mama, especialmente, en mujeres [4].

###1.1. Problemática

El uso de la Inteligencia Artificial ha permitido un gran avance en el diagnostico de cancer de mama. Uno de los screenings mas realizados es la mamografia [5]; sin embargo, cuando se trata de diagnostico, el estudio histopatologico es consdierado como el gold standard. Pese a ello, este procedimiento requiere de recursos humanos, es tedioso y esta sujeto a variabilidad en su interpretación [6].

Frente a esta situación, un primer alcance es automatizar el proceso de interpretación para optimizar la calidad y tiempo de diagnóstico.

# 2. Metodología

### Dataset
Se utiliza un dataset cuyos parámetros corresponden al procesamiento de imagenes histopatológicas. El data set cuenta con 30 features y 1 target el cual corresponde al diagnostico del examen ('M' maligno o 'B' benigno).
El dataset se encuentra disponible en: https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data

### Modelos de Machine Learning para clasificación
En el presente challenge se plantea realizar un modelo de aprendizaje supervisado para realizar la tarea de clasificación utilizando por separado K-Nearest Neighbours y Naive-Bayes Gaussiano.

El algoritmo K-Nearest Neighbors se trata de un clasificador que hace uso de la proximidad existente entre los datos del dataset para predecir acerca de la clasificacion de un dato individual. Además, dado que se trata de un algoritmo no paramétrico, no hace ninguna asunción acerca de los datos con los que se trabaja, a diferencia de Naive Bayes en su implementación Gaussiana.

El algoritmo Naive Bayes Gaussiano es utilizado en contextos en los que las caracteristicas o atributos son de naturaleza continua, como los features del dataset a utilizar; sin embargo, esta versión asume que los datos siguen una distribución gaussiana, lo cual puede generar un bias en el entrenamiento.


In [72]:
#Liberias
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [73]:
df = pd.read_csv("https://raw.githubusercontent.com/alessandraaldave/MyRepository/main/data.csv",index_col=0)
df.head(5)

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [74]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 569 entries, 842302 to 92751
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    object 
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
 4   area_mean                569 non-null    float64
 5   smoothness_mean          569 non-null    float64
 6   compactness_mean         569 non-null    float64
 7   concavity_mean           569 non-null    float64
 8   concave points_mean      569 non-null    float64
 9   symmetry_mean            569 non-null    float64
 10  fractal_dimension_mean   569 non-null    float64
 11  radius_se                569 non-null    float64
 12  texture_se               569 non-null    float64
 13  perimeter_se             569 non-null    float64
 14  area_se                 

In [75]:
df.drop(columns=['Unnamed: 32'], inplace=True)

In [76]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 569 entries, 842302 to 92751
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    object 
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
 4   area_mean                569 non-null    float64
 5   smoothness_mean          569 non-null    float64
 6   compactness_mean         569 non-null    float64
 7   concavity_mean           569 non-null    float64
 8   concave points_mean      569 non-null    float64
 9   symmetry_mean            569 non-null    float64
 10  fractal_dimension_mean   569 non-null    float64
 11  radius_se                569 non-null    float64
 12  texture_se               569 non-null    float64
 13  perimeter_se             569 non-null    float64
 14  area_se                 

In [77]:
df

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [78]:
df['diagnosis'].unique()

array(['M', 'B'], dtype=object)

In [79]:
df['diagnosis'] = df['diagnosis'].map({'M' :0, 'B' :1}).astype(int) #mapping numbers
df.head()

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
842517,0,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
84300903,0,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
84348301,0,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
84358402,0,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


#Clasificación con K-Nearest Neighbors

In [80]:
#Division de la data en features (X) y target (y)
X = df.drop(columns=['diagnosis'])
y = df['diagnosis']

#Division de la data en training set y testing set
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state = 42)

#Inicializacion del objeto clasificador KNeighborsClassifier
KNN_classifier = KNeighborsClassifier(n_neighbors=6)

#Ajuste del modelo clasificador KNN segun el dataset mediante el metodo fit
KNN_classifier.fit(X_train,y_train)
#El modelo ya ha sido entrenado

#Clasificación con Naive Bayes

In [81]:
from sklearn.naive_bayes import GaussianNB

#Naive Bayes Gusiano
#Inicializacion de una instancia Naive Bayes Gausiano
naivebayes_gaussian = GaussianNB()

#Ajuste del modelo de clasificacion al dataset
naivebayes_gaussian.fit(X_train,y_train)

# 3. Resultados

In [82]:
#Predicción del testing set
y_prediccion = KNN_classifier.predict(X_test)

accuracy_KNN = accuracy_score(y_test, y_prediccion)
precision_KNN = precision_score(y_test, y_prediccion, pos_label=0)
recall_KNN = recall_score(y_test, y_prediccion, pos_label=0)
F1_KNN = f1_score(y_test, y_prediccion, pos_label=0)


print("Accuracy KNN:", accuracy_KNN)
print("Precision KNN:", precision_KNN)
print("Recall KNN:", recall_KNN)
print("F1-score KNN:", F1_KNN)

Accuracy KNN: 0.9649122807017544
Precision KNN: 0.975609756097561
Recall KNN: 0.9302325581395349
F1-score KNN: 0.9523809523809524


In [83]:
y_prediccion_NBGauss = naivebayes_gaussian.predict(X_test)

accuracy_NBGauss = accuracy_score(y_test, y_prediccion_NBGauss)
precision_NBGauss =precision_score(y_test, y_prediccion_NBGauss, pos_label=0)
recall_NBGauss =recall_score(y_test, y_prediccion_NBGauss, pos_label=0)
F1_NBGauss = f1_score(y_test, y_prediccion_NBGauss, pos_label=0)

print("Accuracy con Naive Bayes Gaussiano:", accuracy_NBGauss)
print("Precision con Naive Bayes Gaussiano:", precision_NBGauss)
print("Recall con Naive Bayes Gaussiano:", recall_NBGauss)
print("F1-score con Naive Bayes Gaussiano:", F1_NBGauss)

Accuracy con Naive Bayes Gaussiano: 0.9736842105263158
Precision con Naive Bayes Gaussiano: 1.0
Recall con Naive Bayes Gaussiano: 0.9302325581395349
F1-score con Naive Bayes Gaussiano: 0.963855421686747


# 4. Discusiones

De los resultados obtenidos a través de las métricas, el modelo de entrenamiento realizado con la versión Gaussiana de Naive Bayes aparenta tener un mejor performance en comparación a K-Nearest Neighbours; sin embargo, esto podría tratarse de una sobreestimación debido a que la cantidad de data no es suficiente como para conseguir un modelo robusto.

Además, pese a que Naives-Bayes es ampliamente utilizado en aplicaciones reales para diagnóstico médico, también presenta la desventaja de asumir que todos los atributos son independientes entre sí, lo cual muchas veces no es cierto puesto que existen parametros que biológicamente estan correlacionado.

# 5. Conclusiones

El uso de algoritmos de ML en el área de la salud permite la mejora de la velocidad, eficiencia, exactitud y precision con la se dan los procesos de diagnostico de una enfermedad o el riesgo de desarrollarlo al analizar grandes cantidades de data.

La elección del tipo de aprendizaje a realizar y el modelo a emplear debe ir alineada no solo los objetivos del proyecto; sino tambien al contexto de la data, pues es posible obtener resultados de las métricas obtenidas no representen realmente el desenvolvimiento del modelo frente a un nuevo set de datos en el que la distribución de ellos sea diferente o la data presente características distintas.

Los modelos obtenidos a través de KNN-Classifier y Gaussian Naives Bayes demuestran resultados prometedores; no obstante, estos podrían ser reevaluados y mejorados con un dataset de mayor volumen, un nuevo preprocesamiento y una nueva configuración de parámetros.





#6. Referencias

[1] Centers for Disease Control and Prevention, “What is breast cancer?,” Centers for Disease Control and Prevention, Jul. 25, 2023. https://www.cdc.gov/cancer/breast/basic_info/what-is-breast-cancer.html

[2] “Breast cancer,” www.who.int. https://www.who.int/news-room/fact-sheets/detail/breast-cancer#:~:text=In%202022%2C%20there%20were%202.3

[3] CDC, “What Is Breast Cancer Screening?,” Centers for Disease Control and Prevention, Sep. 22, 2021. https://www.cdc.gov/cancer/breast/basic_info/screening.htm#:~:text=if%20you%20qualify.-

[4] American Cancer Society, “ACS breast cancer screening guidelines,” www.cancer.org, Jan. 14, 2022. https://www.cancer.org/cancer/types/breast-cancer/screening-tests-and-early-detection/american-cancer-society-recommendations-for-the-early-detection-of-breast-cancer.html

[5] Ahn JS, Shin S, Yang SA, Park EK, Kim KH, Cho SI, Ock CY, Kim S. Artificial Intelligence in Breast Cancer Diagnosis and Personalized Medicine. J Breast Cancer. 2023 Oct;26(5):405-435. doi: 10.4048/jbc.2023.26.e45. PMID: 37926067; PMCID: PMC10625863.

[6] Hameed, Z., Garcia-Zapirain, B., Aguirre, J.J. et al. Multiclass classification of breast cancer histopathology images using multilevel features of deep convolutional neural network. Sci Rep 12, 15600 (2022). https://doi.org/10.1038/s41598-022-19278-2