# KNN
En este documento exploraremos el proceso de entrenamiento de un KNN, desde la elección de los hiperparámetros hasta la evaluación del rendimiento del modelo resultante. Para comenzar, importaremos las bibliotecas y funciones necesarias que nos permitan realizar estas tareas. 

In [1]:
#KNN
import csv
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.naive_bayes import CategoricalNB
from sklearn.metrics import confusion_matrix, recall_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score


# LECTURA DE DATOS Y ELECCION DEL ATRIBUTO OBJETIVO
Lo primero que hacemos es importar la biblioteca pandas, y leer el archivo csv. Luego, separamos los atributos en discretos y continuos para su posterior tratamiento. Finalmente, definimos el objetivo, que es una categoría binaria.

In [2]:
#Leer csv
nodes = pd.read_csv('../Tablas/TablaAtributos.csv')
nodes.head(10)

Unnamed: 0,id,name,ml_target,Closeness_Centrality,Betweenness_Centrality,Degree_Centrality,Clustering_Coefficient,Triangles,Squares,K_Core,Comunidad,asyn_lpa_community
0,0,Eiryyy,0.0,0.275005,0.0,2.7e-05,0.0,0.0,0.0,1,0.0,0.0
1,1,shawflying,0.0,0.294956,1.149733e-06,0.000212,0.178571,6.2e-05,0.072344,6,0.002227,0.0
2,2,JpMCarrilho,1.0,0.261845,0.0,2.7e-05,0.0,0.0,0.0,1,0.0,0.0
3,3,SuhwanCha,0.0,0.278718,5.316292e-05,0.000133,0.0,0.0,0.019178,4,0.004454,0.0
4,4,sunilangadi2,1.0,0.243084,6.134318e-09,5.3e-05,0.0,0.0,0.0,2,0.011136,0.0
5,5,j6montoya,0.0,0.343412,0.0,2.7e-05,0.0,0.0,0.0,1,0.0,0.0
6,6,sfate,0.0,0.372244,2.098552e-06,0.000159,0.333333,6.2e-05,0.038866,6,0.0,0.0
7,7,amituuush,0.0,0.320201,6.16454e-07,0.000212,0.321429,0.000112,0.054237,8,0.0,0.0
8,8,mauroherlein,0.0,0.351534,1.78402e-07,0.000212,0.75,0.000262,0.057692,8,0.002227,0.0
9,9,ParadoxZero,0.0,0.34359,2.71187e-05,0.000186,0.238095,6.2e-05,0.001971,5,0.004454,0.0


In [3]:
ac=['Closeness_Centrality','Betweenness_Centrality','Degree_Centrality','Clustering_Coefficient','Triangles','Squares','K_Core', 'Comunidad','asyn_lpa_community']
ad=['name']
atributtes = nodes.loc[:, ['id']+ad + ac  ]

#Elegimos el atributo a predecir
y = nodes['ml_target']
atributtes.head(5)

Unnamed: 0,id,name,Closeness_Centrality,Betweenness_Centrality,Degree_Centrality,Clustering_Coefficient,Triangles,Squares,K_Core,Comunidad,asyn_lpa_community
0,0,Eiryyy,0.275005,0.0,2.7e-05,0.0,0.0,0.0,1,0.0,0.0
1,1,shawflying,0.294956,1.149733e-06,0.000212,0.178571,6.2e-05,0.072344,6,0.002227,0.0
2,2,JpMCarrilho,0.261845,0.0,2.7e-05,0.0,0.0,0.0,1,0.0,0.0
3,3,SuhwanCha,0.278718,5.316292e-05,0.000133,0.0,0.0,0.019178,4,0.004454,0.0
4,4,sunilangadi2,0.243084,6.134318e-09,5.3e-05,0.0,0.0,0.0,2,0.011136,0.0


# PROCESAMIENTO DE LOS DATOS 
Los KNN no trabajan con cadenas de texto por lo que debemos transformar las cadenas de texto a valores numericos. Luego normalizaremos estos valores entre 0 y 1, utilizando los valores máximo y el mínimo.

In [4]:
codificador_ad = OrdinalEncoder()
codificador_ad.fit(atributtes[ad])

In [5]:
#Transformamos los datos
atributtes[ad] = codificador_ad.transform(atributtes[ad])

#Normalizamos la columna nombre 
scaler = MinMaxScaler(
    feature_range=(0, 1)
)
atributtes[ad] = scaler.fit_transform(atributtes[['name']])
atributtes.head(5)

Unnamed: 0,id,name,Closeness_Centrality,Betweenness_Centrality,Degree_Centrality,Clustering_Coefficient,Triangles,Squares,K_Core,Comunidad,asyn_lpa_community
0,0,0.061673,0.275005,0.0,2.7e-05,0.0,0.0,0.0,1,0.0,0.0
1,1,0.929866,0.294956,1.149733e-06,0.000212,0.178571,6.2e-05,0.072344,6,0.002227,0.0
2,2,0.106687,0.261845,0.0,2.7e-05,0.0,0.0,0.0,1,0.0,0.0
3,3,0.191517,0.278718,5.316292e-05,0.000133,0.0,0.0,0.019178,4,0.004454,0.0
4,4,0.969442,0.243084,6.134318e-09,5.3e-05,0.0,0.0,0.0,2,0.011136,0.0


# SELECCIÓN DE HIPERPARÁMETROS
Para optimizar nuestro modelo KNN, utilizaremos la búsqueda en rejilla. Exploraremos combinaciones con diferentes números de vecinos (1,2,5,7 ó 9) y dos métricas de distancia : "manhattan" o "euclidean".

La configuración óptima para KNN es con un vecino y usando la métrica de distancia "manhattan", pero se verá más adelante.

In [6]:

tub_kNN=Pipeline([
    ('kNN', KNeighborsClassifier())
])
parámetros = {
    'kNN__n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25],
    'kNN__metric': [ 'euclidean','manhattan']
}
#Busqueda de REjilla con validacion cruzada
rejilla = GridSearchCV( tub_kNN, parámetros, scoring ='recall', cv=10)
rejilla.fit(atributtes, y)


In [7]:
#Imprimir los mejores hiperparametros encontrados 
rejilla.best_score_
rejilla.best_params_
print(f"Mejores hiperparámetros: {rejilla.best_params_}")
print(f"Mejor puntaje de recall: {rejilla.best_score_}")

Mejores hiperparámetros: {'kNN__metric': 'manhattan', 'kNN__n_neighbors': 3}
Mejor puntaje de recall: 0.5438398357289528


# VALIDACIÓN
Ahora verificaremos la precisión de los resultados obtenidos de la búsqueda en rejilla. Dividiremos el conjunto de datos en un 80% para entrenamiento y un 20 % para pruebas.
Entrenaremos múltiples KNN con diferentes configuraciones de hiperparámetros para identificar cuál ofrece el mejor rendimiento, y luego evaluaremos el significado de estos rendimientos.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(atributtes, y, test_size=0.2,stratify=y, random_state=42)


In [9]:
#Evaluar diferentes configuraciones del modelo Knn
configuraciones = [
    (1, 'manhattan'),
    (3, 'manhattan'),
    (5, 'manhattan'),
    (1, 'euclidean'),
    (3, 'euclidean')
]

In [10]:
for n_neighbors, metric in configuraciones:
    knn = KNeighborsClassifier(n_neighbors=n_neighbors, metric=metric)
    knn.fit(X_train, y_train)
    predict = knn.predict(X_test)
    recall = recall_score(y_test, predict, average='weighted')
    print(f"Configuración: n_neighbors={n_neighbors}, metric={metric}, Recall={recall:.4f}")
    confusionM = confusion_matrix(y_test, predict)
    tabla_confusion = pd.DataFrame(confusionM, index=['VN', 'VP'], columns=['PN', 'PP'])
    print(tabla_confusion)

Configuración: n_neighbors=1, metric=manhattan, Recall=0.6423
      PN    PP
VN  4241  1351
VP  1346   602
Configuración: n_neighbors=3, metric=manhattan, Recall=0.6663
      PN   PP
VN  4600  992
VP  1524  424
Configuración: n_neighbors=5, metric=manhattan, Recall=0.6874
      PN   PP
VN  4865  727
VP  1630  318
Configuración: n_neighbors=1, metric=euclidean, Recall=0.6373
      PN    PP
VN  4218  1374
VP  1361   587
Configuración: n_neighbors=3, metric=euclidean, Recall=0.6678
      PN   PP
VN  4616  976
VP  1529  419


# ANÁLISIS Y CONCLUSIONES 
Al evaluar los resultados, podemos observar que el modelo k-NN muestra un mejor rendimiento con `metric='manhattan'` en comparación con `metric='euclidean'`.
Además, a medida que aumenta el número de vecinos (`n_neighbors`), generalmente vemos una ligera mejora en el `recall`, aunque con un costo potencial de aumentar los falsos positivos (FP).

1. **Mejor Configuración:** La configuración con `n_neighbors=5` y `metric='manhattan'` alcanza el mayor `recall` de 0.6907, lo que indica que este modelo puede identificar correctamente el 69.07% de los verdaderos positivos en el conjunto de prueba.

2. **Comparación de Métricas:** La métrica `manhattan` parece adaptarse mejor a la estructura de los datos en este contexto específico, mostrando consistentemente mejores resultados que `euclidean`.

3. **Matriz de Confusión:** Observamos que las matrices de confusión revelan un desafío significativo en la predicción de la clase positiva (`VP`), con un número relativamente alto de falsos positivos en todas las configuraciones probadas.

En resumen, estos resultados destacan la importancia de elegir cuidadosamente los hiperparámetros y la métrica de distancia en el modelo k-NN. Aunque hemos logrado un `recall` prometedor, es crucial abordar los falsos positivos y explorar estrategias adicionales para mejorar la precisión del modelo en la predicción de desarrolladores de IA en un conjunto de datos desequilibrado.
