# ¿Que es esto?

Una exploracion de k- nearest neighbor.\
Con el objetivo de pensar como esto se puede interpretar en el incremental learning.

Vamos a realizar la implementación paso a paso de k-NN.

- Paso 1 : Calcular la distancia euclidiana.
- Paso 2 : Consigue los vecinos más cercanos.
- Paso 3 : Haz predicciones

### Distancia eclidiana

Entre menor sea la distancia entre dos valores, mas similares son.

$d(p,q) = \sqrt{\sum_i (p_i – q_i)^2}$

- p es una fila de datos.
- q es otra fila distinta de datos.


In [6]:
import math
# calculate the Euclidean distance between two vectors
def euclidean_distance(row1, row2):
	distance = 0.0
	for i in range(len(row1)-1):
		distance += (row1[i] - row2[i])**2
	return math.sqrt(distance)

### Datos sinteticos

In [7]:
import pandas as pd
# Lee el archivo de texto en un DataFrame de Pandas
df = pd.read_csv('data.txt', delim_whitespace=True)
df

Unnamed: 0,X1,X2,Y
0,2.781084,2.550537,0
1,1.465489,2.362125,0
2,3.396562,4.400294,0
3,1.38807,1.85022,0
4,3.064072,3.005306,0
5,7.627531,2.759262,1
6,5.332441,2.088627,1
7,6.922597,1.771064,1
8,8.675419,-0.242069,1
9,7.673756,3.508563,1


Ejemplo de calcular las distancias.

Fijamos una fila y calculamos la distancia euclidiana con todas las filas del dataset.

In [8]:
dataset = df.to_numpy()
row0 = dataset[0]
for row in dataset:
 distance = euclidean_distance(row0, row)
 print(distance)

0.0
1.3290173915275787
1.9494646655653247
1.5591439385540549
0.5356280721938492
4.850940186986411
2.592833759950511
4.214227042632867
6.522409988228337
4.985585382449795


### Visualización de los datos

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Leer el dataset en un DataFrame
data = pd.read_csv('data.txt', delim_whitespace=True)

# Separar los datos en dos clases
clase_0 = data[data['Y'] == 0]
clase_1 = data[data['Y'] == 1]

# Crear un gráfico de dispersión para cada clase
plt.scatter(clase_0['X1'], clase_0['X2'], color='blue', label='Clase 0', s=100, alpha=0.7)
plt.scatter(clase_1['X1'], clase_1['X2'], color='yellow', label='Clase 1', s=100, alpha=0.7)

# Agregar etiquetas y título
plt.xlabel('X1')
plt.ylabel('X2')
plt.title('Scatter Plot de Datos')

# Mostrar leyenda
plt.legend()

# Mostrar el gráfico
plt.show()


### Calculamos el vecino mas cercano

In [15]:
def get_neighbors(train, test_row, num_neighbors):
    """
    Find the 'num_neighbors' most similar neighbors of a 'test_row' within the 'train' dataset.

    Parameters:
    train: list
        The training dataset containing multiple rows.

    test_row: list
        The specific row for which the nearest neighbors are to be found.

    num_neighbors: int
        The number of nearest neighbors to be located.

    Returns:
    list
        List of 'num_neighbors' closest neighboring rows from the 'train' dataset.
    """

    # Calculate the distance between 'test_row' and each row in the 'train' dataset
    distances = [(train_row,euclidean_distance(test_row, train_row)) for train_row in train]

    # Sort the distances in ascending order
    distances.sort(key=lambda tup: tup[1])

    # Retrieve the 'num_neighbors' closest neighbors
    neighbors = [distances[i][0] for i in range(num_neighbors)]


    return neighbors


In [14]:
neighbors = get_neighbors(dataset, dataset[0], 3)
for neighbor in neighbors:
 print(f"Most similar vectors: {neighbor}")

Most similar vectors: [2.7810836 2.550537  0.       ]
Most similar vectors: [3.06407232 3.00530597 0.        ]
Most similar vectors: [1.46548937 2.36212508 0.        ]


### Predicciones

In [31]:
from pprint import pprint

def predict_classification(train, test_row, num_neighbors):
    """
    Make a classification prediction for a 'test_row' based on its nearest neighbors in the 'train' dataset.

    Parameters:
    train: list
        The training dataset containing multiple rows.

    test_row: list
        The specific row for which the classification prediction needs to be made.

    num_neighbors: int
        The number of nearest neighbors to consider for prediction.

    Returns:
    int
        The predicted classification for the 'test_row'.
    """

    # Obtain the 'num_neighbors' nearest neighbors using 'get_neighbors' function
    neighbors = get_neighbors(train, test_row, num_neighbors)
    pprint(f"Vecinos mas cercanos:  {neighbors}")

    # Extract the output values (the last column) from the neighbors
    output_values = [row[-1] for row in neighbors]
    print(f"Clases de los vecinos mas cercanos {output_values}")
    # Make a prediction based on the most common output value among the neighbors
    prediction = max(set(output_values), key=output_values.count)

    return prediction


In [37]:
row: int = 4
prediction = predict_classification(dataset, dataset[row], 7)
print('Expected %d, Got %d.' % (dataset[row][-1], prediction))

('Vecinos mas cercanos:  [array([3.06407232, 3.00530597, 0.        ]), '
 'array([2.7810836, 2.550537 , 0.       ]), array([3.39656169, 4.40029353, '
 '0.        ]), array([1.46548937, 2.36212508, 0.        ]), '
 'array([1.38807019, 1.85022032, 0.        ]), array([5.33244125, 2.08862677, '
 '1.        ]), array([6.92259672, 1.77106367, 1.        ])]')
Clases de los vecinos mas cercanos [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0]
Expected 0, Got 0.


In [16]:
df

Unnamed: 0,X1,X2,Y
0,2.781084,2.550537,0
1,1.465489,2.362125,0
2,3.396562,4.400294,0
3,1.38807,1.85022,0
4,3.064072,3.005306,0
5,7.627531,2.759262,1
6,5.332441,2.088627,1
7,6.922597,1.771064,1
8,8.675419,-0.242069,1
9,7.673756,3.508563,1


## Conclusiones

Con el pequeño experimento visualizamos lo que hacen estos metodos, que basicamente solo miran la similitud y segun la cantidad de vecinos que tomemos sabremos a que clase pertenece nuestro nuevo dato.

Para estos metodos regular mente se usan toda la cantidad de datos el cual al incrementar clases y datos, se tendrá problemas de memoria.

Investigando un poco, los problemas de memoria los resuelven utilizando prototipos, estos prototipos se refieren a tomar unos datos particulares de cada clase, para realizar un algoritmo del vecino mas cercano eficiente mente pero con una menor cantidad de datos.

Se habla un poco mas sobre los prototipos aqui en: [The Elements of Statistical Learning](https://link.springer.com/book/10.1007/978-0-387-84858-7)