<a href="https://colab.research.google.com/github/arnaldog12/Machine_Learning/blob/master/KNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

|  |  |
|-------------|-------|
| üéì **Aprendizado** | Supervisionado |
| üìã **Tarefa** | Classifica√ß√£o ou Regress√£o |
| üîß **Normaliza√ß√£o** | Sim |
| ‚≠ê **Dificuldade** | F√°cil |

# ‚öôÔ∏è Depend√™ncias

In [18]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.datasets import load_iris, load_diabetes

# üîç Introdu√ß√£o

O algoritmo do KNN (*k-Nearest Neighbors*) √© um dos algoritmos de aprendizagem de m√°quina mais simples e intuitivos. Ele √© amplamente utilizado para problemas de **classifica√ß√£o** e **regress√£o**.

A ideia principal √© que amostras similares tendem a estar pr√≥ximas uma das outras. Al√©m disso, ele √© um algoritmo *lazy learner*, ou seja, ele n√£o constr√≥i um modelo expl√≠cito durante o treinamento. Na nossa implementa√ß√£o, vamos ver que o m√©todo `fit` n√£o faz nada. Na realidade, ele apenas faz uma c√≥pia os dados de treinamento.

Para classificar uma nova amostra, o algoritmo:

1. **Calcula a dist√¢ncia entre a nova amostra e as amostras do banco de treinamento**.

> üí° Essa dist√¢ncia pode ser qualquer m√©trica de dist√¢ncia conhecida. Em geral, usa-se a dist√¢ncia de Manhattan (L1) ou Euclideana (L2).

2. **Encontram-se os $k$ vizinhos mais pr√≥ximos**.

> üí° Na pr√°tica, costuma-se usar valores **√≠mpares** para $k$ e geralmente entre 3 e 7.

3. **Decis√£o da sa√≠da**:
    - Classifica√ß√£o: a classe mais comum entre os vizinhos √© atribuida √† nova amostra.
    - Regress√£o: m√©dia ou mediana dos valores dos $k$ vizinhos.

‚úÖ **Vantagens**:
- **Simplicidade**: f√°cil de entender e implementar
- **N√£o-param√©trico**: n√£o assume distribui√ß√£o espec√≠fica dos dados.
- **Lazy learner**: n√£o requer treinamento.
- **Desempenho**: bom desempenho em conjuntos de dados pequenos e limpos. Bom para modelo de "baseline".

‚ùå **Desvantagens**:

- **Predi√ß√£o Lenta**: n√£o recomendado para grandes volumes de dados.
- **Consumo de mem√≥ria**: todo o dataset fica armazenado em mem√≥ria.
- **Sens√≠vel a outliers**
- **Sens√≠vel √† escala**: features com escalas diferentes dominam a dist√¢ncia. **Necess√°rio normalizar!**

> ‚ö†Ô∏è Nas nossas implementa√ß√µes, vamos usar os datasets `iris` e `diabetes`, onde todas as features est√£o na mesma escala e/ou normalizadas. Nesse caso, n√£o precisaremos normalizar. **Por√©m, no mundo real, lembre-se de normalizar!**

## Escolha do Par√¢metro $k$

- **$k$ pequeno ($k=1$)**: Mais sens√≠vel a ru√≠do, fronteiras de decis√£o irregulares
- **$k$ grande**: Mais suave, mas pode ignorar padr√µes locais
- **$k$ √≠mpar**: Evita empates em classifica√ß√£o bin√°ria
- **Valida√ß√£o cruzada**: Melhor forma de escolher k

# üé≤ Dados

In [19]:
iris = load_iris()

df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['class'] = iris.target
df.head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
5,5.4,3.9,1.7,0.4,0
6,4.6,3.4,1.4,0.3,0
7,5.0,3.4,1.5,0.2,0
8,4.4,2.9,1.4,0.2,0
9,4.9,3.1,1.5,0.1,0


In [20]:
diabetes = load_diabetes()

df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
df['target'] = diabetes.target
df.head(10)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0
5,-0.092695,-0.044642,-0.040696,-0.019442,-0.068991,-0.079288,0.041277,-0.076395,-0.041176,-0.096346,97.0
6,-0.045472,0.05068,-0.047163,-0.015999,-0.040096,-0.0248,0.000779,-0.039493,-0.062917,-0.038357,138.0
7,0.063504,0.05068,-0.001895,0.066629,0.09062,0.108914,0.022869,0.017703,-0.035816,0.003064,63.0
8,0.041708,0.05068,0.061696,-0.040099,-0.013953,0.006202,-0.028674,-0.002592,-0.01496,0.011349,110.0
9,-0.0709,-0.044642,0.039062,-0.033213,-0.012577,-0.034508,-0.024993,-0.002592,0.067737,-0.013504,310.0


# üíª Implementa√ß√£o

## M√©tricas de Dist√¢ncia

In [21]:
def l1_distance(a, b):
    return np.sum(np.abs(a - b), axis=1)

def l2_distance(a, b):
    return np.sqrt(np.sum((a - b)**2, axis=1))

## Classificador

In [22]:
class KNNClassifier(object):
    def __init__(self, n_neighbors=1, metric=l1_distance):
        self.n_neighbors = n_neighbors
        self.metric = metric

    def fit(self, x, y):
        self.x_train = x
        self.y_train = y

    def predict(self, x):
        y_pred = np.zeros((x.shape[0], 1), dtype=self.y_train.dtype)

        for i, x_test in enumerate(x):
            distances = self.metric(self.x_train, x_test)
            nn_index = np.argsort(distances)
            nn_pred = self.y_train[nn_index[:self.n_neighbors]]
            y_pred[i] = np.bincount(nn_pred).argmax()

        return y_pred

In [23]:
x = iris.data
y = iris.target

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42, stratify=y)

print(x.shape, y.shape)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(150, 4) (150,)
(105, 4) (105,)
(45, 4) (45,)


In [24]:
knn = KNNClassifier(n_neighbors=3)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)

print('Acur√°cia: {:.2f}%'.format(accuracy_score(y_test, y_pred) * 100))

Acur√°cia: 93.33%


In [25]:
list_res = []
for metric in [l1_distance, l2_distance]:
    for k in range(1, 10, 2):
        knn = KNNClassifier(n_neighbors=k, metric=metric)
        knn.fit(x_train, y_train)
        y_pred = knn.predict(x_test)

        acc = accuracy_score(y_test, y_pred) * 100
        list_res.append([k, metric.__name__, acc])

df = pd.DataFrame(list_res, columns=['k', 'metric', 'acur√°cia'])
df

Unnamed: 0,k,metric,acur√°cia
0,1,l1_distance,91.111111
1,3,l1_distance,93.333333
2,5,l1_distance,93.333333
3,7,l1_distance,93.333333
4,9,l1_distance,93.333333
5,1,l2_distance,93.333333
6,3,l2_distance,95.555556
7,5,l2_distance,97.777778
8,7,l2_distance,95.555556
9,9,l2_distance,95.555556


### Compara√ß√£o com o Scikit-learn

In [26]:
from sklearn.neighbors import KNeighborsClassifier

list_res = []
for metric in ['cityblock', 'euclidean']:
    for k in range(1, 10, 2):
        knn = KNeighborsClassifier(n_neighbors=k, metric=metric, algorithm='brute')
        knn.fit(x_train, y_train)
        y_pred = knn.predict(x_test)

        acc = accuracy_score(y_test, y_pred) * 100
        list_res.append([k, 'l1_distance' if metric == 'cityblock' else 'l2_distance', acc])

df = pd.DataFrame(list_res, columns=['k', 'metric', 'acur√°cia'])
df

Unnamed: 0,k,metric,acur√°cia
0,1,l1_distance,91.111111
1,3,l1_distance,93.333333
2,5,l1_distance,93.333333
3,7,l1_distance,93.333333
4,9,l1_distance,93.333333
5,1,l2_distance,93.333333
6,3,l2_distance,95.555556
7,5,l2_distance,97.777778
8,7,l2_distance,95.555556
9,9,l2_distance,95.555556


## Regressor

In [27]:
class KNNRegressor(object):
    def __init__(self, n_neighbors=1, metric=l1_distance):
        self.n_neighbors = n_neighbors
        self.metric = metric

    def fit(self, x, y):
        self.x_train = x
        self.y_train = y

    def predict(self, x):
        y_pred = np.zeros((x.shape[0], 1), dtype=self.y_train.dtype)

        for i, x_test in enumerate(x):
            distances = self.metric(self.x_train, x_test)
            nn_index = np.argsort(distances)
            nn_pred = self.y_train[nn_index[:self.n_neighbors]]
            y_pred[i] = np.mean(nn_pred)

        return y_pred

In [28]:
x = diabetes.data
y = diabetes.target

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

print(x.shape, y.shape)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(442, 10) (442,)
(309, 10) (309,)
(133, 10) (133,)


In [29]:
knn = KNNRegressor(n_neighbors=7)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)

print('MSE: {:.2f}'.format(mean_squared_error(y_test, y_pred)))

MSE: 3088.88


In [30]:
list_res = []
for metric in [l1_distance, l2_distance]:
    for k in range(1, 10, 2):
        knn = KNNRegressor(n_neighbors=k, metric=metric)
        knn.fit(x_train, y_train)
        y_pred = knn.predict(x_test)

        mse = mean_squared_error(y_test, y_pred)
        list_res.append([k, metric.__name__, mse])

df = pd.DataFrame(list_res, columns=['k', 'metric', 'mse'])
df

Unnamed: 0,k,metric,mse
0,1,l1_distance,6247.203008
1,3,l1_distance,3465.568087
2,5,l1_distance,3153.743759
3,7,l1_distance,3088.881387
4,9,l1_distance,3051.905505
5,1,l2_distance,5629.609023
6,3,l2_distance,3663.817043
7,5,l2_distance,3222.117895
8,7,l2_distance,3235.792543
9,9,l2_distance,3175.550543


### Compara√ß√£o com o Scikit-learn

In [31]:
from sklearn.neighbors import KNeighborsRegressor

list_res = []
for metric in ['cityblock', 'euclidean']:
    for k in range(1, 10, 2):
        knn = KNeighborsRegressor(n_neighbors=k, metric=metric, algorithm='brute')
        knn.fit(x_train, y_train)
        y_pred = knn.predict(x_test)

        mse = mean_squared_error(y_test, y_pred)
        list_res.append([k, 'l1_distance' if metric == 'cityblock' else 'l2_distance', mse])

df = pd.DataFrame(list_res, columns=['k', 'metric', 'mse'])
df

Unnamed: 0,k,metric,mse
0,1,l1_distance,6247.203008
1,3,l1_distance,3465.568087
2,5,l1_distance,3153.743759
3,7,l1_distance,3088.881387
4,9,l1_distance,3051.905505
5,1,l2_distance,5629.609023
6,3,l2_distance,3663.817043
7,5,l2_distance,3222.117895
8,7,l2_distance,3235.792543
9,9,l2_distance,3175.550543


# üí≠ Considera√ß√µes Finais

## Varia√ß√µes do KNN

**1. kNN Ponderado**

Em vez de voto simples, usa pesos baseados na dist√¢ncia:

```python
weights = 1 / (distances + epsilon)  # Evita divis√£o por zero
weighted_vote = np.sum(weights * classes)
```

**2. Radius-based NN**

Em vez de k fixo, usa todos os pontos dentro de um raio:

```python
neighbors = x_train[distances < radius]
```

> üí° Ambas varia√ß√µes est√£o dispon√≠veis na implementa√ß√£o do scikit-learn


## Otimiza√ß√µes Comuns

**1. Estruturas de Dados Eficientes**
- **KD-Tree**: Para baixa dimensionalidade (< 10 dimens√µes).

- **Ball Tree**: Para alta dimensionalidade.

- **LSH (Locality Sensitive Hashing)**: Para busca aproximada.

> üí° KD-Tree e Ball Tree est√£o dispon√≠veis na implementa√ß√£o do scikit-learn

**2. Redu√ß√£o de Dimensionalidade**

Para dados de dimensionalidade muito grande, tamb√©m √© comum utilizar algoritmos de redu√ß√£o de dimensionalidade (como PCA e t-SNE) ou de sele√ß√£o de features (ex: remo√ß√£o de features irrelevantes) antes da aplica√ß√£o do KNN.
