# Computer Lab 1: k-NN classifier

## Exercise 3 – User localization from RSSI

Consider the following scenario, in which we wish to localize a user employing a non-GPS system  (e.g., in indoor localization). The user holds a transmission device (e.g., a smartphone or other sensor with transmission capabilities). Localization is based on measurements of the Received Signal Strength Indicator (RSSI) from D sensors (base stations) placed in the area in which the localization service is provided. The area is divided into $N_C$ square cells, and localization amounts to identifying the cell in which the user is located.

In a **training stage**, the transmission device is placed in the center of each cell and broadcasts a data packet, and RSSI is measured by each sensor. This yields one measurement, corresponding to a vector of length $D$. The process is repeated $M$ times for each cell, and for all $N_C$ cells. The training stage provides a 3-dimensional array of size $N_C \times D \times M$.

In a **test stage**, the user is located in an unknown cell. The transmission device broadcasts a data packet, and each sensor measures the RSSI and communicates it to a fusion center. The fusion center treats the received RSSI values as a test vector of length $D$. It applies a k-NN classifier, comparing the test vector with all $M \times N_C$ training vectors available in the training set. For each test vector, the k-NN classifier outputs the probability that each cell contains the user.

**Available data**: you are provided with a file (`localization.mat` in `/data/` folder) containing two variables, called traindata and testdata. These variables have the same size, and are 3-dimensional arrays of size $D=7$, $M=5$, and $N_C = 24$.

The training data can be seen as labelled data where each cell is a class, and you are given M data vectors for each cell. Regarding the test data, a test vector consists of a single measurement; so each measurement has to be used individually and you can perform up to M tests for each cell.
The data correspond to real acquisition experiments performed outdoors nearby Politecnico di Torino, using an STM32L microcontroller with 915 MHz 802.15.4 transceiver.

**Task**: your task is to implement a k-NN classifier in Matlab for the classification task described above, and evaluate its performance.

**Performance evaluation**: The performance is defined in terms of accuracy in the localization task, and it has to be averaged over all cells. Average accuracy is defined as the posterior probability associated to the cell that the user is actually located in.

In [80]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import random
import scipy.io
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from tqdm import tqdm

# Plots setting.
sns.set_context(
    'talk', rc = {
        'font.size': 12.0,
        'axes.labelsize': 10.0,
        'axes.titlesize': 10.0,
        'xtick.labelsize': 10.0,
        'ytick.labelsize': 10.0,
        'legend.fontsize': 10.0,
        'legend.title_fontsize': 12.0,
        'patch.linewidth': 2.0
        }
    )

data_sets = ['Train', 'Test']

In [2]:
# Check current folder.
os.getcwd()

'/'

In [114]:
data_path = "/Users/ernestocolacrai/Documents/GitHub/StatisticalLearning/data/"

try:
    # Attempt to load the MATLAB data file.
    data = scipy.io.loadmat(data_path + f"localization.mat")

    print(
        f"Data ✓\n",
        f"Data Keys: {data.keys()}"
        )
except:
    print(f"Not found data! ({data_path})")

Data ✓
 Data Keys: dict_keys(['__header__', '__version__', '__globals__', 'cell_coordinates', 'testdata', 'traindata'])


In [115]:
def rearrange(dataset, rows, columns, depth):
    arranged = np.zeros([columns * depth, rows + 1])
    count = 0
    for j in range(depth):
        for i in range(columns):
            arranged[i + count, :-1] = dataset[:, i, j].T
            arranged[i + count, -1] = j + 1 # +1 since it starts from 0
        
        count = count + columns
    return arranged

In [116]:
D = 7 # Features number (ROWS)
M = 5 # Measures number for each cell (class) (COLUMNS)
Nc = 24 # Classes number (cells number) (DEPTH)

len(rearrange(data['traindata'], D, M, Nc)), len(rearrange(data['testdata'], D, M, Nc))

(120, 120)

In [117]:
train_data = np.random.permutation(rearrange(data['traindata'], D, M, Nc))
test_data = np.random.permutation(rearrange(data['testdata'], D, M, Nc))

In [118]:
k = 7
bar = True

M = len(test_data)
N = len(train_data)

D = np.zeros([M, N], dtype=float)  # Distance matrix
E = np.zeros([M, k], dtype=int)  # Array of nearest neighbors
pred = np.zeros(M, dtype=int)

for i in tqdm(range(M), colour='green', disable=bar): # For each test point
    for j in range(N): # For each training point
        D[i][j] = np.sqrt(np.sum((test_data[i] - train_data[j]) ** 2)) # Calculate euclidean distance between the points
    # Find indices of k nearest neighbors
    E[i] = np.argsort(D[i])[:k]

    pred[i] = np.argmax(np.bincount(train_data[E[i]][:,-1].astype(int)))

In [122]:
pred, test_data[:,-1].astype(int)

(array([ 3,  9, 20,  3,  7, 19, 11, 23, 18, 13,  6,  3, 21, 20,  2, 13, 20,
        21,  3,  5, 20,  6, 17, 23, 21,  7, 23,  6, 15, 17,  6, 18, 19, 14,
        22, 17, 18,  6, 22, 19,  5, 18,  2, 19,  6, 23, 23, 14, 11, 11, 11,
        15, 23, 15,  3,  6, 18, 11,  6,  7,  4,  1, 17,  6, 14,  1, 17,  3,
         9,  7,  2, 23, 22, 14, 14, 17, 13,  5,  7, 14,  6, 11, 11, 22, 13,
        15, 20, 17, 23, 22,  6,  9, 21,  3, 13,  3,  3, 23, 11,  6, 11,  5,
         4,  6, 19,  4, 11,  4,  4, 11,  6,  3, 17, 14,  9, 21, 23,  7, 15,
         6]),
 array([ 3, 12, 20,  3,  5, 19,  9, 23, 18, 13,  8,  2, 21, 20,  1, 13, 20,
        21,  2,  5, 20, 10, 16, 24, 21,  7, 23, 10, 15, 17,  6, 18, 19, 13,
        22, 17, 18, 10, 22, 19,  5, 18,  1, 19,  8, 23, 23, 14, 11,  9, 12,
        15, 23, 15,  3, 10, 18, 11,  6,  7,  4,  1, 16,  8, 14,  1, 17,  2,
        12,  7,  1, 24, 22, 14, 14, 17, 16,  5,  7, 14,  8,  9,  9, 22, 13,
        15, 20, 17, 24, 22, 10, 12, 21,  2, 16,  2,  3, 24, 11,  6, 11,  5

--- HERE LAST IMPLEMENTATION

In [223]:
idx = random.randint(0,119)

# print(train_data[E[idx]][:,-1].astype(int))
# print(np.bincount(train_data[E[idx]][:,-1].astype(int)) / np.sum(np.bincount(train_data[E[idx]][:,-1].astype(int))))
# print(np.unique(train_data[E[idx]][:,-1].astype(int)))

In [224]:
vals, freqs = np.unique(train_data[E[idx]][:,-1].astype(int), return_counts=True)
percs = freqs / len(train_data[E[idx]][:,-1].astype(int))

print(vals)
print(freqs)
print(percs)
vals[np.argmax(freqs)], percs[np.argmax(freqs)]

[13 15 17]
[4 1 2]
[0.57142857 0.14285714 0.28571429]


(13, 0.5714285714285714)

--- END LAST IMPLEMENTATION

In [7]:
data['traindata'].shape

(7, 5, 24)

In [8]:
data['testdata'].shape

(7, 5, 24)

In [9]:
data['cell_coordinates']

array([[ 1.5,  1.5],
       [ 4.5,  1.5],
       [ 7.5,  1.5],
       [10.5,  1.5],
       [ 1.5,  4.5],
       [ 4.5,  4.5],
       [ 7.5,  4.5],
       [10.5,  4.5],
       [ 1.5,  7.5],
       [ 4.5,  7.5],
       [ 7.5,  7.5],
       [10.5,  7.5],
       [ 1.5, 10.5],
       [ 4.5, 10.5],
       [ 7.5, 10.5],
       [10.5, 10.5],
       [ 1.5, 13.5],
       [ 4.5, 13.5],
       [ 7.5, 13.5],
       [10.5, 13.5],
       [ 1.5, 16.5],
       [ 4.5, 16.5],
       [ 7.5, 16.5],
       [10.5, 16.5]])

In [10]:
np.sqrt(np.sum((rearrange(data['traindata'], D, M, Nc)[0] - rearrange(data['traindata'], D, M, Nc)[1]) ** 2))

1.7320508075688772

In [72]:
def knn(x_train: np.ndarray, y_train: np.ndarray, x_test: np.ndarray, k: int = 3, bar: bool = True) -> np.ndarray:

    """
    This function implements the k-nearest neighbors classification algorithm for classifying data points from a test set based on a training set.

    Args:
        x_train (np.ndarray): A NumPy array containing the training data (features).
        y_train (np.ndarray): A NumPy array containing the training data (labels).
        x_test (np.ndarray): A NumPy array containing the testing data (features).
        k (int, optional): The number of nearest neighbors to consider for classification. Defaults to 3.
        bar (bool, optional): A parameter for showing the progress bar. Defaults to False (not showing).

    Returns:
        np.ndarray: A NumPy array containing the predicted labels for testing data.
    """

    # x_train = train_df.iloc[:,:-1].to_numpy()
    # y_train = train_df.iloc[:,-1].to_numpy()
    # x_test = test_df.iloc[:,:-2].to_numpy()

    # Initialize data structures
    M = len(x_test)
    N = len(x_train)

    # Validate k parameter
    assert (type(k) != float) and (k % 2 == 1), "k parameter should be an odd integer number."
    assert k < N, "k parameter should be smaller than the train set size."

    pred = np.zeros(M, dtype=int)

    D = np.zeros([M, N], dtype=float)  # Distance matrix
    E = np.zeros([M, k], dtype=int)  # Array of nearest neighbors

    for i in tqdm(range(M), colour='green', disable=bar): # For each test point
        for j in range(N): # For each training point
            D[i][j] = np.sqrt(np.sum((x_test[i] - x_train[j]) ** 2)) # Calculate euclidean distance between the points
        
        # Find indices of k nearest neighbors
        E[i] = np.argsort(D[i])[:k]

        # l = np.sum(y_train[E[i]] == 1) >= (k + 1) / 2 # Check majority of labels (l)

        # prediction = 1 if l else 2 # Assign prediction based on majority
        pred[i] = np.argmax(np.bincount(E[i].astype(int)))
    
    return pred

In [73]:
train = np.random.permutation(rearrange(data['traindata'], D, M, Nc))
test = np.random.permutation(rearrange(data['testdata'], D, M, Nc))

x_train = train[:, :-1]
y_train = train[:, -1]
x_test = test[:, :-1]
y_test = test[:, -1]

pred = knn(
    x_train=x_train,
    y_train=y_train,
    x_test=x_test,
    k=3
)

In [76]:
pred, y_test.astype(int)

(array([ 2, 16, 64, 41, 22, 21, 38, 35, 32,  2,  3,  7,  0, 43,  2, 43, 20,
        43, 26,  2, 31, 10,  9, 26,  9, 26, 21, 30, 26, 54, 26, 54, 28, 10,
        38, 32,  7, 20,  3, 31, 28, 18, 43, 35,  9, 18, 25, 43, 28, 30, 38,
        38,  9,  9, 26, 64, 28, 10, 26, 64, 28, 59,  7, 26, 64, 25, 64, 21,
        15,  3, 35, 25, 25,  3, 10, 15, 21, 12, 12, 54, 12,  7, 22, 88, 25,
         3, 18,  3, 28, 32, 26, 22,  3,  7, 18, 31,  9, 20, 58, 12,  3,  4,
         4, 28, 31,  3, 12,  0, 22,  4, 30, 26, 28, 22, 20, 20, 41, 21,  2,
        10]),
 array([18, 13, 15,  3, 19, 11, 12,  1, 13, 18,  7,  8,  1, 10, 18, 10,  2,
        10, 23, 18, 21,  4, 14, 24, 14, 24, 11, 22, 23,  3, 23,  3, 12,  4,
        12, 13,  8,  2,  6, 21,  9,  5, 10,  1, 14,  5, 20, 10,  9, 22, 12,
        12, 14, 14, 24, 16,  9,  4, 24, 15,  9, 21,  8, 23, 15, 20, 15, 11,
        22,  7,  1, 20, 20,  7,  4, 22, 11, 16, 17,  3, 17,  8, 19, 15, 20,
         6,  5,  6,  6, 13, 24, 19,  7,  8,  5, 21, 13,  2, 17, 17,  5, 16

(120, 3)

In [120]:
import random
i = random.randint(1,120)
y_test[pred[i]], y_test[pred[i]]

(array([ 9.,  6., 12.]), array([ 9.,  6., 12.]))