# Assignment 2

Create a program to evaluate the Generalization Error (GE), Prediction Model Error (ME) and Training Error (TE) for the k-nearest neighbors (KNN) learning approach. For doing so, compute the model considering neighborhood sizes from 1 to 35.

### Imports:

In [3]:
from sklearn.neighbors import KNeighborsRegressor
import numpy as np

### Built-in Custom Functions:

In [4]:
def KNN(N_size, x_train, Y_train, x_test):
    '''
    This function implements the KNN regression learning model. The required
    inputs are the following:
    -   N_size (integer): size of the neighborhood. Automatically reduced to
    training dataset size if greater than it.
    -   x_train (1-D list): list of x values related to the training data set.
    -   Y_train (1-D list): list of Y values related to the training data set.
    -   x_test (1-D list): list of x values related to the testing data set.

    The function outputs the following:
    -   Y_hat (1-D list): list containing the KNN regressed values for the 
    x_test data set according to the model training.
    '''

    N_size = np.minimum(len(x_train), N_size)

    x_i = [[x] for x in x_train]

    KNN = KNeighborsRegressor(N_size).fit(x_i, Y_train)

    x_i = [[x] for x in x_test]

    Y_hat = KNN.predict(x_i)

    return Y_hat


## Solution Code:

### Data Sets and Learning Model:


#### Training Set:

With $N^{training} = 50$:


- Generate $x_i$, $N^{training}$ uniformly separated data points between 0 and 1.

- Generate $n_i$, $N^{training}$ noise data points randomly distributed with 0 mean and 0.1 variance.

- Build the observed data model as: 

$Y_i^{training} = f(x_i) + n_i$, with $\space i = 1 ... N^{training}$ and $f(x) = sin(2 \pi · x)$

In [57]:
# Defining function to generate a set of data
def gen_data(n_samples: int) -> np.ndarray:
    def func_x(input):
        return np.sin((2)*(np.pi)*(input))

    x_i = np.linspace(0, 1, num=n_samples)

    #noise 
    n_i = np.random.normal(loc=0, scale=np.sqrt(0.1), size=n_samples)

    #combining features + noise
    generated_samples = func_x(x_i) + n_i


    #Dataset size
    print(np.shape(generated_samples))

    return generated_samples

### Generating training set (`n_samples` = 50)

In [60]:
Y_training = gen_data(n_samples=50)
Y_training

(50,)


array([-0.17325964,  0.40127299,  0.90727945,  0.12146286,  0.67388567,
        0.67142145, -0.089101  ,  0.24022948,  0.94281416,  1.15124657,
        1.30653578,  1.29571659,  1.19991314,  1.31520611,  0.82542668,
        1.12456576,  1.41869973,  1.26865552,  0.80373527,  0.75608585,
       -0.10391818,  0.71083597,  0.20328744,  0.22398467, -0.09016854,
        0.33849203, -0.19251314, -0.17479673, -0.43473069, -0.56812546,
       -0.58501721, -0.57470081, -0.80517429, -1.30976322, -0.85088787,
       -1.02231995, -0.73495988, -1.1142986 , -0.65244351, -1.08301263,
       -0.8755801 , -1.25991329, -0.62545113, -1.25669396, -0.76075747,
       -0.43981506, -0.42468342,  0.377051  , -0.47555186, -0.06910076])

#### Testing Set:

With $N^{testing} = 300$:


- Follow the same previous steps, using $N^{testing}$ instead of $N^{training}$.

In [61]:
Y_testing = gen_data(n_samples=300)
Y_testing

(300,)


array([ 0.13757984,  0.25869262, -0.20949118,  0.17674034, -0.03478698,
        0.22973475, -0.0078699 ,  0.58588469,  0.5497586 , -0.05209111,
        0.66696454,  0.61768343, -0.07776954,  0.36162855,  0.40349582,
        0.44093632,  0.22273598,  0.37262832,  0.89473556,  0.48391178,
        0.3210905 ,  0.84992815,  0.58934761,  0.07998875,  0.10384251,
        0.56599913,  1.06778825, -0.06967744,  0.42451053,  0.85312585,
        0.88410112,  0.94862151,  0.49207096,  1.12226682,  0.15594538,
        0.34996053,  0.51798479,  1.07121846,  0.65218045,  0.90890081,
        1.18803548,  0.25336425,  0.78492318,  0.35925907,  0.59699353,
        0.94894729,  1.07428793,  1.28028865,  1.10787707,  0.60563953,
        0.92393699,  0.68274705,  0.72475129,  1.08630822,  0.82998617,
        0.35854046,  1.22116505,  0.6851774 ,  1.07434808,  0.53684975,
        1.46873249,  0.2319586 ,  0.88006977,  0.92839407,  0.94393505,
        1.05498104,  0.85548012,  1.05461462,  1.31169187,  1.30

#### Learning Model:

Use the K-Nearest Neighbors to evaluate its performance. Plot the model result for neighborhood sizes of 1, 5, 15, 25 and 40.

In [None]:
N_sizes_plot = [1,2,5,35] #Plot these neighborhood sizes

#KNN result for neighborhood of all possible sizes?

#Hint: call KNN(N_size, x_train, Y_train, x_test) iteratively for each
#neighborhood size and plot 

#Hint: plot with transparecy to see all lines


### Evaluation:

#### Generalization Error (GE):

In [None]:
### GE Calculation:


#### Model Prediction Error (ME):

In [None]:
### ME Calculation:

#### Training Error (TE):

In [None]:
### Regression of training dataset values:

#KNN result for training data input?

#TE Calculation:

#### Errors Plot:

In [None]:
#Visualization of all errors alltogether? Different axis for two different trends?