# Data Loading and Preprocessing

We consider the same notebook used in the labs, containing house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

https://www.kaggle.com/harlfoxem/housesalesprediction

For each house we know 18 house features (e.g., number of bedrooms, number of bathrooms, etc.) plus its price, that is what we would like to predict.

## TO DO: Insert your ID number ("numero di matricola") below

In [1]:
#put here your ``numero di matricola''
numero_di_matricola = #just a number to start

Load the required packages

In [2]:
#import all packages needed
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Read the data, remove data samples/points with missing values (NaN), and print some statistics.

In [3]:
#load the data
df = pd.read_csv('kc_house_data.csv', sep = ',')

#remove the data samples with missing values (NaN)
df = df.dropna() 

df.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0,3164.0
mean,4645240000.0,535435.8,3.381163,2.071903,2070.027813,15250.54,1.434893,0.009798,0.244311,3.459229,7.615676,1761.252212,308.775601,1967.489254,94.668774,98077.125158,47.557868,-122.212337,1982.544564,13176.302465
std,2854203000.0,380900.4,0.895472,0.768212,920.251879,42544.57,0.507792,0.098513,0.776298,0.682592,1.166324,815.934864,458.977904,28.095275,424.439427,54.172937,0.140789,0.139577,686.25667,25413.180755
min,1000102.0,75000.0,0.0,0.0,380.0,649.0,1.0,0.0,0.0,1.0,3.0,380.0,0.0,1900.0,0.0,98001.0,47.1775,-122.514,620.0,660.0
25%,2199775000.0,315000.0,3.0,1.5,1430.0,5453.75,1.0,0.0,0.0,3.0,7.0,1190.0,0.0,1950.0,0.0,98032.0,47.459575,-122.32425,1480.0,5429.5
50%,4027701000.0,445000.0,3.0,2.0,1910.0,8000.0,1.0,0.0,0.0,3.0,7.0,1545.0,0.0,1969.0,0.0,98059.0,47.5725,-122.226,1830.0,7873.0
75%,7358175000.0,640250.0,4.0,2.5,2500.0,11222.5,2.0,0.0,0.0,4.0,8.0,2150.0,600.0,1990.0,0.0,98117.0,47.68025,-122.124,2360.0,10408.25
max,9839301000.0,5350000.0,8.0,6.0,8010.0,1651359.0,3.5,1.0,4.0,5.0,12.0,6720.0,2620.0,2015.0,2015.0,98199.0,47.7776,-121.315,5790.0,425581.0


Get the feature matrix and the vector of target values. We want to predict the price by using features other than id as input.

In [4]:
Data = df.values
# m = number of input samples
m = Data.shape[0]
print("Amount of data:",m)
Y = Data[:m,2]
X = Data[:m,3:]

feature_names = df.columns[3:]

Amount of data: 3164


We split the $m$ samples of the data into 3 parts: one will be used for training and choosing the parameters, one for choosing among different models, and one for testing. The part for training and choosing the parameters will consist of $m_{train}=2/3 m$ samples, the one for choosing among different models will consist of $m_{val}= (m - m_{train})/2$ samples, while the other part consists of $m_{test}=m - m_{train} - m_{val}$ samples.

In [5]:
# Split data into train (2/3 of samples), validation (1/6 of samples), and test data (the rest)
m_train = int(2./3.*m)
m_val = int((m-m_train)/2.)
m_test = m - m_train - m_val
print("Amount of data for training and deciding parameters:",m_train)
print("Amount of data for validation (choosing among different models):",m_val)
print("Amount of data for test:",m_test)
from sklearn.model_selection import train_test_split

#Xtrain_and_val, Ytrain_and_val is the part of data for training and validation
#Xtest, Ytest is the part of data for testing
Xtrain_and_val, Xtest, Ytrain_and_val, Ytest = train_test_split(X, Y, test_size=m_test/m, random_state=numero_di_matricola)

#if you need to consider a specific training and validation split, use
#Xtrain, Ytrain for training and Xval, Yval for validation
Xtrain, Xval, Ytrain, Yval = train_test_split(Xtrain_and_val, Ytrain_and_val, test_size=m_val/(m_train+m_val), random_state=numero_di_matricola)

Amount of data for training and deciding parameters: 2109
Amount of data for validation (choosing among different models): 527
Amount of data for test: 528


Let's scale the data.

In [6]:
# Data pre-processing
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(Xtrain)
Xtrain_scaled = scaler.transform(Xtrain)
Xtrain_and_val_scaled = scaler.transform(Xtrain_and_val)
Xval_scaled = scaler.transform(Xval)
Xtest_scaled = scaler.transform(Xtest)

# Neural Networks

Let's learn the best neural network with 1 hidden layer and between 1 and 9 hidden nodes, choosing the best number of hidden nodes with cross-validation.

In [7]:
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV

mlp_cv = MLPRegressor()
param_grid = {'hidden_layer_sizes': [i for i in range(1,10)],
              'activation': ['relu'],
              'solver': ['lbfgs'], 
              'random_state': [numero_di_matricola]
             }
mlp_GS = GridSearchCV(mlp_cv, param_grid=param_grid, 
                   cv=5, verbose=True)
mlp_GS.fit(Xtrain_and_val_scaled, Ytrain_and_val)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-lear

GridSearchCV(cv=5, estimator=MLPRegressor(),
             param_grid={'activation': ['relu'],
                         'hidden_layer_sizes': [1, 2, 3, 4, 5, 6, 7, 8, 9],
                         'random_state': [2005838], 'solver': ['lbfgs']},
             verbose=True)

Now let's check what is the best parameter, and compare the best NNs with the linear model (learned on train and validation) on test data.

In [8]:
#let's print the best model according to grid search
print("Best model: ",mlp_GS.best_estimator_)
#let's print the error 1-R^2 for the best model
print("Error (1-R^2) of best model: ",1. - mlp_GS.best_score_)

Best model:  MLPRegressor(hidden_layer_sizes=6, random_state=2005838, solver='lbfgs')
Error (1-R^2) of best model:  0.19928166870691155


Let's learn the best NN using all of training and validation, and then compare the error of the best NN on train and validation and on test data.

In [9]:
best_mlp = MLPRegressor(hidden_layer_sizes=(6,), activation='relu', solver='lbfgs', random_state = numero_di_matricola)
best_mlp.fit(Xtrain_and_val_scaled,Ytrain_and_val)

print("Error best model on train and validation: ",1. - best_mlp.score(Xtrain_and_val_scaled,Ytrain_and_val))
print("Error best model on test data: ",1. - best_mlp.score(Xtest_scaled,Ytest))

Error best model on train and validation:  0.14684514004479243
Error best model on test data:  0.2108732059657511


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


# Linear Regression

Now let's learn the linear model on train and validation, and get error (1-R^2) on train and validation and on test data.

In [10]:
from sklearn import linear_model
#LR the linear regression model
LR = linear_model.LinearRegression()

#fit the model on training data
LR.fit(Xtrain_and_val_scaled, Ytrain_and_val)

print("1 - coefficient of determination on training data:"+str(1 - LR.score(Xtrain_and_val_scaled,Ytrain_and_val)))
print("1 - coefficient of determination on test data:"+str(1 - LR.score(Xtest_scaled,Ytest)))

1 - coefficient of determination on training data:0.2816091847539549
1 - coefficient of determination on test data:0.28199685648204786


# k-Nearest Neighbours

You will now explore the k-Nearest Neighbours (kNN) method for regression. In order to do this, you will need to use load the scikit-learn package *neighbors.KNeighborsRegressor* 

k-Nearest Neighbours for regression works as follows: the predicted value $h(\textbf{x})$ for an instance $\textbf{x}$ is obtained by first finding the $\ell$ instances *in the training set* that are clostest to $\textbf{x}$; the predicted value $h(\textbf{x})$ is then the mean of the targets of such $\ell$ instances. $\ell$ is a parameter of the method. The targets of the $\ell$ instances used for prediction can be weighted by the (inverse of) their distance to $\textbf{x}$.

## TO DO: load the package for kNN regression, learn the model with default parameters using the training and validation scaled data, and print the error (1-R^2) on the data used to train the model and on the test data.

In [11]:
#TO DO: import package
from sklearn.neighbors import KNeighborsRegressor

#TO DO: learn model
neigh = KNeighborsRegressor(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)
neigh.fit(Xtrain_and_val_scaled,  Ytrain_and_val)

print("Error on training and validation data:"+str(1 - neigh.score(Xtrain_and_val_scaled,Ytrain_and_val)))
print("Error on test data:"+str(1 - neigh.score(Xtest_scaled,Ytest)))

Error on training and validation data:0.16882691013756013
Error on test data:0.1988584902797098


## TO DO: repeat the point (including the printing instructions) above using the kNN version where points are weighted by the inverse of their distance 

In [12]:
#TO DO: import package
from sklearn.neighbors import KNeighborsRegressor

#TO DO: learn model
neigh = KNeighborsRegressor(n_neighbors=5, weights='distance', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)
neigh.fit(Xtrain_and_val_scaled,  Ytrain_and_val)

print("Error on training and validation data:"+str(1 - neigh.score(Xtrain_and_val_scaled,Ytrain_and_val)))
print("Error on test data:"+str(1 - neigh.score(Xtest_scaled,Ytest)))

Error on training and validation data:0.00047349954080910805
Error on test data:0.1922744420412975


## TO DO: use cross validation to choose the best number of neighbours between 2 and 20)

In [13]:
param_grid = {'n_neighbors': [i for i in range(2,21)],}
knnc = KNeighborsRegressor(n_neighbors = param_grid)
knnGS = GridSearchCV(knnc, param_grid=param_grid, cv=5)
knn_results=knnGS.fit(Xtrain_and_val_scaled, Ytrain_and_val)

print(knn_results.best_score_)
print(knn_results.best_params_)

0.7338392839597482
{'n_neighbors': 6}


## TO DO: print the best model according to cross validation above, and print the score of the best model 

In [14]:
#let's print the best model according to grid search
print("Best model: ", knnGS.best_estimator_)
#let's print the error 1-R^2 for the best model
print("Score of best model: ", 1- knnGS.best_score_)

Best model:  KNeighborsRegressor(n_neighbors=6)
Score of best model:  0.2661607160402518


## TO DO: learn the best model on all of the training and validation scaled data, and print the error on training and validation scaled data, and on test scaled data

In [15]:
knn_best=knnGS.best_estimator_
knn_best.fit(Xtrain_and_val_scaled,Ytrain_and_val)

print("Error best model on train and validation: ", str(1 - knn_best.score(Xtrain_and_val_scaled,Ytrain_and_val)))
print("Error best model on test data: ", str(1 - knn_best.score(Xtest_scaled,Ytest)))

Error best model on train and validation:  0.17456408989696204
Error best model on test data:  0.2158415513975861


## TO DO: compare the error on test data of the best kNN model with the error on test data of linear regression and of NNs. Describe what you observe and give a potential explanation.
## [USE MAX 10 LINES]

The error on the test data of kNN model is 0.2158

The error on the test data of Linear regression is 0.28

The error on the test data of NNs is 0.2108

The smallest error on the test data is the one of NNs model even if kNN's error on test data is very close. So the best model to use is the neural network. A possible explanation could be due to the fact that there are nonlinearities involved. 

# Clustering and "Local" Linear Models

You are now going to explore the use of clustering to identify groups of *similar* instances, and then learning models that are specific to each group.

Once you have clustered the data, and then learned a model for each cluster, the prediction for a new instance is obtained by using the model of the cluster that is the closest to the instance, where the distance of a cluster to the instance is defined as the distance of the *center* of the cluster to the instance.

**Note**: in this part you are not explicitely told which part of the data to use, deciding which one is the correct one is part of the homework!

## TO DO: use k-means in sklearn to learn a cluster with 5 clusters.

In [16]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5, random_state=numero_di_matricola)
kmeans.fit(Xtrain_and_val_scaled,  Ytrain_and_val)

KMeans(n_clusters=5, random_state=2005838)

## TO DO: for each cluster, learn a linear model using the elements of the cluster. For each model, print the error on the data used to learn it.

In [17]:
#training set
x0 = []
x1 = []
x2 = []
x3 = []
x4 = []

y0 = []
y1 = []
y2 = []
y3 = []
y4 = []

for i in range(len(kmeans.labels_)):
    if kmeans.labels_[i] == 0 :
        x0.append(Xtrain_and_val_scaled[i])
        y0.append(Ytrain_and_val[i])
    elif kmeans.labels_[i] == 1 :
        x1.append(Xtrain_and_val_scaled[i])
        y1.append(Ytrain_and_val[i])
    elif kmeans.labels_[i] == 2 :
        x2.append(Xtrain_and_val_scaled[i])
        y2.append(Ytrain_and_val[i])
    elif kmeans.labels_[i] == 3 :
        x3.append(Xtrain_and_val_scaled[i])
        y3.append(Ytrain_and_val[i])
    elif kmeans.labels_[i] == 4 :
        x4.append(Xtrain_and_val_scaled[i])
        y4.append(Ytrain_and_val[i])
        
LR0 = linear_model.LinearRegression()
LR1 = linear_model.LinearRegression()
LR2 = linear_model.LinearRegression()
LR3 = linear_model.LinearRegression()
LR4 = linear_model.LinearRegression()

LR0.fit(x0,y0)
LR1.fit(x1,y1)
LR2.fit(x2,y2)
LR3.fit(x3,y3)
LR4.fit(x4,y4)




if(len(x0)!=0):
    print("Training error linear model of cluster 0:  "+str(1 - LR0.score(x0,y0)))
if(len(x1)!=0):
    print("Training error linear model of cluster 1:  "+str(1 - LR1.score(x1,y1)))
if(len(x2)!=0):
    print("Training error linear model of cluster 2:  "+str(1 - LR2.score(x2,y2)))
if(len(x3)!=0):
    print("Training error linear model of cluster 3:  "+str(1 - LR3.score(x3,y3)))
if(len(x4)!=0):
    print("Training error linear model of cluster 4:  "+str(1 - LR4.score(x4,y4)))


Training error linear model of cluster 0:  0.3203417032347149
Training error linear model of cluster 1:  0.36033818834211506
Training error linear model of cluster 2:  0.17097136397876644
Training error linear model of cluster 3:  0.06493111033909638
Training error linear model of cluster 4:  0.10344099598748535


## TO DO: *compute* the error (1 - R^2) on the data not used to learn the models.
For each instance not used to learn the model, the prediction is done by:
- finding the cluster C whose center is the closest to the instance
- use the model learned for cluster C to make the prediction

In [18]:
#error on test set

prediction_cluster = kmeans.predict(Xtest_scaled)

x0_test = []
x1_test = []
x2_test = []
x3_test = []
x4_test = []

y0_test = []
y1_test = []
y2_test = []
y3_test = []
y4_test = []


for i in range(len(prediction_cluster)):
    if  prediction_cluster[i] == 0 :
        x0_test.append(Xtest_scaled[i])
        y0_test.append(Ytest[i])
    elif prediction_cluster[i] == 1 :
        x1_test.append(Xtest_scaled[i])
        y1_test.append(Ytest[i])
    elif prediction_cluster[i] == 2 :
        x2_test.append(Xtest_scaled[i])
        y2_test.append(Ytest[i])
    elif prediction_cluster[i] == 3 :
        x3_test.append(Xtest_scaled[i])
        y3_test.append(Ytest[i])
    elif prediction_cluster[i] == 4 :
        x4_test.append(Xtest_scaled[i])
        y4_test.append(Ytest[i])



        


## TO DO: *print* the error (1-R^2) on the data not used to learn the models

In [19]:
if(len(x0_test)!=0):
    x0_err = 1. - LR0.score(x0_test, y0_test)
    print("Test error of Cluster 0 Linear Model: " + str(x0_err))
if(len(x1_test)!=0):
    x1_err = 1. - LR1.score(x1_test, y1_test)
    print("Test error of Cluster 1 Linear Model: " + str(x1_err))
if(len(x2_test)!=0):
    x2_err = 1. - LR2.score(x2_test, y2_test)
    print("Test error of Cluster 2 Linear Model: " + str(x2_err))
if(len(x3_test)!=0):
    x3_err = 1. - LR3.score(x3_test, y3_test)
    print("Test error of Cluster 3 Linear Model: " + str(x3_err))
if(len(x4_test)!=0):
    x4_err = 1. - LR4.score(x4_test, y4_test)
    print("Test error of Cluster 4 Linear Model: " + str(x4_err))


Test error of Cluster 0 Linear Model: 0.3500111301228288
Test error of Cluster 1 Linear Model: 0.3359193609841882
Test error of Cluster 2 Linear Model: 0.422641732356422
Test error of Cluster 3 Linear Model: 0.0227036174010663
Test error of Cluster 4 Linear Model: 0.3627737976423909


## TO DO: compare the error of the model "clustering + linear models" and of the linear model (see the beginning of the HW). Describe what you observe, and provide a possible explanation.
## [USE MAX 10 LINES]

The linear model test error is 0.28

For the cluster 3 i have a better test error than linear model test error but for the other clusters the test erorrs are worse than linear model. A possible explanation could be due to the number of clusters used. Too much clusters may have a bad influnce on the learning process.

## TO DO: compare the error of the model "clustering + linear models" and of kNN. Describe what you observe, and provide a possible explanation.
## [USE MAX 10 LINES]

The kNN test erorr is 0.2158

For the same reasons as above i have that cluster 3 test error is better than kNN test error but for the other clusters the test erorrs are worse than knn.
A possible explanation could be due to the number of clusters used. The data is mostly not linearly separable so they are more difficult to divide into clusters.

# Clustering and "Local" NNs

Repeat the same as above, but using neural networks instead of linear models.

**Note**: note that we are not telling you which parameters to use for NNs. You have to decide how to select the parameters.

## TO DO: clearly explain how you decided to set the parameters, motivating the choice of your strategy.

I use MLPRegressor with the "lbfgs" solver since for small dataset it converges faster and perform better and the activation functions "identity". 

## TO DO: repeat the analysis in part "Clustering and "Local" Linear Models" using NNs instead of linear models.

In [20]:
#training set
x0 = []
x1 = []
x2 = []
x3 = []
x4 = []

y0 = []
y1 = []
y2 = []
y3 = []
y4 = []

for i in range(len(kmeans.labels_)):
    if kmeans.labels_[i] == 0 :
        x0.append(Xtrain_and_val_scaled[i])
        y0.append(Ytrain_and_val[i])
    elif kmeans.labels_[i] == 1 :
        x1.append(Xtrain_and_val_scaled[i])
        y1.append(Ytrain_and_val[i])
    elif kmeans.labels_[i] == 2 :
        x2.append(Xtrain_and_val_scaled[i])
        y2.append(Ytrain_and_val[i])
    elif kmeans.labels_[i] == 3 :
        x3.append(Xtrain_and_val_scaled[i])
        y3.append(Ytrain_and_val[i])
    elif kmeans.labels_[i] == 4 :
        x4.append(Xtrain_and_val_scaled[i])
        y4.append(Ytrain_and_val[i])
        
NN0 = MLPRegressor(activation="identity", solver="lbfgs")
NN1 = MLPRegressor(activation="identity", solver="lbfgs")
NN2 = MLPRegressor(activation="identity", solver="lbfgs")
NN3 = MLPRegressor(activation="identity", solver="lbfgs")
NN4 = MLPRegressor(activation="identity", solver="lbfgs")

NN0.fit(x0,y0)
NN1.fit(x1,y1)
NN2.fit(x2,y2)
NN3.fit(x3,y3)
NN4.fit(x4,y4)




if(len(x0)!=0):
    print("Training error NN of cluster 0:  "+str(1 - NN0.score(x0,y0)))
if(len(x1)!=0):
    print("Training error NN of cluster 1:  "+str(1 - NN1.score(x1,y1)))
if(len(x2)!=0):
    print("Training error NN of cluster 2:  "+str(1 - NN2.score(x2,y2)))
if(len(x3)!=0):
    print("Training error NN of cluster 3:  "+str(1 - NN3.score(x3,y3)))
if(len(x4)!=0):
    print("Training error NN of cluster 4:  "+str(1 - NN4.score(x4,y4)))
    
#error on test set

prediction_cluster = kmeans.predict(Xtest_scaled)

x0_test = []
x1_test = []
x2_test = []
x3_test = []
x4_test = []

y0_test = []
y1_test = []
y2_test = []
y3_test = []
y4_test = []


for i in range(len(prediction_cluster)):
    if  prediction_cluster[i] == 0 :
        x0_test.append(Xtest_scaled[i])
        y0_test.append(Ytest[i])
    elif prediction_cluster[i] == 1 :
        x1_test.append(Xtest_scaled[i])
        y1_test.append(Ytest[i])
    elif prediction_cluster[i] == 2 :
        x2_test.append(Xtest_scaled[i])
        y2_test.append(Ytest[i])
    elif prediction_cluster[i] == 3 :
        x3_test.append(Xtest_scaled[i])
        y3_test.append(Ytest[i])
    elif prediction_cluster[i] == 4 :
        x4_test.append(Xtest_scaled[i])
        y4_test.append(Ytest[i])


if(len(x0_test)!=0):
    x0_err = 1. - NN0.score(x0_test, y0_test)
    print("\n" + "Test error of Cluster 0 NN: " + str(x0_err))
if(len(x1_test)!=0):
    x1_err = 1. - NN1.score(x1_test, y1_test)
    print("Test error of Cluster 1 NN: " + str(x1_err))
if(len(x2_test)!=0):
    x2_err = 1. - NN2.score(x2_test, y2_test)
    print("Test error of Cluster 2 NN: " + str(x2_err))
if(len(x3_test)!=0):
    x3_err = 1. - NN3.score(x3_test, y3_test)
    print("Test error of Cluster 3 NN: " + str(x3_err))
if(len(x4_test)!=0):
    x4_err = 1. - NN4.score(x4_test, y4_test)
    print("Test error of Cluster 4 NN: " + str(x4_err))

Training error NN of cluster 0:  0.32021596262285934
Training error NN of cluster 1:  0.3597053608834436
Training error NN of cluster 2:  0.17097136696824156
Training error NN of cluster 3:  0.06494463833776809
Training error NN of cluster 4:  0.10344099711344279

Test error of Cluster 0 NN: 0.3497293774414476
Test error of Cluster 1 NN: 0.3362866789679886
Test error of Cluster 2 NN: 0.42263797592977115
Test error of Cluster 3 NN: 0.022602095256945787
Test error of Cluster 4 NN: 0.36275457850517323


## TO DO: compare the error of the model "clustering + NNs" and of NNs (see the beginning of the HW). Describe what you observe, and provide a possible explanation.
## [USE MAX 10 LINES]

The NNs model test error is 0.2108

For the cluster 3 i have a better test error than NNs test error but for the other clusters the test erorrs are worse than NNs.
A possible explanation could be due to the number of clusters used. Too much clusters may have a bad influnce on the learning process.

## TO DO: compare the error of the model "clustering + NNs" and of kNN. Describe what you observe, and provide a possible explanation.
## [USE MAX 10 LINES]

The kNN model test error is 0.2158

For the cluster 3 i have a better test error than kNN test error but for the other clusters the test erorrs are worse than kNN.

 A possible explanation could be due to the number of clusters used. Too much clusters may have a bad influnce on the learning process.

## TO DO: compare the error of the model "clustering + NNs" and of "clustering + Linear Models". Describe what you observe, and provide a possible explanation.
## [USE MAX 10 LINES]

The clustering + linear models error and the clustering + NNs error are almost the same.
A possible explanation could be due to the number of clusters used. I think that in general clustering is not the best tecnique to use in this case due to data we have. Clusters have definitely a bad influnce on the learning process.