<hr style="color:#a4342d;">
<p align="center">
    <b style="font-size:2.5vw; color:#a4342d; font-weight:bold;">
    Introduction to machine learning - Homework 3
    </b>
</p>
<hr style="color:#a4342d;">

<b>Authors</b>: <i>C. Bosch, M. Cornet & V. Mangeleer</i>

[comment]: <> (Section)
<hr style="color:#a4342d;">
<p align="center">
    <b style="font-size:1.5vw; color:#a4342d;">
    Initialization
    </b>
</p>
<hr style="color:#a4342d;">

[comment]: <> (Description)
<p align="justify">
    In this section, one will be able to initialize all the librairies needed and load the untouched dataset.

In [None]:
# -- LIBRAIRIES --
import copy
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Allow notebook to plot in terminal
%matplotlib inline

In [None]:
# -- FUNCTION --

# Used to print a basic section title in terminal
def section(title = "UNKNOWN"):

    # Number of letters to determine section size
    title_size = len(title)

    # Section title boundaries
    boundary  = "-"
    for i in range(title_size + 1):
        boundary += "-"
    
    # Printing section
    print(boundary)
    print(f" {title} ")
    print(boundary)

In [None]:
# -- ORIGINAL DATASET --
# The original dataset is contained in the "data/original" folder

# Stores the original dataset
dataset_original_X = []
dataset_original_Y = []

# Load the original dataset
for i in range(1, 11):
    dataset_original_X.append(pd.read_csv(f"data/original/X_Zone_{i}.csv"))
    dataset_original_Y.append(pd.read_csv(f"data/original/Y_Zone_{i}.csv"))

In [None]:
# -- BASIC INFORMATION DATASET --

# Loading X and Y dataset for the first wind turbine
dataset_X1 = dataset_original_X[0]
dataset_Y1 = dataset_original_Y[0]

# Displaying their relative information
section("WIND TURBINE 1 - X Dataset")
section("HEAD")
print(dataset_X1.head())
section("INFO")
dataset_X1.info()

section("WIND TURBINE 1 - Y Dataset")
section("HEAD")
print(dataset_Y1.head())
section("INFO")
dataset_Y1.info()

[comment]: <> (Section)
<hr style="color:#a4342d;">
<p align="center">
    <b style="font-size:1.5vw; color:#a4342d;">
    Exploring
    </b>
</p>
<hr style="color:#a4342d;">

[comment]: <> (Description)
<p align="justify">
    In this section, one will be able to gain some basic insight regarding the dataset

In [None]:
# -- HISTOGRAM --

# Extracting only relevant variables
dataset_X1_relevant = dataset_X1[["U10", "U100", "V10", "V100"]]
dataset_Y1_relevant = dataset_Y1[dataset_Y1["TARGETVAR"] >= 0]   # /!\ Removing test samples (y = -1) /!\
dataset_Y1_relevant = dataset_Y1_relevant[["TARGETVAR"]]

# Observing distributions
dataset_X1_relevant.hist(bins = 60, figsize = (20, 15))
dataset_Y1_relevant.hist(bins = 60, figsize = (15, 4))

In [None]:
# -- OBSERVING WIND vs POWER --

# Removing test data
dataset_X_clean = dataset_X1[dataset_Y1["TARGETVAR"] >= 0]
dataset_Y_clean = dataset_Y1[dataset_Y1["TARGETVAR"] >= 0]

# Computing total wind speed
u_wind   = dataset_X_clean[["U100"]].to_numpy()
v_wind   = dataset_X_clean[["V100"]].to_numpy()
wind_tot = np.sqrt(u_wind**2 + v_wind**2)

# Retreiving power
power = dataset_Y_clean[["TARGETVAR"]].to_numpy()

# To see more clearly, one sample out of 2 is removed
for i in range(1):
    wind_tot = wind_tot[1::2]
    power    = power[1::2]

# Plotting
plt.figure(figsize=(15, 10))
plt.scatter(wind_tot, power, s = 3)
plt.grid()
plt.xlabel("Speed [m/s]")
plt.ylabel("Normalized Power [-]")
plt.show()

[comment]: <> (Section)
<hr style="color:#a4342d;"></hr>
<p align="center">
    <b style="font-size:1.5vw; color:#a4342d;">
    Dataset - Train & Test | DataLoader
    </b>
</p>
<hr style="color:#a4342d;"></hr>

[comment]: <> (Description)
<p align="justify">
    In this section, one will be able to explore further the dataset ! First, one needs to create a train and test set. Then, it is interesitng to look for correlations, new variables and possible improvements to the current dataset. All the new datasets will be save in the datafolder and ready to use by our different models ! The functions available are:
</p>

In [2]:
# -- FUNCTIONS : MEAN, VARIANCE, ZONAL AVERAGE SPEED AND TIME STEPS --
#
# Used to compute the mean and variance of a variable over some timeslices in the dataset
def computeMeanVariance(datasets, 
                        variables = ["U100", "V100"],
                        window    = 100,
                        variance  = True):

    # Security
    assert window > 1, "Window size must be greater than 1 to compute mean and var"

    # Looping over all the datasets
    for d in datasets:

        # Looping over the variables whose mean and var must be computed
        for v in variables:

            # Retreiving data 
            data = d.loc[: , [v]].to_numpy()

            # Stores mean and variance (1st and 2nd : mean = their value, var = 0 otherwise NAN problem while computation)
            mean = [data[0][0], data[1][0]]
            var  = [0, 0]

            for i in range(2, len(data)):

                # Start and end index for computation
                index_start = i - window if i - window >= 0 else 0
                index_end   = i - 1 if i - 1 >= 0 else 0

                # Computing mean and variance (much faster using numpy variables)
                mean.append(np.mean(data[index_start:index_end]))
                var.append(np.var(data[index_start:index_end]))
            
            # Adding the new data to dataset
            d[f"{v}_mean"] = mean
            if variance:
                d[f"{v}_var"] = var

# Used to compute the instantenous mean and variance of a variable accross multiple datasets
def computeZonalValue(datasets, 
                      variables = ["U100", "V100"],
                      variance  = True):

    # Security
    assert len(datasets) > 1, "To compute mean and var, at least 2 datasets are needed"

    # Looping over the variables whose mean and var must be computed
    for v in variables:

        # Number of samples
        nb_samples = len(datasets[0])

        # Stores all the different values in numpy matrix for efficient computation
        data = np.zeros((nb_samples, len(datasets)))

        # Retreiving all the corresponding data
        for i, d in enumerate(datasets):
            
            # Squeeze is there to remove useless dimension (Ask Victor)
            data[:, i] = np.squeeze(d.loc[: , [v]].to_numpy())

        # Computing mean and variance (much faster using numpy variables)
        mean = np.mean(data, axis = 1) # Axis = 1 to make mean over each row
        var  = np.var(data, axis = 1)

        # Adding new data to all the datasets
        for d in datasets:
            d[f"{v}_mean"] = mean
            if variance:
                d[f"{v}_var"] = var

# Used to add the value taken by a given variable over the past samples
def addPastTime(datasets,
                variables = ["U100", "V100"],
                window    = 3):
    #
    # Note from Victor
    # This function was a pain in the ass to make ! Even I, am not sure why it works well :D
    #
    # Security
    assert window > 0, "Window size must be greater than 0 to add past samples"

    # Looping over the datasets
    for d in datasets:

        # Looping over the different columns
        for i, v in enumerate(variables):

            # Retrieving current data
            data = d[[v]].to_numpy()

            # Stores all the past results
            former_data = np.zeros((len(data), window))

            # Looping over the corresponding data
            for j in range(len(data)):

                # Start and end index for retreiving values
                index_start = j - window if j - window >= 0 else 0
                index_end   = j if j - 1 >= 0 else 0
                
                # Retrieve corresponding value
                values = data[index_start:index_end]

                # Fixing case where looking at starting indexes < window size
                if len(values) != window:
                    values = np.append(np.zeros((window - len(values), 1)), values)

                # Placing the data (such that by reading left to right: t - 1, t - 2, t - 3, ...)
                for k, val in enumerate(values):
                        former_data[j][k] = val

            # Addding past results in the dataset
            for t in range(window):
                d[f"{v}_(t-{window - t})"] = former_data[:, t]

# Used to normalize the data of different variables
def normalize(datasets,
              norm_type = "argmax",
              data_type = "column",
              variables = ["U100", "V100"]):
    """
    Documentation :
        - norm_type (str) : argmax, mean
            - Normalize using the argmax or by using the mean and std of the data
        - data_type (str) : column, all
            - Apply the normalization using norm_type computed either on a unique column or all the same columns
    """
    # Security
    assert norm_type in ["argmax", "mean"], "Normalization types = argmax, mean"
    assert data_type in ["column", "all"] , "Data types = column, all"

    # Initialization of the normalization variables
    argmax, mean, std = list(), list(), list()

    # 1 - Computing argmax or mean and std of all datasets
    if data_type == "all":

        # Looping over the different variables to normalize
        for i, v in enumerate(variables):

            # Initialization of the normalization variables
            argmax_list, mean_list, std_list = list(), list(), list()

            # Looping over all the datasets
            for d in datasets:

                # Retrieving currently observed data
                current_data = d[[v]].to_numpy()

                # Retrieving variables
                argmax_list.append(np.max(np.abs(current_data)))
                mean_list.append(np.mean(current_data))
                std_list.append(np.std(current_data))

            # Adding results
            argmax.append(max(argmax_list))
            mean.append(sum(mean_list)/len(mean_list))
            std.append(sum(std_list)/len(std_list))
    
    # 2 - Normalization of the datasets
    for d in datasets:

        # Looping over the different columns
        for i, v in enumerate(variables):
            
            # Case 1 - Mean and std - Column
            if norm_type == "mean" and data_type == "column":
                data         = d[[v]].to_numpy()
                d[v] = (data - np.mean(data))/np.std(data)

            # Case 2 - Mean and std - All
            elif norm_type == "mean" and data_type == "all":
                data         = d[[v]].to_numpy()
                d[v] = (data - mean[i])/std[i]
            
            # Case 3 - Argmax - Column
            elif norm_type == "argmax" and data_type == "column":
                data = d[[v]].to_numpy()
                d[v] = data/np.max(data)

            # Case 4 - Argmax - All
            else:
                data = d[[v]].to_numpy()
                d[v] = data/argmax[i]

# Used to remove specific columns from the dataset
def remove(datasets, variables):

    # Looping over all datasets and variables
    for d in datasets:
        for v in variables:

            # Removing
            d.drop(v, inplace = True, axis = 1)

In [None]:
# -- DATA LOADER -- 
# This class has for purpose to handle the data and make our life easier ! 
#
class dataLoader():
    
    # Initialization of the loader
    def __init__(self, datasets_X, datasets_Y):

        # Stores the original, transformed and final datasets
        self.original_datasets_X    = datasets_X
        self.original_datasets_Y    = datasets_Y
        self.transformed_datasets_X = datasets_X
        self.transformed_datasets_Y = datasets_Y
        self.final_dataset_X        = None
        self.final_dataset_Y        = None

        # Used to know if datasets have been combined or not
        self.isCombined = None

    # Used to display the head of the transformed dataset (first set)
    def showHeadTransformed(self):
        section("Dataset - X - Transformed")
        print(self.transformed_datasets_X[0].head())
        section("Dataset - Y - Transformed")
        print(self.transformed_datasets_Y[0].head())

    # Used to split the final dataset into a train and test set (In test set, values for y are equal to -1)
    def splitTrainTest(self, save = False, save_dir = "new_data"):

        # Security
        assert self.isCombined != None, "You must first use self.finalize"

        # Case 1 - Datasets have been combined all together
        if self.isCombined == True:
            X_train = self.final_dataset_X[self.final_dataset_Y['TARGETVAR'] != -1]
            Y_train = self.final_dataset_Y[self.final_dataset_Y['TARGETVAR'] != -1]
            X_test  = self.final_dataset_X[self.final_dataset_Y['TARGETVAR'] == -1]
            Y_test  = self.final_dataset_Y[self.final_dataset_Y['TARGETVAR'] == -1] # Not useful, I know !

        # Case 2 - Datasets are still separated
        if self.isCombined == False:
            
            X_train, Y_train, X_test, Y_test = list(), list(), list(), list()

            # Looping over all the small datasets
            for x, y in zip(self.final_dataset_X, self.final_dataset_Y):
                X_train.append(x[y['TARGETVAR'] != -1])
                Y_train.append(y[y['TARGETVAR'] != -1])
                X_test.append(x[y['TARGETVAR'] == -1])
                Y_test.append(y[y['TARGETVAR'] == -1])

        # Be careful with the order
        return X_train, X_test, Y_train, Y_test
        
    # Used to perfom final operation on dataset (Combining everything or storing them separately)
    def finalization(self, dataset_type = "combined"):

        # Security
        assert dataset_type in ["combined", "separated"], "The final dataset can either be of type combined or separated"

        # Case 1 - Combining into one big dataset
        if dataset_type == "combined":
            self.final_dataset_X = pd.concat(self.transformed_datasets_X)
            self.final_dataset_Y = pd.concat(self.transformed_datasets_Y)
            self.isCombined = True

        # Case 2 - Separated datasets
        else:
            self.final_dataset_X = self.transformed_datasets_X
            self.final_dataset_Y = self.transformed_datasets_Y
            self.isCombined = False

    #--------------------------------------------------------------------------------
    #                                    PIPELINES
    #--------------------------------------------------------------------------------
    #
    # List of functions available:
    #
    # - computeMeanVariance(datasets, variables = ≈, window = 100, variance  = True)
    #
    # - computeZonalValue(datasets, variables = ["U100", "V100"], variance  = True)
    # 
    # - addPastTime(datasets, variables = ["U100", "V100"], window = 3):
    #
    # - normalize(datasets, norm_type = "argmax", data_type = "column", variables = ["U100", "V100"])
    #
    def pipeline(self, useMeanVariance = True, var_MV   = ["U10", "V10", "U100", "V100"], variance_MV  = True, window_MV = 24 * 7,
                       useZonal        = True, var_ZON  = ["U10", "V10", "U100", "V100"], variance_ZON = True,
                       usePastTime     = True, var_PT   = ["U10", "V10", "U100", "V100"], window_ZON   = 3,
                       useNormalize    = True, var_NORM = ["U10", "V10", "U100", "V100"], norm_type = "argmax", data_type = "column"):

        # Copying original dataset
        dX = copy.deepcopy(self.original_datasets_X)
        dY = copy.deepcopy(self.original_datasets_Y)

        # Applying the different transformations
        if useNormalize:
            normalize(dX, variables = var_NORM, norm_type = norm_type, data_type = data_type)
        if useMeanVariance:
            computeMeanVariance(dX, variables = var_MV, window = window_MV, variance = variance_MV)
        if useZonal:
            computeZonalValue(dX, variables = var_ZON, variance = variance_ZON)
        if usePastTime:
            addPastTime(dX, variables = var_PT, window = window_ZON)

        # Updating dataset
        self.transformed_datasets_X = dX
        self.transformed_datasets_Y = dY

        # Making sure one has to finalize again
        self.isCombined = None

In [None]:
# -- GAINING INSIGHTS (2) - ARGMAX ALL -- 
#
# Treshold value for removing correlation
tresh_corr = 0.3

# Initialization of the loader
loader_2 = dataLoader(dataset_original_X, dataset_original_Y)

# Aplying the first prototype of pipeline (Mean value on 30 days)
loader_2.pipeline(norm_type = "argmax", data_type = "all")

# Combine all the small datasets into a big one
loader_2.finalization(dataset_type = "combined")

# Retreives the train and test set (in Pandas frame)
data_X_2, _, _, _ = loader_2.splitTrainTest()

# -- Correlation matrix -- 
corr_2             = data_X_2.corr()
corr_2[np.abs(corr_2) < tresh_corr] = 0
sns.set(rc={'figure.figsize':(25, 20)})
sns.heatmap(corr_2, cmap = "YlGnBu", annot = True)

In [None]:
# -- GAINING INSIGHTS (3) - MEAN COLUMN -- 
#
# Treshold value for removing correlation
tresh_corr = 0.3

# Initialization of the loader
loader_3 = dataLoader(dataset_original_X, dataset_original_Y)

# Aplying the first prototype of pipeline (Mean value on 30 days)
loader_3.pipeline(norm_type = "mean", data_type = "column")

# Combine all the small datasets into a big one
loader_3.finalization(dataset_type = "combined")

# Retreives the train and test set (in Pandas frame)
data_X_3, _, _, _ = loader_3.splitTrainTest()

# -- Correlation matrix -- 
corr_3             = data_X_3.corr()
corr_3[np.abs(corr_3) < tresh_corr] = 0
sns.set(rc={'figure.figsize':(25, 20)})
sns.heatmap(corr_3, cmap = "BuPu", annot = True)

In [None]:
# -- GAINING INSIGHTS (4) - MEAN ALL -- 
#
# Treshold value for removing correlation
tresh_corr = 0.3

# Initialization of the loader
loader_4 = dataLoader(dataset_original_X, dataset_original_Y)

# Aplying the first prototype of pipeline (Mean value on 30 days)
loader_4.pipeline(norm_type = "mean", data_type = "all")

# Combine all the small datasets into a big one
loader_4.finalization(dataset_type = "combined")

# Retreives the train and test set (in Pandas frame)
data_X_4, _, _, _ = loader_4.splitTrainTest()

# -- Correlation matrix -- 
corr_4             = data_X_4.corr()
corr_4[np.abs(corr_4) < tresh_corr] = 0
sns.set(rc={'figure.figsize':(25, 20)})
sns.heatmap(corr_4, cmap = "Greens", annot = True)

In [None]:
# -- CORRELATION MATRIX COMPARISON --
corr_21 = corr_2 - corr_1
sns.set(rc={'figure.figsize':(25, 20)})
sns.heatmap(corr_21, cmap = "Greens", annot = True)

plt.figure()
corr_23 = corr_2 - corr_3
sns.set(rc={'figure.figsize':(25, 20)})
sns.heatmap(corr_23, cmap = "BuPu", annot = True)

plt.figure()
corr_24 = corr_2 - corr_4
sns.set(rc={'figure.figsize':(25, 20)})
sns.heatmap(corr_24, cmap = "YlGnBu", annot = True)

plt.figure()
corr_31 = corr_3 - corr_1
corr_31[np.abs(corr_31) < 0.2] = 0
sns.set(rc={'figure.figsize':(25, 20)})
sns.heatmap(corr_31, cmap = "BuPu", annot = True)

In [None]:
# Template
loader.pipeline(useMeanVariance = True, var_MV   = ["U10", "V10", "U100", "V100"], variance_MV  = True, window_MV = 24 * 7,
                useZonal        = True, var_ZON  = ["U10", "V10", "U100", "V100"], variance_ZON = True,
                usePastTime     = True, var_PT   = ["U10", "V10", "U100", "V100"], window_ZON = 3,
                useNormalize    = True, var_NORM = ["U10", "V10", "U100", "V100"], norm_type = "argmax", data_type = "column")

In [None]:
# Initialization of the loader
loader = dataLoader(dataset_original_X, dataset_original_Y)

In [None]:
# TYPE 1 - ORIGINAL
loader.pipeline(useMeanVariance = False,
                useZonal        = False,
                usePastTime     = False,
                useNormalize    = False)
loader.finalization()

data_X0, submit_X0, data_Y0, submit_Y0 = loader.splitTrainTest()
print(data_X0.head())

In [None]:
# TYPE 2 - INFLUENCE OF NORMALIZATION
loader.pipeline(useMeanVariance = False,
                useZonal        = False,
                usePastTime     = False,
                useNormalize    = True, norm_type = "argmax", data_type = "column")
                
loader.finalization()
data_X1, submit_X1, data_Y1, submit_Y1 = loader.splitTrainTest()
print(data_X1.head())

loader.pipeline(useMeanVariance = False,
                useZonal        = False,
                usePastTime     = False,
                useNormalize    = True, norm_type = "argmax", data_type = "all")
                
loader.finalization()
data_X2, submit_X2, data_Y2, submit_Y2 = loader.splitTrainTest()
print(data_X2.head())

loader.pipeline(useMeanVariance = False,
                useZonal        = False,
                usePastTime     = False,
                useNormalize    = True, norm_type = "mean", data_type = "column")
                
loader.finalization()
data_X3, submit_X3, data_Y3, submit_Y3 = loader.splitTrainTest()
print(data_X3.head())

loader.pipeline(useMeanVariance = False,
                useZonal        = False,
                usePastTime     = False,
                useNormalize    = True, norm_type = "mean", data_type = "all")
                
loader.finalization()
data_X4, submit_X4, data_Y4, submit_Y4 = loader.splitTrainTest()
print(data_X4.head())

In [None]:
# -- Datasets --
data_X_tot = [data_X0, data_X1, data_X2, data_X3, data_X4]
data_Y_tot = [data_Y0, data_Y1, data_Y2, data_Y3, data_Y4]

[comment]: <> (Section)
<hr style="color:#a4342d;"></hr>
<p align="center">
    <b style="font-size:1.5vw; color:#a4342d;">
    Model - Training & Testing
    </b>
</p>
<hr style="color:#a4342d;"></hr>

[comment]: <> (Description)
<p align="justify">
    In this section, one will be able to explore further the dataset ! First, one needs to create a train and test set. Then, it is interesitng to look for correlations, new variables and possible improvements to the current dataset. All the new datasets will be save in the datafolder and ready to use by our different models ! The functions available are:
</p>

In [None]:
# -- FUNCTIONS --
#
# Used to compute a model's accuracy against different datasets
def modelTesting(datasets_X, datasets_y, model, test_size = 0.3, random_state = 69):
    
    # Contains mean accuracy of the model against each dataset
    accuracy_train = []
    accuracy_test = []

    # Looping over whole the different datasets
    for X, y in zip(datasets_X, datasets_y):
        
        # Final conversion (Numpy and retrieving targets)
        X = X.to_numpy()
        y = y[["TARGETVAR"]].to_numpy().ravel()

        # Retrieving datasets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, random_state = random_state)

        # Fitting the model on current split
        model.fit(X_train, y_train)

        # Accuracy
        accuracy_train.append(model.score(X_train, y_train))
        accuracy_test.append(model.score(X_test, y_test))

    return accuracy_train, accuracy_test

def modelPlotResults(parameters, acc_train, acc_test, xlabel = "UNKNOWN", param_name = "UNKNOWN", fontsize = 15, save_path = "graphs/"):

    # 1 - Evolution of the test accuracy
    plt.figure()
    sns.set(font_scale = 2)

    # Plotting evolution curve for a specific dataset with varying parameter value
    for i, a in enumerate(acc_test):
        plt.plot(parameters, [a_i * 100 for a_i in a], label = f"Dataset n°{i}", linewidth = 5)
    
    plt.legend(loc="upper right", fontsize = fontsize)
    plt.ylabel("Accuracy [%]", fontsize = fontsize)
    plt.xlabel(xlabel, fontsize = fontsize)
    plt.savefig(f"{save_path}_1.png")
    plt.show()

    # 2 - Bar plot
    acc_test_best  = []
    acc_train_best = []
    k_best         = []

    # Looping over accuracies to find best results
    for a1, a2 in zip(acc_train, acc_test):

        # Finding best test accuracy
        max_value = max(a2)
        max_index = a2.index(max_value)

        # Adding results
        acc_test_best.append(max_value)
        acc_train_best.append(a1[max_index])
        k_best.append(parameters[max_index])

    # Contains x-axis labels
    x_ax_labels = [f"Dataset n°{i} - {param_name} = {k_best[i]}" for i in range(len(acc_train))]

    # Used to make x-axis    
    index = [i for i in range(len(acc_train))]

    # Plotting the results
    plt.figure()
    plt.bar([i - 0.2 for i in index], [a_i * 100 for a_i in acc_train_best], 0.4, label = "Train")
    plt.bar([i + 0.2 for i in index], [a_i * 100 for a_i in acc_test_best], 0.4, label = "Test")
    plt.xticks(index, x_ax_labels)
    plt.ylabel("Accuracy [%]", fontsize = fontsize)
    plt.legend()
    plt.savefig(f"{save_path}_2.png")
    plt.show()



<hr style="color:#a4342d; width: 145px;" align="left">
<p style="color:#a4342d;">KNeighborsRegressor</p>
<hr style="color:#a4342d; width: 145px;" align="left">

In [144]:
# -- Generating results KNN -- 
from sklearn.neighbors import KNeighborsRegressor

# Definition of the parameters to be tested
k_param = np.linspace(1, 100, 100, dtype = int)

# Stores the accuracy of the training and testing
knn_accuracy_train = [[] for i in range(len(data_X_tot))]
knn_accuracy_test  = [[] for i in range(len(data_X_tot))]

for k in k_param:

    # Initialization of the model
    model = KNeighborsRegressor(n_neighbors = k)

    # Computing accuracies on all the datasets
    acc_train, acc_test = modelTesting(data_X_tot, data_Y_tot, model, test_size = 0.3, random_state = 69)

    # Adding the results
    for i, acc_1, acc_2 in zip(range(len(acc_train)), acc_train, acc_test):
        knn_accuracy_train[i].append(acc_1)
        knn_accuracy_test[i].append(acc_2)

# Plotting the results
modelPlotResults(k_param, knn_accuracy_train, knn_accuracy_test, 
                 xlabel = "Number of neighbors - $k$ [-]", fontsize = 30, 
                 param_name = "knn", save_path = "graphs/knn/knn")

TypeError: n_neighbors does not take <class 'numpy.float64'> value, enter integer value

In [None]:
modelPlotResults(k_param, knn_accuracy_train, knn_accuracy_test, 
                 xlabel = "Number of neighbors - $k$ [-]", fontsize = 30, 
                 param_name = "k", save_path = "graphs/knn/knn")

<hr style="color:#a4342d; width: 100px;" align="left">
<p style="color:#a4342d;">Random Forest</p>
<hr style="color:#a4342d; width: 100px;" align="left">

In [None]:
# -- Generating results Random Forest -- 
from sklearn.ensemble import RandomForestRegressor

# Definition of the parameters to be tested
max_depth = np.linspace(1, 40, 40)

# Stores the accuracy of the training and testing
rf_accuracy_train = [[] for i in range(len(data_X_tot))]
rf_accuracy_test  = [[] for i in range(len(data_X_tot))]

for d in max_depth:

    # Initialization of the model
    model = RandomForestRegressor(max_depth = d, n_estimators = 10)

    # Computing accuracies on all the datasets
    acc_train, acc_test = modelTesting(data_X_tot, data_Y_tot, model, test_size = 0.3, random_state = 69)

    # Adding the results
    for i, acc_1, acc_2 in zip(range(len(acc_train)), acc_train, acc_test):
        rf_accuracy_train[i].append(acc_1)
        rf_accuracy_test[i].append(acc_2)

In [None]:
# Plotting the results
modelPlotResults(max_depth, rf_accuracy_train, rf_accuracy_test, 
                 xlabel = "Depth of the tree - $d$ [-]", fontsize = 30, 
                 param_name = "d", save_path = "graphs/rf/rf")