In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler

from plotting import plotly_plot, format_colorbar_plot, format_colorbar_pca_plot

## General remark

**This notebook contains a lot of code, especially for plotting. You do not need to understand all of it, try to focus on the parts that are hightlighted.**

## Introduction

In this notebook, you should apply the techniques you learned for regression in a more realistic problem setting. We have a collection of bridges modeled as 2D beams that all feature one defect. Our goal is to train a model to learn the location of this defect as a function of displacement measurements. Since sensors are expensive, we can only place them in five locations on the bridges. The assignment includes the following tasks:

- Select five locations for the sensors based on a visual inspection of the displacement field 
- Pre-process the available data to use it in a neural network
- Train a neural network to learn a mapping from the displacement measurements to the defect location, comment on the choice of hyperparameters (number of hidden layers, nodes per layer, ...)
- Visualise your results and evaluate the accuracy of your network
- Implement an alternative based on PCA where we have sensor data at every location

Let's take a look at the dataset first. It is a CSV file, and a convenient way to read and manipulate this file type is via the `Dataframe` of the `pandas` library. Printing a few lines of the dataset before performing the analysis is good practice. We load the dataset into a `Dataframe` from the `pandas` library and print a few rows from the top and bottom. The dataset consists of a collection of displacement fields of the bridges. We have a total of 1000 bridges, as can be seen from the tail of the data frame `df.tail()`, and 712 locations in which the displacements have been measured, as can be seen from the tail of a sample `bar_0.tail()`, which is just a single sample we took from the dataset. Note that the location is uniform for a specific sample owing to the dataset's structure.

![overview beam structure](img/beam_structure.png)

In [None]:
df = pd.read_csv('regression-data_realistic.csv')
bar_0 = df[df['sample'] == 0]
df.head()

In [None]:
df.tail()

In [None]:
bar_0.tail()

## Data visualization and feature extraction

Your first task is to select measurement locations. The following cell plots the displacement in the x- or y-direction or the magnitude over the beam's domain. You can select the three components via the buttons on the topside. In addition, you can see all the available measurement locations. You can hover over the plot, which will display the data frame's corresponding node ID.

In [None]:
plotly_plot(df)

Select five measurement locations that you expect to be informative. Plug them into the predefined list `measure_locs`. Remember that we only have a budget of five locations, make sure to not exceed this threshold to secure a spot on the leaderboard (more to that later). The remaining code in this cell collects the displacements from all of our beams at the selected nodes and the defect location. These quantities are stored in the arrays `measurements` and `defect_locs`.

In [None]:
# define measurement locations, get corresponding coord
measure_locs = [481, 461, 312, 41, 66]
measure_coords = np.array([bar_0[bar_0['node'] == loc][['x','y']].to_numpy() for loc in measure_locs]).squeeze(1)

# double check the measurement locations
print(measure_coords)
                             
# read measurement from all samples in dataframe
measurements = np.empty((df['sample'].max()+1,0))

# loop through measurement locations and collect measuments from all samples
for loc in measure_locs:
    dx = df[df['node'] == loc]['dx'].to_numpy()
    dy = df[df['node'] == loc]['dy'].to_numpy()
    measurements = np.append(measurements, np.vstack((dx, dy)).transpose(),axis=1)

# get defect locations
defect_locs = df[df['node']==0]['location'].to_numpy()

Let's plot the defect location as a function of the displacement measurements to get a feel for the dataset:

In [None]:
# plot a few hyperplanes of the dataset
# ======================================== ignore code ==============================================================
fig, ax = plt.subplots(5,2, figsize=(10,20))
[ax.flat[i].scatter(measurements[:,i], defect_locs, s=5) for i in range(len(ax.flat))]
[ax.flat[i].ticklabel_format(style='sci', axis='x', scilimits=(0,0)) for i in range(len(ax.flat))]
[ax[0,i].set_title(title) for i,title in enumerate([r'$u_x$',r'$u_y$'])]
[ax[i,0].text(-0.4, 0.46, r'node {}'.format(i+1), transform=ax[i,0].transAxes, fontsize=12) for i in range(ax.shape[0])]
[ax[i,0].set_ylabel(r'$x_{defect}$') for i in range(ax.shape[0])]
ticks = [np.linspace(np.min(measurements[:,i]), np.max(measurements[:,i]), 4) for i in range(measurements.shape[1])]
[axs.set_xticks(tick) for axs, tick in zip(ax.flat, ticks)]
plt.show()
# ===================================================================================================================

We can see that most measurements do not have a unique mapping to the defect location, suggesting we need multiple features to distinguish between the deformation states.

<div style="background-color:#AABAB2; vertical-align: middle; padding:3px 20px;">
<p>
    
<b>Task:</b>
Change the measurement locations (2 code cells above this text) and study the influence on the plots. Finally, pick 5 sensors you believe will lead to the best results.
</p>
</div>

Let's take a look at a 2D scatterplot of our data. Note that this is a projection of the data on this particular 2D subspace of the input space. The color bar indicates the defect location of a data point.

In [None]:
# select components to plot
#   remeber that we flattened the measurement array, meaning that
#   even indices correspond to x-measurements, uneven incdices to y-measurements.
#   the tuple (3,7) threfore corresponds to u_y in node 2, and u_y in node 4
#   you can also look at the plot to see which component you are inspecting

idcs = (3,7)
skip = 5

# make figure and plot data
# ======================================== ignore code ==============================================================
fig, ax = plt.subplots(figsize=(6,5))
plot1 = ax.scatter(measurements[::skip,idcs[0]], measurements[::skip,idcs[1]], c=defect_locs[::skip], s=39)
format_colorbar_plot(fig, ax, plot1, idcs)
plt.show()
# ===================================================================================================================

This should look more promising; the defect location seems to be an injective function when considering multiple measurements.

<div style="background-color:#AABAB2; vertical-align: middle; padding:3px 20px;">
<p>
    
<b>Task:</b>
Select different indices to be plotted. Compare choosing only x components with choosing y components.
</p>
</div>

## Data pre-processing

To train a model using this data, we need to split it in a training, validation and test set. The training set is used to train the model, the validation set is used to fit the hyperparameters of the NN, and the test set is used to assess the predictive capabilities of the resulting model. The latter should not be used for model selection or training. We need truly unseen data to properly evaluate the performance of our final model.

<!-- <div style="background-color:rgba(0, 0, 0, 0.0470588); ; vertical-align: middle; padding:10px 20px;"> -->
<div style="background-color:#AABAB2; vertical-align: middle; padding:10px 20px;">
<p>
    <b>Task:</b>

Complete the code below to implement a function that creates a random train, validation & test set.
- Apply a random permutation to the data. Make sure X & y are permutated in the same manner!
- Compute the size of the train, validation & test split
- Select the permutated data splits
- Select sensible fraction sizes for the validation and test set
    </p>
</div>

In [None]:
# function to split dataset into training, validation, and test set
def train_test_val_split(X, y, val_size, test_size, seed=0):
    """
    X = [N x features]
    y = [N x outputs]
    val_size = fraction of the full dataset becoming the validation set (e.g. 0.5 = 50%)
    test_size = fraction of the full dataset becoming the test set (e.g. 0.5 = 50%)
    """
    
    # set seed
    np.random.seed(seed)
        
    # get total number of data points
    n_total = X.shape[0]
    
    # Permutate X and y
    perm = np.random.permutation(n_total)
    
    # Compute the number of data points for training, validation, and test set.
    n_test, n_val = int(np.floor(n_total*test_size)), int(np.floor(n_total*val_size))
    n_train = n_total - n_test - n_val
    print(f"Data points in training set: {n_train}, validation set: {n_val}, test set: {n_test}")
    
    # Obtain the indices corresponding to the train, val & test set.
    # With the indices, split X & y into the three sets each.
    idcs_train = perm[0:n_train]
    idcs_val   = perm[n_train:n_train+n_val]
    idcs_test  = perm[n_train+n_val:]
    
    # split X
    X_train    = X[idcs_train]
    X_val      = X[idcs_val]
    X_test     = X[idcs_test]
    
    # split y
    y_train    = y[idcs_train]
    y_val      = y[idcs_val]
    y_test     = y[idcs_test]
    
    # return all
    return X_train, X_val, X_test, y_train, y_val, y_test, idcs_train, idcs_val, idcs_test

In [None]:
# set up scalers and scale data
xscaler, yscaler = StandardScaler(), StandardScaler()
xit = xscaler.inverse_transform
yit = yscaler.inverse_transform
X, y = xscaler.fit_transform(measurements), yscaler.fit_transform(defect_locs[:,None]).reshape(-1)

# Run the function you created
X_train, X_val, X_test, y_train, y_val, y_test, _, _, idcs_test = train_test_val_split(X,y, test_size=0.2, val_size=0.2)

## Linear Model

First, let's try a linear model with linear features. We use the `MLPRegressor` to stay consistent with the workflow for the rest of the notebook. A linear model with linear features can be obtained by setting the activation function to be the identity. The inputs are multiplied with the weights twice &mdash; once before and once after the hidden layer. However, the linear combination of linear models will still result in a linear model, which makes this little trick work. The training is trivial, and we therefore employ the built-in `MLPRegressor.fit()` function to this end.

In [None]:
# Set up linear regression model
LinearModel = MLPRegressor(solver='sgd', hidden_layer_sizes=(10), activation='identity', learning_rate='constant')

# train NN
LinearModel.fit(X_train, y_train)
y_pred = LinearModel.predict(X_test)

In [None]:
# select sensor indeces
idcs = (3,7) # <-- Change this to change sensor data considered

# ======================================== ignore code ==============================================================
# create figure
fig, ax = plt.subplots(1,3,figsize=(12.5,3.8), constrained_layout=True, sharey=True)

# collect data
X_plot = xit(X_test)[:,idcs]

# plot
plot0 = ax[0].scatter(X_plot[:,0], X_plot[:,1], c=yit(y_test[:,None]).reshape(-1))
plot1 = ax[1].scatter(X_plot[:,0], X_plot[:,1], c=yit(y_pred[:,None]).reshape(-1))
plot2 = ax[2].scatter(X_plot[:,0], X_plot[:,1], c=np.abs(y_pred-y_test))

# adjust plots
format_colorbar_plot(fig, ax, [plot0, plot1, plot2], idcs)
plt.show()
# ===================================================================================================================

In [None]:
# select the sample in the test set we want to look at
index = 74 # <--- Change this index to change the sample!
total_idx = idcs_test[index]

# get prediction and true value of the defect location
defect_loc_true = yit(y_test[[index]][None,:])[0,0]
defect_loc_pred = yit(LinearModel.predict(X_test[[index],:])[:,None])[0,0]

# create the plot
plotly_plot(df, total_idx, measure_locs, defect_loc_true, defect_loc_pred)

This is not awful, but we can do better then that. Let's try a nonlinear model now! As you might already have guessed form or implementation of the linear model, we propose a neural network. We also discard the `MPLRegressor.fit()` function and implement the training loop our selves, to be able to take a look under the hood.

# Network training
With our dataset ready, we can create a function to train a neural network (NN). There are many choices to be made when creating such a function, most are related to the bias - variance tradeoff discussed in notebook 2. The outline of the function we want to create is as follows:

`def NN_train()`:<br>
&emsp;&emsp; For a maximum number of epochs:<br>
&emsp;&emsp;&emsp;&emsp; Permutate the data<br> 
&emsp;&emsp;&emsp;&emsp; For every minibatch:<br> 
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; Collect the X & y training data<br> 
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; Take a NN step (partial fit)<br> 
&emsp;&emsp;&emsp;&emsp; Compute the root mean squared error (RMSE) on the validation set<br> 
&emsp;&emsp;&emsp;&emsp; If the RMSE is the lowest:<br> 
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; Save the RMSE & NN model<br> 
&emsp;&emsp;&emsp;&emsp; If RMSE has not decreased in the last X epochs<br> 
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; Stop the training loop<br> 
&emsp;&emsp;&emsp;&emsp; Adapt the learning rate if necessary<br> 
&emsp;&emsp; Return NN, rmse_min & full_rmse_array<br>

<div style="background-color:#AABAB2; vertical-align: middle; padding:3px 20px;">
<p>
    
<b>Task:</b>
    Go through the training loop algorithm and compare with the code. The code is **not** complete. Identify what bits are missing and implement them.
</p>
</div>

In [None]:
# function to train the NN
def NN_train(NN, X_train, y_train, X_val, y_val, max_epoch=100000, tol=1e-6, verbose=False, lr_init=1e-2, lr_pow=0.9, lr_step=500, seed=0, batchsize=50):
    
    # set seed 
    np.random.seed(seed)
    
    
    # set up array for mse and improvement count for early stopping
    rmse = [] 
    rmse_min = 1e10
    no_improvement_count = 0
    
    # init learning rate
    lr = lr_init
    NN.learning_rate_init = lr
    
    NN_best = NN
    
    # loop over iterations
    for epoch in range(max_epoch):
            
        # set up permutation of the data
        n = X_train.shape[0]
        perm = np.random.permutation(n)
        batches_per_epoch = int(np.floor(n/batchsize))
        
        # loop over batches
        for it in range(batches_per_epoch):
            
            # collect current batch
            X_batch = X_train[perm[it*batchsize:(it+1)*batchsize]]
            y_batch = y_train[perm[it*batchsize:(it+1)*batchsize]]
                
            # take step
            NN.partial_fit(X_batch, y_batch)
            
        # compute rmse on validation set after each epoch
        y_val_hat = NN.predict(X_val)
        rmse.append(np.sqrt(np.sum((y_val - y_val_hat)**2)))
        
        # adapt learning rate
        if (epoch > 0) and (epoch%lr_step==0):
            lr *= lr_pow
            NN.learning_rate_init = lr
            if verbose:
                print("Reduced learning rate to {:.4e}".format(lr))
        
        # check if no improvement occured in last iters
        if rmse[-1] - rmse_min > tol:
            no_improvement_count += 1
        elif rmse[-1] < rmse_min:
            rmse_min = rmse[-1]
            no_improvement_count = 0
            NN_best = NN
        
        # exit loop when no improvement was registered during past twenty iters
        if no_improvement_count == 20:
            print("Training stopped after {} epochs".format(epoch))
            break
        
        # print loss (optional)
        if verbose and epoch%200==0:
            print("\nIteration {}".format(epoch))
            print("   rmse {:.4e}\n".format(rmse[epoch]))
    
    if (epoch==max_epoch-1): print("Reached max_epochs ( {} )".format(max_epoch))
    
    # return trained network and last rmse
    return NN_best, rmse_min, rmse

In [None]:
# Set up NN
NN = MLPRegressor(solver='sgd', hidden_layer_sizes=(5, 5), activation='tanh', learning_rate='constant')

# train NN
NN, _, rmse = NN_train(NN, X_train, y_train, X_val, y_val, max_epoch=10000, verbose=True, lr_init=1e-1, lr_step=50)

You can use the following plotting routines to visualize your predictions. Keep in mind that all of the following graphs are based on projections of the input data on 1D or 2D subspaces that suppress at least part of the information contained in the dataset. Those projections, however, are necessary to enable visualizations of the predictions.

In [None]:
idx = 3 # <-- Change this to change sensor data considered

# ======================================== ignore code ==============================================================
# get prediction
y_pred = NN.predict(X_test)

# create figure and select component
fig, ax = plt.subplots(figsize = (6,5))

# plot data
ax.scatter(xit(X_test)[:,idx], yit(y_test[:,None]).reshape(-1), label='truth', s=30)
ax.scatter(xit(X_test)[:,idx], yit(y_pred[:,None]).reshape(-1), label='prediction', s=30)

# adjust plot
ax.legend()
ax.ticklabel_format(style='sci', axis='x', scilimits=(0,0))
ax.set_xlabel(rf"$u_{{{int(idx/2)+1}, {'x' if idx%2 == 0 else 'y'}}}$", fontsize=12)
ax.set_ylabel(r'$x_{defect}$')
plt.show()
# ===================================================================================================================

In [None]:
idcs = (3,7) # <-- Change this to change sensor data considered

# plot prediciton for projection of inputs on 2D subspace
# ======================================== ignore code ==============================================================
# create figure and select measurements to plot
fig, ax = plt.subplots(1,3,figsize=(12.5,3.8), constrained_layout=True, sharey=True)

# collect data
X_plot = xit(X_test)[:,idcs]

# plot
plot0 = ax[0].scatter(X_plot[:,0], X_plot[:,1], c=yit(y_test[:,None]).reshape(-1))
plot1 = ax[1].scatter(X_plot[:,0], X_plot[:,1], c=yit(y_pred[:,None]).reshape(-1))
plot2 = ax[2].scatter(X_plot[:,0], X_plot[:,1], c=np.abs(y_pred-y_test))

# adjust plots
format_colorbar_plot(fig, ax, [plot0, plot1, plot2], idcs)
plt.show()
# ===================================================================================================================

## Hyperparameter tuning

The network's performance looks good on a visual inspection, but we need to quantify the error and compare it for different architectures to find the best-performing model. For this purpose, we turn to a grid-search strategy to find hyperparameters that give the best prediction on a validation set.

<div style="background-color:#AABAB2; vertical-align: middle; padding:3px 20px;">
<p>
    
<b>Task:</b>
- Create arrays over the hyperparameters to vary
- Loop over the arrays
- Initialize & train the NN
- Compare the RMSE of different models
</p>
</div>

\
Note: For the parameters, at least vary the number of layers and the size of each layer. Optionally, also look at the activation function.

In [None]:
# define coordinate vectors for grid
layer_sizes = [5,10, 15, 20]
layer_numbers = [1, 2, 3, 4]

# get grid for the coordinate pairs and store them in an array
rmse = np.zeros((len(layer_sizes), len(layer_numbers)))

# loop all hidden layer sizes
for i, lsize in enumerate(layer_sizes):
    
    # loop over all numbers of hidden layers
    for j, lnumber in enumerate(layer_numbers):
    
        # get tuple for archbatch_size=cture and print
        layers = (lsize,) * lnumber
        print("Training NN with hidden layers:  {}".format(layers))
        
        # get NN
        NN = MLPRegressor(solver='sgd', hidden_layer_sizes=layers, activation='tanh')
        NN, rmse[i,j], _ = NN_train(NN, X_train, y_train, X_val, y_val, max_epoch=100000, verbose=False, lr_init=1e-1, lr_step=20)
        
        # print
        print("     Mean square error:    {:.4e}\n".format(rmse[i,j]))


# get NN that gave lowerst rmse and print
min_size, min_number = np.unravel_index(np.argmin(rmse), rmse.shape)
print("\n\nModel with {} layers and {} neurons per layer gave lowest rmse of {:.4e}".format(layer_numbers[min_number], layer_sizes[min_size], rmse[min_size, min_number]))

## Model prediction on test set

Let's use our test data to visualize our best-performing model and test its predictive capabilities. First, re-initialize & train the model with the optimal hyperparameters.

<div style="background-color:#AABAB2; vertical-align: middle; padding:3px 20px;">
<p>
    
<b>Task:</b>
- Obtain the parameters that lead to the best RMSE and retrain the model.
</p>
</div>

In [None]:
# Set up NN
layers = (layer_sizes[min_size],) * layer_numbers[min_number]
NN = MLPRegressor(solver='sgd', hidden_layer_sizes=layers, activation='tanh')

# train NN
NN, _, _ = NN_train(NN, X_train, y_train, X_val, y_val, max_epoch=10000, verbose=False, lr_init=1e-1, lr_step=50)

In [None]:
# get prediction
y_pred = NN.predict(X_test)

# select sensor component
idx = 3 # <-- Change this to change sensor data considered

# ======================================== ignore code ==============================================================
# create figure
fig, ax = plt.subplots(figsize = (6,5))

# plot data
ax.scatter(xit(X_test)[:,idx], yit(y_test[:,None]).reshape(-1), label='truth', s=30)
ax.scatter(xit(X_test)[:,idx], yit(y_pred[:,None]).reshape(-1), label='prediction', s=30)

# adjust plot
ax.legend()
ax.ticklabel_format(style='sci', axis='x', scilimits=(0,0))
ax.set_xlabel(rf"$u_{{{int(idx/2)+1}, {'x' if idx%2 == 0 else 'y'}}}$", fontsize=12)
ax.set_ylabel(r'$x_{defect}$')
plt.show()
# ===================================================================================================================

Let's plot the prediction for projection of inputs on 2D subspace

In [None]:
# select sensor indeces
idcs = (3,7) # <-- Change this to change sensor data considered

# ======================================== ignore code ==============================================================
# create figure
fig, ax = plt.subplots(1,3,figsize=(12.5,3.8), constrained_layout=True, sharey=True)

# collect data
X_plot = xit(X_test)[:,idcs]

# plot
plot0 = ax[0].scatter(X_plot[:,0], X_plot[:,1], c=yit(y_test[:,None]).reshape(-1))
plot1 = ax[1].scatter(X_plot[:,0], X_plot[:,1], c=yit(y_pred[:,None]).reshape(-1))
plot2 = ax[2].scatter(X_plot[:,0], X_plot[:,1], c=np.abs(y_pred-y_test))

# adjust plots
format_colorbar_plot(fig, ax, [plot0, plot1, plot2], idcs)
plt.show()
# ===================================================================================================================

Let us pick a few data points (or samples/bridges) from our test set to inspect how well our predictions compare with the ground truth. Change the index at the top of the following code block to change the sample, you can ignore the rest of the code.

In [None]:
# select the sample in the test set we want to look at
index = 51 # <--- Change this index to change the sample!
total_idx = idcs_test[index]

# get prediction and true value of the defect location
defect_loc_true = yit(y_test[[index]][None,:])[0,0]
defect_loc_pred = yit(NN.predict(X_test[[index],:])[:,None])[0,0]

# create the plot
plotly_plot(df, total_idx, measure_locs, defect_loc_true, defect_loc_pred)

Finally, we need to compute the RMSE for all samples in the test set to quantify our accuracy.

In [None]:
y_pred_test = NN.predict(X_test)
rmse_test = np.sqrt(np.sum((yit(y_pred_test[:,None]) - yit(y_test[:,None])).reshape(-1)**2) / y_test.shape[0])
print("RMSE on test set for best performing model: {:.4e}".format(rmse_test))

## Feature selection: Beyond individual sensors
So far, we have used our engineering judgement to pick a subset of sensor locations. With these sensors, we have been able to make fairly accurate predictions for the defect location. Choosing a subset of all available sensor locations was necessary to keep the number of inputs feasible, and to keep the cost of the sensors low.
Alternatively if we have a sensor in each location, we can use Principal Component Analysis (PCA), to reduce the information from all sensors into a few modes. For a recap on PCA you can review previous MUDE lectures.

<div style="background-color:#AABAB2; vertical-align: middle; padding:3px 20px;">
<p>
    
<b>Task:</b>
- Create a dataset with the dx and dy data from all sensors
- Use PCA to transform the dataset into 10 features per sample. The number 10 is chosen to keep the number of inputs to the network the same compared to 5 sensors with x & y data.
</p>
</div>


In [None]:
# import PCA from sklearn
from sklearn.decomposition import PCA

# Obtain full dataset
measurements = df[['dx','dy']].to_numpy().flatten()
measurements = np.reshape(measurements, (1000, -1))  # Shape: [Num_samples(1000) x features]

# Creating PCA modes
# -------------------
num_modes = 10
# -------------------
pca = PCA(n_components=num_modes)
pca.fit(measurements)

measurement_modes = pca.transform(measurements)

print( f"The variance explained by each component =\n{pca.explained_variance_ratio_}\n")
print( f"The singular values =\n{pca.singular_values_}\n")

print(f"Number of features without PCA: {len(measurements[0])}")
print(f"Number of features with PCA: {len(measurement_modes[0])}")


### Visualizing PCA modes
Similar to how we can plot the defect location depending on the individual sensors above, we can also plot the location based on the PCA modes. This is done in the following section. Select different modes to get a better understanding of the data.

In [None]:
# Select the modes to plot
modes = (0, 1) # <--- Change this index to change the pca modes to plot!

# ======================================== ignore code ==============================================================
# Plot the data
fig, ax = plt.subplots(figsize=(6,5))
plot1 = ax.scatter(measurement_modes[:,modes[0]], measurement_modes[:,modes[1]], c=defect_locs[:], s=40)
format_colorbar_pca_plot(fig, ax, plot1, modes)
plt.show()
# ===================================================================================================================

Similarly to earlier, we need to further pre-process our data.

<div style="background-color:#AABAB2; vertical-align: middle; padding:3px 20px;">
<p>
    
<b>Task:</b>
    Re-normalize and split the data. (Note that only the inputs X change with PCA, the output stays the same)
</p>
</div>

In [None]:
# set up scalers and scale data
xscaler_pca = StandardScaler()  # Scaler for y does not change
xit_pca = xscaler_pca.inverse_transform
X_pca = xscaler_pca.fit_transform(measurement_modes)

# Split into train, validation & test set. Permutation is the same, so we only need to obtain new X
X_train_pca, X_val_pca, X_test_pca, y_train_pca, y_val_pca, y_test_pca, _, _, idcs_test_pca = train_test_val_split(X_pca, y, test_size=0.2, val_size=0.2)

<div style="background-color:#AABAB2; vertical-align: middle; padding:3px 20px;">
<p>
    
<b>Task:</b>
Initialize & train a single NN model based on this data.
</p>
</div>

In [None]:
# Set up NN
NN_PCA = MLPRegressor(solver='sgd', hidden_layer_sizes=(15, 15, 15), activation='tanh')

# train NN
NN_PCA, _, _ = NN_train(NN_PCA, X_train_pca, y_train_pca, X_val_pca, y_val_pca, max_epoch=10000, verbose=False, lr_init=1e-1, lr_step=100 )

In [None]:
idx = 0 # <-- Change this to change sensor data considered

# ======================================== ignore code ==============================================================
# get prediction
y_pred_pca = NN_PCA.predict(X_test_pca)
X_plot = xscaler_pca.inverse_transform(X_test_pca)

# create figure and select component
fig, ax = plt.subplots(figsize = (6,5))

# plot data
ax.scatter(X_plot[:,idx], yit(y_test_pca[:,None]).reshape(-1), label='truth', s=30)
ax.scatter(X_plot[:,idx], yit(y_pred_pca[:,None]).reshape(-1), label='prediction', s=30)

# adjust plot
ax.legend()
ax.ticklabel_format(style='sci', axis='x', scilimits=(0,0))
ax.set_xlabel(rf"$u_{{{int(idx/2)+1}, {'x' if idx%2 == 0 else 'y'}}}$", fontsize=12)
ax.set_ylabel(r'$x_{defect}$')
plt.show()
# ===================================================================================================================

In [None]:
# select sensor indeces
modes = (0,1) # <-- Change this to change sensor data considered

# ======================================== ignore code ==============================================================
# create figure
fig, ax = plt.subplots(1,3,figsize=(12,3.8), constrained_layout=True, sharey=True)

# collect data
X_plot = xscaler_pca.inverse_transform(X_test_pca)[:,modes]

# plot
plot0 = ax[0].scatter(X_plot[:,0], X_plot[:,1], c=yit(y_test_pca[:,None]))
plot1 = ax[1].scatter(X_plot[:,0], X_plot[:,1], c=yit(y_pred_pca[:,None]).reshape(-1))
plot2 = ax[2].scatter(X_plot[:,0], X_plot[:,1], c=np.abs(yit(y_pred_pca[:,None]) - yit(y_test_pca[:,None])))

# adjust plots
format_colorbar_pca_plot(fig, ax, [plot0, plot1, plot2], modes)
plt.show()
# ===================================================================================================================

## Comparison
In the plot above, you can compare the values of the rightmost plot with the same plot made earlier with sensors.
Lets see how the PCA predictions compare to those with manual sensors. Change the index at the top of the following code block to change the sample, you can ignore the rest of the code.

In [None]:
# select the sample in the test set we want to look at
index = 51 # <--- Change this index to change the sample!
total_idx = idcs_test[index]

# get prediction and true value of the defect location
defect_loc_true = yit(y_test[[index]][None,:])[0,0]
defect_loc_pred = yit(NN.predict(X_test[[index],:])[:,None])[0,0]
defect_loc_pred_pca = yit(NN_PCA.predict(X_test_pca[[index],:])[:,None])[0,0]

plotly_plot(df, total_idx, measure_locs, defect_loc_true, defect_loc_pred, defect_loc_pred_pca)

Finally, we need to compute the RMSE for all samples in the test set to quantify our accuracy.

In [None]:
rmse_test_pca = np.sqrt(np.sum((yit(y_test_pca[:,None]) - yit(y_pred_pca[:,None])).reshape(-1)**2) / yit(y_pred_pca[:,None]).shape[0])

print("RMSE on test set for this PCA model: {:.4e}".format(rmse_test_pca))
print("RMSE on test set for the best performing model based on 5 sensors: {:.4e}".format(rmse_test))

Compare the error obtained using PCA with that using manual sensors.

Optional: 
- How does the error change when using more or less PCA modes?
- Do a hyperparameter study to find the best network when using PCA 

## Bonus
- Instead of using PCA, use K-means clustering to reduce the dimensionality of the problem. Can you obtain a lower error with it? 