<center><img src='https://drive.google.com/uc?id=1_utx_ZGclmCwNttSe40kYA6VHzNocdET' height="60"></center>

AI TECH - Akademia Innowacyjnych Zastosowań Technologii Cyfrowych. Program Operacyjny Polska Cyfrowa na lata 2014-2020
<hr>

<center><img src='https://drive.google.com/uc?id=1BXZ0u3562N_MqCLcekI-Ens77Kk4LpPm'></center>

<center>
Projekt współfinansowany ze środków Unii Europejskiej w ramach Europejskiego Funduszu Rozwoju Regionalnego
Program Operacyjny Polska Cyfrowa na lata 2014-2020,
Oś Priorytetowa nr 3 "Cyfrowe kompetencje społeczeństwa" Działanie  nr 3.2 "Innowacyjne rozwiązania na rzecz aktywizacji cyfrowej"
Tytuł projektu:  „Akademia Innowacyjnych Zastosowań Technologii Cyfrowych (AI Tech)”
    </center>

# Validation and cross-validation

In this exercise you will implement a validation pipeline.

At the end of the MSLE exercise you tested your model against the training and test datasets. As you should observe, there's a gap between the results. By validating your model, not only should you be able to anticipate the test time performance, but also have a method to compare different models.

Implement the basic validation method, i.e. a random split. Test it with your model from Exercise MSLE.

In [None]:
%matplotlib inline

!wget -O mieszkania.csv https://www.dropbox.com/s/zey0gx91pna8irj/mieszkania.csv?dl=1
!wget -O mieszkania_test.csv https://www.dropbox.com/s/dbrj6sbxb4ayqjz/mieszkania_test.csv?dl=1

--2024-08-21 22:17:52--  https://www.dropbox.com/s/zey0gx91pna8irj/mieszkania.csv?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.2.18, 2620:100:6017:18::a27d:212
Connecting to www.dropbox.com (www.dropbox.com)|162.125.2.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.dropbox.com/scl/fi/3x5umw93vtxvmp037wczv/mieszkania.csv?rlkey=dmvzaueu361g7s2w6ui6m9ryb&dl=1 [following]
--2024-08-21 22:17:53--  https://www.dropbox.com/scl/fi/3x5umw93vtxvmp037wczv/mieszkania.csv?rlkey=dmvzaueu361g7s2w6ui6m9ryb&dl=1
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucd9fd2192cc8d655589e0e24956.dl.dropboxusercontent.com/cd/0/inline/CZGcC-KqZIIrKEH4o24YfGvIkB3zYncoFYj2naFW3ePxRJiaKbygfzCxzPkAeZ0tOYhlustUPP5djTZya6v7vCQS4CEPztJoocfI3gSP_4qDZKysZl4pBsGssKcOfinQeBE/file?dl=1# [following]
--2024-08-21 22:17:53--  https://ucd9fd2192cc8d655589e0e24956.dl.dropboxusercontent.com/cd/0/inl

In [None]:
from typing import Tuple, List

import numpy as np
import pandas as pd
from sklearn import preprocessing
from tqdm import tqdm

def load(name: str) -> Tuple[np.ndarray, np.array]:
    data = pd.read_csv(name)
    x = data.loc[:, data.columns != 'cena'].to_numpy()
    y = data['cena'].to_numpy()

    return x, y

In [None]:
x_train, y_train = load('mieszkania.csv')
x_test, y_test = load('mieszkania_test.csv')

In [None]:
labelencoder = preprocessing.LabelEncoder()
labelencoder.fit(x_train[:, 1])
x_train[:, 1] = labelencoder.transform(x_train[:, 1])
x_test[:, 1] = labelencoder.transform(x_test[:, 1])

x_train = x_train.astype(np.float64)
x_test = x_test.astype(np.float64)

In [None]:
#######################################################
# TODO: Implement the basic validation method,        #
# compare MSLE on training, validation, and test sets #
#######################################################

def random_split(x: np.ndarray, percent: float = 0.8) -> Tuple[np.ndarray, np.ndarray]:
  premutation_idx = np.random.permutation(len(x))

  idx_train = premutation_idx[: int(len(x) * percent)]
  idx_val = premutation_idx[int(len(x)*percent) : ]

  return idx_train, idx_val

idx_train, idx_val = random_split(np.zeros([100, 10]))

len(np.intersect1d(idx_train, idx_val)) == 0

True

To make the random split validation reliable, a huge chunk of training data may be needed. To get over this problem, one may apply cross-validaiton.

![alt-text](https://chrisjmccormick.files.wordpress.com/2013/07/10_fold_cv.png)

Let's now implement the method. Make sure that:
* number of partitions is a parameter,
* the method is not limited to `mieszkania.csv`,
* the method is not limited to one specific model.

In [None]:
####################################
# TODO: Implement cross-validation #
####################################

def msle(ys: np.ndarray, ps: np.ndarray) -> float:
    assert len(ys) == len(ps)

    return ((np.log(1 + ys) - np.log(1 + ps))**2).mean()


def train(x: np.ndarray, y: np.ndarray, lr: float = 1e-2, alpha: int = 10000, max_iter = 10000) -> Tuple[np.ndarray, np.ndarray]:
  num_features = x.shape[1]

  bias = 0.0
  weights = np.zeros([num_features, 1])

  for epoch in tqdm(range(max_iter)):
    y_pred = (x @ weights + bias).squeeze()

    # Gradient descent - derivative of MSLE Loss function
    grad_weights = (
            2 * np.mean(np.multiply(x, ((np.log1p(y) - np.log1p(y_pred)) * (-1 / (1 + y_pred))[np.newaxis, :]).T), axis=0)
        )[:, np.newaxis]

    weights -= lr * grad_weights
    bias -= lr * 2 * np.mean((np.log1p(y) - np.log1p(y_pred)) * (-1 / (1+y_pred)))

    pred = (x @ weights + bias).squeeze()

    loss = msle(y, pred)

    if epoch % 2000 == 0:
      print(f'Epoch {epoch}: Loss = {loss:8.8f}')

  return bias, weights

def cross_val(x: np.ndarray, y: np.array, n_splits: int=10, shuffle: bool = False) -> List[float]:
  if shuffle:
    id_shuffle = np.random.permutation(len(x))

    x = x[id_shuffle]
    y = y[id_shuffle]

  val_losses = []

  k = int((1 / n_splits) * len(x))
  for i in range(n_splits):
    print("fold number:", i)

    id_val_fold = slice(k * i, k * (i + 1))
    id_train_fold = [slice(0, max(k * i, 0)), slice(k * (i + 1), len(x))]

    x_train_fold = np.r_[x[id_train_fold[0]], x[id_train_fold[1]]]
    y_train_fold = np.r_[y[id_train_fold[0]], y[id_train_fold[1]]]
    x_val_fold = x[id_val_fold]
    y_val_fold = y[id_val_fold]

    bias, weights = train(x_train_fold, y_train_fold)
    val_losses.append(msle(y_val_fold, (x_val_fold @ weights + bias).squeeze()))

  return val_losses

loss = cross_val(x_train, y_train, n_splits=3)
print(f"\nloss: {np.mean(loss)}")

fold number: 0


  5%|▍         | 479/10000 [00:00<00:01, 4787.93it/s]

Epoch 0: Loss = 1.13775879


 25%|██▍       | 2477/10000 [00:00<00:02, 2890.20it/s]

Epoch 2000: Loss = 1.13752140


 44%|████▍     | 4394/10000 [00:01<00:01, 3826.74it/s]

Epoch 4000: Loss = 1.13728403


 66%|██████▌   | 6569/10000 [00:01<00:00, 4115.54it/s]

Epoch 6000: Loss = 1.13704666


 83%|████████▎ | 8296/10000 [00:02<00:00, 3952.23it/s]

Epoch 8000: Loss = 1.13680930


100%|██████████| 10000/10000 [00:02<00:00, 3596.99it/s]


fold number: 1


  5%|▌         | 505/10000 [00:00<00:01, 5043.03it/s]

Epoch 0: Loss = 0.94170875


 24%|██▍       | 2427/10000 [00:00<00:01, 4596.68it/s]

Epoch 2000: Loss = 0.94152347


 46%|████▌     | 4588/10000 [00:01<00:01, 3455.58it/s]

Epoch 4000: Loss = 0.94133821


 65%|██████▌   | 6519/10000 [00:01<00:01, 2704.90it/s]

Epoch 6000: Loss = 0.94115296


 89%|████████▉ | 8894/10000 [00:02<00:00, 4307.08it/s]

Epoch 8000: Loss = 0.94096772


100%|██████████| 10000/10000 [00:02<00:00, 3569.77it/s]


fold number: 2


  6%|▌         | 559/10000 [00:00<00:01, 5585.62it/s]

Epoch 0: Loss = 1.12280585


 28%|██▊       | 2801/10000 [00:00<00:01, 5221.51it/s]

Epoch 2000: Loss = 1.12257121


 50%|████▉     | 4961/10000 [00:00<00:00, 5346.41it/s]

Epoch 4000: Loss = 1.12233659


 66%|██████▋   | 6646/10000 [00:01<00:00, 5544.19it/s]

Epoch 6000: Loss = 1.12210198


 88%|████████▊ | 8849/10000 [00:01<00:00, 5448.67it/s]

Epoch 8000: Loss = 1.12186737


100%|██████████| 10000/10000 [00:01<00:00, 5436.20it/s]


loss: 1.0680527779735025





Recall that sometimes validation may be tricky, e.g. significant class imbalance, having a small number of subjects, geographically clustered instances...

What could in theory go wrong here with random, unstratified partitions? Think about potential solutions and investigate the data in order to check whether these problems arise here.

In [None]:
##############################
# TODO: Investigate the data #
##############################
overall = pd.Series(x_train[:, 1]).value_counts() / len(x_train)

id_train, _ = random_split(x_train[:, 1])
random = pd.Series(x_train[id_train, 1]).value_counts() / len(id_train)

compare = pd.DataFrame({'overall': overall, 'random': random}).sort_index()
compare['rand error %'] = 100 * random / overall - 100

print(compare)

     overall   random  rand error %
0.0    0.235  0.23750      1.063830
1.0    0.245  0.25625      4.591837
2.0    0.270  0.27500      1.851852
3.0    0.250  0.23125     -7.500000
