# Vaje 3: Merjenje napovedne napake modela

## Naloga 1: Prečno preverjanje in stabilnost modela

In [82]:
import numpy as np

# Preberemo podatke shranje v numpy formatu (s funkcijo numpy.save(pot, array))
data = np.load("vaje3_1.npy")
X = data[:, :-1]
y = data[:, -1]
X

array([[0.28757752, 0.27362273, 0.15967398],
       [0.78830514, 0.59386693, 0.14451585],
       [0.40897692, 0.16018481, 0.14918039],
       ...,
       [0.39149875, 0.75208667, 0.06712062],
       [0.70957985, 0.40160665, 0.43520844],
       [0.10882407, 0.44855811, 0.68013496]])

1.a: Preveri (povprečno) točnost linearne regresije s petkratnim prečnim preverjanjem. Kako stabilen je model oz. kakšna je varianca dobljenih napak?

<details>
  <summary>Namig:</summary>

  *Pomagaj si z [objektom sklearn.model_selection.KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) in njegovo metodo split(X)*.
   
</details>

In [83]:
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Split data 
kf = KFold(n_splits=5)

reg = LinearRegression()
scores = [] # list of model scores
coefficients = np.zeros((5, X.shape[1]+1)) # Matrix of coefficients
i = 0
for train_index, test_index in kf.split(X, y):
    # Split data into test and train set
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    #Fit model
    reg.fit(X_train, y_train)

    #Calculate score
    se = mean_squared_error(y_test, reg.predict(X_test))
    print("Model score:", np.sqrt(se))
    scores.append(np.sqrt(se))

    #Save coefficients and intercept
    coefficients[i, 0] = reg.intercept_
    coefficients[i, 1:] = reg.coef_
    i += 1

print("Score variance:", np.var(scores))

Model score: 0.18960165232406972
Model score: 0.18638864123988205
Model score: 0.1960588986624767
Model score: 0.1771748443908929
Model score: 0.17002566428275065
Score variance: 8.485067415270995e-05


1.b: S prečnim preverjanjem sestavi pet modelov linearne regresije ter si (v matriko velikosti 5x4) shrani njihove začetne vrednosti in koeficiente. Se istoležni koeficienti v različnih vzorcih (Foldih) razlikujejo? Za koliko?

In [84]:
# Each column represents the coefficient for one model in the form: y = b1 + b2*x1 + b3*x2

coefficients


array([[-0.18415939,  1.50743854,  2.43282336,  1.47957335],
       [-0.19639105,  1.52647655,  2.4416169 ,  1.47915122],
       [-0.16964682,  1.51169294,  2.4205532 ,  1.46727743],
       [-0.15786318,  1.48666751,  2.43633069,  1.47178237],
       [-0.18573746,  1.49966098,  2.42805102,  1.50079499]])

1.c: Podatkom dodaj spremenljivke drugega reda ($x_1^2$, $x_1\cdot x_2$, $x_1\cdot x_3$, $x_2^2$, ...). Stolpce lahko združiš s funkcijo numpy.concatenate(seznam stolpcev, axis=1). 

In [85]:
# Add second order variables (poglej v resitve, misljeno je da dodasmo vse 
# kombinacije kvadratov in produktov spremenljivk)

XExtra = np.array([X[:, 0]**2, X[:, 1]**2, X[:, 0]*X[:, 1]]) # x1^2, x2^2, x1*x2 variables

X = np.concatenate((X, XExtra.T), axis=1)



1.d: Preveri točnost linearne regresije na podatkih iz naloge 1.c s petkratnim prečnim preverjanjem. Se koeficienti modela bolj ali manj razlikujejo med različnimi vzorci?

In [90]:
reg2 = LinearRegression()
scores2 = [] # list of model scores
coefficients2 = np.zeros((5, X.shape[1]+1)) # Matrix of coefficients
i = 0
for train_index, test_index in kf.split(X, y):
    # Split data into test and train set
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    #Fit model
    reg2.fit(X_train, y_train)

    #Calculate score
    se = mean_squared_error(y_test, reg2.predict(X_test))
    print("Model score:", np.sqrt(se))
    scores2.append(np.sqrt(se))

    #Save coefficients and intercept
    coefficients2[i, 0] = reg2.intercept_
    coefficients2[i, 1:] = reg2.coef_
    i += 1

print("Score variance:", np.var(scores2))

print("Coefficient matrix:")
coefficients2


Model score: 0.12971108239867477
Model score: 0.13674574603964723
Model score: 0.1263916088709987
Model score: 0.12705994180945984
Model score: 0.12126477470961572
Score variance: 2.559490198434548e-05
Coefficient matrix:


array([[-0.21647914,  1.80391937,  2.86018547,  1.49333945, -0.8105725 ,
        -0.93980606,  1.01332118],
       [-0.24423595,  1.88427615,  2.92771324,  1.47292813, -0.85300612,
        -0.99473392,  0.99480839],
       [-0.22673122,  1.81576364,  2.92590259,  1.48065408, -0.78607162,
        -1.00764211,  0.97947199],
       [-0.2065375 ,  1.78912485,  2.91707915,  1.47932503, -0.8091055 ,
        -1.03106449,  1.05451385],
       [-0.22596353,  1.80625632,  2.91929052,  1.49427694, -0.81017764,
        -1.01461433,  1.0264871 ]])

## Naloga 2:  Stratificirano vzorčenje

In [91]:
# Preberemo podatke iz datoteke vaje3_2.npz. Podatke x, y lahko shranimo v datoteko s končnico npz z uporabo funkcije numpy.savez(pot, x=x, y=y)
data = np.load("vaje3_2.npz")
# Podatke shranimo v spremenljivko x
x = data["x"]
# Ciljne vrednosti shranimo v spremenljivko y
y = data["y"]

2.a: Preveri točnost logistične regresije s petkratnim prečnim preverjanjem. Izpiši točnost modela glede na metriko "klasifikacijska točnost" (accuracy) v vsakem vzorcu. Opaziš kaj nenavadnega?

In [97]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
kf = KFold(n_splits=5)

logreg = LogisticRegression()
scores = [] # list of model scores
predictions = [] # list of model predictions
YValues = [] # list of Y's

for i, (train_index, test_index) in enumerate(kf.split(x, y)):
    # Split data into test and train set
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]

    #Fit model
    logreg.fit(x_train, y_train)

    #Calculate score
    acc = accuracy_score(y_test, logreg.predict(x_test))
    print("Model score:", acc)
    scores.append(acc)
    predictions.append(logreg.predict(x_test))
    YValues.append(y_test)

Model score: 1.0
Model score: 1.0
Model score: 1.0
Model score: 0.66
Model score: 0.0


2.b: Za vsak vzorec podatkov izpiši število pozitivnih in negativnih vrednosti ciljne spremenljivke v učni in testni množici. Zakaj se je točnost modela v nalogi 2.a tako razlikovala med različnimi vzorci?

In [100]:
# Poglej pod variables v toolbarju tabeli predictions in YValues
# Vidis, da so mnozice Y lahko samo 0 na zacetku in samo 1 na koncu, ko razrezemo  na 5 delov 
# so v prvih 3 samo 0 in model napoveduje samo 0, nato so pol 0 in pol 1 in model zadane polovico in 
# v zadnji mnozici pa je vse narobe.

2.c: Če je distribucija ciljne spremenljivke v vzorcih učne in testne množice podobna originalni distribuciji ciljne spremenljivke, takemu vzorčenju rečemo stratificirano vzorčenje. Sestavi stratificirane vzorce za petkratno prečno preverjanje in preveri koliko pozitivnih in koliko negativnih primerov vsebuje učna in testna množica.


<details>
  <summary>Namig:</summary>

  *Pomagaj si z [objektom sklearn.model_selection.StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) in njegovo metodo split(x, y)*.
   
</details>

In [102]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)

for i, (train_index, test_index) in enumerate(skf.split(x, y)):
    print(f"Fold{i} (train set): number of 1:{np.sum(y[train_index])}, number of 0:{len(y[train_index])-np.sum(y[train_index])}" )
    print(f"Fold{i} (test set): number of 1:{np.sum(y[test_index])}, number of 0:{len(y[test_index])-np.sum(y[test_index])}" )

print(f"Number of 1's in y: {np.sum(y)}, number of 0's in y: {len(y)-np.sum(y)}")

Fold0 (train set): number of 1:161.0, number of 0:639.0
Fold0 (test set): number of 1:40.0, number of 0:160.0
Fold1 (train set): number of 1:161.0, number of 0:639.0
Fold1 (test set): number of 1:40.0, number of 0:160.0
Fold2 (train set): number of 1:161.0, number of 0:639.0
Fold2 (test set): number of 1:40.0, number of 0:160.0
Fold3 (train set): number of 1:161.0, number of 0:639.0
Fold3 (test set): number of 1:40.0, number of 0:160.0
Fold4 (train set): number of 1:160.0, number of 0:640.0
Fold4 (test set): number of 1:41.0, number of 0:159.0
Number of 1's in y: 201.0, number of 0's in y: 799.0


2.d: Preveri točnost logistične regresije s petkratnim prečnim preverjanjem na vzorcih, ki jih dobiš s stratificiranim vzorčenjem. So dobljeni modeli bolj stabilni? Si pričakoval/a da bodo dobljeni rezultati bolj stabilni?

In [103]:
logreg2 = LogisticRegression()
scores2 = [] # list of model scores
predictions2 = [] # list of model predictions
YValues2 = [] # list of Y's

for i, (train_index, test_index) in enumerate(skf.split(x, y)):
    # Split data into test and train set
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]

    #Fit model
    logreg2.fit(x_train, y_train)

    #Calculate score
    acc = accuracy_score(y_test, logreg2.predict(x_test))
    print("Model score:", acc)
    scores2.append(acc)
    predictions2.append(logreg2.predict(x_test))
    YValues2.append(y_test)

Model score: 0.835
Model score: 0.94
Model score: 0.99
Model score: 1.0
Model score: 0.785


2.e: Pred stratificiranim vzorčenjem podatke še premešaj. To narediš tako, da objektu StratifiedKFold dodaš parameter shuffle=True. So sedaj rezultati bolj stabilni? Kaj se zgodi, če kodo poženeš večkrat?

In [None]:
np.random.seed(42)	

skfShuffle = StratifiedKFold(n_splits=5, shuffle=True)

# Copy paste of the above code, now the models are trained with shuffled data and are more accurate, 
# we set the random seed for reproducible results.

Opomba: Parameter shuffle naredi stratificirano vzorčenje stohastično s pomočjo generatorja naključnih števil. Z metodo numpy.random.seed(celo število) lahko poskrbimo, da bo generator vedno vračal ista naključna števil in bodo posledično naši eksperimenti ponovljivi.

Pozor: Vpliv na točnost modela je le posledica random seed-a. V publikacijah se random uporablja le za ponovljivost eksperimentov (in ne kot: Najboljši rezultat dobimo pri random seed-u 18)

## Dodatna naloga

Premisli, kakšna bo povprečna točnost pri napovedovanju diskretne spremenljivke z modelom majority classifier (iz prvih vaj), če uporabimo poseben primer prečnega preverjena, ki ga imenujemo "izpusti enega" (leave-one-out). Predpostavi, da je ciljna spremenljivka enakomerno razporejena (vsaka ciljna vrednost se pojavi (število vrstic)/(število unikatnih vrednosti)-krat).