# Week 34: Introduction to the course, Logistics and Practicalities
**Eric Ludvigsen**, Student

Date: **Week 34, August 19-23, 2024**

## Exercises

Here are three possible exercises for week 34

## Exercise 1: Setting up various Python environments

The first exercise here is of a mere technical art. We want you to have 
* git as a version control software and to establish a user account on a provider like GitHub. Other providers like GitLab etc are equally fine. You can also use the University of Oslo [GitHub facilities](https://www.uio.no/tjenester/it/maskin/filer/versjonskontroll/github.html). 

* Install various Python packages

We will make extensive use of Python as programming language and its
myriad of available libraries.  You will find
IPython/Jupyter notebooks invaluable in your work.  You can run **R**
codes in the Jupyter/IPython notebooks, with the immediate benefit of
visualizing your data. You can also use compiled languages like C++,
Rust, Fortran etc if you prefer. The focus in these lectures will be
on Python.

If you have Python installed (we recommend Python3) and you feel
pretty familiar with installing different packages, we recommend that
you install the following Python packages via **pip** as 

1. pip install numpy scipy matplotlib ipython scikit-learn sympy pandas pillow 

For **Tensorflow**, we recommend following the instructions in the text of 
[Aurelien Geron, Hands‑On Machine Learning with Scikit‑Learn and TensorFlow, O'Reilly](http://shop.oreilly.com/product/0636920052289.do)

We will come back to **tensorflow** later. 

For Python3, replace **pip** with **pip3**.

For OSX users we recommend, after having installed Xcode, to
install **brew**. Brew allows for a seamless installation of additional
software via for example 

1. brew install python3

For Linux users, with its variety of distributions like for example the widely popular Ubuntu distribution,
you can use **pip** as well and simply install Python as 

1. sudo apt-get install python3  (or python for Python2.7)

If you don't want to perform these operations separately and venture
into the hassle of exploring how to set up dependencies and paths, we
recommend two widely used distrubutions which set up all relevant
dependencies for Python, namely 

* [Anaconda](https://docs.anaconda.com/), 

which is an open source
distribution of the Python and R programming languages for large-scale
data processing, predictive analytics, and scientific computing, that
aims to simplify package management and deployment. Package versions
are managed by the package management system **conda**. 

* [Enthought canopy](https://www.enthought.com/product/canopy/) 

is a Python
distribution for scientific and analytic computing distribution and
analysis environment, available for free and under a commercial
license.

We recommend using **Anaconda** if you are not too familiar with setting paths in a terminal environment.

In [567]:
# enviornment is set up hence these packages working
import numpy as np
import pandas as pd
import sklearn.linear_model as skl
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

## Exercise 2: making your own data and exploring scikit-learn

We will generate our own dataset for a function $y(x)$ where $x \in [0,1]$ and defined by random numbers computed with the uniform distribution. The function $y$ is a quadratic polynomial in $x$ with added stochastic noise according to the normal distribution $\cal {N}(0,1)$.
The following simple Python instructions define our $x$ and $y$ values (with 100 data points).

In [568]:
n = 100
randomness_coeff = 0.2
x = np.random.rand(n,1)
y = 2.0 + 5*(x*x) + randomness_coeff * np.random.randn(n,1)

# fancier display than numpy print
results_frame = pd.DataFrame({"x":x.flatten(), "y":y.flatten()})
display(results_frame)

Unnamed: 0,x,y
0,0.971290,6.967837
1,0.908235,6.434637
2,0.044615,2.168872
3,0.238674,2.775274
4,0.938416,6.072832
...,...,...
95,0.425518,2.842964
96,0.488470,3.238665
97,0.638463,4.066793
98,0.766629,4.625011


1. Write your own code (following the examples under the [regression notes](https://compphysics.github.io/MachineLearning/doc/LectureNotes/_build/html/chapter1.html)) for computing the parametrization of the data set fitting a second-order polynomial. 



In [569]:
# make design matrix
P = 3
# matrix should correspond to pattern
# beta_1 + beta_2*x + beta_3*x^2
feature_matrix = np.zeros((n, P))

#feature_matrix[:,0] = 1
#feature_matrix[:,1] = x[:,0]
#feature_matrix[:,2] = x[:,0]**2
#feature_matrix[:,3] = x[:,0]**3

# automated way of filling matrix
for exponent in range(0,P):
    feature_matrix[:,exponent] = x[:,0]**exponent

print("feature matrix")
df = pd.DataFrame(feature_matrix)
display(df)


feature matrix


Unnamed: 0,0,1,2
0,1.0,0.971290,0.943405
1,1.0,0.908235,0.824891
2,1.0,0.044615,0.001990
3,1.0,0.238674,0.056965
4,1.0,0.938416,0.880624
...,...,...,...
95,1.0,0.425518,0.181066
96,1.0,0.488470,0.238603
97,1.0,0.638463,0.407635
98,1.0,0.766629,0.587721


In [570]:
# optimal beta is now given by
# beta_hat = (X^T X)^-1 X^T y
feature_transposed = np.transpose(feature_matrix)
print("feature matrix shape: ", feature_matrix.shape)
print("transposed:", feature_transposed.shape)

x_t_x = np.matmul(feature_transposed, feature_matrix)
print("X^T X shape:", x_t_x.shape)
calc_invert = np.linalg.inv(x_t_x)

beta = calc_invert.dot(feature_transposed).dot(y)
df = pd.DataFrame(beta)
print("\nbeta")
display(df)

y_tilde = feature_matrix @ beta
df = pd.DataFrame(y_tilde)
print("y tilde")
display(df)

print("add modelled y to display frame along with absolute error")
results_frame["y_tilde"] = y_tilde.flatten()
results_frame["y - y_tilde"] = y.flatten()-y_tilde.flatten()
display(results_frame)

feature matrix shape:  (100, 3)
transposed: (3, 100)
X^T X shape: (3, 3)

beta


Unnamed: 0,0
0,2.03577
1,-0.14858
2,5.213718


y tilde


Unnamed: 0,0
0,6.810104
1,6.201573
2,2.039519
3,2.297308
4,6.487664
...,...
95,2.916572
96,3.207202
97,4.066200
98,4.986074


add modelled y to display frame along with absolute error


Unnamed: 0,x,y,y_tilde,y - y_tilde
0,0.971290,6.967837,6.810104,0.157733
1,0.908235,6.434637,6.201573,0.233064
2,0.044615,2.168872,2.039519,0.129353
3,0.238674,2.775274,2.297308,0.477966
4,0.938416,6.072832,6.487664,-0.414832
...,...,...,...,...
95,0.425518,2.842964,2.916572,-0.073608
96,0.488470,3.238665,3.207202,0.031463
97,0.638463,4.066793,4.066200,0.000593
98,0.766629,4.625011,4.986074,-0.361063


2. Use thereafter **scikit-learn** (see again the examples in the regression slides) and compare with your own code.   


In [571]:
# if this cell runs twice in a row (without running the above prediction),
# the results are wrong and y_tilde_skl changes - not sure why
clf = skl.LinearRegression().fit(feature_matrix, y)
ytilde_skl = clf.predict(feature_matrix)

In [572]:
results_frame["scikit y_tilde"] = ytilde_skl
display(results_frame)
# similar y_tilde result

Unnamed: 0,x,y,y_tilde,y - y_tilde,scikit y_tilde
0,0.971290,6.967837,6.810104,0.157733,6.810104
1,0.908235,6.434637,6.201573,0.233064,6.201573
2,0.044615,2.168872,2.039519,0.129353,2.039519
3,0.238674,2.775274,2.297308,0.477966,2.297308
4,0.938416,6.072832,6.487664,-0.414832,6.487664
...,...,...,...,...,...
95,0.425518,2.842964,2.916572,-0.073608,2.916572
96,0.488470,3.238665,3.207202,0.031463,3.207202
97,0.638463,4.066793,4.066200,0.000593,4.066200
98,0.766629,4.625011,4.986074,-0.361063,4.986074



3. Using scikit-learn, compute also the mean square error, a risk metric corresponding to the expected value of the squared (quadratic) error defined as

$$
MSE(\boldsymbol{y},\boldsymbol{\tilde{y}}) = \frac{1}{n}
\sum_{i=0}^{n-1}(y_i-\tilde{y}_i)^2,
$$

In [573]:
def custom_mse(y_true, y_pred):
    n = len(y_true) # number of data points
    mse = (1/n) * np.sum( (y_true - y_pred)**2 )
    return mse 

calculated_mse = custom_mse(y, y_tilde)                          
print(f"Calculated mean squared error: {calculated_mse:.2f}")

# scikit mean squared error
mse_skl = mean_squared_error(y, ytilde_skl)                          
print(f"Scikit mean squared error:     {mse_skl:.2f}")

Calculated mean squared error: 0.04
Scikit mean squared error:     0.04


and the $R^2$ score function.
If $\tilde{\boldsymbol{y}}_i$ is the predicted value of the $i-th$ sample and $y_i$ is the corresponding true value, then the score $R^2$ is defined as

$$
R^2(\boldsymbol{y}, \tilde{\boldsymbol{y}}) = 1 - \frac{\sum_{i=0}^{n - 1} (y_i - \tilde{y}_i)^2}{\sum_{i=0}^{n - 1} (y_i - \bar{y})^2},
$$

where we have defined the mean value  of $\boldsymbol{y}$ as

$$
\bar{y} =  \frac{1}{n} \sum_{i=0}^{n - 1} y_i.
$$

In [574]:
def custom_r2(y_true, y_pred):
    n = len(y_true) # number of data points
    y_mean = (1/n) * np.sum(y_true)
    r2 =  1 - ((np.sum( (y_true - y_pred)**2) ) / (np.sum( (y_true - y_mean)**2 )))
    return r2

calculated_r2 = custom_r2(y, y_tilde)                          
print(f"Calculated R^2: {calculated_r2:.2f}")

# scikit r^2 score
mse_r2 = r2_score(y, ytilde_skl)                          
print(f"Scikit R^2:     {mse_r2:.2f}")

Calculated R^2: 0.99
Scikit R^2:     0.99


You can use the functionality included in scikit-learn. If you feel for it, you can use your own program and define functions which compute the above two functions. 
Discuss the meaning of these results. Try also to vary the coefficient in front of the added stochastic noise term and discuss the quality of the fits.

In [575]:
# scikit predictions match manually created predictions
# with low stochastic noise, the match is almost perfect.
# fitting a second order polynomial with a second order polynomial should match well so that is as expected

# increasing noise gradually makes the results diverge, it is no longer sufficient to match a simple polynomial
# coefficient = 20 gives mse 358.28 and R^2 0.01 which is terrible
# changing polynomial degree of model (P) does not help in either case

## Exercise 3: Split data in test and training data

In this exercise we want you to to compute the MSE for the training
data and the test data as function of the complexity of a polynomial,
that is the degree of a given polynomial.

The aim is to reproduce Figure 2.11 of [Hastie et al](https://github.com/CompPhysics/MLErasmus/blob/master/doc/Textbooks/elementsstat.pdf).

Our data is defined by $x\in [-3,3]$ with a total of for example $n=100$ data points. You should try to vary the number of data points $n$ in your analysis.

In [576]:
np.random.seed()
n = 100
# Make data set.
x = np.linspace(-3, 3, n).reshape(-1, 1)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.normal(0, 0.1, x.shape)

# fancier display than numpy print
results_frame = pd.DataFrame({"x":x.flatten(), "y":y.flatten()})
display(results_frame)

Unnamed: 0,x,y
0,-3.000000,0.089425
1,-2.939394,0.074634
2,-2.878788,0.058156
3,-2.818182,-0.020584
4,-2.757576,0.189305
...,...,...
95,2.757576,0.660865
96,2.818182,0.942837
97,2.878788,0.652794
98,2.939394,0.694325


where $y$ is the function we want to fit with a given polynomial.

**a)**
Write a first code which sets up a design matrix $X$ defined by a fifth-order polynomial and split your data set in training and test data.

In [577]:
# make design matrix
P = 5

X = np.zeros((n, P))
for exponent in range(0,P):
    X[:,exponent] = x[:,0]**exponent

print("feature matrix")
df = pd.DataFrame(X)
display(df)


feature matrix


Unnamed: 0,0,1,2,3,4
0,1.0,-3.000000,9.000000,-27.000000,81.000000
1,1.0,-2.939394,8.640037,-25.396472,74.650235
2,1.0,-2.878788,8.287420,-23.857723,68.681324
3,1.0,-2.818182,7.942149,-22.382419,63.077727
4,1.0,-2.757576,7.604224,-20.969224,57.824224
...,...,...,...,...,...
95,1.0,2.757576,7.604224,20.969224,57.824224
96,1.0,2.818182,7.942149,22.382419,63.077727
97,1.0,2.878788,8.287420,23.857723,68.681324
98,1.0,2.939394,8.640037,25.396472,74.650235


**b)**
Write thereafter (using either **scikit-learn** or your matrix inversion code using for example **numpy**)
and perform an ordinary least squares fitting and compute the mean squared error for the training data and the test data. These calculations should apply to a model given by a fifth-order polynomial.

In [578]:
# optimal beta is given by
# beta_hat = (X^T X)^-1 X^T y

x_t_x = np.matmul(X.T, X)
print("X^T X shape:", x_t_x.shape)
calc_invert = np.linalg.inv(x_t_x)

beta = calc_invert.dot(X.T).dot(y)
df = pd.DataFrame(beta)
print("\nbeta")
display(df)

y_tilde = X @ beta

print("add modelled y to display frame along with absolute error")
results_frame["y_tilde"] = y_tilde.flatten()
results_frame["y - y_tilde"] = y.flatten()-y_tilde.flatten()
display(results_frame)

calculated_mse = custom_mse(y, y_tilde)                          
print(f"Calculated mean squared error: {calculated_mse:.2f}")

calculated_r2 = custom_r2(y, y_tilde)                          
print(f"Calculated R^2: {calculated_r2:.2f}")

X^T X shape: (5, 5)

beta


Unnamed: 0,0
0,0.82431
1,0.441399
2,-0.015144
3,-0.036109
4,-0.00385


add modelled y to display frame along with absolute error


Unnamed: 0,x,y,y_tilde,y - y_tilde
0,-3.000000,0.089425,0.026903,0.062522
1,-2.939394,0.074634,0.025650,0.048984
2,-2.878788,0.058156,0.025160,0.032996
3,-2.818182,-0.020584,0.025442,-0.046027
4,-2.757576,0.189305,0.026509,0.162796
...,...,...,...,...
95,2.757576,0.660865,0.946552,-0.285687
96,2.818182,0.942837,0.896932,0.045905
97,2.878788,0.652794,0.843610,-0.190815
98,2.939394,0.694325,0.786479,-0.092154


Calculated mean squared error: 0.03
Calculated R^2: 0.89


**c)**
Add now a model which allows you to make polynomials up to degree $15$.  Perform a standard OLS fitting of the training data and compute the MSE for the training and test data and plot both test and training data MSE as functions of the polynomial degree. Compare what you see with Figure 2.11 of Hastie et al. Comment your results. For which polynomial degree do you find an optimal MSE (smallest value)?

In [579]:
def linear_regression_model(x, y, P):
    if np.shape(x) != np.shape(y):
        raise ValueError("x and y must be the same shape")
    n = len(x)

    X = np.zeros((n, P))
    for exponent in range(0,P):
        X[:,exponent] = x[:,0]**exponent

    beta = (np.linalg.inv(X.T.dot(X))).dot(X.T).dot(y)

    y_tilde = X @ beta

    results_frame = pd.DataFrame({"x":x.flatten(), "y":y.flatten()})
    results_frame["y_tilde"] = y_tilde.flatten()

    return results_frame                        
    


In [580]:
# test function working
np.random.seed()
n = 200
P = 6

x = np.linspace(-3, 3, n).reshape(-1, 1)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.normal(0, 0.1, x.shape)

results_frame = linear_regression_model(x, y, P)

display(results_frame)

calculated_mse = custom_mse(results_frame["y"], results_frame["y_tilde"])
print(f"Calculated mean squared error: {calculated_mse:.2f}")
calculated_r2 = custom_r2(results_frame["y"], results_frame["y_tilde"])
print(f"Calculated R^2: {calculated_r2:.2f}")

Unnamed: 0,x,y,y_tilde
0,-3.000000,0.092148,0.255078
1,-2.969849,-0.014163,0.213800
2,-2.939698,-0.119671,0.175532
3,-2.909548,0.118743,0.140164
4,-2.879397,0.040287,0.107586
...,...,...,...
195,2.879397,0.806904,0.749761
196,2.909548,0.679490,0.692833
197,2.939698,0.726984,0.632511
198,2.969849,0.694786,0.568676


Calculated mean squared error: 0.02
Calculated R^2: 0.91
