# Part a): Ordinary Least Square on the Franke function
In this notebook, we generate a dataset by sampling the Franke function on the rectangle $[0,1]\times[0,1]$, both with and without the presence of added noise. We then try fitting a polynomial function to these dataset, and evaluate how well we are able to approximate the data.

In [1]:
import numpy as np
import os
os.sys.path.append(os.path.dirname(os.path.abspath('.')))
import pandas as pd

# Import local modules
from src.models.models import OLS
from src.evaluation.evaluation import mse, r_squared

In [2]:
df_X = pd.read_csv('../data/generated/X.csv', index_col=0)
df_z_no_noise = pd.read_csv('../data/generated/no_noise.csv', usecols=[1])
df_z_some_noise = pd.read_csv('../data/generated/some_noise.csv', usecols=[1])
df_z_noisy = pd.read_csv('../data/generated/noisy.csv', usecols=[1])

X = np.array(df_X)
z_no_noise = np.array(df_z_no_noise).ravel()
z_some_noise = np.array(df_z_some_noise).ravel()
z_noisy = np.array(df_z_noisy).ravel()

z_no_noise.shape

(400,)

We will now compute the MSE and the $R^2$ score for our three datasets. We will store our results in a csv file in order to show them in our final report.

In [44]:
targets = [{
    'name': 'No noise',
    'values': z_no_noise
},
{
    'name': 'Some noise (sigma 0.1)',
    'values': z_some_noise
},
{
    'name': 'Noisy (sigma 0.9)',
    'values': z_noisy
}]
col_names = ['MSE', 'R^2']
output_df = pd.DataFrame(columns=col_names)
print('%-30s|%-10s|%-10s' %('Data', 'MSE', 'R^2'))
print('-'*50)
for target in targets:
    z = target['values']
    ol = OLS()
    ol.fit(X, target['values'])
    predictions = ol.predict(X)
    
    mse_value = mse(z, predictions)
    r_2_value = r_squared(z, predictions)
    print('%-30s|%-10f|%-10f' %(target['name'], mse_value, r_2_value))
    
    output_df = output_df.append(pd.DataFrame(data=[[mse_value, r_2_value]], columns=col_names, index=[target['name']]))
output_df.to_csv('../reports/csv_files/1_mse_r2_score.csv')

Data                          |MSE       |R^2       
--------------------------------------------------
No noise                      |0.002143  |0.999930  
Some noise (sigma 0.1)        |0.010636  |0.999698  
Noisy (sigma 0.9)             |0.773151  |0.997827  


We see a how an increase in noise increases the mean square error, but that the R squared statistic remains fairly close to 1.

Now we will get our $\beta$ estimates and their individual variances. This will enable to construct confidence intervals for the parameters. The output will be saved in three individual csv files and shown in the final report.

In [45]:
targets = [{
    'name': 'No noise',
    'values': z_no_noise,
    'filename': '1_parameters_with_confidence_intervals_no_noise.csv'
},
{
    'name': 'Some noise (sigma 0.1)',
    'values': z_some_noise,
    'filename': '1_parameters_with_confidence_intervals_some_noise.csv'
},
{
    'name': 'Noisy (sigma 0.9)',
    'values': z_noisy,
    'filename': '1_parameters_with_confidence_intervals_noisy.csv'
}]
features = list(df_X.columns)

for target in targets:
    z = target['values']
    filename = target['filename']
    ol = OLS()
    ol.fit(X, z)
    predictions = ol.predict(X)
    
    mse_value = mse(z, predictions)
    estimate = ol.beta
    var_beta = mse_value*np.diag(np.linalg.inv(np.dot(X.transpose(), X)))
    lower_bound = estimate - 1.96*np.sqrt(var_beta)
    upper_bound = estimate + 1.96*np.sqrt(var_beta)
    
    parameter_df = pd.DataFrame(
        data = pd.compat.OrderedDict([
            ('Feature', features,),
            ('Estimate', estimate,),
            ('Variance', var_beta),
            ('Lower bound', lower_bound),
            ('Upper bound', upper_bound)
        ])   
    )
    parameter_df.to_csv('../reports/csv_files/' + filename)
    
