# Homework 1
***

We are going to work with the following dataset: fluid current in a tube.
Some statistics are collected for dataset, incl. mean, skewness, kurtosis, etc. We are predicting flow rate ('tohn/hour'). We need to build confidence and predictive intervals.

In [24]:
%matplotlib inline

import numpy as np
from sklearn import datasets, linear_model, preprocessing, model_selection
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
import pandas as pd

In [25]:
df = pd.read_csv('exxsol_data.csv', sep=';', header=(0))

There are 10 features and 1 label to predict:

In [26]:
print(df.columns.values)

['mean' 'std' 'skew' 'kurt' 'RMS' 'crest' 'freq_peak' 'shan' 'perm' 'temp'
 'tohn/hour']


In [27]:
y = df['tohn/hour']
freq_temp = df[['freq_peak', 'temp']]

Physics tells us that flow rate is a function of a frequency peak and temperature.

In [28]:
freq_temp, y = shuffle(freq_temp, y)

# split data into training and testing sets
#from sklearn.model_selection import train_test_split
#train_freq, test_freq, train_y, test_y = train_test_split(freq, y, train_size=0.7, random_state=2)

lr = linear_model.LinearRegression()
predicted = model_selection.cross_val_predict(
    lr, freq_temp, y.ravel(), cv=20)
score = model_selection.cross_val_score(lr, freq_temp, y,
                                         scoring='r2', cv=20)

## Q0: Build point estimate for mean r2 score and its deviation

In [29]:
mean_r2 = np.mean(score)
std_r2 = np.std(score)
# std_r2 = np.sqrt(np.mean(abs(score - mean_r2) ** 2))  # manual computation
print('Mean R2:', mean_r2)
print('Std R2: ', std_r2)

Mean R2: 0.8268699707067322
Std R2: 0.074594916692097


## Q1: Predicted is an array with predictions of the label y. Assuming, that $\sigma = 0.1$, compute 95% confidence and predictive interval for mean squared error. 

In [30]:
def plus_minus(base, diff):
    return base - diff, base + diff

In [31]:
sigma = 0.1
num_samples = y.shape[0]
mse = np.mean(np.array((predicted - y) ** 2))
quantile_95 = 1.96
print('Confindence interval: [{}, {}]'.format(*plus_minus(mse, sigma / np.sqrt(num_samples))))
print('Predictive interval:  [{}, {}]'.format(*plus_minus(mse, quantile_95 * sigma)))

Confindence interval: [0.15915128157567132, 0.16671057103585585]
Predictive interval:  [-0.03306907369423642, 0.3589309263057636]


## Q2:  Compute 95% confidence and predicted intervals for mean squared error, assuming no knowledge about $\sigma$.

In [32]:
t = 1.984  # having 700 samples, 0.95 confidence level value of t is obtained
           # from Student's distribution table for 100 samples (less than we have to ensure certainly)
std = np.std(np.array((predicted - y) ** 2))
print('Confindence interval: [{}, {}]'.format(*plus_minus(mse, t * std / np.sqrt(num_samples))))
print('Predictive interval:  [{}, {}]'.format(*plus_minus(mse, t * std)))

Confindence interval: [0.13677054158145663, 0.18909131103007054]
Predictive interval:  [-0.5292077955171285, 0.8550696481286556]


We can use additional features and more complex model, e.g. ElasticNet.

In [33]:
y = df['tohn/hour']
X = df.drop(['tohn/hour'],axis=1)
X = preprocessing.scale(X)
X, y = shuffle(X, y)

encv = linear_model.ElasticNetCV(cv=10, max_iter=3000, n_alphas=10)
predicted_encv = model_selection.cross_val_predict(
    encv, X, y.ravel(), cv=20)
score_encv = model_selection.cross_val_score(encv,X, y.ravel(),
                                         scoring='r2', cv=20)

## Q3:  Compute 95% confidence interval for difference in means of mean squared error between 2 models, assuming no knowledge about $\sigma$.

In [34]:
en_mse = np.mean(np.array((predicted_encv - y) ** 2))
mean = np.mean(np.array(en_mse - mse))  # means of mean?
t = 1.984  # having 700 samples, 0.95 confidence level value of t is obtained
           # from Student's distribution table for 100 samples (less than we have to ensure certainly)
std = np.std(np.array((predicted - y) ** 2) - np.array((predicted_encv - y) ** 2))
print('Confindence interval: [{}, {}]'.format(*plus_minus(mean, t * std / np.sqrt(num_samples))))
print('Predictive interval:  [{}, {}]'.format(*plus_minus(mean, t * std)))

Confindence interval: [-0.2688240989582027, 0.06236934655284773]
Predictive interval:  [-4.4845048395869815, 4.278050087181627]
