A linear model attempts to estimate sets of observables $y$ with linear combinations of predictors $x$.

---

If $x$ has a single dimension, i.e. there is only one coefficient of linearity, the obtained model is univariate.

If, on the other hand, $x$ has more than one dimension, i.e. each of its dimensions has a corresponding linear coefficient, the obtained model is multivariate.

---

Additionally, the $x$ and $y$ sets may be affected by uncertainties.

Any uncertainty affecting any quantity is here assumed to stem from a normal distribution.

---

This notebook attempts to summarise the best `python` tools with which each of these cases can be better tackled.



# Import statements

In [1]:
import numpy as np

from scipy import stats
from sklearn import linear_model
from statsmodels.regression.linear_model import OLS
from statsmodels.tools import tools

# Univariate models

In [2]:
n = 100
p = 2   # number of parameters to fit, i.e. slope and intercept

real_slope = -5.3
real_intercept = 10.7

x_sigma = 0.5
y_sigma = 3.5

# create column vector of predictors
x = np.linspace(start=-30, stop=20, num=n) + np.random.normal(loc=0, scale=x_sigma, size=(n,))
x = x.reshape(-1, 1)

x_design = tools.add_constant(x)

# create column vector of observables
y = real_slope * np.linspace(start=-30, stop=20, num=n) + real_intercept + np.random.normal(loc=0, scale=y_sigma, size=(n,))
y = y.reshape(-1, 1)

x_to_predict = 6
x_to_predict_col_vec = np.array([[1],[x_to_predict]])

alpha = 1-.95

### Without uncertainties

##### Using `scipy`

In [3]:
# vectors need to be converted to rows
x_row = x.flatten()
y_row = y.flatten()

sp_result = stats.linregress(x.flatten(), y_row)

print(f"Estimated slope {sp_result.slope:.4f} ± {sp_result.stderr:.4f}\nEstimated intercept {sp_result.intercept:.4f} ± {sp_result.intercept_stderr:.4f}")

Estimated slope -5.3330 ± 0.0287
Estimated intercept 10.3853 ± 0.4421


In [4]:
residuals = y_row - (x_row * sp_result.slope + sp_result.intercept)
s_value = np.linalg.norm(residuals, 2) / np.sqrt(n - p)

y_estimated = x_to_predict * sp_result.slope + sp_result.intercept

inner_sqrt_term = (x_to_predict_col_vec.T @ np.linalg.inv(x_design.T @ x_design ) @ x_to_predict_col_vec )[0,0]

one_sided_CL_magnitude = stats.t.ppf(q = 1-(alpha/2), df = n-p) * s_value * np.sqrt(inner_sqrt_term)

print(f"95% Confidence level at x = {x_to_predict} is y_estimated = {y_estimated:.4f} ± {one_sided_CL_magnitude:.4f}")


95% Confidence level at x = 6 is y_estimated = -21.6125 ± 1.0402


In [5]:
one_sided_PL_magnitude = stats.t.ppf(q = 1-alpha/2, df = n-p) * s_value * np.sqrt(1+inner_sqrt_term)

print(f"95% Confidence level at x = {x_to_predict} is y_estimated = {y_estimated:.4f} ± {one_sided_PL_magnitude:.4f}")

95% Confidence level at x = 6 is y_estimated = -21.6125 ± 8.3575


##### Using `sklearn`

In [None]:
sk_lm = linear_model.LinearRegression()
sk_result = sk_lm.fit(x, y)

print(f"Estimated slope {sk_result.coef_[0,0]:.4f} \nEstimated intercept {sk_result.intercept_[0]:.4f} ")

Estimated slope -5.3330 
Estimated intercept 10.3853 


In [None]:
residuals = y_row - (x_row * sk_result.coef_[0,0] + sk_result.intercept_[0])
s_value = np.linalg.norm(residuals, 2) / np.sqrt(n - p)

y_estimated = x_to_predict * sk_result.coef_[0,0] + sk_result.intercept_[0]

inner_sqrt_term = (x_to_predict_col_vec.T @ np.linalg.inv(x_design.T @ x_design ) @ x_to_predict_col_vec )[0,0]

one_sided_CL_magnitude = stats.t.ppf(q = 1-(alpha/2), df = n-p) * s_value * np.sqrt(inner_sqrt_term)

print(f"95% Confidence level at x = {x_to_predict} is y_estimated = {y_estimated:.4f} ± {one_sided_CL_magnitude:.4f}")

95% Confidence level at x = 6 is y_estimated = -21.6125 ± 1.0402


In [None]:
one_sided_PL_magnitude = stats.t.ppf(q = 1-alpha/2, df = n-p) * s_value * np.sqrt(1+inner_sqrt_term)

print(f"95% Confidence level at x = {x_to_predict} is y_estimated = {y_estimated:.4f} ± {one_sided_PL_magnitude:.4f}")

95% Confidence level at x = 6 is y_estimated = -21.6125 ± 8.3575


##### Using `statsmodels`

In [9]:
sm_lm = OLS(y, x_design)
sm_result = sm_lm.fit()
# print(sm_result.summary())

In [10]:
prediction = sm_result.get_prediction(x_to_predict_col_vec.T)

y_estimated = prediction.predicted_mean[0]

one_sided_CL_magnitude = prediction.summary_frame(alpha=alpha).mean_ci_upper.iloc[0] - prediction.predicted_mean[0]

print(f"95% Confidence level at x = {x_to_predict} is y_estimated = {y_estimated:.4f} ± {one_sided_CL_magnitude:.4f}")

95% Confidence level at x = 6 is y_estimated = -21.6125 ± 1.0402


In [11]:
one_sided_PL_magnitude = prediction.summary_frame(alpha=alpha).obs_ci_upper.iloc[0] - prediction.predicted_mean[0]

print(f"95% Confidence level at x = {x_to_predict} is y_estimated = {y_estimated:.4f} ± {one_sided_PL_magnitude:.4f}")

95% Confidence level at x = 6 is y_estimated = -21.6125 ± 8.3575


##### Conclusion

Use `statsmodels` since it is neater and there is less variable manipulation.

Do not use `scikit-learn` since it doesn't even provide standard errors for the fit coefficients.

# Multivariate models