A linear model attempts to estimate sets of observables $y$ with linear combinations of predictors $x$.

---

If $x$ has a single dimension, i.e. there is only one coefficient of linearity, the obtained model is univariate.

If, on the other hand, $x$ has more than one dimension, i.e. each of its dimensions has a corresponding linear coefficient, the obtained model is multivariate.

---

Additionally, the $x$ and $y$ sets may be affected by uncertainties.

Any uncertainty affecting any quantity is here assumed to stem from a normal distribution.

---

This notebook attempts to summarise the best `python` tools with which each of these cases can be better tackled.



# Import statements

In [205]:
import numpy as np

from scipy import odr, stats
from sklearn import linear_model
from statsmodels.regression.linear_model import GLS, OLS, WLS
from statsmodels.tools import tools

# Univariate models

In [251]:
n = 100
p = 2   # number of parameters to fit, i.e. slope and intercept

real_slope = -5.3
real_intercept = 10.7

x_sigma = 0.5
y_sigma = 3.5

# create column vector of predictors
x = np.linspace(start=-30, stop=20, num=n) + np.random.normal(loc=0, scale=x_sigma, size=(n,))
x = x.reshape(-1, 1)

x_design = tools.add_constant(x)

# create column vector of observables
y = real_slope * np.linspace(start=-30, stop=20, num=n) + real_intercept + np.random.normal(loc=0, scale=y_sigma, size=(n,))
y = y.reshape(-1, 1)

x_to_predict = 6
x_to_predict_col_vec = np.array([[1],[x_to_predict]])

alpha = 1-.95

### Without uncertainties

##### Using `scipy.stats`

In [252]:
# vectors need to be converted to rows
x_row = x.flatten()
y_row = y.flatten()

sp_result = stats.linregress(x.flatten(), y_row)

print(f"Estimated slope {sp_result.slope:.4f} ± {sp_result.stderr:.4f}\nEstimated intercept {sp_result.intercept:.4f} ± {sp_result.intercept_stderr:.4f}")

Estimated slope -5.2957 ± 0.0264
Estimated intercept 10.8958 ± 0.4084


In [253]:
residuals = y_row - (x_row * sp_result.slope + sp_result.intercept)
s_value = np.linalg.norm(residuals, 2) / np.sqrt(n - p)

y_estimated = x_to_predict * sp_result.slope + sp_result.intercept

inner_sqrt_term = (x_to_predict_col_vec.T @ np.linalg.inv(x_design.T @ x_design ) @ x_to_predict_col_vec )[0,0]

one_sided_CL_magnitude = stats.t.ppf(q = 1-(alpha/2), df = n-p) * s_value * np.sqrt(inner_sqrt_term)

print(f"95% Confidence level at x = {x_to_predict} is y_estimated = {y_estimated:.4f} ± {one_sided_CL_magnitude:.4f}")


95% Confidence level at x = 6 is y_estimated = -20.8787 ± 0.9584


In [254]:
one_sided_PL_magnitude = stats.t.ppf(q = 1-alpha/2, df = n-p) * s_value * np.sqrt(1+inner_sqrt_term)

print(f"95% Confidence level at x = {x_to_predict} is y_estimated = {y_estimated:.4f} ± {one_sided_PL_magnitude:.4f}")

95% Confidence level at x = 6 is y_estimated = -20.8787 ± 7.7356


##### Using `sklearn`

In [192]:
sk_lm = linear_model.LinearRegression()
sk_result = sk_lm.fit(x, y)

print(f"Estimated slope {sk_result.coef_[0,0]:.4f} \nEstimated intercept {sk_result.intercept_[0]:.4f} ")

Estimated slope -5.3289 
Estimated intercept 10.4126 


In [193]:
residuals = y_row - (x_row * sk_result.coef_[0,0] + sk_result.intercept_[0])
s_value = np.linalg.norm(residuals, 2) / np.sqrt(n - p)

y_estimated = x_to_predict * sk_result.coef_[0,0] + sk_result.intercept_[0]

inner_sqrt_term = (x_to_predict_col_vec.T @ np.linalg.inv(x_design.T @ x_design ) @ x_to_predict_col_vec )[0,0]

one_sided_CL_magnitude = stats.t.ppf(q = 1-(alpha/2), df = n-p) * s_value * np.sqrt(inner_sqrt_term)

print(f"95% Confidence level at x = {x_to_predict} is y_estimated = {y_estimated:.4f} ± {one_sided_CL_magnitude:.4f}")

95% Confidence level at x = 6 is y_estimated = -21.5607 ± 0.9975


In [191]:
one_sided_PL_magnitude = stats.t.ppf(q = 1-alpha/2, df = n-p) * s_value * np.sqrt(1+inner_sqrt_term)

print(f"95% Confidence level at x = {x_to_predict} is y_estimated = {y_estimated:.4f} ± {one_sided_PL_magnitude:.4f}")

95% Confidence level at x = 6 is y_estimated = -21.5607 ± 7.9832


##### Using `statsmodels`

In [257]:
sm_lm = OLS(y, x_design)
sm_result = sm_lm.fit()
print(sm_result.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.998
Model:                            OLS   Adj. R-squared:                  0.998
Method:                 Least Squares   F-statistic:                 4.036e+04
Date:                Mon, 17 Mar 2025   Prob (F-statistic):          5.41e-130
Time:                        18:43:42   Log-Likelihood:                -276.16
No. Observations:                 100   AIC:                             556.3
Df Residuals:                      98   BIC:                             561.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         10.8958      0.408     26.680      0.0

In [258]:
prediction = sm_result.get_prediction(x_to_predict_col_vec.T)

y_estimated = prediction.predicted_mean[0]

one_sided_CL_magnitude = prediction.summary_frame(alpha=alpha).mean_ci_upper.iloc[0] - prediction.predicted_mean[0]

print(f"95% Confidence level at x = {x_to_predict} is y_estimated = {y_estimated:.4f} ± {one_sided_CL_magnitude:.4f}")

95% Confidence level at x = 6 is y_estimated = -20.8787 ± 0.9584


In [259]:
one_sided_PL_magnitude = prediction.summary_frame(alpha=alpha).obs_ci_upper.iloc[0] - prediction.predicted_mean[0]

print(f"95% Confidence level at x = {x_to_predict} is y_estimated = {y_estimated:.4f} ± {one_sided_PL_magnitude:.4f}")

95% Confidence level at x = 6 is y_estimated = -20.8787 ± 7.7356


##### Conclusion

Use `statsmodels` since it is neater and there is less variable manipulation.

Do not use `scikit-learn` since it doesn't even provide standard errors for the fit coefficients.

### With uncertainties in $y$

Here, it is assumed that there is no dependence of $\Delta y$ on the value of $y$ itself

In [135]:
y_weights = 1/abs(np.random.normal(loc=y_sigma, scale=y_sigma/10, size=(n,)))**2

##### Using `sklearn`

In [186]:
sk_weighted_lm = linear_model.LinearRegression()
sk_weighted_result = sk_weighted_lm.fit(x, y, sample_weight=y_weights)

print(f"Estimated slope {sk_weighted_result.coef_[0,0]:.4f} \nEstimated intercept {sk_weighted_result.intercept_[0]:.4f} ")

Estimated slope -5.3260 
Estimated intercept 10.4618 


In [198]:
residuals = y_row - (x_row * sk_weighted_result.coef_[0,0] + sk_weighted_result.intercept_[0])
s_squared_value = sum(y_weights * residuals**2) / (n - p)

y_estimated = x_to_predict * sk_weighted_result.coef_[0,0] + sk_weighted_result.intercept_[0]

inner_sqrt_term = (x_to_predict_col_vec.T @ np.linalg.inv(x_design.T @ np.diag(y_weights) @ x_design ) @ x_to_predict_col_vec )[0,0]

one_sided_CL_magnitude = stats.t.ppf(q = 1-(alpha/2), df = n-p) * np.sqrt(s_squared_value * inner_sqrt_term)

print(f"95% Confidence level at x = {x_to_predict} is y_estimated = {y_estimated:.4f} ± {one_sided_CL_magnitude:.4f}")

95% Confidence level at x = 6 is y_estimated = -21.4942 ± 1.0083


In [201]:
one_sided_PL_magnitude = stats.t.ppf(q = 1-alpha/2, df = n-p) * np.sqrt(s_squared_value + s_squared_value * inner_sqrt_term)

print(f"95% Confidence level at x = {x_to_predict} is y_estimated = {y_estimated:.4f} ± {one_sided_PL_magnitude:.4f}")

95% Confidence level at x = 6 is y_estimated = -21.4942 ± 2.5394


##### Using `statsmodels`

In [187]:
sm_weighted_lm = WLS(y, x_design, y_weights )
sm_weighted_result = sm_weighted_lm.fit()
# print(sm_weighted_result.summary())

In [188]:
weighted_prediction = sm_weighted_result.get_prediction(x_to_predict_col_vec.T)

y_estimated = weighted_prediction.predicted_mean[0]

one_sided_CL_magnitude = weighted_prediction.summary_frame(alpha=alpha).mean_ci_upper.iloc[0] - weighted_prediction.predicted_mean[0]

print(f"95% Confidence level at x = {x_to_predict} is y_estimated = {y_estimated:.4f} ± {one_sided_CL_magnitude:.4f}")

95% Confidence level at x = 6 is y_estimated = -21.4942 ± 1.0083


In [189]:
one_sided_PL_magnitude = weighted_prediction.summary_frame(alpha=alpha).obs_ci_upper.iloc[0] - weighted_prediction.predicted_mean[0]

print(f"95% Confidence level at x = {x_to_predict} is y_estimated = {y_estimated:.4f} ± {one_sided_PL_magnitude:.4f}")

95% Confidence level at x = 6 is y_estimated = -21.4942 ± 2.5394


##### Conclusion

Again, the `statsmodels` approach is neater and preferred.

`sklearn` is also fine to be used, but some manipulation needs to be done with the variables.

Do not use `scipy.odr` since it does not perform an exact weighted least squares calculation to estimate the fit parameters and instead uses the Levenberg-Marquardt-type algorithm.

### With uncertainties in $x$ and $y$

In [202]:
x_weights = 1/abs(np.random.normal(loc=x_sigma, scale=x_sigma/10, size=(n,)))**2

##### Using `scipy.odr`

In [206]:
def univariate_linear_function(params, x):
    """
    Univariate linear calculation
    Input: params (list) contain the linear coefficients
    Input: x (row vector) contains the independent variables
    Output: the linear calculation
    """

    return params[0] + params[1] * x

In [207]:
odr_model = odr.Model(univariate_linear_function)

odr_data = odr.RealData(x=x.flatten(), y=y.flatten(), sx=1/np.sqrt(x_weights), sy=1/np.sqrt(y_weights))

odr_instance = odr.ODR(odr_data, odr_model, beta0=[1,30])

odr_output = odr_instance.run()

odr_output.pprint()

Beta: [10.40553543 -5.33367186]
Beta Std Error: [0.42445092 0.02755116]
Beta Covariance: [[0.21446413 0.0046572 ]
 [0.0046572  0.00090361]]
Residual Variance: 0.8400406388050697
Inverse Condition #: 0.0020416657392539838
Reason(s) for Halting:
  Sum of squares convergence


##### Conclusion

# Multivariate models