LinearRegression.jl implements linear regression using the least-squares algorithm (relying on the sweep operator). This package is in the alpha stage. Hence it is likely that some bugs exist. Furthermore, the API might change in future versions.
The usage aims to be straightforward, a call to regress
to build a linear regression model, and a call to predict_in_sample
to predict data using the built linear regression model.
When predicting on data not present during the regression, use the predict_out_of_sample
function as this does not require a response value (consequently, statistics that need a response, as the residuals, are not available.)
The regress call will compute some statistics about the fitted model in addition to the coefficients. The statistics computed depend on the value of the req_stats
argument.
The prediction functions compute predicted values together with some statistics. Like for the regress calls, the statistics computed depend on the value of the req_stats
argument.
When some analytical positive weights are used, a weighted regression is performed.
Fitting the model generates some statistics dependent on the req_stats
argument of the regress
function.
n
,p
,"coefs"
and"see"
are always computed"mse"
,"sst"
,"rmse"
,"aic"
,"sigma"
,"t_statistic"
,"vif"
,"r2"
,"adjr2"
,"stderror"
,"t_values"
,"p_values"
,"ci"
, are computed upon request.- some diagnostics can be requested as well. Here is the full list as Symbols
[:diag_normality, :diag_ks, :diag_ad, :diag_jb, :diag_heteroskedasticity, :diag_white, :diag_bp ]
,"diag_normality"
is a shortcut for[:diag_ks, :diag_ad, :diag_jb]
and:diag_heteroskedasticity
is a shortcut for[:diag_white, :diag_bp]
.
- some diagnostics can be requested as well. Here is the full list as Symbols
- "default", includes the mandatory stats, and some of the optional statistics here as Symbols:
[:coefs, :sse, :mse, :sst, :rmse, :sigma, :t_statistic, :r2, :adjr2, :stderror, :t_values, :p_values, :ci]
"all"
includes all availble statistics"none"
include only the mandatory statistics
The meaning for these statistics is given below.
The number of observations n
used to fit the model.
The number of independent variables p
used in the model.
The Total Sum of Squares (SST) is calculated but not presented to the user. In case of model with intercept the SST is computed with the following:
And when there is no intercept with the following:
The Error Sum of Squares (or SSE) also known as Residual Sum of Square (RSS). This package uses the sweep operator (Goodnight, J. (1979). "A Tutorial on the SWEEP Operator." The American Statistician.) to compute the SSE.
The Mean Squared Error (MSE) is calculated as
The Root Mean Squared Error (RMSE) is calculated as
The MSE is the estimator of σ̂² unless at least one robust covariance estimator is requested.
The R² (R2 or R-squared) see (https://en.wikipedia.org/wiki/Coefficient_of_determination) is calculated with the following formula:
The Adjusted R² (ADJR2) is computed with the following formulas:
when it is a model with an intercept:
And when there is no intercept:
The Akaike information criterion is calculated with the Linear Regression specific formula:
The t_statistic is computed by using the inverse cumulative t_distribution (with quantile()
) with parameter (n - p
) at 1 - \frac{α}{2}
.
The standard errors of the coefficients are calculated by multiplying the Sigma (estimated by the MSE) with the pseudo inverse matrix (resulting from the sweep operator), out of which the square root of the diagonal elements are extracted.
The t-values are calculated as the coefficients divided by their standard deviation.
The upper bound of the confidence interval for each coefficient is calculated as the coeffiecent + coefficient's standard error * t_statistic.
The lower bound of the confidence interval for each coefficient is calculated as the coeffiecent - coefficient's standard error * t_statistic.
The p-values are computed using the F Distribution, the degree of freedom for each coefficent.
Variance inflation factor (VIF) is calculated by taking the diagonal elements of the inverse of the correlation matrix formed by the independent variables.
The predicted residual error sum of squares is calculated by taking the sum of squares from the PRESS
(see below the statistics related to predictions) of each observations.
Robust Covariance estimator can be requested through the cov
argument of the regress
function.
The options are (as Symbols):
:white
: Heteroscedasticity:hc0
: Heteroscedasticity:hc1
: Heteroscedasticity):hc2
: Heteroscedasticity):hc3
: Heteroscedasticity):nw
: HAC (Heteroskedasticity and Autocorrelation Consistent estimator)
The user can select estimators from above list. If the user select :white
as an estimator then HC3 will be selected for a small size (n <= 250) otherwise HC0 will be selected. (see "Using Heteroscedasticity Consitent Standard Errors in the Linear Regression Model" J. Scott Long and Laurie H. Ervin (1998-2000)).
If another estimator is requested it is provided. A list of estimator can be requested as in for instance cov=[:hc2, hc3]
.
Comprehensive descriptions of the estimators and their applications shoudl in found in a text book, here only a brief description of the implementation is provided.
Having InvMat the pseudo inverse resulting from the sweep operator. And having xe
being the matrix of the independent variables times the residuals. Then HC0 is calculated as:
Having n being the number of observations and p the number of variables. Then HC1 is calculated as:
The leverage or hat matrix is calculated as:
xe
is scaled by \frac{1}{1 - H}
then
xe
is scaled by \frac{1}{{\left( 1 - H \right)^2}}
then
Newey-West estimator calculation is not documented yet. See reference implementation current implementation for details.
Predicting values using independent variables and a model will generate predicted values and some additional statistics dependent on the value of the req_stats
argument of the predict*
functions.
Here is a list of the available statistics:
[:predicted, :residuals, :leverage, :stdp, :stdi, :stdr, :student, :rstudent, :lcli, :ucli, :lclp, :uclp, :press, :cooksd]
The predicted value is the sum of the dependant variable(s) multiplied by the coefficients from the regression and the intercept (if the model has one). The predicted value is also known as the Y-hat.
The residuals are here defined as the known responses variables minus the predicted values.
The leverage for the i-th independent observation x_i when it is not a weighted regression is calculated as:
And as per below when it is a weighted regression with a vector of weights W
with the i-th weight being w_i
then the i-th leverage is calculated as such:
STDP is the standard error of the mean predicted value, and is calculated as
and for a weighted regression as:
STDI is the standard error of the individual predicted value, and is calculated as
and for a weighted regression as:
STDR is the standard error of the residual, and is calculated as
and for a weighted regression as:
Student represents the standardized residuals, and is calculated by using the residuals over the standard error of the residuals.
RStudent is the studentized residuals calculated as
LCLI is the lower bound of the prediction interval and is calculated as:
UCLI is the upper bound of the prediction interval and is calculated as:
LCLP is the lower bound of the predicted mean confidence interval and is calculated as:
UCLP is the upper bound of the predicted mean confidence interval and is calculated as:
COOKSD is the Cook's Distance for each predicted value, and is calculated as
PRESS is the predicted residual error sum of squares and is calculated as
Type 1 Sum of squares, are calculated as a by-product of the sweep operator.
Type 2 Sum of squares, are calculated using the pseudo-inverse matrix. The Type 2 SS of the ith independent variable is the square of the coefficient of the independent variable divided by the ith element of the diagonal from the pseudo-inverse matrix.
pcorr1
and pcorr2
are the squared partial correlation coefficient calculated as:
When there is an intercept in the model the pcorr1
and pcorr2
are considered missing
for the intercept.
scorr1
and scorr2
are the squared semi-partial correlation coefficient calculated as:
When there is an intercept in the model the scorr1
and scorr2
are considered missing
for the intercept.
For all options and parameters they can be passed as a Vector{String}
or a Vector{Symbol}
or alternatively if only options is needed as a single String
or Symbol
. For instance "all"
, :all
or ["R2", "VIF"]
or [:r2, :vif]
.
regress(f::StatsModels.FormulaTerm, df::AbstractDataFrame, req_plots; α::Float64=0.05, req_stats=["default"], weights::Union{Nothing,String}=nothing, remove_missing=false, cov=[:none], contrasts=nothing, plot_args=Dict("plot_width" => 400, "loess_bw" => 0.6, "residuals_with_density" => false))
regress(f::StatsModels.FormulaTerm, df::DataFrames.AbstractDataFrame; α::Float64=0.05, req_stats=["default"], weights::Union{Nothing,String}=nothing, remove_missing=false, cov=[:none], contrasts=nothing)
predict_out_of_sample(lr::linRegRes, df::AbstractDataFrame; α=0.05, req_stats=["none"], dropmissingvalues=true)
predict_in_sample(lr::linRegRes, df::AbstractDataFrame; α=0.05, req_stats=["none"], dropmissingvalues=true)
kfold(f, df, k, r = 1, shuffle=true; kwargs...)