# Multicollinearity

## Definition:

Define a multivariate linear regression model such that

$Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_p X_{pi} + \varepsilon_i$ for $i = 1, 2, \cdots, n$ and $\varepsilon_i$ ~ $N(0, \sigma^2)$ for any $i$

if $Corr(X_i, X_j) >> 0$ for any $i \neq j$, we can say that some of the predictor variables are **linearly correlated** and this regression model has **multicolinearity**.

## Description:

Multicollinearity occurs in regression analysis when several predictor variables within a multiple regression model exhibit **strong correlations** with each other. This means that the value of one predictor can be expressed as a linear combination of other predictor variables. This instability hinders the accurate determination of the individual relationships between each predictor and the dependent variable and leads to a decrease in statistical significance of each predictor variable.

## Demonstration and Diagram:

![image.png](attachment:image.png)

As we can see in the correlation matrix and the plot, because all show relatively high correlations $(\rho > 0.6)$ with each other apart from **TLT**, we can say this group of exogenous variables may have a **multicolinearity** issue. Therefore, we can expect to reduce the dimensionality and complexity of our data after we apply **dimensionality reduction methods** to our dataframe, making each variable relatively more useful in explaining variation.

## Diagnosis

We can detect **multicollinearity** initially by creating a **correlation matrix** of the exogenous (predictor) variables. If at least one pair shows high Pearson correlation ($\rho > 0.6$), we can suspect that our set of predictor variables has a **multicollinearity** problem. However, to formally decide the dataset has multicollinearity, we use a metric called **Variance Inflation Factor (VIF)**, <u> which is calculated for each of the predictor variables </u> :

$VIF_{exog} = \Large\frac{1}{R^2}$

where $R^2$ is the coefficient of determination when fitting an OLS model to a single exogenous variable while using others as the predictors. An independent variable with a **VIF** greater than $5$ indicates <u> severe </u> multicollinearity.

Although our results indicate that there is only mild multicollinearity $(1 < VIF < 5)$ , we can still work towards reducing the complexity of our data and making our predictor variables more useful.

## Damage

**Multicollinearity** in a regression model can have severe consequences if it is not taken care of:

* High correlation among independent variables in a regression model leads to increased standard errors for coefficient estimates $\hat{\beta_i}$, resulting in less precise coefficient estimates with wider confidence intervals.

* Multicollinearity may reduce the distinct impact of each variable, despite potentially high $R^2$ values.

As a result, **multicollinearity** decreases our confidence in the regression model and our estimates, and may lead to scraping of the model completely if not handled properly.

## 6) Directions

Although we can use several methods such as dropping exogenous variables with high correlations and fitting regressions to the exogenous variables themselves; an industry standard is to use **Principal Component Analysis (PCA)** to address the issue of **multicollinearity**. Principal Component Analysis creates new variables that are <u> linear combinations of the original predictor variables </u> which explain most of the **variation** in the data, thus letting us pick the principal components which provide the most information; effectively letting us reduce the complexity of our data.

We can see that the first 3 principal components explain ~$85.65$% of the variance. Then, we can confidently fit an OLS model using **pc1**, **pc2** and **pc3**; while dropping the rest of the **pc**s.

As we can see from the results table, while $R^2$ value practically stayed the same at $R^2 =$ ~ $0.3$ **VIF** values from the new principal component variables are all equal to $1$ and pairwise correlations between the principal components are ~$0$. Therefore, we can confidently conclude that our principal component analysis ended up improving the **multicollinearity** problem in our dataset and resulted in a better OLS regression model.