Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create linear_regression_proposal.md #210

Open
wants to merge 9 commits into
base: LinearRegression
Choose a base branch
from

Conversation

Bowen1Zhu
Copy link

Reference to issue

Description of the changes proposed in the pull request

  • Create linear_regression_proposal.md

Reviewers requested:

  • @shreyagupta98

@Bowen1Zhu
Copy link
Author

@atuljayaram atuljayaram self-assigned this Jul 17, 2020
@atuljayaram
Copy link
Contributor

Assigned to @kylebegovich

\section*{Checking Multicollinearity}
Before the analysis, it is important to verify several regression conditions so as to make sure that our analysis is valid. Most of the conditions in linear regression can be checked easily with the residuals--the only snag is that we get the residuals only \textit{after} we perform the regression. Still, "no multicollinearity" is the condition that we are able to check beforehand, without using the residuals.

Since a multiple regression involves more than one predictor variables, multicollinearity occurs when some of them are highly correlated with each other, which means each of these predictors will account for similar variance in the target variable. Therefore, though the presence of multicollinearity will not affect the predictive power of our model, it will make it more difficult for us to assess the individual influence of a predictor on our target variable. We can detect multicollinearity using either a correlation matrix or VIF factors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first sentence "Since a multiple regression involves more then..." could be moved up to the beginning of the "Checking Multicollinearity" section. It would be great to first define the term you have used in the section's title especially since it is described in great detail.

\begin{figure}[H]\centering\includegraphics[width=0.5\linewidth]{13}\end{figure}
But wait, you may ask, why did I choose to delete X3 rather than X6? Indeed, my choice was kind of arbitrary here. To remove the redundant variables more carefully, we can consider using stepwise regression to only keep the predictors that lead to the best performance; I will describe backwards stepwise regression later.

Another way to deal with multicollinearity without having to drop your predictors before the analysis is to perform regression with regularization techniques (such as Lasso and Ridge) which I will also describe at the end of the article. Regularization can help you handle multicollinearity, so if you don't want to delete any variables, you may choose to skip the OLS model below and directly jump to the Lasso/Ridge/Elastic Net regression presented in the end.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional, but the phrase "which I will also describe at the end of the article" could be omitted for concision, given that "Lasso" and "Ridge" are already the headings of two sections.

R-squared, a.k.a. coefficient of determination, measures the proportion of the variance of the target variable that can be explained by the predictors in our model. It is calculated as $1 - \frac{\text{Sum of Squares Error (SSE)}}{\text{Sum of Squares Total (SST)}}$, and ranges from 0 to 100\%. The higher $R^2$, the better the model. An $R^2$ of 100\%, for example, means the model explains all the variation of the target variable, whereas a value of 0\% indicates zero predictive capability.
Our model has $R^2 = 0.542$, so approximately 54\% of the variation in the house price can be accounted for by our OLS model.

Note that there's an adjusted $R^2 = 0.537$ in the second row which is slightly smaller than $R^2$. This is because $R^2$ only works as intended in a simple linear regression model where there's only one predictor. In a multiple regression, when a new predictor is added to the model, $R^2$ can only increase but never decrease. This implies that a model may seem to have a better fit simply because it has more predictors. Thus the adjusted $R^2$, calculated as $1 - \frac{\text{Mean Squares Error (MSE)}}{\text{Mean Squares Total (MST)}}$ or $1 - \frac{(1 - R^2)(n - 1)}{n - k - 1}$ if we already know $R^2$, takes into account the number of predictors $k$ and the number of data points $n$ in the model. It only increases if the new predictor improves the model more than expected by chance and decreases if it fails to do so. As a result, adjusted $R^2$ should always be less than or equal to $R^2$.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph has very detailed explanations, but if possible, see which sentences could be shortened- this will make the paragraph more concise!


The two columns in the middle are the results of t-tests checking for the relevancy of each coefficient. The null hypothesis of each t-test is that the corresponding coefficient equals zero, meaning no linear relationship between this predictor and the target variable. The t-statistic is the value in the first column (mean) divided by the second column (standard error); the p-values corresponding to each t-statistic are presented in the next column. You can see that the p-values of all our coefficients except X1 are smaller than .01, so we can safely reject the null hypotheses and say that they are indeed relevant to predicting our target variable. However, X1 may be a redundant variable at .01 significance level.

\section*{Stepwise Regression}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this section could definitely be shortened. For example, a sentence like "An easier predictor selection procedure, as I mentioned previously, is backwards stepwise regression" could be phrased as "Backwards stepwise regression is an easier procedure and reliable predictor".




That’s it for a complete procedure to perform linear regression analysis in Python! We have

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a solid draft- I think making some of the longer paragraphs shorter would be great. Very organized and put together!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants