Create linear_regression_proposal.md #210

Bowen1Zhu · 2020-06-09T02:19:39Z

Reference to issue

[Content Dev Proposal] Linear Regression : Entry Blog #207

Description of the changes proposed in the pull request

Create linear_regression_proposal.md

Reviewers requested:

@shreyagupta98

Bowen1Zhu · 2020-06-10T01:20:53Z

https://www.loom.com/share/67698333181f4a44b6c5b955e60807ed

atuljayaram · 2020-07-17T18:44:48Z

Assigned to @kylebegovich

chocohazel926 · 2020-07-25T18:03:11Z

LinearRegression/Blog/main.tex

+\section*{Checking Multicollinearity}
+Before the analysis, it is important to verify several regression conditions so as to make sure that our analysis is valid. Most of the conditions in linear regression can be checked easily with the residuals--the only snag is that we get the residuals only \textit{after} we perform the regression. Still, "no multicollinearity" is the condition that we are able to check beforehand, without using the residuals.
+
+Since a multiple regression involves more than one predictor variables, multicollinearity occurs when some of them are highly correlated with each other, which means each of these predictors will account for similar variance in the target variable. Therefore, though the presence of multicollinearity will not affect the predictive power of our model, it will make it more difficult for us to assess the individual influence of a predictor on our target variable. We can detect multicollinearity using either a correlation matrix or VIF factors.


I think the first sentence "Since a multiple regression involves more then..." could be moved up to the beginning of the "Checking Multicollinearity" section. It would be great to first define the term you have used in the section's title especially since it is described in great detail.

chocohazel926 · 2020-07-25T18:07:51Z

LinearRegression/Blog/main.tex

+\begin{figure}[H]\centering\includegraphics[width=0.5\linewidth]{13}\end{figure}
+But wait, you may ask, why did I choose to delete X3 rather than X6? Indeed, my choice was kind of arbitrary here. To remove the redundant variables more carefully, we can consider using stepwise regression to only keep the predictors that lead to the best performance; I will describe backwards stepwise regression later.
+
+Another way to deal with multicollinearity without having to drop your predictors before the analysis is to perform regression with regularization techniques (such as Lasso and Ridge) which I will also describe at the end of the article. Regularization can help you handle multicollinearity, so if you don't want to delete any variables, you may choose to skip the OLS model below and directly jump to the Lasso/Ridge/Elastic Net regression presented in the end.


Optional, but the phrase "which I will also describe at the end of the article" could be omitted for concision, given that "Lasso" and "Ridge" are already the headings of two sections.

chocohazel926 · 2020-07-25T18:11:50Z

LinearRegression/Blog/main.tex

+R-squared, a.k.a. coefficient of determination, measures the proportion of the variance of the target variable that can be explained by the predictors in our model. It is calculated as $1 - \frac{\text{Sum of Squares Error (SSE)}}{\text{Sum of Squares Total (SST)}}$, and ranges from 0 to 100\%. The higher $R^2$, the better the model. An $R^2$ of 100\%, for example, means the model explains all the variation of the target variable, whereas a value of 0\% indicates zero predictive capability.
+Our model has $R^2 = 0.542$, so approximately 54\% of the variation in the house price can be accounted for by our OLS model.
+
+Note that there's an adjusted $R^2 = 0.537$ in the second row which is slightly smaller than $R^2$. This is because $R^2$ only works as intended in a simple linear regression model where there's only one predictor. In a multiple regression, when a new predictor is added to the model, $R^2$ can only increase but never decrease. This implies that a model may seem to have a better fit simply because it has more predictors. Thus the adjusted $R^2$, calculated as $1 - \frac{\text{Mean Squares Error (MSE)}}{\text{Mean Squares Total (MST)}}$ or $1 - \frac{(1 - R^2)(n - 1)}{n - k - 1}$ if we already know $R^2$, takes into account the number of predictors $k$ and the number of data points $n$ in the model. It only increases if the new predictor improves the model more than expected by chance and decreases if it fails to do so. As a result, adjusted $R^2$ should always be less than or equal to $R^2$.


This paragraph has very detailed explanations, but if possible, see which sentences could be shortened- this will make the paragraph more concise!

chocohazel926 · 2020-07-25T18:16:53Z

LinearRegression/Blog/main.tex

+
+The two columns in the middle are the results of t-tests checking for the relevancy of each coefficient. The null hypothesis of each t-test is that the corresponding coefficient equals zero, meaning no linear relationship between this predictor and the target variable. The t-statistic is the value in the first column (mean) divided by the second column (standard error); the p-values corresponding to each t-statistic are presented in the next column. You can see that the p-values of all our coefficients except X1 are smaller than .01, so we can safely reject the null hypotheses and say that they are indeed relevant to predicting our target variable. However, X1 may be a redundant variable at .01 significance level.
+
+\section*{Stepwise Regression}


I think that this section could definitely be shortened. For example, a sentence like "An easier predictor selection procedure, as I mentioned previously, is backwards stepwise regression" could be phrased as "Backwards stepwise regression is an easier procedure and reliable predictor".

chocohazel926 · 2020-07-25T18:20:43Z

LinearRegression/linear_regression_blog.md

+
+
+
+That’s it for a complete procedure to perform linear regression analysis in Python! We have


This is a solid draft- I think making some of the longer paragraphs shorter would be great. Very organized and put together!

Create linear_regression_proposal.md

c2ce857

Bowen1Zhu requested a review from shreythecray as a code owner June 9, 2020 02:19

Create Multiple_Linear_Regression.ipynb

d691f1e

Bowen1Zhu assigned Bowen1Zhu and shreythecray Jun 9, 2020

Create blog

e046e78

atuljayaram self-assigned this Jul 17, 2020

atuljayaram requested a review from kylebegovich July 17, 2020 18:44

chocohazel926 reviewed Jul 25, 2020

View reviewed changes

Bowen1Zhu added 6 commits July 27, 2020 20:40

Upload blog

576a23e

Revise blog

3c1287b

Revise blog

7f03979

delete wrong files

8094267

minor edits

c47e9bb

minor edits

aacc93d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create linear_regression_proposal.md #210

Create linear_regression_proposal.md #210

Bowen1Zhu commented Jun 9, 2020

Bowen1Zhu commented Jun 10, 2020

atuljayaram commented Jul 17, 2020

chocohazel926 Jul 25, 2020

chocohazel926 Jul 25, 2020

chocohazel926 Jul 25, 2020

chocohazel926 Jul 25, 2020

chocohazel926 Jul 25, 2020


		The two columns in the middle are the results of t-tests checking for the relevancy of each coefficient. The null hypothesis of each t-test is that the corresponding coefficient equals zero, meaning no linear relationship between this predictor and the target variable. The t-statistic is the value in the first column (mean) divided by the second column (standard error); the p-values corresponding to each t-statistic are presented in the next column. You can see that the p-values of all our coefficients except X1 are smaller than .01, so we can safely reject the null hypotheses and say that they are indeed relevant to predicting our target variable. However, X1 may be a redundant variable at .01 significance level.

		\section*{Stepwise Regression}




		That’s it for a complete procedure to perform linear regression analysis in Python! We have

Create linear_regression_proposal.md #210

Are you sure you want to change the base?

Create linear_regression_proposal.md #210

Conversation

Bowen1Zhu commented Jun 9, 2020

Reference to issue

Description of the changes proposed in the pull request

Reviewers requested:

Bowen1Zhu commented Jun 10, 2020

atuljayaram commented Jul 17, 2020

chocohazel926 Jul 25, 2020

Choose a reason for hiding this comment

chocohazel926 Jul 25, 2020

Choose a reason for hiding this comment

chocohazel926 Jul 25, 2020

Choose a reason for hiding this comment

chocohazel926 Jul 25, 2020

Choose a reason for hiding this comment

chocohazel926 Jul 25, 2020

Choose a reason for hiding this comment