Regressions

Outline

Regression
Multilinear Regression
Logistic Regression
Additional

Matrix and Numpy refresher
Model assumptions and issues
Confusion Matrix
ROC AUC

Regression

- The most common way to visualize simple linear regression is using a scatter plot. #### Correlation Coefficient - Correlation coefficients provide a measure of the strength and direction of a linear relationship. - strength: closeness of points - direction: positive or negative - **Calculation**: $r = \frac{ \sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y}) }{% \sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i-\bar{y})^2}}$ - **Rule of thumb:** - Strong relationship $0.7≤∣r∣≤1.0$ - Moderate relationship $0.3≤∣r∣<0.7$ - Weak relationship $0.0≤∣r∣<0.3$ > Note: a correlation coefficient of 0 there is no **linear** relationship between two variables.

Define the line

Component:
- intercept: the predicted value of the response when the x variable is zero
- slope: the predicted change in the response for every 1 unit increase in the x variable
Regression Line:
- Method: MSE
- minimize $\sum_{i=1}^{n}(y_i-\hat{y})^2$
- find the minimum value of: for each data point in the dataset, look at the distance between the predicted and actual values, square these and sum them all together.
- Manual calculation

Code Practice and interpretation

import statsmodels.api as sm

df['intercept'] = 1

lm = sm.OLS(df.y, df[[x1, intercept]])
result = lm.fit()
result.summary()

Interpretation (v1,v2)

p-value: -> if the variable is statistically significant for predicting the dependent variable

the p-value associated with area is very small, which sugguests there is a statistical evidence that population slope associated with area in relating to price is non-zero.
R-squared: -> square of correlation coefficient, between 0 and 1 the more closer to 1, the better fit. -> amount of varibility in the dependent variable y explained by the model.

e.g. 67.8% of the varibility in area can be explained by the price of a house.
coef:

e.g. for ever one unit increase in area, the predicted increaase in price is 348.5.
intercept:

e.g. Based on our predicted values, it would be unexpected to have a price below 9588, because this is the predicted price of a house with no area.

Multilinear Regression

Preview -> Matrix and Numpy refresher

Build models

through aclculation of matrix by the formular in numpy

$\beta = (X^′X) ^−1X ^′y$

X= df [['intercept', x1, x2, x3]]
Y = df.price
np.dot(np.dot(np.linalg.inv(np.dot(X.transpose(), X)), X.transpose()),y)

Dummy variables
- the way to add categorical variables to multiple linear regression model by 1,0 encoding, must drop one of the column (the column dropped called baseline). reason:
  1. to assure that all of the columns are linearly independent
  2. assure that the dot product of $X^′X$ is invertible
  3. assure the X natrix is full rank
- Encoding Create and need to Drop pd.get_dummies(df[categorical_var])
Interpretation (v1,v2)

* **coef:** - Quantitative For every one unit increase in X, the expected Y increases by the `slope`, holding all else constant. > e.g. for ever additional unit increase in area of the house, the price is expected to increaase by 348.5 **as long as the other variables stay the same**. - Categorical > We expect that a house in neighborhood C will cost 7168 less than a neighborhood A house, all else being equal.

Potential Problems
1. Linear relationship doesnt exists
2. Correlated errors
3. Non constant variance
4. Outliers hurt the model
5. Multicollinearity
Multicollinearity Problem Details described in addition table
- Consequences: direction flipped coefficients
- Scatterplot Matrix
```
import searborn as sb
sb.pairplot(df[['var1','var2','var3']])
```
- VIFs (Variance Inflation Factors)
  - Calculation: $VIF_i = \frac{1}{1- R_i^2}$
    - Logic All other x-variables - excluding $x_i$ -> are used to predict $x_i$ then compute $R_i^2$ if one related to other: $R_i^2$ (\uparrow) then $1- R_i^2$(\downarrow) (\therefore) VIFs (\uparrow)
  - Code: if VIF> 10 then we have multicollinearity in model
```
from pasty import dmatrices
from statsmodels.stats.outliers_influence import 

variance_inflation_factor 
y,X = dmatrices('price ~ intercept + area +bedrooms + bathrooms', df, return_type ='dataframe')

vif = pd.DataFrame()
vif["VIF factor"] = [variance_inflation_fWctor(X.values,i) for i in range(X.shape[1])]
vif["features"] = X.columns
vif
```
    and remove either with VIFs > 10
Higher order terms
- Why: To help fit more complex relationship in data.
- How: Multiplying two or more x-variables by one another. Common higher order terms include:
  1. Multipled by itself: quadratics $(x_1^2)$ and cubics $(x_1^3)$
  2. interactions: $(x_1 x_2)$
- When: by the curves in the relatipnships between the y and x variable
- Notice!!!: we can not interprete the linear term the same way as before, becuase the variable is involved in the higher order term as well as the linear term

Add	Curve
quadratic
cubic
interaction	when use: lines even cross or grow apart quickly.

Logistic Regression

Basics used to predict only two possible outcomes
Odds ratio $log(\frac{p}{1-p}) = b_0 +b_1x_1+ b_2x_2 +...$ $\downarrow$ $\displaystyle p= \frac{e^{b_0 +b_1x_1+ b_2x_2 +...}}{1+e^{b_0 +b_1x_1+ b_2x_2 +...}}$ ($p:$ probability of category 1 occuring)
Interpretation: (v1,v2)
- if Coef. >=1 unit increase-> np.exp(Coef.)
  if Coef. < 1, unit decrease-> reciprocal 1/np.exp(Coef.)
- Quantitative Vars:
  - For every one unit increase in x1, we expect a multiplicative change in the odds of being in the one category of $e^{b_1}$, holding all other variables constant.
  For each 1 unit increase in duration, fraud is 0.23 (np.exp(-1.46)) times as likely hodling all else constant.
  
  Better this version using reciprocal :
  
  For each 1 unit decrease in duration, fraud is 4.32 1/(np.exp(-1.46)) times as likely holding elase all else constant.
- Categorical Vars:
  - When in category x1, we expect a multiplicative change in the odds of being in the one category by $e^{b_1}$ compared to the baseline.
  Fraud is 12.76(np.exp(2.54)) times as likely on weekdays than weekends hodling all else constant.
Model Fit (check Confusion Matrix, and v1)
- Accuracy: ROC AUC

Additional

1. Matrix and Numpy refresher

2. Model Assumptions

Potential Problems	Error	Assess Methods
1. Non-linearity of the y and x	- Linearity is that a linear model is the relationship that truly exists between the y variable and x variables. - Consequences: 1. the predictions will not be very accurate 2. the linear relationships associated with the coefficients aren't useful	a plot of residuals by the predicted values `(y−ŷ)`by predicted ŷ. curvature patterns : linear model might not fit (biased) Expcted: random scatter
1. Non-linearity of the y and x
2. Correlation of error terms	- Description: Correlated errors frequently occur when our data are collected over time (like in forecasting stock prices or interest rates in the future) or data are spatially related (like predicting flood or drought regions). We can often improve our predictions by using information from the past data points (for time) or the points nearby (for space). - Root Cause: not accounting for correlated errors is that you can often use this correlation to your advantage to better predict future events or events spatially close to one another.	Durbin-Waston: used to assess whether correlation of the errors is an issue ARIMA or ARMA moels : impleted to use this correlation to make better predictions
3. Non-constant Variance and Normally Distributed Errors	- Description: Non-constant variance is when the spread of your predicted values differs depending on which value you are trying to predict. This isn't a huge problem in terms of predicting well. - Consequences: it does lead to confidence intervals and p-values that are inaccurate. Confidence intervals for the coefficients will be too wide for areas where the actual values are closer to the predicted values, but too narrow for areas where the actual values are more spread out from the predicted values.	- Test residuals plot: non-constant variance: is labeled as heteroscedastic constnat variance: homoscedastic residuals (consistent across the range of values) Fix: a log (or some other transformation of the response variable is done) in order to "get rid" of the non-constant variance. In order to choose the transformation, a Box-Cox is commonly used.
4. Outliers/Leverage points	Outerliner: points that lie far away from the regular trends of the data. - Consequences: If we are aggregating data from multiple sources, its possible that some of the data values were carried over incorrectly or aggregated incorrectly.	Regularization
5. Multicollinearity	- Description: when we have x variables are correlated with one another. - Consequences: One of the main concern of multicollinearity is that it can lead to coefficients being flipped from the direction we expected from simple linear regression	Bivariate plots or Variance inflation factor (review: VIF , 2) if VIF> 10 then we have multicollinearity

*VIF $VIF_i = \frac{1}{1- R_i^2}$

3. Confusion Matrix

1		Actual Class		2
1		Positive	Negative	3	4
Predicted Class	Positive	TP Sensitivity	FP Type I error	Precision TP /(TP+FP)	5
Predicted Class	Negative	FN Type II error	TN Specificity	6	7
8		Recall (TruePositiveRate) TP/(TP+FN)	FalsePositive Rate FP/(FP+TN)	12	14
8		9	11	13	15

Recall / True Posotive Rate/ Sensitivity how many correctly classified? $\frac{True\ positive}{Total\ Positive} $ = $ \frac{TP}{TP+FN}$
Precision/ Positive Predictive Value out of all the items labeled positive, how many truly belong to positive? $\frac{True\ positive}{Total\ Predicted\ Positive} $ = $\frac{TP}{TP+FP}$

4. ROC and AUC

Review (Link)

ROC its comparing the rate at which classifier is making correct predictions (TP) and the rate at which the classifier is making false alarm (FP). $TPR = TP/(TP+ FP)$ $FPR = FP/(FP+ TN)$
AUC Area under the curve. AUC=0 -> BAD AUC=1 -> GOOD The more UP AND LEFT the hump is, the larger the AUC will be and the better the classifier.

Build a Modle using scikit-learn

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=10000, n_features=10, n_classes=2, n_informative=5)
Xtrain = X[:9000]
Xtest = X[9000:]
ytrain = y[:9000]
ytest = y[9000:]

clf = LogisticRegression()
clf.fit(Xtrain, ytrain)

Calculate ROC curve

from sklearn import metrics
import pandas as pd
from ggplot import *

preds = clf.predict_proba(Xtest)[:,1]
fpr, tpr, _ = metrics.roc_curve(ytest, preds)

df = pd.DataFrame(dict(fpr=fpr, tpr=tpr))
ggplot(df, aes(x='fpr', y='tpr')) +\
 geom_line() +\
 geom_abline(linetype='dashed')

Calculate the AUC

auc = metrics.auc(fpr,tpr)
ggplot(df, aes(x='fpr', ymin=0, ymax='tpr')) +\
 geom_area(alpha=0.2) +\
 geom_line(aes(y='tpr')) +\
 ggtitle("ROC Curve w/ AUC=%s" % str(auc))

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regressions.md

Regressions.md

Regressions

Regression

Define the line

Code Practice and interpretation

Multilinear Regression

Logistic Regression

Additional

1. Matrix and Numpy refresher

2. Model Assumptions

3. Confusion Matrix

4. ROC and AUC

Files

Regressions.md

Latest commit

History

Regressions.md

File metadata and controls

Regressions

Regression

Define the line

Code Practice and interpretation

Multilinear Regression

Logistic Regression

Additional

1. Matrix and Numpy refresher

2. Model Assumptions

3. Confusion Matrix

4. ROC and AUC