Skip to content

Latest commit

 

History

History
384 lines (337 loc) · 16.5 KB

Regressions.md

File metadata and controls

384 lines (337 loc) · 16.5 KB

Regressions

Outline


Regression

- The most common way to visualize simple linear regression is using a scatter plot. #### Correlation Coefficient - Correlation coefficients provide a measure of the strength and direction of a linear relationship. - strength: closeness of points - direction: positive or negative - **Calculation**: $r = \frac{ \sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y}) }{% \sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i-\bar{y})^2}}$ - **Rule of thumb:** - Strong relationship $0.7≤∣r∣≤1.0$ - Moderate relationship $0.3≤∣r∣<0.7$ - Weak relationship $0.0≤∣r∣<0.3$ > Note: a correlation coefficient of 0 there is no **linear** relationship between two variables.

Define the line

  • Component:
    • intercept: the predicted value of the response when the x variable is zero
    • slope: the predicted change in the response for every 1 unit increase in the x variable
  • Regression Line:
    • Method: MSE
    • minimize $\sum_{i=1}^{n}(y_i-\hat{y})^2$
    • find the minimum value of: for each data point in the dataset, look at the distance between the predicted and actual values, square these and sum them all together.
    • Manual calculation

Code Practice and interpretation

import statsmodels.api as sm

df['intercept'] = 1

lm = sm.OLS(df.y, df[[x1, intercept]])
result = lm.fit()
result.summary()

Interpretation (v1,v2)

  • p-value: -> if the variable is statistically significant for predicting the dependent variable

    the p-value associated with area is very small, which sugguests there is a statistical evidence that population slope associated with area in relating to price is non-zero.

  • R-squared: -> square of correlation coefficient, between 0 and 1 the more closer to 1, the better fit. -> amount of varibility in the dependent variable y explained by the model.

    e.g. 67.8% of the varibility in area can be explained by the price of a house.

  • coef:

    e.g. for ever one unit increase in area, the predicted increaase in price is 348.5.

  • intercept:

    e.g. Based on our predicted values, it would be unexpected to have a price below 9588, because this is the predicted price of a house with no area.


Multilinear Regression

Preview -> Matrix and Numpy refresher
  • Build models

    • through aclculation of matrix by the formular in numpy $\beta = (X^′X) ^−1X ^′y$
      X= df [['intercept', x1, x2, x3]]
      Y = df.price
      np.dot(np.dot(np.linalg.inv(np.dot(X.transpose(), X)), X.transpose()),y)
      
  • Dummy variables

    • the way to add categorical variables to multiple linear regression model by 1,0 encoding, must drop one of the column (the column dropped called baseline). reason:
      1. to assure that all of the columns are linearly independent
      2. assure that the dot product of $X^′X$ is invertible
      3. assure the X natrix is full rank
    • Encoding Create and need to Drop pd.get_dummies(df[categorical_var])
  • Interpretation (v1,v2)

* **coef:** - Quantitative For every one unit increase in X, the expected Y increases by the `slope`, holding all else constant. > e.g. for ever additional unit increase in area of the house, the price is expected to increaase by 348.5 **as long as the other variables stay the same**. - Categorical > We expect that a house in neighborhood C will cost 7168 less than a neighborhood A house, all else being equal.
  • Potential Problems

    1. Linear relationship doesnt exists
    2. Correlated errors
    3. Non constant variance
    4. Outliers hurt the model
    5. Multicollinearity
  • Multicollinearity Problem Details described in addition table

    • Consequences: direction flipped coefficients

    • Scatterplot Matrix

      import searborn as sb
      sb.pairplot(df[['var1','var2','var3']])
      
    • VIFs (Variance Inflation Factors)

      • Calculation: $VIF_i = \frac{1}{1- R_i^2}$

        • Logic All other x-variables - excluding $x_i$ -> are used to predict $x_i$ then compute $R_i^2$ if one related to other: $R_i^2$ (\uparrow) then $1- R_i^2$(\downarrow) (\therefore) VIFs (\uparrow)
      • Code: if VIF> 10 then we have multicollinearity in model

        from pasty import dmatrices
        from statsmodels.stats.outliers_influence import 
        
        variance_inflation_factor 
        y,X = dmatrices('price ~ intercept + area +bedrooms + bathrooms', df, return_type ='dataframe')
        
        vif = pd.DataFrame()
        vif["VIF factor"] = [variance_inflation_fWctor(X.values,i) for i in range(X.shape[1])]
        vif["features"] = X.columns
        vif
        

        and remove either with VIFs > 10

  • Higher order terms

    • Why: To help fit more complex relationship in data.
    • How: Multiplying two or more x-variables by one another. Common higher order terms include:
      1. Multipled by itself: quadratics $(x_1^2)$ and cubics $(x_1^3)$
      2. interactions: $(x_1 x_2)$
    • When: by the curves in the relatipnships between the y and x variable
    • Notice!!!: we can not interprete the linear term the same way as before, becuase the variable is involved in the higher order term as well as the linear term
Add Curve
quadratic Image
cubic
interaction when use: lines even cross or grow apart quickly.

Logistic Regression

  • Basics used to predict only two possible outcomes

  • Odds ratio $log(\frac{p}{1-p}) = b_0 +b_1x_1+ b_2x_2 +...$ $\downarrow$ $\displaystyle p= \frac{e^{b_0 +b_1x_1+ b_2x_2 +...}}{1+e^{b_0 +b_1x_1+ b_2x_2 +...}}$ ($p:$ probability of category 1 occuring)

  • Interpretation: (v1,v2)

    • if Coef. >=1 unit increase-> np.exp(Coef.)
      if Coef. < 1, unit decrease-> reciprocal 1/np.exp(Coef.)

    • Quantitative Vars:

      • For every one unit increase in x1, we expect a multiplicative change in the odds of being in the one category of $e^{b_1}$, holding all other variables constant.

      For each 1 unit increase in duration, fraud is 0.23 (np.exp(-1.46)) times as likely hodling all else constant.

      Better this version using reciprocal :

      For each 1 unit decrease in duration, fraud is 4.32 1/(np.exp(-1.46)) times as likely holding elase all else constant.

    • Categorical Vars:

      • When in category x1, we expect a multiplicative change in the odds of being in the one category by $e^{b_1}$ compared to the baseline.

      Fraud is 12.76(np.exp(2.54)) times as likely on weekdays than weekends hodling all else constant.

  • Model Fit (check Confusion Matrix, and v1)


Additional

1. Matrix and Numpy refresher

2. Model Assumptions

Potential Problems Error Assess Methods
1. Non-linearity of the y and x - Linearity is that a linear model is the relationship that truly exists between the y variable and x variables.
- Consequences:
1. the predictions will not be very accurate
2. the linear relationships associated with the coefficients aren't useful
a plot of residuals by the predicted values (y−ŷ)by predicted ŷ.
  • curvature patterns : linear model might not fit (biased)
  • Expcted: random scatter
  • 2. Correlation of error terms - Description:
    Correlated errors frequently occur when our data are collected over time (like in forecasting stock prices or interest rates in the future) or data are spatially related (like predicting flood or drought regions).
    We can often improve our predictions by using information from the past data points (for time) or the points nearby (for space).
    - Root Cause:
    not accounting for correlated errors is that you can often use this correlation to your advantage to better predict future events or events spatially close to one another.
  • Durbin-Waston: used to assess whether correlation of the errors is an issue
  • ARIMA or ARMA moels : impleted to use this correlation to make better predictions
  • 3. Non-constant Variance and Normally Distributed Errors - Description:
    Non-constant variance is when the spread of your predicted values differs depending on which value you are trying to predict. This isn't a huge problem in terms of predicting well.
    - Consequences:
    it does lead to confidence intervals and p-values that are inaccurate. Confidence intervals for the coefficients will be too wide for areas where the actual values are closer to the predicted values, but too narrow for areas where the actual values are more spread out from the predicted values.
    - Test residuals plot:
  • non-constant variance: is labeled as heteroscedastic
  • constnat variance: homoscedastic residuals (consistent across the range of values)
    Fix: a log (or some other transformation of the response variable is done) in order to "get rid" of the non-constant variance. In order to choose the transformation, a Box-Cox is commonly used.
  • 4. Outliers/Leverage points Outerliner: points that lie far away from the regular trends of the data.
    - Consequences:
    If we are aggregating data from multiple sources, its possible that some of the data values were carried over incorrectly or aggregated incorrectly.
    Regularization
    5. Multicollinearity - Description:
    when we have x variables are correlated with one another.
    - Consequences:
    One of the main concern of multicollinearity is that it can lead to coefficients being flipped from the direction we expected from simple linear regression
    Bivariate plots or Variance inflation factor (review: VIF, 2) if VIF> 10 then we have multicollinearity

    *VIF $VIF_i = \frac{1}{1- R_i^2}$

    3. Confusion Matrix

    1 Actual Class 2
    Positive Negative 3 4
    Predicted
    Class
    Positive TP
    Sensitivity
    FP
    Type I error
    Precision
    TP /(TP+FP)
    5
    Negative FN
    Type II error
    TN
    Specificity
    6 7
    8 Recall
    (TruePositiveRate)
    TP/(TP+FN)
    FalsePositive Rate
    FP/(FP+TN)
    12 14
    9 11 13 15
    • Recall / True Posotive Rate/ Sensitivity how many correctly classified? $\frac{True\ positive}{Total\ Positive} $ = $ \frac{TP}{TP+FN}$
    • Precision/ Positive Predictive Value out of all the items labeled positive, how many truly belong to positive? $\frac{True\ positive}{Total\ Predicted\ Positive} $ = $\frac{TP}{TP+FP}$

    4. ROC and AUC

    Review (Link)

    • ROC its comparing the rate at which classifier is making correct predictions (TP) and the rate at which the classifier is making false alarm (FP). $TPR = TP/(TP+ FP)$ $FPR = FP/(FP+ TN)$
    • AUC Area under the curve. AUC=0 -> BAD AUC=1 -> GOOD The more UP AND LEFT the hump is, the larger the AUC will be and the better the classifier.
    1. Build a Modle using scikit-learn
    from sklearn.datasets import make_classification
    from sklearn.linear_model import LogisticRegression
    
    X, y = make_classification(n_samples=10000, n_features=10, n_classes=2, n_informative=5)
    Xtrain = X[:9000]
    Xtest = X[9000:]
    ytrain = y[:9000]
    ytest = y[9000:]
    
    clf = LogisticRegression()
    clf.fit(Xtrain, ytrain)
    
    1. Calculate ROC curve
    from sklearn import metrics
    import pandas as pd
    from ggplot import *
    
    preds = clf.predict_proba(Xtest)[:,1]
    fpr, tpr, _ = metrics.roc_curve(ytest, preds)
    
    df = pd.DataFrame(dict(fpr=fpr, tpr=tpr))
    ggplot(df, aes(x='fpr', y='tpr')) +\
     geom_line() +\
     geom_abline(linetype='dashed')
    
    1. Calculate the AUC
    auc = metrics.auc(fpr,tpr)
    ggplot(df, aes(x='fpr', ymin=0, ymax='tpr')) +\
     geom_area(alpha=0.2) +\
     geom_line(aes(y='tpr')) +\
     ggtitle("ROC Curve w/ AUC=%s" % str(auc))
    

    Creative Commons License
    This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.