Part II Regression Models (Linear).md

Part II Regression Models (Linear)

C5 Measuring Performace in Regression Models

  • Principle:
    • relying on solely 1 metric can be problematic
    • utilize visulizations of model fit, redisual plots...
  • Quantitative Measures
    • RMSE: same unit with original data, average distance between predicted and true values
    • R-square: denote the correlation(not accuracy!) between predicted and true values
      • same MSE but larger variance in responses lead to lower R-square
    • Confidence/Prediction Interval: Confidence is for mean prediction, Prediction is for single datapoints
    • Spearman Correlation (Depending on your goal: suppose you want to predict the rank as opposed to actual number)
  • Trade-off between variance and bias
    • Low variance, high bias: robust but not accurate, underfit
    • High variance, low bias: complex, outfitted model, sensitive to new data
      • E(MSE) = sigma^2(irreducible noise) + (Model Bias)^2 + Model Variance, favoring high variance, low bias
  • Common Errors
    • overpredict low values, underpredict high ones -- common in tree-based models

C6 Linear Regression and Its Cousins

Linear Regression (OLS)

  • Meaning of Linear Regression
    • linear parameter(linear relationship)
    • additive
    • generally trys to minimize a function of MSE
    • highly interpretable
    • Based on standard assumptions, it provides, with unbiased coefficient, the lowest MSE (which could be lowered by creating )
    • Use Case: signal filtering, use the model variance to identify measured noise, so that underlying true signal(y) could be recovered by y_hat
  • beta-equation
    • 2 requirements for to have unique inverse
      • full rank = no predictor can be determined from a combination of one or more of other predictors
      • the number of observations larger than the number of predictors
      • IF VIOLATED: conditional inverse can still be attained, but the result is not unique, hence interpretablity stained
    • Collinearity
      • Impact
        • standard errors & coefficients inflated
        • predictive power remained, but significance of predictors drop (Type II Error)
      • How to spot?
        • VIF(variance inflation factor): to assess the correlation
          • VIF
          • is represents the unadjusted coefficient of determination for regressing the ith independent variable on the remaining ones
        • not significant respective T-tests, high R^2
      • How to correct?
        • Delete correlated predictors(p47) to ensure all pairwise correlations are below a certain threshold
          • Find the highest correlation (between A and B)
          • remove the one that has higher average correlation with other variables
          • loop until no pairwise correlation is above threshold
        • PCA, PLS(simultaneouos dimension reduction and regression), Lasso, Ridge
        • Situations that multicollinearity doesn't matter: a. dummy variable; b. control variable(as opposed of variable of interest); c. intersection or power
    • Assumptions
      • a. linear; b. nocollinearity; c. normal errors; d. errors independent of independent IVs(correctly specified model); e. errors constant along DV(homoscedacity == independent observations == no group, no autocorrelation)
    • Drawback: a. only for linear; b. sensitive to outliers (bc they are minimizing SSE)
      • Robust Regression : use alternatives to SSE to address the tendency towards outliers
        • Hubert function: Given certain threshold C: when residuals < C, use squared residuals; when >= C, use absolute residual

Partial Least Squares

  • Meaning
    • Similar to PCA, denotes a linear combination of predictors based on their variability
    • Moreover, the combination also have maximum correlation with the response (focuses on covariance rather than variance)
    • A supervised dimension reduction procedure, as opposed to PCA being unsupervised
    • Can deal with multivariate outcome
  • Drawbacks
    • Cannot deal with too many predictors (Penalty Models still would be more efficient in this scope)
    • Cannot deal with nonlinear relationship with ease (Tree and other models recommended)
    • Slow computation, every component needs to calculate a new IV and DV matrix
  • Resources

Penalty Models

  • Ridge(shrinking model, L2, penalty = )
  • Lasso(least absolute shrinkage and selection operation model, L1, penalty = )
  • Elastic Net
  • Notice: Since penalty term uses coefficients, requiring the scale of predictors to be constant, hence pre-scaling is important