Skip to content

gastonstat/stat151a

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 

Repository files navigation

STAT 151A - Linear Modeling: Theory and Applications

  • Description: This is a course on linear models as well as generalized linear models and their application. Topics include linear regression and modeling, visualization and diagnostics, confidence intervals and hypothesis, analysis of variance, dealing with large number of predictors, and generalized linear models.

  • Instructor: Gaston Sanchez

  • Lecture: 3 hours of lecture per week

  • Lab: 2 hours of computer lab sessions

  • Assignments: biweekly HW assignments

  • Exams: Up to 2 midterm exams, and final test

  • Notes and texts:

    • Prof. Sanchez's notes
    • Applied Regression Analysis and Generalized Linear Models (by John Fox)
  • Prerequisites: Statistical and Probability Theory, as well as Linear Algebra. It owuld also be nice to have some familiarity with R.

  • LMS: the specific learning resources of a given semester are shared in the Learning Management Sysment (LMS) approved by Campus authorities (e.g. bCourses, Canvas)

  • Policies:


1. Introduction

πŸ“‡ ABOUT: By the end of this introductory module, you will be able to:

  • Define what a linear model is (in what sense a model is said to be linear)
  • Describe the high-level intuition of regression (and the regression function)

πŸ“– READING:

  • Chapters 1 and 4
  • Preliminary concepts

✏️ TOPICS:

  • Preliminary Concepts
    • Intuition of regression
    • Meaning of the term "linear"
    • Geometric duality of a data set
    • Review of orthogonal projections

2. Simple Linear Regression (SLR)

πŸ“‡ ABOUT:

In this week, we introduce the descriptive aspects of a Simple Linear Regression model. This involves postponing the discussion of inferential aspects for later. In particular, we focus on the method of (Ordinary) Least Squares to obtain the estimated coefficients of a simple linear model. Likewise, we discuss the geometric aspects of OLS, and understand how the Gauss-Markov assumptions wrap a linear model with a first layer of "soft" statistical assumptions.


πŸ“– READING:

  • Chapters 5.1 and 10.1
  • Geometry of simple regression
  • Gauss-Markov assumptions in simple regression

✏️ TOPICS:

  • Simple Linear Regression (SLR)
    • Residual Sum of Squares
    • Least Squares estimates
    • Geometry of simple OLS
    • Analysis of Variance decomposition
  • SLR under GM assumptions
    • Gauss Markov Assumptions
    • Properties of OLS coefficients
    • Properties of OLS estimates
    • Estimate of standard deviation (sigma)
    • Gauss-Markov Theorem

3. Multiple Linear Regression (MLR)

πŸ“‡ ABOUT:

This week, we introduce the model-fitting aspects of Multiple Linear Regression. Like we did in the previous module, we postpone the discussion of the inferential aspects for later. We'll keep our focus on the method of (Ordinary) Least Squares to obtain the coefficients of a multiple linear model. Likewise, we'll continue to study the geometric aspects of OLS, and understand how the Gauss-Markov assumptions wrap a linear model with a first layer of "soft" statistical assumptions.


πŸ“– READING:

  • Chapters 5.2 and 10.2 and 10.33
  • Geometry of multiple regression
  • Gauss-Markov assumptions in multiple regression

✏️ TOPICS:

  • Multiple Linear Regression (MLR)
    • Introduction to Mulriple Regression
    • Least Squares estimates
    • Geometry of simple OLS
  • SLR under GM assumptions
    • Properties of OLS coefficients
    • Properties of OLS estimates (y-hat and residuals)
    • Estimate of variance
    • Gauss-Markov Theorem

4. Normality Assumptions in Linear Regression

πŸ“‡ ABOUT:

In this module, we begin the introduction of the Normal Theory (i.e. so-called Normality assumptions) for linear regression models. This involves assuming that random error terms are Normally distributed, which is a requirement in order to make inferences (e.g. confidence intervals, hypothesis tests) within regression modeling.

We study how the Normality assumptions wrap a linear model with another layer of theoretical assumptions (we like to think of this as a second layer of "hard" statistical assumptions). This involves deriving Maximum Likelihood (ML) estimators, and also studying the distributions of the estimated regression quantities (e.g. coefficients, fitted values, residuals, sums of squares, etc).


πŸ“– READING:

  • Chapter 6
  • Normality assumptions in simple regression
  • Normality assumptions in multiple regression

✏️ TOPICS:

  • Normality assumptions in SLR
    • Normality assumptions
    • Maximum Likelihood estimators
    • Distributions of estimators
    • Distributions of sum of squares
  • Normality assumptions in MLR
    • Multivariate Normal distribution
    • Distributions of estimators
    • Distributions of sum of squares

5. Inference in Linear Regression Models

πŸ“‡ ABOUT:

After reviewing the normality assumptions in regression models and how they affect the distributions of various estimates, we move onto the inferential aspects. In this module we describe how to construct confidence intervals and how to make hypothesis tests.


πŸ“– READING:

  • Chapter 6
  • Confidence Intervals in regression models
  • Hypothesis Tests in regression models

✏️ TOPICS:

  • Confidence Intervals
    • Confidence intervals for regression coefficients
    • Meaning of "predictions"
    • Intervals for predictions
  • Hypothesis Tests
    • Test for a single predictor
    • F-test for multiple predictors
    • F-test and anova test

6. Dummy Variables and ANOVA

πŸ“‡ ABOUT:

So far we've studied linear regression models under the implicit assumption that both the response and the predictors are quantitative variables. However, we still need to study what to do when we have one or more predictors that are qualitative (i.e. categorical).


πŸ“– READING:

  • Chapters 7 and 8
  • Dummy Variables
  • ANOVA

✏️ TOPICS:

  • Dummy Variables
    • Dummy Regressors for categorical variables
    • The use of dummy (i.e. binary) indicator variables
    • Various types of encoding for categorical variables
  • ANOVA
    • Introduction to ANOVA
    • One-way anova: constraints, estimates, and dispersion
    • Anova test

7. Residual Analysis and Diagnostic Tools

πŸ“‡ ABOUT:

The estimation of and inference from the regression model depend on several assumptions. These assumptions should be checked using regression diagnostics before using the model in earnest. This week, we cover diagnostic tools for assessing the validity of assumptions about the model specification, the error terms, and issues with unusual and influential observations.


πŸ“– READING:

  • Chapters 11 and 12
  • Residual Analysis (part 1)
  • Residual Analysis (part 2)

✏️ TOPICS:

  • Residual Analysis (part 1)
    • Problems in regression analysis
    • Residuals and Leverages
    • Types of residuals
    • Basic residual plots
  • Residual Analysis (part 2)
    • Detecting heteroscedasticity
    • Detecting non-normality
    • Detecting unusual observations
    • Detecting influential observations

8. Multicollinearity

πŸ“‡ ABOUT:

Previously, we mentioned that one class of problematic issues in regression has to do with the Rank assumption of the design matrix X (full rank). This week, we discuss in what way not having a full rank matrix X affects the estimated regression quantities. More specifically, we'll study the common issue of dealing with multicollinearity.


πŸ“– READING:

  • Chapter 13
  • The Sum-of-Squares-and-Cross-Products (SSCP) matrix X'X
  • Multicollinearity

✏️ TOPICS:

  • Review of the SSCP matrix
    • The Sum-of-Squares-and-Cross-Products (SSCP) matrix
    • SSCP and friends
    • Notion and measures of multidimensional scatter
    • Eigenstructure of the SSCP matrix
  • Multicollinearity
    • What is multicollinearity
    • Examples of multicollinearity
    • Variance Inflation Fator (VIF)
    • Singular Value Decomposition (SVD) and multicollinearity

9. Dealing with Multicollinearity

πŸ“‡ ABOUT:

In this module, we continue the discussion about multicollinearity. More specifically, we describe two methods, Principal Components Regression (PCR) and Ridge Regression (RR), that allow us to overcome some of the obstacles posed when dealing with multicollinearity.


πŸ“– READING:

  • Chapter 13.1 and 13.2.3
  • Principal Components Analysis (PCA)
  • Ridge Regression

✏️ TOPICS:

  • Use of PCA to deal with multicollinearity
    • Crash introduction to Principal Components Analysis
    • PCA and EVD
    • Geometry of PCA
    • Use of PCA for regression analysis
  • Ridge Regression
    • Introduction to Ridge Regression
    • Mean-Square-Error (MSE) in Ridge Regression
    • Geometry of Ridge Regression
    • Solution of Ridge Regression

10. Variable Selection and Model Building

πŸ“‡ ABOUT:

In this module, we go over common methods for selecting variables, comparing models of different sizes (i.e. different number of predictors), and choosing the "best" model.


πŸ“– READING:

  • Chapter 22.1
  • Model Choice Criteria

✏️ TOPICS:

  • Model Selection
    • Introduction to model selection
    • Predictive performance
    • Limitations of R2 for comparing models of different number of predictors
  • Model Comparison Criteria
    • Adjusted R-squared
    • Mallows's Cp
    • Akaike Information Criterion (AIC)
    • Bayesian Information Criterion (BIC)

11. Introduction to Logistic Regression

πŸ“‡ ABOUT:

In this module, we transition into the so-called framework of Generalized Linear Models (GLM). Specifically, we start with regression models to predict a (binary) categorical predictor using the "plain vanilla" logistic regression model.


πŸ“– READING:

  • Chapter 14.1
  • Logistic Regression
  • Logistic Regression toy example

✏️ TOPICS:

  • Logistic Regression
    • Limitations of a linear model when applied on a binary response variable
    • Core idea to formulate a binary regression model with a logistic function
    • The Logistic regression model
  • Logistic Regression Example
    • Coronary Heart Disease (chd) data
    • Fitting a logistic regression model
    • Interpretation of regression coefficients

12. Estimation in Logistic Regression

πŸ“‡ ABOUT:

In this week, we focus on the estimation of logistic regression models. The estimation criterion is based on maximum likelihood, which unfortunately cannot be solved analytically. Instead, we need to use numerical methods such as Newton's method (aka Newton-Raphson's method). This is the method behind what is perhaps the most common algorithm to estimate logistic regression models, namely: IWLS "Iterative Weighted Least Squares" (aka Iterative Re-weighted Least Squares, IRLS).


πŸ“– READING:

  • Chapter 14.1
  • Estimation of Logistic Regression

✏️ TOPICS:

  • Maximum Likelihood estimation in Logistic Regression
    • Derivation of the (log)likelihood of a binary logistic regression model
    • Limitation for maximizing log-likelihood analytically
    • Estimation via numerical optimization methods (e.g. Newton's method)
    • Review of Newton's method
  • Numerical estimation in Logistic Regression
    • Newton's method to estimate a logistic regression model
    • Iterative Weighted Least Squares (IWLS) algorithm

13. Poisson Regression

πŸ“‡ ABOUT:

This week we briefly describe poisson regression, and the theoretical framework of Generalized Linear Models (GLM). Much of what we've discussed about logistic regression applies to poisson regression, and to other members of GLM.


πŸ“– READING:

  • Chapter 15
  • Introduction to Poisson Regression
  • GLM Framework

✏️ TOPICS:

  • Poisson Regression
    • Derivation of the (log)likelihood of poisson regression model
    • Limitation for maximizing log-likelihood analytically
    • Estimation via numerical optimization methods (e.g. Newton's method)
    • Review of Newton's method
  • GLM Framework
    • Main components of a GLM (random component, linear predictor, and link function)
    • Link functions, and their inverses, for linear regression, poisson regression, and logistic regression
    • R functions glm() and their summary() outputs