Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
829 lines (658 sloc) 26.3 KB

[Data Analysis] Week 4 Lectures

Preflight:

setwd("~/Desktop/coursera-data-analysis/lectures/")

Clustering Example

slides | video | transcript

Samsung Galaxy S3 case study / example

load('samsungData.rda')
#names(samsungData)
table(samsungData$activity)

###
par(mfrow=c(1,2))
numericActivity <- as.numeric(as.factor(samsungData$activity))[samsungData$subject==1]
plot(samsungData[samsungData$subject==1,1],pch=19,col=numericActivity,ylab=names(samsungData)[1])
plot(samsungData[samsungData$subject==1,2],pch=19,col=numericActivity,ylab=names(samsungData)[2])
legend(150,-0.1,legend=unique(samsungData$activity),col=unique(numericActivity),pch=19)

### Clustering based _just on average_ acceleration
### cluster dendrogram (where are the patterns?)
source("http://dl.dropbox.com/u/7710864/courseraPublic/myplclust.R")
distanceMatrix <- dist(samsungData[samsungData$subject==1,1:3])
hclustering <- hclust(distanceMatrix)
myplclust(hclustering,lab.col=numericActivity)

### Plotting **max** acceleration for the first subject
### (see the difference?)
par(mfrow=c(1,2))
plot(samsungData[samsungData$subject==1,10],
     pch=19, col=numericActivity,
     ylab=names(samsungData)[10])
plot(samsungData[samsungData$subject==1,11],
     pch=19, col=numericActivity,
     ylab=names(samsungData)[11])

### Clustering based on maximum acceleration
#source("http://dl.dropbox.com/u/7710864/courseraPublic/myplclust.R")
distanceMatrix <- dist(samsungData[samsungData$subject==1,10:12])
hclustering <- hclust(distanceMatrix)
myplclust(hclustering, lab.col=numericActivity)

### Singular value decomposition
svd1 = svd(scale(samsungData[samsungData$subject==1,-c(562,563)]))
par(mfrow=c(1,2))
plot(svd1$u[,1], col=numericActivity, pch=19)
plot(svd1$u[,2], col=numericActivity, pch=19)

### Find maximum contributor
### "look at the right singular vector that corresponds with the left singular vector that gives that pattern"
plot(svd1$v[,2], pch=19)

### New clustering with maximum contributer
maxContrib <- which.max(svd1$v[,2])
distanceMatrix <- dist(samsungData[samsungData$subject==1, c(10:12,maxContrib)])
hclustering <- hclust(distanceMatrix)
myplclust(hclustering, lab.col=numericActivity) 

### New clustering with maximum contributer
names(samsungData)[maxContrib]                          
### "separates out a variable that distinguishes between walking and walking-up"

### K-means clustering (nstart=1, first try)
kClust <- kmeans(samsungData[samsungData$subject==1,-c(562,563)],centers=6)
table(kClust$cluster,samsungData$activity[samsungData$subject==1])

### K-means clustering (nstart=1, second try)
### do it again! (note: not quite the same)
kClust <- kmeans(samsungData[samsungData$subject==1,-c(562,563)],centers=6,nstart=1)
table(kClust$cluster,samsungData$activity[samsungData$subject==1])

### K-means clustering (nstart=100, first try)
### 100 random starts, then take the average
kClust <- kmeans(samsungData[samsungData$subject==1,-c(562,563)],centers=6,nstart=100)
table(kClust$cluster,samsungData$activity[samsungData$subject==1])

### K-means clustering (nstart=100, second try)
kClust <- kmeans(samsungData[samsungData$subject==1,-c(562,563)],centers=6,nstart=100)
table(kClust$cluster,samsungData$activity[samsungData$subject==1])

### Cluster 1 Variable Centers (Laying)
plot(kClust$center[1,1:10],pch=19,ylab="Cluster Center",xlab="")

### Cluster 2 Variable Centers (Walking)
plot(kClust$center[6,1:10],pch=19,ylab="Cluster Center",xlab="")

Basic Least Squares

slides | video | transcript

  • 1st lecture about regression

Goals of statistical modeling

  • Describe the distribution of variables
  • Describe the relationship between variables
  • Make inferences about distributions or relationships

Example: Average parent and child heights

The Galton Data

Load Galton Data:

library(UsingR); data(galton)
par(mfrow=c(1,2))
hist(galton$child,col="blue",breaks=100)
hist(galton$parent,col="blue",breaks=100)

The distribution of child heights

hist(galton$child,col="blue",breaks=100)

"If you had only one number to use to summarize the distribution what would it be?"

Only know the child - average height

hist(galton$child,col="blue",breaks=100)
meanChild <- mean(galton$child)
lines(rep(meanChild,100),seq(0,150,length=100),col="red",lwd=5)

Here the mean makes sense b/c it's a pretty symmetric distribution

Only know the child - why average?

it minimizes a certain set of errors

What if we plot child versus average parent?

plot(galton$parent,galton$child,pch=19,col="blue")

Jitter: (mostly just to visualize the points that are stacked)

# Jittered plot
set.seed(1234)
plot(jitter(galton$parent,factor=2),jitter(galton$child,factor=2),pch=19,col="blue")

Average parent = 65 inches tall:

plot(galton$parent,galton$child,pch=19,col="blue")
near65 <- galton[abs(galton$parent - 65)<1, ]
points(near65$parent,near65$child,pch=19,col="red")
lines(seq(64,66,length=100),rep(mean(near65$child),100),col="red",lwd=4)

"If the average parent is 65 inches tall, what is the average child height?"

Average parent = 71 inches tall:

plot(galton$parent,galton$child,pch=19,col="blue")
near71 <- galton[abs(galton$parent - 71)<1, ]
points(near71$parent,near71$child,pch=19,col="red")
lines(seq(70,72,length=100),rep(mean(near71$child),100),col="red",lwd=4)

Fitting a line

plot(galton$parent,galton$child,pch=19,col="blue")
lm1 <- lm(galton$child ~ galton$parent)
lines(galton$parent,lm1$fitted,col="red",lwd=3)

linear model

"Why not this line?"

plot(galton$parent,galton$child,pch=19,col="blue")
lines(galton$parent, 26 + 0.646*galton$parent)

B/c Not all points are on the line

plot(galton$parent,galton$child,pch=19,col="blue")
lines(galton$parent,lm1$fitted,col="red",lwd=3)

Allowing for variation (e.g., what are the variables we didn't measure?)

How do we pick best?

If $C_i$ is the height of child $i$ and $P_i$ is the height of the average parent, pick the line that makes the child values $C_i$ and our guesses

Plot what is leftover

par(mfrow=c(1,2))
plot(galton$parent,galton$child,pch=19,col="blue")
lines(galton$parent,lm1$fitted,col="red",lwd=3)
plot(galton$parent,lm1$residuals,col="blue",pch=19)
abline(c(0,0),col="red",lwd=3)

Inference Basics

slides | video | transcript

  • prev'ly: fitting lines
  • if we have a sample - how can we say something about our best fit line?
    • what does that line say about the population?

Fit a line to the Galton Data

library(UsingR)
data(galton)
plot(galton$parent,galton$child,pch=19,col="blue")
lm1 <- lm(galton$child ~ galton$parent)
lines(galton$parent,lm1$fitted,col="red",lwd=3)
  • plot average parent height vs. child height
  • lm = "linear model" fn
    • lm always fits an intercept term (unless you tell it not to)
  • use lines to add the fit line (From lm) to the plot
lm1
  • observe the coefficient(s)!

Create a "population" of 1 million families

  • "Create a scenario where we have 1 million families..."
    • based on the fit of the original data set
newGalton <- data.frame(parent=rep(NA, 1e6), child=rep(NA,1e6))

# create the parents (use mean & sd of parents from Galton data)
# again: `rnorm` to generate this normally distributed random sample
newGalton$parent <- rnorm(1e6,mean=mean(galton$parent), sd=sd(galton$parent))

# create the children from the parent data and the coefficients from the `lm`
newGalton$child <- lm1$coeff[1] + lm1$coeff[2]*newGalton$parent + rnorm(1e6,sd=sd(lm1$residuals))

smoothScatter(newGalton$parent,newGalton$child)
abline(lm1,col="red",lwd=3)
# note how the new, simulated data actually fits to our regression line from the actual Galton data

Let's take a sample

set.seed(134325)

# sample from our simulated data
sampleGalton1 <- newGalton[sample(1:1e6,size=50,replace=FALSE),]

# new regression line (from the simulated/random sample)
sampleLm1 <- lm(sampleGalton1$child ~ sampleGalton1$parent)

plot(sampleGalton1$parent,sampleGalton1$child,pch=19,col="blue")

# put our black (sample) & red (actual) lines on there
lines(sampleGalton1$parent,sampleLm1$fitted,lwd=3,lty=2)
abline(lm1,col="red",lwd=3)

Let's take another sample

"Another random sample of 50..."

sampleGalton2 <- newGalton[sample(1:1e6,size=50,replace=FALSE),]
sampleLm2 <- lm(sampleGalton2$child ~ sampleGalton2$parent)
plot(sampleGalton2$parent,sampleGalton2$child,pch=19,col="blue")
lines(sampleGalton2$parent,sampleLm2$fitted,lwd=3,lty=2)
abline(lm1,col="red",lwd=3)
  • again: red line is the original fit line (real Galton data)
  • black line is our random sample
    • sample is always random, so fit line is always a little different

Let's take another sample

Again:

sampleGalton3 <- newGalton[sample(1:1e6,size=50,replace=F),]
sampleLm3 <- lm(sampleGalton3$child ~ sampleGalton3$parent)
plot(sampleGalton3$parent,sampleGalton3$child,pch=19,col="blue")
lines(sampleGalton3$parent,sampleLm3$fitted,lwd=3,lty=2)
abline(lm1,col="red",lwd=3)

Many samples

"Let's do that again. 100 times."

sampleLm <- vector(100,mode="list")
for(i in 1:100){
  sampleGalton <- newGalton[sample(1:1e6,size=50,replace=FALSE),]
  sampleLm[[i]] <- lm(sampleGalton$child ~ sampleGalton$parent)
}

smoothScatter(newGalton$parent,newGalton$child)
for(i in 1:100){abline(sampleLm[[i]],lwd=3,lty=2)}
abline(lm1,col="red",lwd=3)
  • again: red line is original fit line from real data
  • the many black lines? - fit lines from the 100 iterations
    • note the different slopes etc.
    • and/but mostly around the same center
  • real thing? - you only get one of these (black) lines in a given study

Histogram of estimates

par(mfrow=c(1,2))
# hist. of intercept terms:
hist(sapply(sampleLm,function(x){coef(x)[1]}),col="blue",xlab="Intercept",main="")
# hist. of slope terms:
hist(sapply(sampleLm,function(x){coef(x)[2]}),col="blue",xlab="Slope",main="")
  • In real world you don't get to look at this many.
  • in the real world you usually only have one sample
    • and you need to work w/ what you have

Distribution of coefficients

  • central limit theorem - super important
  • re: estimating variance
  • standard error = square root of the estimated variance for a coefficient

Estimating the values in R

par(mfrow=c(1,1))
sampleGalton4 <- newGalton[sample(1:1e6,size=50,replace=FALSE),]
sampleLm4 <- lm(sampleGalton4$child ~ sampleGalton4$parent)
summary(sampleLm4)

hist(sapply(sampleLm,function(x){coef(x)[2]}),col="blue",xlab="Slope",main="",freq=F)
lines(seq(0,5,length=100),dnorm(seq(0,5,length=100),mean=coef(sampleLm4)[2],
      sd=summary(sampleLm4)$coeff[2,2]),lwd=3,col="red")
  • note actual slope is slightly different than the slope value that generated the data
    • but the variance is about the same

Why do we standardize?

  • standardize!
    • b/c you want to be able to make comparisons
    • and standardized quantitative data is easier to compare
    • easier to reason about
par(mfrow=c(1,2))
hist(sapply(sampleLm,function(x){coef(x)[1]}),col="blue",xlab="Intercept",main="")
hist(sapply(sampleLm,function(x){coef(x)[2]}),col="blue",xlab="Slope",main="")

Standardized coefficients

  • take the coefficients that we have...
  • Degrees of Freedom: (approx.)
    • number of samples - number of things you estimated

$t_{n−2}$ versus $N(0,1)$

par(mfrow=c(1,1))

# linear set of points
x <- seq(-5,5,length=100)
# plot the normal dist.:
plot(x,dnorm(x),type="l",lwd=3)
# and a couple of t-distributions:
lines(x,dt(x,df=3),lwd=3,col="red")
lines(x,dt(x,df=10),lwd=3,col="blue")

Confidence intervals

  • "what are the possible values that the real B1 has?"
  • level $\alpha$ confidence interval
    • confidence interval: set of values for B1 that we think are plausible
summary(sampleLm4)$coeff
confint(sampleLm4,level=0.95)
  • note on the slide: "+" and "-" are backward in the equation

Repeated samples of size 50 from the 1 million family data set:

par(mar=c(4,4,0,2))
plot(1:10,type="n",xlim=c(0,1.5),ylim=c(0,100),
     xlab="Coefficient Values",ylab="Replication")
for(i in 1:100){
    ci <- confint(sampleLm[[i]])
    color="red"
    if((ci[2,1] < lm1$coeff[2]) & (lm1$coeff[2] < ci[2,2])){color = "grey"}
    segments(ci[2,1],i,ci[2,2],i,col=color,lwd=3)
}
lines(rep(lm1$coeff[2],100),seq(0,100,length=100),lwd=3)

How you report the inference

sampleLm4$coeff
confint(sampleLm4,level=0.95)

A one inch increase in parental height is associated with a 0.77 inch increase in child's height (95% CI: 0.42-1.12 inches).


P-values

slides | video | transcript

What is a P-value?

Idea: Suppose nothing is going on - how unusual is it to see the estimate we got?

Approach:

  1. Define the hypothetical distribution of a data summary (statistic) when "nothing is going on" (null hypothesis)
  2. Calculate the summary/statistic with the data we have (test statistic)
  3. Compare what we calculated to our hypothetical distribution and see if the value is "extreme" (p-value)

Galton data

(again; case study cont'd)

library(UsingR)
data(galton)
plot(galton$parent, galton$child,
     pch=19, col="blue")
lm1 <- lm(galton$child ~ galton$parent)
abline(lm1,col="red",lwd=3)
  • null hypothesis - no relationship b/w parent + child height
  • null dist. + observed stat.
x <- seq(-20,20,length=100)
plot(x,dt(x,df=(928-2)),col="blue",lwd=3,type="l")
arrows(summary(lm1)$coeff[2,3],0.25,summary(lm1)$coeff[2,3],0,col="red",lwd=4)

Calculating p-values

summary(lm1)

A quick simulated example

set.seed(9898324)
yValues <- rnorm(10)
xValues <- rnorm(10)
lm2 <- lm(yValues ~ xValues)
summary(lm2)
  • random x and y
    • should be totally unrelated
x <- seq(-5,5,length=100)
plot(x,dt(x,df=(10-2)),col="blue",lwd=3,type="l")
arrows(summary(lm2)$coeff[2,3],0.25,summary(lm2)$coeff[2,3],0,col="red",lwd=4)
  • "How much probability density exists to the right of the observed statistic?"
    • how much curve lies to the right of the red arrow?
xCoords <- seq(-5,5,length=100)
plot(xCoords,dt(xCoords,df=(10-2)),col="blue",lwd=3,type="l")
xSequence <- c(seq(summary(lm2)$coeff[2,3],5,length=10),summary(lm2)$coeff[2,3])
ySequence <- c(dt(seq(summary(lm2)$coeff[2,3],5,length=10),df=8),0)
polygon(xSequence,ySequence,col="red"); polygon(-xSequence,ySequence,col="red")

Simulate a ton of data sets with no signal

set.seed(8323); pValues <- rep(NA,100)
for(i in 1:100){
  xValues <- rnorm(20);yValues <- rnorm(20)
  pValues[i] <- summary(lm(yValues ~ xValues))$coeff[2,4]
}
hist(pValues,col="blue",main="",freq=F)
abline(h=1,col="red",lwd=3)

set.seed(8323); pValues <- rep(NA,100)
for(i in 1:100){
  xValues <- rnorm(20);yValues <- 0.2 * xValues + rnorm(20)
  pValues[i] <- summary(lm(yValues ~ xValues))$coeff[2,4]
}
hist(pValues,col="blue",main="",freq=F,xlim=c(0,1)); abline(h=1,col="red",lwd=3)

"Under the null hypothesis, the p-values will always be uniformly distributed." "If the alternative hypothesis is true..."

set.seed(8323); pValues <- rep(NA,100)
for(i in 1:100){
  xValues <- rnorm(100);yValues <- 0.2* xValues + rnorm(100)
  pValues[i] <- summary(lm(yValues ~ xValues))$coeff[2,4]
}
hist(pValues,col="blue",main="",freq=F,xlim=c(0,1)); abline(h=1,col="red",lwd=3)
  • p-value depends on the signal (the slope) and the sample size

Some typical values (single test)

  • $P < 0.05$ (significant)
  • $P < 0.01$ (strongly significant)
  • $P < 0.001$ (very significant)

In modern analyses, people generally report both the confidence interval and P-value. This is less true if many many hypotheses are tested.

If you are working w/ lots of data, try to use a P-value smaller than 0.05.

How you interpret the results

summary(lm(galton$child ~ galton$parent))$coeff

A one inch increase in parental height is associated with a 0.77 inch increase in child's height (95% CI: 0.42-1.12 inches). This difference was statistically significant (P<0.001).

P-value on its own is not necessarily super-informative but can be useful in guiding "where to from here" etc.


Regression with Factor Variables

slides | video | transcript

Key ideas:

  • Outcome is still quantitative
  • Covariate(s) are factor variables
    • "factor" roughly equal "qualitative"
  • Fitting lines = fitting means
  • Want to evaluate contribution of all factor levels at once

Example: movie ratings

Go get some RottenTomatoes.com movie ratings:

download.file("http://www.rossmanchance.com/iscam2/data/movies03RT.txt",destfile="data/movies.txt")
movies <- read.table("data/movies.txt",sep="\t",header=T,quote="")
head(movies)

Rotten tomatoes score vs. rating

plot(movies$score ~ jitter(as.numeric(movies$rating)),col="blue",xaxt="n",pch=19)
axis(side=1,at=unique(as.numeric(movies$rating)),labels=unique(movies$rating))

Get the ratings and plot them by their rating.

Average score by rating

plot(movies$score ~ jitter(as.numeric(movies$rating)),col="blue",xaxt="n",pch=19)
axis(side=1,at=unique(as.numeric(movies$rating)),labels=unique(movies$rating))
meanRatings <- tapply(movies$score,movies$rating,mean)
points(1:4,meanRatings,col="red",pch="-",cex=5)

Apply a marker to indicate the mean score per rating.

Another way to write it down

(( insert mathematical equation here... ))

Average values

  • $b_0$ = average of the G movies

  • $b_0 + b_1$ = average of the PG movies

  • $b_0 + b_2$ = average of the PG-13 movies

  • $b_0 + b_3$ = average of the R movies

  • slightly more complicated than just calc'ing the averages

  • BUT: this allows us to fit them w/ linear models

Doing it in R

lm1 <- lm(movies$score ~ as.factor(movies$rating))
summary(lm1)

Note: coercing the ratings to factors w/ as.factor

Plot fitted values

plot(movies$score ~ jitter(as.numeric(movies$rating)),col="blue",xaxt="n",pch=19)
axis(side=1,at=unique(as.numeric(movies$rating)),labels=unique(movies$rating))
points(1:4,lm1$coeff[1] + c(0,lm1$coeff[2:4]),col="red",pch="-",cex=5)

Question 1

"What is the average difference in rating b/w G- & R-rated movies?"

lm1 <- lm(movies$score ~ as.factor(movies$rating))
summary(lm1)
confint(lm1)

Question 2

"What is the average difference in rating b/w PG-13 and R-rated movies?"

We could rewrite our model...

# `ref` is the "reference category"
lm2 <- lm(movies$score ~ relevel(movies$rating,ref="R"))
summary(lm2)
confint(lm2)

Question 3

"What if you want to know if there is any difference at all b/w any of the different levels of this particular variable?" (at any level?)

ANOVA!

lm1 <- lm(movies$score ~ as.factor(movies$rating))
anova(lm1)

Sum of squares (G movies)

(remember? from ANOVA (above))

gMovies <- movies[movies$rating=="G",]; xVals <- seq(0.2,0.8,length=4)
plot(xVals,gMovies$score,ylab="Score",xaxt="n",xlim=c(0,1),pch=19)
abline(h=mean(gMovies$score),col="blue",lwd=3); abline(h=mean(movies$score),col="red",lwd=3)
segments(xVals+0.01,rep(mean(gMovies$score),length(xVals)),xVals+0.01,
         rep(mean(movies$score),length(xVals)),col="red",lwd=2)
segments(xVals-0.01,gMovies$score,xVals-0.01,rep(mean(gMovies$score),length(xVals)),col="blue",lwd=2)

Tukey's (honestly significant difference test)

less common than the ANOVA

lm1 <- aov(movies$score ~ as.factor(movies$rating))
TukeyHSD(lm1)

Multiple Variable Regression

slides | video | transcript

"Same least squares approach..."

Key ideas:

  • Regression with multiple covariates
  • Still using least squares/central limit theorem
  • Interpretation depends on all variables

Example data: Millenium Development Goal 1 + WHO childhood hunger

download.file("http://apps.who.int/gho/athena/data/GHO/WHOSIS_000008.csv?profile=text&filter=COUNTRY:*;SEX:*","data/hunger.csv",method="curl")
hunger <- read.csv("data/hunger.csv")
hunger <- hunger[hunger$Sex!="Both sexes",]
head(hunger)

Plot percent hungry versus time

lm1 <- lm(hunger$Numeric ~ hunger$Year)
plot(hunger$Year,hunger$Numeric,pch=19,col="blue")

"Remember the linear model"

$$ Hu_i = b_0 + b_1Y_i + e_i $$

  • $b_0$ = percent hungry at Year 0
  • $b_1$ = decrease in percent hungry per year
  • $e_i$ = everything we didn't measure

Add the linear model

lm1 <- lm(hunger$Numeric ~ hunger$Year)
plot(hunger$Year,hunger$Numeric,pch=19,col="blue")
lines(hunger$Year,lm1$fitted,lwd=3,col="darkgrey")

Color by male/female

plot(hunger$Year,hunger$Numeric,pch=19)
points(hunger$Year,hunger$Numeric,pch=19,col=((hunger$Sex=="Male")*1+1))

"Now two lines..."* (Male/Female w/ separate intercept lines)

lmM <- lm(hunger$Numeric[hunger$Sex=="Male"] ~ hunger$Year[hunger$Sex=="Male"])
lmF <- lm(hunger$Numeric[hunger$Sex=="Female"] ~ hunger$Year[hunger$Sex=="Female"])
plot(hunger$Year,hunger$Numeric,pch=19)
points(hunger$Year,hunger$Numeric,pch=19,col=((hunger$Sex=="Male")*1+1))
lines(hunger$Year[hunger$Sex=="Male"],lmM$fitted,col="black",lwd=3)
lines(hunger$Year[hunger$Sex=="Female"],lmF$fitted,col="red",lwd=3)

Two lines, same slope

(in R)

lmBoth <- lm(hunger$Numeric ~ hunger$Year + hunger$Sex)
plot(hunger$Year,hunger$Numeric,pch=19)
points(hunger$Year,hunger$Numeric,pch=19,col=((hunger$Sex=="Male")*1+1))
abline(c(lmBoth$coeff[1],lmBoth$coeff[2]),col="red",lwd=3)
abline(c(lmBoth$coeff[1] + lmBoth$coeff[3],lmBoth$coeff[2] ),col="black",lwd=3)

Interactions

Two lines, different slopes (in R)

lmBoth <- lm(hunger$Numeric ~ hunger$Year + hunger$Sex + hunger$Sex*hunger$Year)
plot(hunger$Year,hunger$Numeric,pch=19)
points(hunger$Year,hunger$Numeric,pch=19,col=((hunger$Sex=="Male")*1+1))
abline(c(lmBoth$coeff[1],lmBoth$coeff[2]),col="red",lwd=3)
abline(c(lmBoth$coeff[1] + lmBoth$coeff[3],lmBoth$coeff[2] +lmBoth$coeff[4]),col="black",lwd=3)

summary(lmBoth)

Interactions for continuous variables

"a little bit tricky" (!?)


Regression in the Real World

video | transcript

(( just watch ))

  • "Previous examples ... everything works out neatly and nicely."
    • "In real life ... regression is very hard" (complicated)
  • Things to pay attn to:
    • confounders
      • variable that is correlated with both dependent & independent variables
      • sometimes detect w/ careful exploration/visualization
    • complicated interactions
    • skewness
    • outliers
      • data points that are outside the expectations
        • ("way too high or way too low")
        • ("something that doesn't fit the pattern")
      • even 1 outlier can dramatically alter the "shape of the data
        • and/or the visualization
      • careful w/ outliers
        • sometimes they're b/c of broken data collection
        • and/but sometimes they're real data
          • (that just happen to introduce big-time skew...)
    • non-linear patterns
    • variance changes
      • heteroskedasticity
      • there are some things you can do to deal w/ this
        • (list of models/techniques)
    • issues w/ diff. units or scales
      • sometimes you need to standardize
        • not just ensuring they're all using the same units
      • also: "flatten" some of the data (e.g., not just "total deaths" but "deaths per 1000" or some such thing)
    • overloading regression (??)
    • correlation & causation
  • Ideal data for regression?
    • oval shape (like Galton's)

Study Group Supplementary Session

Dave explains residuals from a linear model:

$$ y = mx + b + e $$

  • Dave: "simplest linear model" (above)
    • $mx$ = slow
    • $b$ = intercept
  • residual = $r_i = y_i - y(hat)_i$
    • $= y_i - (mx_i + b)$
    • difference b/w the prediction made by the model and the actual observed value
  • coefficients are the values that go into the model to make the predictions
    • so: $m$ and $b$ in our "simplest model" (above)