<H1> Logistic Regression

Recall the linear model

$$y = \alpha + \beta x + \epsilon$$


Linear models are appropriate where the response variable ($y$) is continuous and the errors ($\epsilon$) are normally distributed. Many experimental designs (for example, case-control) have categorical outcomes and/or outcomes whose errors are not normally distributed. Categorical data can be analyzed using a method called 'logistic regression'. The theory was covered in the statistics lectures, so this lab will focus on the R implementation. We will briefly remind you of some definitions and notation.

In logistic regression, we model the *probability* of the outcome $y$, given the independent variable $x$. That is, 
$Y|x$ is Bernoulli, with success probability $p$ given by: 

$$\mathbb{E}(Y|x) = p = \frac{1}{1+e^{-(\alpha + \beta x)}}$$

Note that $p$ is bounded between $0$ and $1$ and is defined for all values of $\alpha + \beta x$.

We use the notation $expit(t)$ to denote the following:

$$expit(t) = \frac{1}{1+e^{-t}}$$

$expit$ is known as the *logistic function*.

and $logit(t)$ to denote:

$$logit(t) = \log\left(\frac{1}{1-t}\right)$$

Note that:

$$logit(expit(t)) = t$$

so that 

$$logit(p) = \alpha + \beta x$$

Because $logit(p) = logit(\mathbb{E(Y|x)})$, logistic regression is considered a *generalized linear model*. That just means (roughly!) that there is a function ($logit$, this case) that transforms the conditional expectation (the outcome, given the data) into a linear function. 


So, how do we perform a logistic regression in R? First, let's get a data set:

In [46]:
Titanic<-read.csv("titanic.csv")

In [47]:
attach(Titanic)
#Titanic
library(plyr)
revalue(pclass, c("1st"=1, "2nd"=0,"3rd"=0))
Titanic

The following objects are masked from Titanic (pos = 3):

    age, boat, embarked, home.dest, name, pclass, room, row.names, sex,
    survived, ticket

The following objects are masked from Titanic (pos = 5):

    age, boat, embarked, home.dest, name, pclass, room, row.names, sex,
    survived, ticket

The following objects are masked from Titanic (pos = 6):

    age, boat, embarked, home.dest, name, pclass, room, row.names, sex,
    survived, ticket



Unnamed: 0,row.names,pclass,survived,name,age,embarked,home.dest,room,ticket,boat,sex
1,1,1st,1,"Allen, Miss Elisabeth Walton",29.0,Southampton,"St Louis, MO",B-5,24160 L221,2,female
2,2,1st,0,"Allison, Miss Helen Loraine",2.0,Southampton,"Montreal, PQ / Chesterville, ON",C26,,,female
3,3,1st,0,"Allison, Mr Hudson Joshua Creighton",30.0,Southampton,"Montreal, PQ / Chesterville, ON",C26,,(135),male
4,4,1st,0,"Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)",25.0,Southampton,"Montreal, PQ / Chesterville, ON",C26,,,female
5,5,1st,1,"Allison, Master Hudson Trevor",0.9167,Southampton,"Montreal, PQ / Chesterville, ON",C22,,11,male
6,6,1st,1,"Anderson, Mr Harry",47.0,Southampton,"New York, NY",E-12,,3,male
7,7,1st,1,"Andrews, Miss Kornelia Theodosia",63.0,Southampton,"Hudson, NY",D-7,13502 L77,10,female
8,8,1st,0,"Andrews, Mr Thomas, jr",39.0,Southampton,"Belfast, NI",A-36,,,male
9,9,1st,1,"Appleton, Mrs Edward Dale (Charlotte Lamson)",58.0,Southampton,"Bayside, Queens, NY",C-101,,2,female
10,10,1st,0,"Artagaveytia, Mr Ramon",71.0,Cherbourg,"Montevideo, Uruguay",,,(22),male


In [42]:
fit.survive<-glm(survived ~ sex + pclass + age + age:pclass,family= "binomial",data = Tdata)

In [43]:
summary(fit.survive)


Call:
glm(formula = survived ~ sex + pclass + age + age:pclass, family = "binomial", 
    data = Tdata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.8679  -0.6631  -0.3145   0.5478   2.6208  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    4.175621   0.576154   7.247 4.25e-13 ***
sexmale       -3.130012   0.245882 -12.730  < 2e-16 ***
pclass2nd     -0.558110   0.687139  -0.812  0.41666    
pclass3rd     -2.730404   0.726326  -3.759  0.00017 ***
age           -0.039911   0.012260  -3.255  0.00113 ** 
pclass2nd:age -0.030098   0.019484  -1.545  0.12240    
pclass3rd:age  0.001764   0.022794   0.077  0.93831    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 869.54  on 632  degrees of freedom
Residual deviance: 536.87  on 626  degrees of freedom
  (680 observations deleted due to missingness)
AIC: 550.87

Number of Fisher Scorin