# MATH 3350 Course Notes - Module S9 _Supplement_

## Analysis with Categorical Variables

Recall that in multiple regression, we are trying to model a potential "true" relationship:
<center>
$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... \beta_k x_k $
</center>

where $y$ is the _dependent_ variable (also known as the _response_ variable) and $x_1, x_2,..., x_k $ are the predictors (_independent_ variables).  

We **estimate** this relationship with our regression model:

<center>
$\widehat{y} = b_0 + b_1 x_1 + b_2 x_2 + ... b_k x_k $
</center>

Each predictor is assumed to be a number.  However, there is a mechanism for including categorical variables in the model. This can be done by explicitly having a binary variable (0 or 1) for different possible categories within the categorical variable.  R will perform this conversion automatically. We will use the class height data set as an example, where gender is the categorical variable we want to include in our analysis. 

In [None]:
#Look at data set
heights <- read.csv("ClassHeightData.csv")
head(heights)

First it is helpful to view the scatterplot, with different markers for the two genders.

In [None]:
plot(heights$Height ~ heights$Shoe.Size,col=factor(heights$Gender),main="Height and Shoe Size by Gender", 
     ylab="Height (inches)", xlab="Shoe Size (US)", pch=(as.integer(as.factor(heights$Gender))+15), cex=1.5 )

Now we can create 2 separate models (one for each gender) and place these on the plot, as follows. 

In [None]:
females <- heights[heights$Gender=="F",]
males <- heights[heights$Gender=="M",]

f_model <- lm(Height ~ Shoe.Size, data = females)
m_model <- lm(Height ~ Shoe.Size, data = males)

plot(heights$Height~heights$Shoe.Size,col=factor(heights$Gender),main="Height and Shoe Size by Gender", 
     ylab="Height (inches)", xlab="Shoe Size (US)", pch=(as.integer(as.factor(heights$Gender))+15), cex=1.5)
abline(f_model, lwd=2)
abline(m_model, col="red", lty=2, lwd=2)

In [None]:
summary(f_model)

In [None]:
summary(m_model)

### Predictions with Gender-Specific Models
The female model predicts that a female with shoe size 8.5 will have height $\approx 50.636 + 1.818(8.5) \approx 66.1$ inches. 

The male model predicts that a male with shoe size 8.5 will have height $\approx 57.373 + 1.229(8.5) \approx 67.8$ inches. 

## Predicting Height Using Only Gender
What if we only want to use the categorical variable _Gender_ to predict height?

In [None]:
g_model <- lm(Height ~ Gender, data=heights)
summary(g_model)

### Interpreting Coefficients for a Categorical Variable

Notice that the only coefficient is **_GenderM_**. This should be treated as a **_1 for males and a 0 for females_**. 

Therefore, the predicted height for every female is $64.5 + 6.227(0) = 64.5$ inches.  
The predicted height for every male is $64.5 + 6.227(1) \approx 70.7$ inches.  

Notice below that these predictions are simply the _mean height_ for each gender in this data set.


In [None]:
mean(females$Height)

In [None]:
mean(males$Height)

### Using Gender and Shoe Size Together As Predictors


Next, we use R to create one model that uses shoe size _and_ takes gender into account automatically.  

In [None]:
#The '.' character indicates we want to use all variables (other than Height) as predictors in this model
model <- lm(Height~., data=heights)   
summary(model)

### Interpreting Combined Model

Again, we have GenderM as a predictor; this should be populated with a 0 when predicting for females and a 1 when predicting for males.

The combined model predicts that a female with shoe size 8.5 will have height $\approx 55.021 + 1.2431(8.5) + 2.2012(0) \approx 65.6$ inches. 

The combined model predicts that a male with shoe size 8.5 will have height $\approx 55.021 + 1.2431(8.5) + 2.2012(1)  \approx 67.8$ inches. 

Finally, below are the diagnostic plots for the combined model. 

In [None]:
plot(model, which=c(1,2))