In [None]:
library(tidyverse)
library(magrittr)

# Multiple Regression

Now that we got the hang of conducting and interpreting regression with a single IV, we're going to move on to being able to use many variables to fit a model that best describes how our IVs potentially influence our DV.

With multiple IVs, we're looking at the effect of one of the IVs "holding all other variables constant" or "controlling for all other variables."

We're still finding the best linear relationship between the DV and our IVs.

#### Simple Linear Regression (one IV):
## $y = \beta_0 + \beta_1x_1 + \varepsilon$
<BR>
    
#### Multiple Linear Regression (multiple IVs):

## $y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 .... + \beta_ix_i + \varepsilon$

### Why?

1. We can explain more of the variance in y.
2. We can describe the relationship between y and different x variables, *while holding the other IVs constant.*
3. If we know that a certain variable could explain some of the bias in our data, we can include that variable in the model to "control" for that bias.

## Big Brother
Big Brother is a TV show in which groups of adults (typically 16 per season) are locked into a house (studio) for an entire summer (~80-90 days).  They are filmed 24/7 and in addition to 3 televised episodes per week, viewers can also watch 24/7 "live feeds."  The "houseguests" are competing to win a half million dollars.  Each week one houseguest wins the head of household (HOH) competition.  That HOH nominates 2 HGs for eviction.  Those HGs have 1 chance to avoid eviction - the veto ceremony.  6 HGs participate in the veto ceremony - the 2 nominees, the HOH, and 3 other HGs based on random draw/HG's choice.  If the winner of the veto removes one of the nominees from "the block" the HOH nominates another remaining HG.  One of the two nominees is voted out each week during a live show, then the process is repeated on a weekly basis until 2 HGs remain.  A winner is voted on between those 2 remaining HGs, voted for by a jury of evicted HGs.  

I have obtained data on previous HGs from the last 21 seasons of regular US Big Brother, 1 special online season of US Big Brother (OTT), and 2 special shortened celebrity seasons.  This data was collected by Vince Dixon and made available on his github page.  You can see his methodology and his analysis of the data (specifically focused on racial dynamics) at https://vincedixonportfolio.com/2019/08/29/methodology-behind-big-brother-diversity-data-dive/.

We can't use linear regression to predict a winner, that's a yes/no variable, but we can use the data to predict how long a player remains in the season, and if there are any factors that are significantly associated with their success.

Let's load the data directly from github.

In [None]:
## use read_csv to read the data from the raw content link 
bbdata <- read_csv("https://raw.githubusercontent.com/vdixon3/big-brother-diversity-data/master/big_brother_data.csv")

## take a peek at the format/variables
glimpse(bbdata)

Given what I know about Big Brother, I'm going to filter out some of our observations.

1. Regular Seasons 1 and 2 are removed because the game was not fully developed with the current rules at that point.  Season 1 had "America Votes" on nominees.  Season 2 had HOH, but not veto competition.
2. BB OTT is going to be removed since the format was different than the traditional seasons and also involved a component of America influencing nominees.
3. There have been a few people who have "self-evicted" or were "ejected" from the season prematurely.  Given this, their tenure in the game was arbitrarily shortened.

In [None]:
# make a vector of seasons I want to exclude
exclude = c("bbus1", "bbus2", "bbott1")

# use the %in% operator to compare season_code against multiple values. ! operator negates - keep seasons not in the exclude list
bb <- bbdata %>% filter(!(season_code %in% exclude))  %>% ## remove non-standard seasons
        filter(self_evicted == "no" & ejected == "no")  %>% ## remove self-evicted and ejected players
        ## select specific variables to keep
        select(index, first, last, season_code, total_houseguests, total_days, age, gender, race_ethnicity, 
               lgbts, final_eviction_day, total_vetos, total_hoh, total_wins, other_comp_wins, 
               total_nominations, final_placement) ## select columns to keep
summary(bb)

Next we're going to take a look at some of our categorical variables to see if they require any data cleaning.

In [None]:
## look at race/ethnicity to see how many groups and size of groups
table(bb$race_ethnicity)

## keep white, black, and lump all other categories into other
bb %<>% mutate(race_ethnicity = fct_lump(race_ethnicity, 2))

## look at table after lumping
table(bb$race_ethnicity)

In [None]:
## look at the lgbt variable to see what the possible values are and the distribution/frequencies
table(bb$lgbts)

## keep non-lgbt and lump all other destinctions into "other"
bb %<>% mutate(lgbts = fct_lump(lgbts, 1))

## confirm our recode
table(bb$lgbts)

In [None]:
## total_hoh and total_vetos are numbers, but they are stored as character in the dataset - convert to numeric

bb %<>% mutate(total_vetos = as.numeric(total_vetos), total_hoh = as.numeric(total_hoh))

## review the summary again
summary(bb)

One more data cleaning step - The total_days variable on each record indicates the number of days in that player's season.  There is some variation in total days - from as few as 26 (celebrity seasons) to as many as 99 days (more current regular seasons).  In order to standardize our measure of tenure in the game I'm going to standardize the player's days in the game out of a standard 99 day season.

In [None]:
## create new variable that is days in game standardized to a 99 day season
bb %<>% mutate(days_in_game = final_eviction_day / total_days * 99)
summary(bb$days_in_game)

Wait, what?  Why is there a player whose days in game is longer than the highest possible value (99 days)?  Let's inspect:

In [None]:
## print players with days in game larger than 99
bb %>% filter(days_in_game > 99)


<img src="images/metta.jpg" width="400" height="400">

This appears to be a data entry error - Metta World Peace made it to day 20 of CBB1, not day 40.  We can edit this value.


In [None]:
## correct final_eviction_day for Metta World Peace
bb[bb$first == "Metta",]$final_eviction_day <- 20

# recalculate our standardized days in game variable
bb %<>% mutate(days_in_game = final_eviction_day / total_days * 99)  %>% select(-final_eviction_day, -final_placement)
summary(bb$days_in_game)

Looks good!  Let's look at some relationships between our numeric variables.

In [None]:
colnames(bb)

In [None]:
options(repr.plot.width=15, repr.plot.height=15) ## plot size options for Jupyter notebook ONLY
#scatterplot matrix
pairs(select_if(bb, is.numeric))

In [None]:
# correlation matrix
cor(bb[c(7,11:16)])

It looks like the variable with the largest correlation with days in game is total_hoh wins.  The positive correlation indicates that as number of HOH wins increases, the player's days in the game also increases.  Let's use this variable as an IV in our first model, a simple linear regression model, to predict tenure. Let's look closer at a scatterplot.

In [None]:
options(repr.plot.width=6, repr.plot.height=5) ## plot size options for Jupyter notebook ONLY

## use ggplot to create a scatterplot with a "smoothed" line defined by the lm.  We can show the se, or CI of our line as well.
bb %>% ggplot(aes(x=total_hoh, y=days_in_game)) + ## indicate df, x and y variables.
  geom_point()+ ## scatterplot
  geom_smooth(method=lm, se=TRUE, color = "magenta", fill = "pink") ## method is lm, show CI

## Fitting the Model - Simple Linear Regression, again
Now we'll fit our model, using total HOH wins to predict a player's days in the game.

In [None]:
mod1 <- lm(days_in_game ~ total_hoh, data=bb) ## here I've saved the resulting model to "mod1"
summary(mod1)

## Interpreting Model Output - One IV
We'll go ahead and interpret our output to see what it tells us about the influence of HOH wins on tenure in the game.

1. Our intercept is 52 days.  Because zero is a possible value of HOH wins, we can interpret this as the average number of days in the game (or the predicted/fitted value of days in the game) **when a player has zero HOH wins**.  So on average a player with zero HOH wins makes it halfway through about halfway in the game (52 of 99 days).  

2. Our coefficient for HOH wins is 15.  This means that for each increase of one HOH win there is, on average, a 15 day increase in days in the game.  In a modern/recent season with 99 days, that would be about 2 weeks in the game - as each eviction cycle is 7 days.  So they would, on average, remain for 2 more cycles.  They cannot be evicted during the cycle in which they are HOH, so this represents an additional week on top of that, **ON AVERAGE**.

3. The p-value for the total HOH coefficient is below an alpha of 0.05, so total HOH is a significant predictor of tenure in the game.  The coefficient is significantly different from 0.

4. Our R-squared is 0.34.  This means that number of HOH wins explains 34% of the variance in days in game.  HOH wins is an important factor in success in the game.

5. The p-value associated with our F-statistic is significantly below an alpha of 0.05.  This means that the inclusion of our IV in our model significantly improves the prediction of tenure over a null model, or a model with no IVs (just using the mean/intercept to explain the variance).  This also tells us our R-squared is significantly different from zero.

We need to check our assumptions!

## Post Hoc Checks - Model 1

### Linear in the Parameters / Errors are Independent
The assumption of linear in the parameters says that the variables are entered into the model correctly (that the relationships are truly linear.  This is not something we can test, but something we need to confirm as modelers.  However, if the errors are not independent, and there is a curvilinear relationship, that may be an indication that we possibly need to square one of our predictors.

In [None]:
# non-PQ / quick and dirty plot
# note - we're plotting with our saved model output - mod1
plot(mod1, which = 1) ## residuals vs. fitted is plot #1

Looking at the red line, there is potentially a slight curvilinear relationship here we might need to address.  That could mean that there is a benefit to winning HOH competitions, but at a certain point, that benefit is not as large.

### Right variables 
Like linear in the parameters, it relies on the analyst to make sure they're using the right variables in the model.

### Normally Distributed Errors
This we're already familiar with from ANOVA - we look at the normality of the distribution of the residuals using a QQ plot.

In [None]:
## plot() is used with our saved lm output.  the argument which allows us to print only the plot we specify, 
## the second plot is our QQ plot.
plot(mod1, which = 2)

It appears that the distribution of residuals deviates from a normal distribution in the tails - especially the upper tail.  This particular shape of a QQ plot indicates that the distribution of residuals is over-dispersed vs. a standard normal distribution - the "tails" of the distribution are wider than expected. 
(Good quick reference for QQ plot interpretation: http://www.ucd.ie/ecomodel/Resources/QQplots_WebVersion.html)

### Influential Outliers
We need to make sure that we don't have any observations that 

1. Have inordinate influence on the fit of the line and 
2. are outliers.  

For this we look at a plot of Residuals vs. Leverage.  We're looking for observations that surpass the red dotted line that indicates a threshold of Cook's D = 0.5.

In [None]:
plot(mod1, which = 5) ## plot 5 is Residuals vs. Leverage

A few of our points are labeled, but none of our observations are outside of the dashed red lines that indicate our threshold for Cook's distance (in fact, because we have no such observations we don't even have the line on our plot. So these are either outliers that do not have inordinate influence on the line, or points that have potential large leverage but are not outliers.  Out of curiosity, let's see who these players are.

In [None]:
bb[c(168,181,296),]

In [None]:
ols_plot_resid_lev(mod1)

Frankie Grande's large number of HOH wins was partially due to the format of the season he played in - for the majority of the season there was a twist in which there were 2 HOHs per week.  While none of these observations are influential outliers, we could choose to remove the entirety of Season 16 if we were concerned about potential bias there.

### Homoscedasticity
Let's check to see if our model violates the assumption of constant variance.

In [None]:
ols_test_breusch_pagan(mod1)

We reject the null hypothesis and therefore conclude the variance is not constant.  However, we know this test is pretty sensitive.  Let's look at our plot of residuals vs. fitted values and a scale-location plot.

In [None]:
plot(mod1, which = c(1, 3)) ## residuals vs. fitted is plot #1

There is definitely a "cone" or "funnel" shape, so I'd conclude we are violating the assumption of homoscedasticity.

## Model 2 - More variables!
Time to add more variables.  Let's add another numerical variable, this time the number of nominations.  Nominations are related to be selected for eviction, so definitely could influence days in game.  Let's see how it affects our model.

In [None]:
mod2 <- lm(days_in_game ~ total_hoh + total_nominations, data=bb) 
summary(mod2)

## Model 2 - Interpretation
It's interpretation time again! This time we have two slope coefficients!

1. Our intercept is now 37.  Why did it change?  It's because this is now the mean of y (fitted value of y) when both hoh wins AND nominations are zero.  So when a player has zero HOH wins and zero nominations they should last, on average, 37 days in the game.  This might not be considered interpretable, however, since you (typically) can't leave the game without being nominated.

2. The coefficients for HOH wins has changed too!  Why is this?  It's because now we interpret this coefficient as:
**An additional HOH win corresponds to a 12 day increase in days in game....**
    - ...holding all other variables in the model constant.
    - ...holding all other variables constant.
    - ...all else equal.
    - ...accounting for other variables in the model.
    - ...when controlling for `total_nominations`.
    - ...holding all other variables fixed.
    
    This means that the coefficient reflects the sole influence of HOH wins separated from the influence of the other variables. 
    HOH wins remains a significant predictor of days in the game - the p-value is below alpha = 0.05 and therefore we reject the null hypothesis - the coefficient is significantly different from zero.  However some of the effect of HOH wins in the first model was better explained by total number of nominations.
   
3. The coefficient for nominations is also significantly different from zero, indicating that nominations is also a significant predictor of days in the game.  It's not in the direction you might expect given that nominations lead to eviction which ends one's game.  However, the longer a player is in the game the more chances they have to be nominated.  

    Our interpretation of the coefficient is:  **Holding all other variables equal, each additional nomination corresponds to an average 8 day increase in days in the game.**
    
4. Our R-squared is now 0.49.  Remember this reflects the amount of variance explained by the entire model, and not any single IV.  So overall, with both of our IVs, the model is explaining almost 50% of the variance in tenure.  That's not bad at all!

5. The overall F-test and related p-value.  This test is also an "omnibus" test of the entire model - is the model with both of our IVs better than a null model with **no predictors**?  We reject the null hypothesis, therefore our model is significantly better than a null model.

But what if we wanted to know if this model is better than model 1?  Can we do that?

### Model Comparison
We can compare models to see if one model is a significant improvement over a previous model.  In order to do this, our reduced/smaller model needs to be **nested** inside our larger model.  This means that the smaller model needs to include a subset of the variables in the larger model and cannot include any variables that are not in the larger model.

Both models also need to be built from the same dataset - if you remove influential outliers between models, they are not nested.

In our case our second model contains the IV from the first model, HOH wins, and an additional IV.  Let's see if this model is significantly better than the previous model.  

The way we do this is through an F-test, but this time instead of comparing our model to a null model, we're instead comparing our "full" (larger) model to a "reduced" (smaller) model.

In [None]:
## we use the anova() NOT aov() function and pass it our two saved models.
anova(mod1, mod2)

What does this mean?  

- The numerator degrees of freedom (Df) of 1 indicates that our full model has 1 more parameter (IV) than our reduced model. 
- Res.Df is the denominator degrees of freedom, or the residual degrees of freedom.  It's the number of observations minus the number of parameters in the full model.
- But the most important part is the p-value.  For this test our null hypothesis is that the models fit equally as well; the alternative hypotheis is that the fuller model is a significant improvement over the smaller model.  In this case our p-value is below alpha = 0.05, therefore our model with 2 IVs is a significant improvement over our model with only one IV.

**IMPORTANT**: It does not matter which model is labeled mod1 or listed first.  The models could be labeled "hohmod" and "nommod" or something like "bob" and "sue."  The F-test always tests the nested model vs. the full model.  

**If we fail to reject null, the smaller (nested) model is preferred.**

**If we reject null, the larger (full) model is preferred.**

This is a typical source of confusion.  You can fit a full model, then a nested model, and do the F-test.  Just because you fit the full model first doesn't matter - the interpretation is based on which is the full and which is the nested, **NOT** on which is fitted first.

Why do we care about including predictors that don't significantly improve the model?

### Parsimony
When we construct statistical models we want to obtain the best fitting model that uses the fewest variables / assumptions.  We want to use the simplest model that is adequate to explain our data. There is no benefit to including more predictors that do not significantly improve the model.

Caveat - model specification should be driven by theory, not necessarily model fit statistics.  So if there is a good reason to control for age in a model of a medical outcome, you would want to retain age in the model as a "control variable" even if it was not statistically significant.

## Model 2 - Post Hoc Checks
We have determined that model 2 is significantly better than the first model, so we should proceed with checking our assumptions.  I will use the "quick and dirty" method of just checking with the built-in, non-PQ R plots.  These are **NOT** appropriate for homework or your Project 4.  I will show you how to create PQ output in the Lab.

In [None]:
## quick and dirty model fit plots
plot(mod2) ## by default includes the 4 we're interested in, don't need to specify which =

1. Residuals vs. Fitted - we are violating the assumption of independent errors - there is a definite curvilinear relationship between the residuals and the predicted values.  This means there is a variable we're not accounting for in the model, which may be a squared term of a variable we're already including.
2. QQ Plot.  The residuals appear to be approximately normally distributed - the deviations in the tails are not as severe as with model 1.
3. Scale-Location - it appears we have that cone/funnel shape (also evident in the plot of Residuals vs. Fitted).  This indicates that we likely have heteroscedasticity, the violation of the assumption of constant variance.  We could run the Breusch-Pagan Test to confirm if we wished.
4. Residuals vs. Leverage - we don't see the red dashed line that indicates Cook's D, and have no observations that surpass that threshold, therefore have no influential outliers.

## Model 3 - Adding a Categorical Predictor
We have two numerical IVs, but what happens when we add a categorical variable?  Let's add gender - which is a categorical variable with only two levels.  Remember this will be "dummy" coded - it will be included in the model as a 0/1 variable -- an "on/off" switch.  By default R uses the lowest indexed level as the "reference group" to which the other levels of the categorical variable are compared.

In [None]:
mod3 <- lm(days_in_game ~ total_hoh + total_nominations + gender, data=bb) 
summary(mod3)

## Interpretation:

1. **Intercept** - The intercept is 38 days.  This is now the average days in game when number of HOH wins equals zero, nominations equals zero, **AND** gender is at the reference value (female).  So for a female with zero HOH wins and zero nominations, the average of days in game is 38.  However, as discussed previously, this is not meaningful as you typically cannot leave the game without being nominated.  This type of subtlety in interpretation is why it's important for the analyst to understand their data.

2. **total_hoh** - Same as above (with the addition of also holding gender constant), this is interpreted as the average increase in days in game with a 1 unit increase in HOH wins, _**holding all else constant**_.  So when we don't vary total nominations or gender, an additional HOH win is associated with an increase of about 12 days in the game.  This number is similar to, but not exactly the same as, the coefficient for total_hoh in the second model.  Remember this is the effect of HOH_wins when "controlling for" the other variables in the model.  This is a significant predictor (p < 0.001).  It is important to check the p-value for each coefficient in each model, sometimes the addition of another variable might cause your predictor to no longer be significant.

3. **total_nominations** - This is also the same interpretation as in model 2, with a slightly different coefficient estimate, except now we're holding both total_hoh and gender constant.  Each nomination is associated with a 8 day increase, on average, of days in game, holding HOH wins and gender constant.  This coefficient also remains significant.

4. **gendermale** - This is the coefficient for "gender == male," which we interpret with respect to the reference group, female.  So in comparison to females, males, on average, last 2 fewer days in the game.  However, with p = 0.38, gender is not a significant predictor of days in game.

5. **adjusted r-squared** - This model (the combination of all three variables) explains 50% of the variance in days in game. This r-squared is actually slightly lower than the r-squared we saw in the previous model (0.4978), which means that adding gender did not improve the predictive power of the model.

6. **F-statistic** - Remember this tests this model with three predictors to the null / empty model with ZERO predictors.  This model with three predictors is significantly better than a null model, but doesn't tell us if this model with gender as a third predictor is significantly better than the second model without gender.  

But, we can test whether model 3 is a significant improvement over model 2.  In this case, model 2 is now the nested model, and model 3 is the full model.

In [None]:
## we use the anova() NOT aov() function and pass it our two saved models.
anova(mod2, mod3)

This model is not a significant improvement over model 2 without gender - the p-value is above 0.05.  Therefore, in the interest of parsimony, we will not include gender in the model.

Before we move on, maybe we'll try to add age to the model (another numerical variable).

### THREE VARIABLES ?!?!?!

In [None]:
mod4 <- lm(days_in_game ~ total_hoh + total_nominations + age, data=bb) 
summary(mod4)

### Model Interpretation

Quick interpretations!

1. The intercept is definitely no longer meaningful - since age can't equal 0.

2. HOH wins is a significant predictor of days in game.  Holding all other variables constant, each HOH win is associated with an additional 13 days in the game.

3. Number of nominations is a significant predictor of days in game.  Holding all other variables constant, each additional nomination is associated with an increase of 7.6 days in game.

4. Age is a significant predictor of days in game, but has a small unstandardized effect size.  An increase in one year of age is associated with an increase in 0.46 days in game, holding all else equal.  This is about half a day.  But you need to remember that just because this coefficient is smaller than the other coefficients, it doesn't mean it's a less important predictor.  A difference in one year of age is not comparable to winning 1 HOH competition or being nominated one time. If we wanted to know the impact of 10 year increase in age (for example, for a 20 y/o player vs. a 30 y/o player), that would be 4.6 day increase in days in game.  This is a reminder that we cannot compare the magnitude of the unstandardized coefficient estimates to determine effect size directly, we need to keep in mind how big that one unit increase in our x value represents.

5. All of our IVs explain 51% of the variance in days in game. (r-squared)

6. Our model is significantly better than a null model, but is adding age to the model a significant improvement over the second model.
    - We cannot compare this to mod3 - gender is not in this model so one is not nested inside the other.  Also, we decided not to retain gender as a predictor.

In [None]:
anova(mod2, mod4)

Here our full model is mod4 (including age) and our nested model is mod2 (just HOH and noms).  The p-value is less than an alpha of 0.05, therefore the fuller model is a significant improvement over the nested model.  We therefore prefer and retain the fourth model.

### Post Hoc Checks
Quick Check of our plots!

In [None]:
# remember, these are not PQ and should not be used for HW or project
plot(mod4)

That curvilinear relationship on the residuals vs. fitted plot is still there - we maybe should address that. 

Normality of residuals looks good.

Scale-location indicates potential heteroscedasticity (violation of the assumption of constant variance).

There do not appear to be any influential outliers.

## Addressing the Curvilinear relationship 
There has been a clear curvilinear relationship evident in the plot of residuals vs. fitted values since our first model.  We need to address this to improve our model.  Let's start with adding total_hoh squared to the model.

In [None]:
mod5 <- lm(days_in_game ~ total_hoh + I(total_hoh^2) + total_nominations + age, data=bb) 
summary(mod5)

I could have created a new variable with total_hoh squared, but instead I elected to just do the calculation within the formula.  In order to do this, I need to use `I()` to tell R to resolve that calculation before fitting the model.

Let's see if this model is a significant improvement over model 4 before proceeding.

In [None]:
anova(mod4, mod5)

The p-value is less than alpha, therefore this full model (model 5) is preferred over the nested model (model 4).  The inclusion of total_HOH squared improved the fit of the model.

Before we interpret the model, let's check our plots to see if this fully addressed the curvilinear relationship between the residuals and the fitted values.

In [None]:
#quick and dirty - do not use on HW or in Projects
plot(mod5)

A curvilinear relationship remains, let's add squared nominations to the model and see if that helps.

In [None]:
mod6 <- lm(days_in_game ~ total_hoh + I(total_hoh^2) + total_nominations + I(total_nominations^2) + age, data=bb) 
anova(mod5, mod6)
plot(mod6)

Adding that squared term did improve the model, but there is still a small amount of curvilinearity.  Let's add age squared as well.

In [None]:
mod7 <- lm(days_in_game ~ total_hoh + I(total_hoh^2) + total_nominations + I(total_nominations^2) + age + I(age^2), data=bb) 
anova(mod6, mod7)
plot(mod7)

That quadratic term does not significantly improve the model.  So we would likely decide to use Model 6, with the squared terms for both nominations and HOH wins, as the final model.  Let's now look at the interpretation with the squared terms included.

In [None]:
summary(mod6)

### Final Model Interpretation

1. **Intercept** - not meaningful because age cannot equal 0.
2. **total_hoh** - each additional HOH win is associated with an average increase of 21 days in the game, holding all else constant.  This is independent of the impact of HOH wins squared.
3. **total_hoh squared** - each additional increase of HOH win squared yields a decrease of 2.6 days in the game, holding all else constant.  Number of HOH wins squared is hard to interpret, but what is more important is that this is a significant predictor of days in game and reflects a downturn that when a player has more and more HOH wins it could start to negatively impact days in game - the relationship between HOH wins and days in game is not perfectly linear.  This could reflect a situation where as players are seen winning more competitions they're seen as a threat and then targetted for eviction.
4. **total_nominations** = each additional nomination yields an average increase of 11 days in the game, holding all else constant. This is independent of the impact of nominations squared.
5. **total_nominations squared** - for each increase of 1 nomination squared, there is an average decrease of 0.6 days in the game.  Again, similar to the hoh squared term, the 1 nomination squared unit is not interpretable; what is important is that this is a significant predictor and reflects that idea of "diminishing returns."  This is accounting for what we mentioned earlier - nominations lead to eviction.  Even though you need to stay in the game to earn more nominations, so overall total nominations is associated with an increase of days in game, the more you're nominated the more chances you have to be ultimately evicted.
6. **age** - for each increase of one year in age there is an increase of 0.42 days in game, controlling for total HOH wins, total nomination wins, and HOH wins and nomination wins squared..  Again, this is not a large number probably because a 1 year difference in age is not a large difference.  If we were interested in a 10 year increase in age we would multiply the coefficient by 10, so a 10 year increase in age would be associated with an average increase of 4.2 days in the game, holding all else constant.
7. **r-squared** - Our final model explains 54% of the variance in days in game, which is really good for "real world" data.
8. **Model F-test** - not surprising, this model is significantly better than a null or empty model.