Unit 1, Sprint 3, Module 3

###Learning Objectives:
- Student can model the relationship of multple predictor variables to a single outcome.
- Student can conduct a t-test to determine the significance of individual variables in the model.
- Student can compare model fit using R-squared

### Returning to our forensic science example:



Anyone who is a fan of detective TV shows has watched a scene where human remains are discovered and some sort of expert is called in to determine when the person died. But is this science fiction or science fact? Is it possible to use evidence from skeletal remains to determine how long a body has been buried (a decent approximation of how long the person has been dead)? 




Researchers sampled long bone material from bodies exhumed from coffin burials in two cemeteries in England. In each case, date of death and burial (and therefore interment time) was known. This data is given in the Longbones.csv dataset. 

**We wish to determine if we can improve our model of the relationship between the nitrogen composition of the long bones of the skeleton and the length of time the body was interred.**

Source: D.R. Jarvis (1997). "Nitrogen Levels in Long Bones from Coffin Burials Interred for Periods of 26-90 Years," Forensic Science International, Vol85, pp199-208

In [None]:
#Import the Longbones.csv dataset from a URL and print the first few rows

import pandas as pd
import numpy as np


data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Longbones/Longbones.csv'

df = pd.read_csv(data_url, skipinitialspace=True, header=0)

#Drop records with NAs and rename the dataset longbones

longbones = df.dropna()

longbones.head()

At the point we left off yesterday, we'd tested for and found a statisticaly significant relationship between long bone nitrogen composition and interrment time.

The null hypothesis is that they are unrelated and the slope is = 0.  The alternative hypothesis is that they are related and the slope is not equal to 0.  In symbolic notation:

Ho: $\beta_1$ = 0

Ha: $\beta_1 \neq$ 0

Today let's take a look at some of the information about model fit displayed in the upper right hand side of the top box.

**R-squared** (also abbreviated $R^2$) is equal to the square of the correlation coefficient.  It's a number between 0 and 1 that is often expressed as a percent.



The interpretation of R-squared is "the percent of the variability in the y variable that is explained by differences in the x variable(s)".  


In this case, $R^2$ = 0.45 = 45%.  The interpretation is that 45% of the variability in interrment length is explained by differences in nitrogen composition of the long bones.



That means that 55% of the variability in interrment time remains unexplained by the model.  

This remaining variablility can be seen in the scatterplot where we have skeletons with the same nitrogen compositon but different interment times.

How do we address this unexplained variability?  

Well, one obvious place to start is to see if some of the other variables can help us understand why skeletons with the same nitrogen composition have different interment times.

We can do this by adding additional terms to our model and testing to see if they are statistically significant.


Simple linear regression model: $\hat{y} = \beta_0 + \beta_1x$

Multiple regresssion model: $\hat{y} = \beta_0 + \beta_1x + \beta_2x + \beta_3x +...$

As with simple linear regression, we use t-tests to determine if model features are statistically significantly associated with the target. 


Let's start by looking at Lime, which indicates if the body was buried with Quicklime or not.  


We can ask the question, "Does burial with Quicklime change the relationship between long bone nitrogen composition and interment time?"

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots() 

#Color the points differently depending on if the body was buried with Quicklime or not.
plt.scatter(x = longbones['Nitro'], y = longbones['Time'], c = longbones['Lime'])

ax.set_title('Long Bone Nitrogen Composition and Interment Time')
ax.set_xlabel('Long Bone Nitrogen Composition in g/100g of Bone')
ax.set_ylabel('Interment Time in Years')

plt.show()

There does seem to be a different pattern in the relationship between nitrogen composition and interment time depending on if the body was buried with Quicklime, and we can test this claim formally using a hypothesis test.

Our multiple regresssion model is: $\hat{y} = \beta_0 + \beta_1x$Nitro + $\beta_2x$Lime

Now we can test the null hypothesis that burial with Quicklime is not associated with tnterrment time against the alternative hypothesis is that burial with Quicklime is associated with interment time.  In symbolic notation:

Ho: $\beta_2$ = 0

Ha: $\beta_2 \neq$ 0

Now we see "Lime" appear in the model under Nitro alongside the model coefficient, standard error, t-test, p-value and CI.

Both Nitro and Lime are statistically significant predictors of interrment time so it makes sense to include both of them in our statistical model.

We can write the model using the coefficients shown in the table as:

$\hat{y}$ = 180.47 - 35.63 x Nitro + 25.45 x Lime



When we have a model with more than one feature, we talk about each one being significantly associated with the target *after taking into account the effects of the other variables*.

After taking into account long bone nitrogen composition, burial with Quicklime is statistically significnatly associated with interrment time.

And vice versa.

After taking into account burial with Quicklime, long bone nitrogen composition is statistically associated with interment time.

Because we now have two significant predictors in our model instead of just one, we'd expect the percent of variability in y to be higher for the model with both nitrogen composition and Quicklime compared to the model with just nitrogen composition.

The output above gives us multiple mesures comparing model predictive ability and fit, but we are going to continue to focus on $R^2$ (AIC, BIC and log likelihood are all less easily interpretable measures of model fit).





However, regular old $R^2$ has one big problem. It will always go up when you enter more factors in the model, even if they aren't statistically significant.

Adjusted $R^2$ penalizes you just a little bit for adding in new model terms, so it will only increase if the new term really adds to the predictive ability of the model.

In the model with just nitrogen composition, adjusted $R^2$ = 0.433

In the model with both nitrogen composition and burial with Quicklime, adjusted $R^2$ = 0.762

That's a huge increase in predictive ability!


Now we can say that about 76% of the variability in interrment time can be explained by differences in long bone nitrogen composition and use of Quicklime at burial.

Can we do even better?????

Let's add burial depth to the model.

Our multiple regresssion model is: $\hat{y} = \beta_0 + \beta_1x$Nitro + $\beta_2x$Lime + $\beta_3x$Depth

This time we want to test the null hypothesis that - after accounting for differences in nitrogen composition and burial with Quicklime - burial depth is not associated with interment time against the alternative hypothesis is that burial depth is associated with interment time.  In symbolic notation:

Ho: $\beta_3$ = 0

Ha: $\beta_3 \neq$ 0

In this case we see that we **fail to reject the null hypothesis** and we conclude that - after accounting for nitrogen composition and burial with Quicklime - burial depth is not associated with interrment time.

Looking at the adjusted $R^2$, we see the predictive ability of this model falls a tiny bit to 0.75.

We will choose not to include burial depth in our model of interment time.  Our current best model remains:

$\hat{y}$ = 180.47 - 35.63 x Nitro + 25.45 x Lime



We have one more variable to consider.  Let's add age at the time of death to the previous model and see if that adds to the predictive ability.

Our multiple regresssion model is: $\hat{y} = \beta_0 + \beta_1x$Nitro + $\beta_2x$Lime + $\beta_3x$Age

This time we want to test the null hypothesis that - after accounting for differences in nitrogen composition and burial with Quicklime - age at the time of death is not associated with interment time against the alternative hypothesis is that age at the time of death is associated with interment time.  In symbolic notation:

Ho: $\beta_3$ = 0

Ha: $\beta_3 \neq$ 0

Again we **fail to reject the null hypothesis** and we conclude that - after accounting for nitrogen composition and burial with Quicklime - age at the time of death is not associated with interrment time.  The adjusted $R^2$ also decreases as when we added burial depth.

Based on the data we have, the best model to predict interrment time of a skeleton is:

$\hat{y}$ = 180.47 - 35.63 x Nitro + 25.45 x Lime

Let's play CSI one more time and use it to make predictions about how long a body has been interred for different values of long bone nitrogen composition and Quicklime application.

- A body that was not buried with Quicklime that has 3.88 g of nitrogen /100g of bone remaining in the long bones of the skeleton.

We estimate this body has been buried for about 42.24 years.

* A body that was buried with Quicklime and has 3.59 g of nitrogen per 100g of bone remaining in the long bones of the skeleton.

We expect this body has been buried for about 78 years.



---



What happens if I have a lot - like thousands - of potential features?

There are many, many machine learning algorithms that have been built to evaluate the most important features of a dataset, and you will learn several of the most common ones as you move throughout the units of the data science curriculum.

However, at a very basic level, they all work the same way we did today by evaluating how much information a particular feature contributes to the model.