# Class 10: Linear models 

This class notebook is designed to let you practice the basics of modelling simple datasets and testing null hypotheses. The datasets are small and simple compared to the project datasets. This is so that you can learn the concepts of data modelling and hypothesis testing without the additional burden of cleaning and manipulating large, messy datasets. 

In this notebook you are analysing datasets of two variables: one response variable and one explanatory variable. In the first two datasets the explanatory variable is categorical and the response variable is numerical. In the last two datasets both the explanatory and the response variables are numerical.

Everything you need to complete these analyses is covered in today's lecture and the accompanying example notebooks. The example notebooks work through the examples in the lecture in more detail. You should make use of those as a reference for completing the analyses in this notebook.

## Imports

In [None]:
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
from statsmodels.formula.api import ols

warnings.filterwarnings("ignore")

## Part 1. One categorical explanatory variable. Numerical response variable.

### Do the horns of horned-lizards protect them from predation?

<div>
<img src="attachment:lizard.jpg" width='50%' title=""/>
</div>

The horned lizard *Phrynosoma macallii* is named for the fringe of horns surrounding the head. What are the horns for? One idea proposed by researchers (Young et al. 2004) is that they offer some protection against being eaten by one of their main predators, the loggerhead shrike, *Lanius ludovicianus*. The shrike skewers the lizard on thorns or barbed wire to save for later eating.

The researchers measured the horn lengths of horned lizards that had not been predated (and hence were alive and free at some point in time) and the horn lengths of lizards that had been predated (and hence were skewered and dead at the same point in time). If their hypothesis were true then we might expect to see a difference in mean horn lengths between living and killed lizards. But if their hypothesis is wrong then we might expect to see no such difference.

The data collected by the researchers are in the file `../Datasets/horned_lizards.csv`.

<div class="alert alert-warning">

Use the [ladybirds.ipynb](ladybirds.ipynb) notebook to help you answer this question.
</div>

- Read in the data and print it to see what it looks like.

- Use an appropriate graph to visually examine the relationship between horn length and predation status. 

- State the null and alternative hypotheses. 

> Write your null and alternative hypotheses here

- Write the model formula for the relationship between horn length and predation status.
- Fit the model and test the null hypothesis.

- Report the outcome of the test as you would in a scientific report or paper. This means report the estimate of the difference including its 95% CI, the value of the test-statistic (in this case *t*) and the *p*-value. Also say whether the the outcome supports or not the biological hypothesis.

> Write your conclusion here.

### Can light shone on knees reset your circadian clock?

<div>
<img src="attachment:elucidating-the-mechan.jpg" width='50%' title=""/>
</div>

A scientific paper published in 1998 (Campbell and Murphy 1998) reported that the circadian clock of people with jetlag can be reset by shining light on the backs of their knees. A later paper (Wright and Czeisler 2002) reexamined this controversial result. The new experiment measured the phase shift in the circadian cycle of people woken from sleep and subjected to one of three interventions: 1) Light shone in eyes only, 2) light shone on backs of knees only, or 3) no light shone at all. This last intervention is the control group against which the other two interventions were compared. The phase shift was measured after two days of intervention.

The data are in the file `../Datasets/knees.csv`. The variable `shift` is the phase shift measured in hours.

There are two biological hypotheses here. 
1. Shining light on the backs of knees causes a phase shift in the circadian clock.
2. Shining light on eyes causes a phase shift in the circadian clock.

<div class="alert alert-warning">

Use the [ladybirds.ipynb](ladybirds.ipynb) notebook to help you answer this question.
</div>

- Read in the data and print it to see what it looks like.

- Use an appropriate graph to visually examine the relationship between light treatment and phase shift. 


- State the null and alternative hypotheses for both biological hypotheses. 

> Write your null and alternative hypotheses here.

Although the explanatory variable "treatment" has three levels, "control", "knee" and "eyes", we still write the model formula as 

    'response variable ~ explanatory variable'
    
Python will know that the explanatory variable "treatment" has three levels and its summary output table will contain these three estimates:
1. The estimate of the mean phase shift of "control"
2. The estimate of the difference in the mean phase shifts between "control" and "eyes"  
3. The estimate of the difference in the mean phase shifts between "control" and "knee"  

- Write the model formula for the relationship between phase shift and light treatment.
- Fit the model and test the null hypotheses.

- Report the outcome of the test as you would in a scientific report or paper. This means report the estimates of the differences including their 95% CIs, the values of the test-statistic (in this case *t*) and the *p*-values. Also say whether the the outcomes support or not the biological hypotheses.

> Write your conclusion here.

In the 1998 study it was later found that the subjects had inadvertently experienced low levels of light to their eyes.

## Part 2. One numerical explanatory variable. Numerical response variable.

### Is soil nitrogen content affected by the number of different earthworm species?

<div>
<img src="attachment:worms.jpg" width='50%' title=""/>
</div>

The forests of northern USA and Canada have no native earthworms. However, earthworms have been introduced by humans and are dramatically changing the soil. To examine if earthworms are changing the nitrogen content of soil, scientists (Gundale et al. 2005) measured nitrogen content and the number of different species of earthworms in 39 hardwood forests in Michigan. 


The data are in the file `../Datasets/earthworms.csv`. Nitrogen content is recorded as a percentage.

<div class="alert alert-warning">

Use the [rattlesnakes.ipynb](rattlesnakes.ipynb) notebook to help you answer this question.
</div>

- Read in the data and print it to see what it looks like.

- Use an appropriate graph to visually examine the relationship between nitrogen content and number of worm species. 

- State the null and alternative hypotheses regarding the relationship between nitrogen content and number of worm species. 

> Write your null and alternative hypotheses here.

- Write the model formula for the relationship between number of earthworm species and soil nitrogen content.
- Fit the model and test the null hypothesis.

- Report the outcome of the test as you would in a scientific report or paper.

> Write your conclusion here.

### What is the relationship between "file" length and call frequency of bush crickets?

<div>
<img src="attachment:fotolia_4278755_XS.jpg" width='50%' title=""/>
</div>

Bush crickets call by rubbing forewings together so that a scrapper on one wing rubs against a file on the other wing. Scientists (Gua et al. 2012) wanted to know what the relationship was between file length and call frequency?

The file `../Datasets/bush_crickets.csv` contains the song frequency (in Hertz) and file length (in mm) of 58 crickets.

<div class="alert alert-warning">

Use the [brain_mass.ipynb](brain_mass.ipynb) notebook to help you answer this question.
</div>

- Read in the data and print it to see what it looks like.

- Use an appropriate graph to visually examine the relationship between song frequency and file length.

Seaborn's regression line does not fit the data well. It also predicts negative song frequencies for file lengths longer than about 5mm, which, of course, is impossible. 

This means that the relationship between file length and song frequency is not a simple straight line, but something more complicated; exponential, quadratic, lograthmic, for example. To try and work out what that relationship is let's first examine if each variable is normally distributed.

- Test if `songFrequency` is normally distributed.
- Test if `fileLength` is normally distributed.

You should find that songFrequency and fileLength are both not normally distributed. In that case let's try log-transforming them both and retesting for normality.

- Log-transform songFrequency and fileLength and retest these for normality.

You should find that both log-songFrequency and log-fileLength are normally distributed.

- Plot log-songFrequency against log-fileLength to see whether the fit of the regression line improves. 

You should find that this looks much better with the data now showing a linear relationship.

This means then that the relationship between file length and song frequency has the form

$$
\log(\mathrm{song\ frequency}) = \mathrm{intercept} + \mathrm{slope} \times \log(\mathrm{file\ length})
$$

That is log-song frequency and log-file length are linearly related.

- Write the model formula for the relationship between log-song frequency and log-file length.
- Fit the model.

- Write the formula for the numerical relationship between log-song frequency and log-file length using the estimates from the model fit.

> Write your formula here.

- Report the outcome of the test as you would in a scientific report or paper.

> Write your conclusion here.

The fossil of an 165 million year old extinct bush cricket species *Archaboilus musicus* has been discovered. It's file length is 9.34 mm. What was its song frequency based on your analysis of living bush crickets?