## Autograded Notebook (Canvas & CodeGrade)

This notebook will be automatically graded. It is designed to test your answers and award points for the correct answers. Following the instructions for each Task carefully.

### Instructions

* **Download this notebook** as you would any other ipynb file
* **Upload** to Google Colab or work locally (if you have that set-up)
* **Delete `raise NotImplementedError()`**
* Write your code in the `# YOUR CODE HERE` space
* **Execute** the Test cells that contain `assert` statements - these help you check your work (others contain hidden tests that will be checked when you submit through Canvas)
* **Save** your notebook when you are finished
* **Download** as a `ipynb` file (if working in Colab)
* **Upload** your complete notebook to Canvas (there will be additional instructions in Slack and/or Canvas)

# Lambda School Data Science - Unit 1 Sprint 3 Module 2

## Module Project: Inference for Linear Regression

### Learning Objectives

* identify the appropriate hypotheses to test for a statistically significant relationship between two quantitative variables
* conduct and interpret a t-test for the slope parameter
* identify the appropriate parts of the output of a linear regression model and use them to build a confidence interval for the slope term.
* make the connection between the t-test for a population mean and a t-test for a slope coefficient.
* identify violations of the assumptions for linear regression

### Total notebook points: 10

## Introduction

### Statistical significance between head size and brain weight in healthy adult humans

The `Brainhead.csv` dataset provides information on 237 individuals who were subject to post-mortem examination at the Middlesex Hospital in London around the turn of the 20th century. Study authors used cadavers to see if a relationship between brain weight and other more easily measured physiological characterizes such as age, sex, and head size could be determined. 

The end goal was to develop a way to estimate a person’s brain size while they were still alive (as the living aren’t keen on having their brains taken out and weighed).

**We wish to determine if there is a linear relationship between head size and brain weight in healthy human adults.**

Source: R.J. Gladstone (1905). "A Study of the Relations of the Brain to the Size of the Head", Biometrika, Vol. 4, pp105-123.


**Use the above information to complete Tasks 1 - 10.**

### Warmup Questions

Recall from the Module 1 Project that we were working with the brain weight variable (`Brain`) and the head size (`Head`) variables. We identified the dependent and independent variables, plotted our variables on the appropriate axes, and then described the strength of the relationship.

* `Brain` (brain weight in g) - **dependent** variable (y)
* `Head` (the head size in cubic cm - **independent** variable (x)

Now, we're going to bring back some statistics from Sprint 2 and look at the statistical association between head size and brain weight.

First, some warmup questions! These are not autograded but are part of completing the project.

**Warmup Q1** - What type of statistical test will we use to determine if there is a statistically significant association between head size and brain weight in the population?

ANSWER:


**Warmup Q2** - Write the null and alternative hypotheses you would use to test for a statistically significant association between head size and brain weight.

ANSWER:



**Task 1** - Load the data

As we usually begin, let's load the data! The URL has been provided.

* load your CSV file into a DataFrame named `df`

In [None]:
# Task 1

# Imports
import pandas as pd
import numpy as np

data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Brainhead/Brainhead.csv'

# YOUR CODE HERE
raise NotImplementedError()

# Print out your DataFrame
df.head()

In [None]:
# Task 1 - Test

assert isinstance(df, pd.DataFrame), 'Have you created a DataFrame named `df`?'
assert len(df) == 237


**Task 2** - Fit OLS model

Now, we're going to fit a regression model to our two variables. We're going to use `statsmodels.formula.api` and import the `ols` model. This import has been provided for you.

* Fit a model and name your variable `model`
* Using the `model.params[1]` method, assign the slope to the variable `slope`. Your variable should be a float (`numpy.float64`).
* Using the same `model.params[0]` method, assign the intercept to the variable `intercept`. Your variable should be a float (`numpy.float64`).
* Print out your model summary.

*Hint: Make sure to use the format Y ~ X for the model input.*

In [None]:
# Task 2

# Import statsmodels - DON'T Delete
from statsmodels.formula.api import ols

# Set-up and fit the model in one step
# (format Y ~ X)

# YOUR CODE HERE
raise NotImplementedError()

# Print the model summary
print(model.summary())

**Task 2 - Test**

In [None]:
# Task 2 - Test

# Hidden tests - you will see the results when you submit to Canvas

**Task 3** - Formulate the statistical model

Using the model parameters returned above, you will now write out the statistical model as a linear equation. Remember, we are predicting brain weight from head size.

* write your equation below, with LaTeX fomatting
* write your equation in Python
    * assign the dependent variable to `y_hat`
    * assign the independent variable to `x` with a value of `4000`
    * write out your slope and intercept terms as floats (you don't need to use the variables you created earlier)

In [None]:
# Task 3

# YOUR CODE HERE
raise NotImplementedError()
print(y_hat)

**Task 3 - Test**

In [None]:
# Task 3 - Test

# Hidden tests - you will see the results when you submit to Canvas

**Task 4** - Statistical parameters

Now that we have fit a model, we're going to pull out the statistical parameters.

* assign the standard error to the variable `std_err`
* assign the value of the t-statistics to the variable `t_stat`
* assign the p-value for the slope to the variable `p_slope`

**Assign values out to the 1/1000 place (for example, `777.555`)**

In [None]:
# Task 4

# YOUR CODE HERE
raise NotImplementedError()

**Task 4 - Test**

In [None]:
# Task 4 - Test

# Hidden tests - you will see the results when you submit to Canvas

**Task 5** - Hypothesis Test (written answer)

Conduct your hypothesis test and determine if head size is statistically significantly associated with brain weight at the alpha = 0.05 level.

ANSWER:


**Task 6** - Hypothesis Test for the intercept? (written answer)

Should you conduct a hypothesis test for the intercept term?  Why or why not?

ANSWER:


**Task 7** - Confidence Interval

Calculate the 95% confidence interval for your slope term. Use your model summary to find these values. Assign them values out to the 1/1000 place (for example, `ci_low = 0.345`)

* assign the lower interval value to the variable `ci_low`
* assign the upper interval value to the variable `ci_upper`

Then, interpret this confidence interval in terms of how we expect brain weight to change when we **change head size by one cubic cm**.

* assign the lower value of brain size to `brain_low`
* assign the upper value of brain size to `brain_upper`


In [None]:
# Task 7

# YOUR CODE HERE
raise NotImplementedError()

**Task 7 - Test**

In [None]:
# Task 7 - Test

# Hidden tests - you will see the results when you submit to Canvas

**Task 8** - Plot confidence intervals

Plot the confidence limits on both the slope and intercept terms with a shaded area around our regression line. The easiest way to do this is using `seaborn` with the `ci` parameter set to the confidence level you want (the default is `ci=95`). This plot will not be autograded.

In [None]:
# Task 8

# YOUR CODE HERE
raise NotImplementedError()

**Task 9** - Correlation (short answer)

Does it seem plausible that larger head size causes greater brain weight?  Or is it possible that something else causes differences in both of those factors?

ANSWER:



## Part B

### Sleep Data

Use the following information to answer Tasks 9 - 16 in the rest of this project:

Researchers recorded data on sleep duration as well as a set of ecological and constitutional variables for a selection of mammal species. This data is available in the Sleep.csv dataset; the URL is provided below. 

(*Source: Allison, T. and Cicchetti, D. (1976), "Sleep in Mammals: Ecological and Constitutional Correlates",  Science, November 12, vol. 194, pp. 732-734.*)

**Data Dictionary:**

| Variable Name |            Description           |                                                                 Details                                                                 |              |   |
|:-------------:|:--------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------:|--------------|---|
| Animal        |            Animal name           |                                                            Character [string]                                                           |    |   |
| Body          |            Body weight           |                                                               Numeric [kg]                                                              |       |   |
| Brain         |           Brain weight           |                                                               Numeric [g]                                                               |  |   |
| SWsleep       | Slow wave (“non-dreaming”) sleep | Numeric [hours]                                                                                                                         |              |   |
| Parasleep     | Paradoxical (“dreaming”) sleep   | Numeric [hours]                                                                                                                         |              |   |
| Totsleep      | Total sleep                      | Numeric [hours]                                                                                                                         |              |   |
| Life          | Maximum life expectancy          | Numeric [years]                                                                                                                         |              |   |
| Gest          | Gestation time                   | Numeric [days]                                                                                                                          |              |   |
| Pred          | Predation index                  | Numeric [1 – 5] 1 = least likely to be preyed upon, 5 = most likely to be preyed upon                                                   |              |   |
| Exposure      | Sleep exposure index             | Numeric [1 – 5] 1 = least amount of exposure during sleep (mammal sleeps indoors or in a den), 5 = most amount of exposure during sleep |              |   |
| Danger        | Overall danger index             | Numeric [ 1 – 5] 1 = least amount of danger from other animals, 5 = most amount of danger from other animals                            |              |   |



**Task 10**

Before we can look at the data, we need to load in the data. The URL has been provided.

* load in the CSV file as a DataFrame and assign it to the variable `df_sleep`
* make sure to view the DataFrame!

In [None]:
# Task 10

data_url_2 = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Sleep/Sleep.csv'

# YOUR CODE HERE
raise NotImplementedError()

# Print out your DataFrame
df_sleep.head()

**Task 10 - Test**

In [None]:
# Task 10 - Test

assert isinstance(df_sleep, pd.DataFrame), 'Have you created a DataFrame named `df_sleep`?'
assert len(df_sleep) == 42


**Task 11** - Plot to check for linearity

Plot the relationship between *gestation time* and time spent in *dreaming sleep*. This plot will not be autograded.

* you can use `seaborn` for your plot, with the `regplot()`
* include the regression line but turn off the confidence interval (`ci=None`)

Describe the relationship between the two variables you plotted below.

ANSWER:

In [None]:
# Task 11

# YOUR CODE HERE
raise NotImplementedError()

**Task 12** - Transform a variable

Let's try something new: taking the log of a variable to transform it. Then we'll look at the relationship between the log of that variable and the other variable (which will remain the same)

* Create a new variable with the log of gestational time and add it as a column to `df_sleep` with the name `log_gest` 

*Hint: use the natural log `np.log()`*

In [None]:
# Task 12

# YOUR CODE HERE
raise NotImplementedError()

# Look at your new column
df_sleep.head()

**Task 12 - Test**

In [None]:
# Task 12 - Test

# Hidden tests - you will see the results when you submit to Canvas

**Task 13** - Plot a new relationship!

Plot the relationship of the log of *gestational time* and *dreaming sleep*. This plot will not be autograded.

* you can use `seaborn` for your plot, with the `regplot()`
* include the regression line but turn off the confidence interval (`ci=None`)

Describe the relationship of the two variables you just plotted.

ANSWER:

In [None]:
# Task 13

# YOUR CODE HERE
raise NotImplementedError()


**Task 14** - Model the sleep data

Next, create a model the relationship of the log of gestation time and dreaming sleep using the `statsmodels.formula.api`. Remember that the `statsmodels` import was made earlier.

* Fit a OLS model and assign it to the variable `model_sleep` (remember to enter the model in the format Y ~ X).
* Print out your model summary.
* Answer the questions below to interpret your results.

In [None]:
# Task 14

# YOUR CODE HERE
raise NotImplementedError()

# Print the model summary
print(model_sleep.summary())

**Task 14**

In [None]:
# Task 14 - Test

# Hidden tests - you will see the results when you submit to Canvas

**Task 15** - Statistical significance of sleep data (short answer)

Is the *log of gestational time* statistically significantly associated with time spent in *dreaming sleep* at the alpha = 0.05 level?

ANSWER:

**Task 16** - Predicting dreaming sleep from gestation time

The final task! Using the model we just created, predict the amount of dreaming sleep for a mammal that gestates her young for 262 days (this is the gestation time).

* Assign the gestation time to the variable `x_predict`. This variable will be an integer.
* Take the log of `x_predict` and assign it to the variable `ln_x_predict`. THis variable will be a float.
* Use the `slope` and `intercept` variables from your `model_sleep` to complete the calculation. 
* Your result should be a float and assigned to the variable `sleep_predict`.

In [None]:
# Task 16

# YOUR CODE HERE
raise NotImplementedError()

# Print out the log of x and the predicted sleep value
print('ln 262 = ', ln_x_predict)
print('Predicted dreaming sleep = ', sleep_predict)

**Task 16 Test**

In [None]:
# Task 16

assert ln_x_predict == np.log(x_predict), 'Did you use the correct log calculation?'

