In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw10.ipynb")

# Homework 10: Linear Regression

**Author**: Yanay Rosen

**Helpful Resource:**
- [Python Reference](http://data8.org/su21/python-reference.html): Cheat sheet of helpful array & table methods used in Data 8!

**Reading**: 
* [Linear Regression](https://www.inferentialthinking.com/chapters/15/2/Regression_Line.html)
* [Method of Least Squares](https://www.inferentialthinking.com/chapters/15/3/Method_of_Least_Squares.html)
* [Least Squares Regression](https://www.inferentialthinking.com/chapters/15/4/Least_Squares_Regression.html)

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to load the provided tests. Each time you start your server, you will need to execute this cell again to load the tests.

For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. **Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook!** For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!

**Deadline:**

This assignment is due Friday, July 30 at 11:59 P.M. PDT. Late work will not be accepted as per the [policies](http://data8.org/su21/policies.html) page. 

**Note: This homework has hidden tests on it. That means even though tests may say 100% passed, doesn't mean your final grade will be 100%. We will be running more tests for correctness once everyone turns in the homework.**

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. Refer to the policies page to learn more about how to learn cooperatively.

You should start early so that you have time to get help if you're stuck. Office hours are held Monday-Friday. The schedule appears on [http://data8.org/su21/office-hours.html](http://data8.org/su21/office-hours.html).

In [1]:
# Don't change this cell; just run it. 

import numpy as np
from datascience import *

# These lines do some fancy plotting magic.",
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

## Exploring the PTEN Gene with Linear Regression

### A Quick Review of Standard Units

<!-- BEGIN QUESTION -->

**Question 0**

Image A | Image B
:-:| :-:
![one](normal2.png) | ![two](normal1.png)


Two normal distributions, Distribution 1 and Distribution 2, were generated, each with a different mean and standard deviation. Which image above corresponds to the data in regular units? Which image above corresponds to the data in standard units? Explain your thought process in 1-2 sentences. Please format your answer like so:

**Image A:** (Regular/Standard) units

**Image B:** (Regular/Standard) units

**Explanation:** ...

**(2 Points)**

<!--
BEGIN QUESTION
name: q1_0
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



### PTEN Linear Regression

This week's homework is about linear regression. The dataset we'll be using is from the Cancer Cell Line Encyclopedia -- you can read more about this database in this [paper](https://www.nature.com/articles/s41586-019-1186-3) and interact with the data yourself at the online portal [here](https://portals.broadinstitute.org/ccle).

The specific dataset we'll be taking a look at is expression data for the PTEN gene in around 1000 cell lines. The PTEN gene is a tumor-suppressing gene, and mutations in the PTEN gene are associated with many types of cancer. A cell line is group of cells that are kept alive and replicate indefinitely in culture (grown in petri dishes, for example).

Run the following cell to load the `pten` table. The `pten` table has four columns, a column for the specific `Cell Line`, a column for the `Copy Number`, which is how much of the PTEN gene (compared to the reference genome) is found in the DNA of that cell line, `mRNA Expression (Affy)`, and `mRNA Expression (RNAseq)`.

*Note:* Since the PTEN gene can appear fewer times than in the reference genome, the `Copy Number` can be negative.

In [2]:
# Just run this cell
pten = Table().read_table("pten.csv")
pten.show(5)

In [3]:
# Just run this cell
pten.hist("Copy Number", bins = np.arange(-1, 1.5, 0.5))

**Question 1**

Looking at the histogram above, we want to check whether or not `Copy Number` is in standard units. For this question, compute the mean and the standard deviation of the values in `Copy Number` and assign these values to `copy_number_mean` and `copy_number_sd` respectively. After you calculate these values, assign `is_su` to either `True` if you think that `Copy Numbers` is in standard units or `False` if you think otherwise. **(5 Points)**

*Hint: What would the mean and SD of an array in standard units be?*

<!--
BEGIN QUESTION
name: q1_1
manual: false
points:
 - 0
 - 5
-->

In [4]:
copy_number = pten.column("Copy Number")
copy_number_mean = ...
copy_number_sd = ...
is_su = ...
print(f"Mean: {copy_number_mean}, SD: {copy_number_sd}, Is in standard units?: {is_su}")

In [None]:
grader.check("q1_1")

**Question 2**

Create the function `standard_units` so that it converts the values in the array `arr` to standard units. We'll then use `standard_units` to create a new table, `pten_su`, that converts all the values in the table `pten` to standard units. **(4 Points)**

<!--
BEGIN QUESTION
name: q1_2
manual: false
points:
 - 1
 - 3
-->

In [7]:
def standard_units(arr):
    ...

# DON'T DELETE OR MODIFY ANY OF THE LINES OF CODE BELOW IN THIS CELL
pten_su = Table().with_columns("Cell Line", pten.column("Cell Line"),
                               "Copy Number SU", standard_units(pten.column("Copy Number")),
                               "mRNA Expression (Affy) SU", standard_units(pten.column("mRNA Expression (Affy)")),
                               "mRNA Expression (RNAseq) SU", standard_units(pten.column("mRNA Expression (RNAseq)"))                             
                              )
pten_su.show(5)

In [None]:
grader.check("q1_2")

You should always visually inspect your data, before numerically analyzing any relationships in your dataset. Run the following cell in order to look at the relationship between the variables in our dataset.

In [10]:
# Just run this cell
pten_su.scatter("Copy Number SU", "mRNA Expression (Affy) SU")
pten_su.scatter("Copy Number SU", "mRNA Expression (RNAseq) SU")
pten_su.scatter("mRNA Expression (Affy) SU", "mRNA Expression (RNAseq) SU")

**Question 3**

Which of the following relationships do you think has the highest correlation (i.e. highest absolute value of `r`)? Assign `highest_correlation` to the number corresponding to the relationship you think has the highest correlation.

1. mRNA Expression (Affy) vs. mRNA Expression (RNAseq)
2. Copy Number vs. mRNA Expression (RNAseq)
3. Copy Number vs. mRNA Expression (Affy)

**(4 Points)**

<!--
BEGIN QUESTION
name: q1_3
manual: false
points:
 - 0
 - 4
-->

In [11]:
highest_correlation = ...

In [None]:
grader.check("q1_3")

**Question 4**

Now, using the `standard_units` function, define the function `correlation` which computes the correlation between `arr1` and `arr2`. **(4 Points)**

<!--
BEGIN QUESTION
name: q1_4
manual: false
points:
 - 0
 - 4
-->

In [14]:
def correlation(arr1, arr2):
    '''arr1 and arr2 will always be the same length.'''
    ...

# This computes the correlation between the different variables in pten
copy_affy = correlation(pten.column("Copy Number"), pten.column("mRNA Expression (Affy)"))
copy_rnaseq = correlation(pten.column("Copy Number"), pten.column("mRNA Expression (RNAseq)"))
affy_rnaseq = correlation(pten.column("mRNA Expression (Affy)"), pten.column("mRNA Expression (RNAseq)"))

print(f" \
      Copy Number vs. mRNA Expression (Affy) Correlation: {copy_affy}, \n \
      Copy Number vs. mRNA Expression (RNAseq) Correlation: {copy_rnaseq}, \n \
      mRNA Expression (Affy) vs. mRNA Expression (RNAseq) Correlation: {affy_rnaseq}")

In [None]:
grader.check("q1_4")

**Question 5**

If we switch what we input as arguments to `correlation`, i.e. found the correlation between `mRNA Expression (Affy)` vs. `Copy Number` instead of the other way around, would the correlation change? Assign `correlation_change` to either `True` if you think yes, or `False` if you think no. **(4 Points)**

<!--
BEGIN QUESTION
name: q1_5
manual: false
points:
 - 0
 - 4
-->

In [17]:
correlation_change = ...

In [None]:
grader.check("q1_5")

<!-- BEGIN QUESTION -->

**Question 6**

Looking at both the scatter plots after Question 2 and the correlations computed in Question 4, what similarities or differences do you see in the strength of the linear relationships? **(6 Points)**

<!--
BEGIN QUESTION
name: q1_6
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 7**

Let's look at the relationship between mRNA Expression (Affy) vs. mRNA Expression (RNAseq) only. Define a function called `regression_parameters` that returns the parameters of the regression line as a two-item array containing the slope and intercept of the regression line as the first and second elements respectively. The function `regression_parameters` takes in two arguments, an array of `x` values, and an array of `y` values. **(7 Points)**

Note: Feel free to use as many lines as needed to define the slope and interecept of `regression_parameters`.

*Hint: You should use a function you previously defined to calculate any intermediate quantities needed.*

<!--
BEGIN QUESTION
name: q1_7
manual: false
points:
 - 0
 - 7
-->

In [20]:
def regression_parameters(x, y):
    ...
    slope = ...
    intercept = ...
    return make_array(slope, intercept)

parameters = regression_parameters(pten.column("mRNA Expression (Affy)"), pten.column("mRNA Expression (RNAseq)"))
parameters

In [None]:
grader.check("q1_7")

**Question 8**

If we switch what we input as arguments to `regression_parameters`, i.e. found the parameters for the regression line for `mRNA Expression (RNAseq)` vs. `mRNA Expression (Affy)` instead of the other way around, would the regression parameters change (would the slope and/or intercept change)? Assign `parameters_change` to either `True` if you think yes, or `False` if you think no. **(4 Points)**

<!--
BEGIN QUESTION
name: q1_8
manual: false
points:
 - 0
 - 4
-->

In [23]:
parameters_change = ...

In [None]:
grader.check("q1_8")

**Question 9**

Now, let's look at how the regression parameters look like in standard units. Use the table `pten_su` and the function `regression_parameters`, and assign `parameters_su` to a two-item array containing the slope and the intercept of the regression line for mRNA Expression (Affy) in standard units vs. mRNA Expression (RNAseq) in standard units. **(3 Points)**


<!--
BEGIN QUESTION
name: q1_9
manual: false
points:
 - 0
 - 3
-->

In [26]:
parameters_su = ...
parameters_su

In [None]:
grader.check("q1_9")

If you are unfamiliar with scientific notation, running the following cell will help you see the slope and intercept more clearly.

In [29]:
round(parameters_su.item(0), 2), round(parameters_su.item(1), 2)

<!-- BEGIN QUESTION -->

**Question 10**

Looking at the array `parameters_su`, what do you notice about the slope and intercept values specifically? Relate the slope to another value we already calculated in a previous question, and relate the slope & intercept pair from `parameters_su` to an equation of the regression line. **(8 Points)**


<!--
BEGIN QUESTION
name: q1_10
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 11**

The Data8 cell line is missing from our dataset. Luckily, we now have the parameters of a regression line to
predict Data8 mRNA Expression (RNAseq) value given a Data8 mRNA Expression (Affy) value. Remember, this is
significant because this is the goal of regression - to predict a value given another. If mRNA expression (Affy) =
8.2, what is our regression estimate of RNAseq? Use the values in `parameters` that we derived in Question
1.7, and assign the result to `data8_rnaseq`. **(4 Points)**

<!--
BEGIN QUESTION
name: q1_11
manual: false
points:
 - 0
 - 4
-->

In [30]:
data8_rnaseq = ...
data8_rnaseq

In [None]:
grader.check("q1_11")

<!-- BEGIN QUESTION -->

**Question 12**

Compute the predicted mRNA Expression (RNAseq) values from the mRNA Expression (Affy) values in the `pten` table. Use the values in the `parameters` array from Question 1.7, and assign the result to `predicted_rnaseq`. We'll plot your computed regression line with the scatter plot from after question 1.2 of mRNA Expression (Affy) vs. mRNA Expression (RNAseq). **(4 Points)**

*Sanity Check: Given the strong correlation between the two variables, does your regression line make sense / have a good fit?*


<!--
BEGIN QUESTION
name: q1_12
manual: true
-->

In [33]:
predicted_rnaseq = ...

# DON'T CHANGE/DELETE ANY OF THE BELOW CODE IN THIS CELL
(pten.with_column("Predicted mRNA Expression (RNAseq)", predicted_rnaseq)
 .select("mRNA Expression (Affy)", "mRNA Expression (RNAseq)", "Predicted mRNA Expression (RNAseq)")
 .scatter("mRNA Expression (Affy)"))
plt.ylabel("mRNA Expression (RNAseq)");

<!-- END QUESTION -->



### Fitting a least-squares regression line

Recall that the least-square regression line is the unique straight line that minimizes root mean squared error (RMSE) among all possible fit lines. Using this property, we can find the equation of the regression line by finding the pair of slope and intercept values that minimize root mean squared error.

**Question 13**

Define a function called `RMSE_mRNA`. It should take two arguments:

1. the slope of a line (a number)
2. the intercept of a line (a number).

It should return a number that is the root mean squared error (RMSE) for a line defined with the arguments slope and intercept used to predict mRNA Expression (RNAseq) values from mRNA Expression (Affy) values for each row in the `pten` table. **(7 Points)**

*Hint: Errors are defined as the difference between the actual `y` values and the predicted `y` values.*

*Note: if you need a refresher on RMSE, here's the [link](https://www.inferentialthinking.com/chapters/15/3/Method_of_Least_Squares.html#Root-Mean-Squared-Error) from the textbook*

<!--
BEGIN QUESTION
name: q1_13
manual: false
points:
 - 0
 - 7
-->

In [34]:
def RMSE_mRNA(slope, intercept):
    affy = pten.column("mRNA Expression (Affy)")
    rnaseq = pten.column("mRNA Expression (RNAseq)")
    predicted_rnaseq = ...
    ...

# DON'T CHANGE THE FOLLOWING LINES BELOW IN THIS CELL
rmse_example = RMSE_mRNA(0.5, 6)
rmse_example

In [None]:
grader.check("q1_13")

<!-- BEGIN QUESTION -->

**Question 14**

What is the RMSE of a line with slope 0 and intercept of the mean of `y` equal to? **(8 Points)**

*Hint 1: The line with slope 0 and intercept of mean of `y` is just a straight horizontal line at the mean of `y`*

*Hint 2: What does the formula for RMSE become if we input our predicted `y` values in the formula? Try writing it out on paper! It should be a familiar formula.*

<!--
BEGIN QUESTION
name: q1_14
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 15**

Find the parameters (the slope and intercept) that minimize RMSE of the regression line for mRNA Expression (Affy) vs. mRNA Expression (RNAseq). Assign the result to `minimized_parameters`.

If you haven't tried to use the `minimize` [function](http://data8.org/su21/python-reference.html) yet, now is a great time to practice. Here's an [example from the textbook](https://www.inferentialthinking.com/chapters/15/3/Method_of_Least_Squares.html#numerical-optimization). **(4 Points)**

*Hint: Use the `RMSE_mRNA` function in Question 1.13*

**NOTE: When you use the minimize function, please pass in `smooth=True` as the second argument to this function. You'll need to do this, otherwise, your answer will be incorrect**

<!--
BEGIN QUESTION
name: q1_15
manual: false
points:
 - 0
 - 0
 - 4
-->

In [37]:
minimized_parameters = ...
minimized_parameters

In [None]:
grader.check("q1_15")

<!-- BEGIN QUESTION -->

**Question 16**

The slope and intercept pair you found in Question 1.15 should be very similar to the values that you found in Question 1.7. Why were we able to minimize RMSE to find the same slope and intercept from the previous formulas? **(6 Points)**


<!--
BEGIN QUESTION
name: q1_16
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 17**

If we had instead minimized mean squared error (MSE), would we have gotten the same slope and intercept of the minimized root mean squared error (RMSE) results? Assign `same_parameters` to either `True` if you think yes, or `False` if you think no. **(4 Points)**


<!--
BEGIN QUESTION
name: q1_17
manual: false
points:
 - 0
 - 4
-->

In [41]:
same_parameters = ...
same_parameters

In [None]:
grader.check("q1_17")

Let's look at the scatter plot of the relationship between mRNA Expression (Affy) and mRNA Expression (RNAseq) again:

In [44]:
pten.scatter("mRNA Expression (Affy)", "mRNA Expression (RNAseq)")

<!-- BEGIN QUESTION -->

**Question 18**

Using a linear regression model, would we be able to obtain accurate predictions of mRNA Expression (RNAseq) for most of the points? Explain why or why not. **(7 Points)**


<!--
BEGIN QUESTION
name: q1_18
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Properties of Binary Distributions

Binary distributions arise in regular everyday life, and as data scientists you will encounter them constantly. A binary distribution is a distribution across two categories: such as voting in support of a proposition or voting against it on your local ballot, flipping heads or tails, having heart disease or not having heart disease. Generally we represent 'yes' or `True` as 1, and 'no' or `False` as 0. Binary distributions have some special properties that make working with them especially easy!

For further reading on binary distributions: [Chapter 14.6](https://inferentialthinking.com/chapters/14/6/Choosing_a_Sample_Size.html#The-SD-of-a-collection-of-0's-and-1's)


**The intent of this section of the homework is to walk you through these properties, so we decided to make all of the tests for this section public (i.e. there are no hidden tests to worry about for this section only).**

### **If you pass all tests in the following section, you will receive full credit.**

#### Question 1

Let's generate a random binary distribution of 0's and 1's. Assign `binary_options` to the correct array of possible values in a binary distribution (i.e. look at the previous sentence). **(1 Point)**


<!--
BEGIN QUESTION
name: q2_1
manual: false
points:
 - 0
 - 1
-->

In [45]:
binary_options = ...


# DON'T DELETE/MODIFY ANY OF THE CODE IN THIS CELL BELOW
sample_size = 100
binary_sample = np.random.choice(binary_options, sample_size)

# Run this to see a bar chart of this random distribution.
Table().with_columns("Value", make_array(1, 0), "Number in Sample", make_array(sum(binary_sample), sample_size - sum(binary_sample))).barh("Value")

In [None]:
grader.check("q2_1")

#### Question 2

The first property you should note is that the proportion of ones in a binary distribution is equal to the mean of the distribution. [Think about why this is true](https://www.inferentialthinking.com/chapters/14/1/Properties_of_the_Mean.html#Proportions-are-Means). Complete the following cell to show that this is the case for your `binary_sample`. Assign `number_of_ones` and `number_of_zeros` to the number of `1`'s and the number of `0`'s respectively from your `binary_sample`. **(2 Points)**


<!--
BEGIN QUESTION
name: q2_2
manual: false
points:
 - 2
-->

In [48]:
number_of_ones = ...
number_of_zeros = ...


# DON'T DELETE/MODIFY ANY OF THE CODE BELOW IN THIS CELL
number_values = len(binary_sample)
sum_of_binary_sample = sum(binary_sample)
# Remember that the mean is equal to the sum divided by the number of items
mean_binary_sample = sum_of_binary_sample / number_values

# Don't change this!
print(f"In your binary sample there were {number_of_ones} ones and {number_of_zeros} zeros. 1*{number_of_ones} + 0*{number_of_zeros} = {number_of_ones}")
print(f"The sum of values in your sample was {sum_of_binary_sample}, divided by the number of items, {number_values}, gives us a mean of {mean_binary_sample}")
print(f"The proportion of ones in your sample was {number_of_ones} ones, divided by the number of items, {number_values}, gives us a value of {mean_binary_sample}" )
print('Those values are equal!')

In [None]:
grader.check("q2_2")

Since the proportion of ones is the same as the mean, the Central Limit Theorem applies! That is, if we resample our sample a lot of times, the distribution of the proportion of ones in our resamples will be roughly normal, with a predictable center and spread!

In [50]:
# Just run this cell
resampled_proportion_of_ones = make_array()

for i in np.arange(5000):
    resample = Table().with_column("Value", binary_sample).sample()
    resample_proportion_ones = resample.where("Value", 1).num_rows / resample.num_rows
    resampled_proportion_of_ones = np.append(resampled_proportion_of_ones, resample_proportion_ones)
    
Table().with_column('Resampled Proportions', resampled_proportion_of_ones).hist()

Let's generate a table where each row has a different number of ones and zeros that we'll use for the following parts.

In [51]:
# Just run this cell
possible_number_ones = np.arange(sample_size + 1)
possible_number_zeros = sample_size - possible_number_ones

possibilities_table = Table().with_columns("Values of One", possible_number_ones, "Values of Zero", possible_number_zeros)
possibilities_table.show(5)

#### Question 3
The second important property of binary distributions is that the standard deviation of every binary distribution is equal to:
$$\sqrt{\text{proportion_ones} *\text{proportion_zeros}}$$

While this property is useful in some cases, a more useful extension of this property is that it tells us that the maximum standard deviation for a binary distribution is 0.5!

Let's explore why that is the case!

Complete the `binary_std_formula` function below so that it returns the standard deviation of a binary distribution according to the formula above. The function takes in a `row` that looks like a row in `possibilities_table`. **(2 Points)**


<!--
BEGIN QUESTION
name: q2_3
manual: false
points:
 - 2
-->

In [52]:
def binary_std_formula(row):
    num_ones = row.item("Values of One")
    num_zeros = row.item("Values of Zero")
    
    sum_ones_and_zeros = ...
    prop_ones = ...
    prop_zeros = ...
    ...

# DON'T DELETE/MODIFY ANY OF THE LINES BELOW IN THIS CELL
possibilities_table = possibilities_table.with_column("Formula SD", possibilities_table.apply(binary_std_formula))
possibilities_table.show(5)

In [None]:
grader.check("q2_3")

Here's another function that takes in a row object from a table, generates a sample that has the same number of ones and zeros as the row specifies, and then returns the standard deviation of that table. You should be able to understand exactly what this function does! It also does the same thing as above, where we return the standard deviation, but we just use `np.std` for this function.

In [54]:
# Just run this cell
def binary_std(row):
    values = make_array()
    for i in np.arange(row.item("Values of One")):
        values = np.append(values, 1)
    for i in np.arange(row.item("Values of Zero")):
        values = np.append(values, 0)
    return np.std(values)

possibilities_table = possibilities_table.with_column("Empirical SD", possibilities_table.apply(binary_std))
possibilities_table.show(5)

All the values are the same! Let's see what this formula means!

In [55]:
# Just run this cell
possibilities_table.scatter("Values of One", "Formula SD")

What a beautiful curve!

Looking at that curve, we can see that maximum value is $0.5$, which occurs in the middle of the distribution, when the two categories have equal proportions (proportion of ones = proportion of zeros = $\frac{1}{2}$).

## (OPTIONAL, NOT IN SCOPE) Logarithmic Plots

A kind of visualization you will frequently encounter as a data scientist is a scatter plot or line plot that uses a logarithmic scale. This **Optional** section will cover how to read and generate logarithmic plots. Since this is optional, there is no autograded/free response questions for these sections. Just read, run cells, and explore.

What is a logarithm? A logarithm helps us find the inverse of an equation that uses exponentials. Specifically, if

$$a^y = x$$

Then

$$\log_a{x} = y$$

The most commonly used $a$, which is known as the base of the logarithm, is $e$, which is equivalent to about 2.718, or 10 (for powers of 10).

We can use `numpy` to take logs in Python! By default, np.log uses a base of e.

In [56]:
make_array(np.log(np.e), np.log(np.e**2), np.log(100))

Back to the visualization: when we are plotting trends that grow exponentially, such as the line

$$ y = e^x$$

our y-axis needs to have a large range of values, which makes it difficult to understand.

Let's see what this looks like:

In [57]:
x = np.arange(0, 10, 1/100)
y = 10 ** x

Table().with_columns("X", x, "Y", y).scatter(0,1)

Note that since $10^{10}$ is so big, we can't really see what's happening at all to the y values when they have x values below 8.

One solution to this to change our y and/or x axis so that instead of having even spaces between the tick marks, our marks grow by an uneven factor. We do this by making the tick marks go on a logarithmic scale, and we'll then be able to understand our data better!

In [58]:
Table().with_columns("X", x, "Y", y).scatter(0,1)
plt.yscale("log")

Now we can tell what's happening to the y values for every x value!

Note how the y values start at $10^0=1$, and increase by a *factor* of $10$ each mark - the next mark is $10^1 = 10$, then $10^2=100$.

You still read this plot like a normal plot, so at a value of $x=5, y=10^5=10000$.

How do you calculate intermediate values? 

At a value like $x = 2.5$ it looks like the y value is somewhere in-between $10^1$ and $10^3$. In this graph with a logarithmic scale, you would say that $y=10^{2.5} \approx 316$.

When visualizing data about the spread of diseases, you will commonly run into plots with logarithmic scales, such as this example from the New York Times. Make sure to always know what the scales of the data are! 

<img src="virus-log-chart.jpg" width="650"/>

Image is from https://www.nytimes.com/2020/03/20/health/coronavirus-data-logarithm-chart.html

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)