# Logistic Regression and Optimization

## **Prelude**

Earlier in the class, we saw that there was a technique called *Linear Regression* that allowed us to model a collection of data points $(x_i,y_i)$ that we believed to have a linear relationship (we also saw how we could use this technique to model data that we believed to have either an exponential or power relationship; see Section 1.5 of the text). We *did not* learn how to determine the linear regression model (a line $y=mx+b$ determined by constants $m,b$ that "minimize the squared errors") from the data points.

In this notebook, we're going to learn about another technique for modeling functions called *Logistic Regression*. In logistic regression, we have data that is dependent on a continuous variable and that is described with discrete outcomes. The goal of the logistic regression is to find the "best" function, with respect to the given data, for describing the probability of an outcome for a given input of the continuous variable.

We'll see two examples in this notebook. In the first example (that we'll do together), we'll look at students taking an exam. Our data points will have $x$-coordinates given by the amount of time students had spent studying during the quarter up to the exam and $y$-coordinates either $0$, if the student failed, or $1$, if the student passed. We'll perform a logistic regression on this data to produce a function $P(t)$ that "best estimates" the probability another student passes the same exam given the amount of time $t$ that they had studied.

In the second example, we'll look at the concentration of a chemical $R$ in blood samples of animals that we have available to us (e.g. in a sanctuary, or a farm). We'll record the outcome of whether or not the animal is still alive in 5 years after we took our measurement -- a 1 indicates the animal is alive still and a 0 indicates the animal did not survive. We'll then perform a logistic regression to find a function $A(l)$ that "best estimates" the probability an animal survives for 5 years given the magnitude of chemical $R$ found in their blood.

## **Learning Goals**

In this notebook, we're going to do the following:
  1. We will learn how to use the technique of logistic regression to evaluate the probability of an outcome (represented by a piece of binary data like 0 or 1) depending on a given input (represented by a piece of continuous data like a real number).
  2. We will learn how the probability function provided by logistic regression can be interpreted as the result of solving an optimization problem in two unknown variables.
  3. We will use the techniques of calculus to solve the above-mentioned optimization problem (given some helpful pieces of starting information) in two concrete examples.

# 1. An Introduction to Logistic Regression

We're going to work through a concrete example of computing the logistic regression model for a collection of data. The goal will be to gain an understanding of the following:
- we want to understand what we are trying to do (i.e. producing a function from a collection of data),
- we want to understand why the process that we are going through makes sense,
- and we want to understand how to explicitly carry out this process.

**Example:** In this example, we are going to be looking at a collection of data in the form of pairs of real numbers $(x_i,y_i)$. We're putting ourselves in the following situation: students have been preparing for a final exam after a 10 week quarter. We hold a survey before these students take their exam asking the number of hours each student spent studying throughout the quarter leading up to the exam. We then record whether or not the student passed or failed the exam.

Suppose that there are $N$ students that take this test. For each student $i=1,...,N$, we record this information as the data-point $(x_i,y_i)$ where the $x$-coordinate $x_i$ will represent the time a student spent studying prior to the exam, recorded in hours, and $y_i$ will be either $1$ if the student passed or $0$ if the student failed.

We're going to use the two lists below, which correspond to $N=143$ data-points, as a concrete example for this scenario. The data is collected and displayed in a table using a Python package "Pandas" below. It is also plotted via a scatterplot using Matplotlib's pyplot functionality.

In [None]:
x_time = [13.139578505832356, 225.4753405871886, 143.50624862966328, 138.42872959763085, 228.67838116095646,
    22.778143609252187, 93.21519320279181, 173.04019609077545, 148.39204490588284, 103.47734660322207,
    80.28120045691193, 70.4433745657244, 21.007076792169276, 6.2473290053980595, 40.669076909865964,
    29.01584490682468, 63.20057189675148, 105.403013842876, 32.560145690061596, 147.06089156870152,
    40.021421611996644, 96.51175131090487, 56.48704619671636, 180.58628353307287, 81.49802915096919,
    145.5040355262578, 175.6678277286986, 141.78455256517475, 171.5047497420194, 137.06852336224287,
    223.3552675443469, 166.95403213259877, 78.39020668298933, 149.97677817619825, 42.358643935381416,
    36.792441854295866, 113.47426451592109, 63.792117622403325, 202.89525683058974, 113.40888806512577,
    30.135655969568663, 24.423812596500124, 132.68544773103582, 148.22303350472947, 226.44495752148046,
    173.08678782576072, 108.00106320432843, 179.25231087405152, 17.771739021840236, 5.19520867221231,
    48.01946732658932, 107.84273894924387, 126.61893604969517, 86.55285702594863, 229.28279888590532,
    12.453934703328054, 193.77352226208444, 28.166042546172783, 145.93739558461644, 212.76502973333697,
    196.10017998627075, 7.525533608052116, 156.59885033992106, 94.82394364383495, 50.35213814638997,
    111.47248134221095, 32.0694032698006, 0.13616964833608614, 22.2994061065071, 177.90378642522114,
    24.39641811375939, 112.93125088338341, 200.54642800206747, 68.34871407305954, 50.1287378355131,
    28.651660794769825, 141.89183563954987, 110.59830017149, 121.57201863440746, 181.19918368448202,
    87.2307798925667, 107.33756396014041, 218.09558508643408, 83.44118900867832, 45.141925319871326,
    90.67934591630316, 167.53948838546353, 129.16195427253172, 203.12762805292016, 106.69634378790172,
    131.68880313013784, 124.03879598224886, 228.5383397449004, 170.38406607043783, 66.01455845469604,
    116.10176817650003, 177.16969460515304, 161.51691384951692, 58.70880558816803, 132.81085906181977,
    181.66883542495103, 118.93515936256333, 3.8958341442981865, 147.19151180864162, 228.6600064598377,
    177.57536441435263, 37.361524825162384, 49.661043657988685, 18.3711123325557, 44.62065234526582,
    111.57594071370484, 4.212708036608179, 166.95341183819266, 4.201906310346553, 214.6512201897064,
    112.5472320467628, 124.34116502952244, 160.72681406532843, 102.43902887538593, 8.766983633083099,
    114.74130410115907, 71.06376610970945, 160.83137350482332, 134.26290441354223, 109.747586350856,
    18.8442292021611, 43.347713418042844, 31.344675088666275, 41.650038776358706, 176.17484736440824,
    30.5006702664221, 177.78141726305526, 11.772117163818667, 221.69252675993158, 161.15693362011444,
    225.8314358118679, 192.99747471585545, 158.53368195431074, 174.14755220582663, 186.13701472703823,
    119.44704280125062, 225.4153715977053, 9.685579204463952]

In [None]:
y_outcomes = [0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0,
    1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1,
    1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
    0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0,
    1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]

In [None]:
import pandas as pd

# We can create a table by setting up {'column name' : list, ...}
# like below
data = {'Cumulative Time Studying' : x_time,
  'Outcome (P=1,F=0)' : y_outcomes}

table = pd.DataFrame(data)
table.style # Displays a stylized table

In [None]:
import matplotlib.pyplot as plt

plt.title("Cumulative hours studied vs Outcomes (P=1, F=0)")
plt.scatter(x_time,y_outcomes)

**Remark:** At this point, it's probably important to state up-front that this data is synthetically generated (i.e. it does not represent a group of real students).

In this data, the $x$-coordinates (measuring time students studied) has been generated "uniformly at random" from the interval $[0,241]$. This means that there was an equal likelihood of getting the value $0$ for an $x$-coordinate as there was getting the value $x=241$, as there was for getting any other value from this interval as well.

**How does this differ from real data?**

In my experience teaching thousands of students over the past few years, it would be atypical to see this kind of distribution representing the hours that students studied throughout the quarter. I would typically expect the following in a real dataset:
- there would be a group of students that had very low to no hours studied throughout the quarter, with at least a few people putting exactly zero.
- Excluding this former group, I would expect that the distribution is a bit more "condensed" towards the mean, with fewer students towards the tails.
- In a self reported poll, almost all students will put integer responses. There would also be considerable amounts of wild rounding, since it's difficult to assess exactly how much time you actually spend working if you don't monitor it precisely.

With this data, we want to do the following. We want to find a function $P(x)$ that gives the probablity a student passes this exam given that they study for $x$ hours throughout the quarter leading up to the exam.

In a *linear regression* we would guess that this function is a line, so $P(x)=mx+b$ for constants $m$ and $b$. Does this make sense in our scenario?

The answer is emphatically *no*. Any linear function will either be flat, which does not seem to represent our collection of data at all (students seem more likely to pass if they have studied longer throughout the quarter), or it will go to $\pm \infty$ as the magnitude of $x$ gets very large. For a function $P(x)$ that measures probability, we should expect that $P(x)$ produces values in the interval $[0,1]$ (corresponding to $P(x)=0$ -- there's no chance or 0% -- all the way up to $P(x)=1$ -- it's a guarantee, 100%.)

## 1.1 A Short Tangent on **Odds**.



If $P(x)$ measures the probability of an event occuring given the information $x$, then the odds of that event occuring given $x$ are described by the function
$$L(x) = \frac{P(x)}{1-P(x)}.$$

**Example:** If you have a bag with 7 red balls and 3 blue balls, and you pick a ball at random from the bag, what is the probability that the ball you pick is red?

In this example, the result is independent of any input variable $x$ (for a silly example, maybe $x$ measures the amount of time you wait to pick the ball?). We can therefore say $P(x)=7/10$ since there are 7 red balls and 10 balls in total.

The odds of picking a red ball given $x$ is then

$$L(x) = \frac{(7/10)}{1-(7/10)}=\frac{(7/10)}{(3/10)} = \frac{7}{3} \approx 2.3333.$$

In other words, you are slightly over twice as likely to pick a red ball as you are to not pick a red ball (which amounts, in this example, to picking a blue ball).

Note that, if the function $P(x)$ has a range in the interval $[0,1]$, then the odds function $L(x)$ has a range in the interval $[0,\infty)$.

If we assume that $P(x)>0$ for all values of $x$ then, by taking the logarithm we get a function
$$\mathrm{logodds}(x)=\mathrm{ln}(L(x))=\mathrm{ln}\left( \frac{P(x)}{1-P(x)}\right)$$
that has range inside the interval $(-\infty,\infty)$.

Throughout all of our examples, we are going to make the following assumption, without trying to justify why the assumption makes sense.

**Assumption:** The function $\mathrm{logodds}(x)$ can be modeled with a linear function.

Generally, it's a difficult question whether, for a given example, this is a reasonable assumption to make. Being able to answer this question takes a strong background in statistics.

However, if we are justified in making this assumption, we can write
$$\mathrm{logodds}(x) = mx+b$$ for some constants $m$ and $b$. If we then replace this function with its relation to $P(x)$ and solve we get:

\begin{align} \mathrm{ln}\left( \frac{P(x)}{1-P(x)}\right) &= mx+b\\
\frac{P(x)}{1-P(x)} &= e^{mx+b}\\
P(x)&=(1-P(x))e^{mx+b}\\
(1+e^{mx+b})P(x)&=e^{mx+b}.\end{align}

Which gives us the following expression:
\begin{align}P(x)& =\frac{e^{mx+b}}{1+e^{mx+b}}\\
&=\frac{1}{1+e^{-(mx+b)}}.\end{align}

The function $P(x)=\frac{e^{mx+b}}{1+e^{mx+b}}$ is a type of *Logistic* function. Below is some code that will display a plot of a logistic function given $m$ and $b$. Try playing around with different values of $m,b$ to see how the function changes shape.

In [None]:
import math as math
import matplotlib.pyplot as plt

m=0.85
b=0.5

def P(x):
  return 1/(1+math.exp(-m*x+b))

x_samples=[]
P_outputs=[]

jump=0.1
for i in range(200):
  x_samples.append(-10+jump*i)
  P_outputs.append(P(x_samples[i]))

plt.grid(True)
plt.plot(x_samples,P_outputs)

## 1.2 What is Logistic Regression?

In logistic regression, our goal is to find the function $P(x)=\frac{1}{1+e^{-(mx+b)}}$ that "best estimates" the true probability of a successful outcome that underlies our data, given the information of a particular known condition.

What we mean by "finding" the function $P(x)$ is just that we want to determine the explicit values for $m$ and $b$ that give us the "best estimate". To do this, we use a precise notion of "best estimate" which sets-up finding $m$ and $b$ as the outcome of an optimization problem in two-variables.

Specifically, we will consider a function $z=F(s,t)$ of two variables $s$ and $t$. This function $F$ should tell us: if we were to use $m=s$ and $b=t$ in our function $P(x)$, then how accurate is the resulting function in predicting our data?

The values that we're really interested in finding correspond to a point $(m,b)$ where $F(m,b)$ is minimized (among all of its possible outputs).
This is analogous to the optimization problems that we've already seen, except we have a function with two inputs instead of one.

**Remark**: If this is confusing you, think about the example $z=F(s,t)=s^2+t^2$. Which $(s,t)$ pair minimizes this function?

So which function $F$ should we use for our optimization?

Let's assume that we chose concrete constants $s$ and $t$ and consider the function $P(x)=\frac{1}{1+e^{-(sx+t)}}$. Then for a given data point $(x_i,y_i)$, we would be able to explicitly compare the outcome $y_i$ to the probability of this outcome predicted by $P(x_i)=\frac{1}{1+e^{-(sx_i+t)}}$.


---


**Definition:** For each of our data points $i=1,...,N$ we call the value
$$ l_i = \begin{cases} -\ln(P(x_i)) & \mbox{if } y_i=1\\ -\ln(1-P(x_i)) & \mbox{if } y_i=0\end{cases}$$
the *log-loss* of the point $(x_i,y_i)$.


---

If we had the correct values for the constants $m$ and $b$, then we would expect the log-loss of each point $(x_i,y_i)$ to be relatively small. In the example of students above: if $P(x_i)$ is close to $1$, then the probability that a student passes having studied $x_i$ hours should be high. If $y_i=1$, then the true outcome for this student agrees with our prediction and, in this case, the log-loss is $-\ln(P(x_i))$ which will be small.

On the other hand, if $P(x_i)$ is close to $1$ and if $y_i=0$, then the log-loss is $-\ln(1-P(x_i))$, which is like $-\ln(\epsilon)$ for a small number $\epsilon$, and this will be large. This large value tells us that there is a large disparity between our prediction functions expected output $P(x_i)$ and the true output, telling us something might be off.


The log-loss at a point $(x_i,y_i)$ can be succinctly written in one expression as

$$l_i=-y_i\ln(P(x_i))-(1-y_i)\ln(1-P(x_i)).$$

The function $F(s,t)$ that we would like to minimize is the sum of the log-losses:

$$F(s,t)=\sum_{i=1}^N l_i = \sum_{i=1}^N -y_i\ln(P(x_i))-(1-y_i)\ln(1-P(x_i)).$$

Remember that we are considering $P(x)=1/(1+e^{-(sx+t)})$ so that, after substituting $x=x_i$ into $P(x)$, the resulting expression for $F(s,t)$ is an expression of just the two variables $s$ and $t$.

## 1.3 Working Out the Logistic Regression in Our Example

If we pick values for $s$ and $t$, we can work out the exact log-loss for the resulting probability function $P(x)$ using our initial data set. Unfortunately, because of the use of both logarithms and exponentials in our formulas, the computer can quickly run out of precision when we try to compute the log-loss directly (we will be exponentiating our rounding errors, and these will accumulate into a very large error after a while).

Fortunately, there are ways that we can work around this difficulty. Consider a specific data point $(x_i,y_i)$. To simplify things, let $z=sx_i+t$ where $s$ and $t$ are the constants we are using to produce our probability function $P(x)$. Then, in this notation, we can write $P(x_i)=1/(1+e^{-z})$.

If $y_i=0$, then the log-loss is $$l_i=-\ln(P(x_i))=-\ln\left(\frac{1}{1+e^{-z}}\right)=\ln(1+e^{-z}).$$

If $y_i=1$, then the log-loss is $$l_i=-\ln(1-P(x_i))=-\ln\left(1-\frac{1}{1+e^{-z}}\right)=-\ln\left(\frac{e^{-z}}{1+e^{-z}}\right)=-\ln\left(\frac{1}{1+e^z}\right)=\ln(1+e^z).$$ To go from the third spot to the fourth in the sequence of equalities above, we put the leftmost 1 on a common denominator with the expression for $P(x_i)$. To go from the fourth spot to the fifth, we multiply the top and bottom of the fraction by $e^z$.

We are going to use a function `logaddexp(a,b)` from a Python package called Numpy in order to get a precise estimate of these log-losses. The `logaddexp` function provides a way to accurately compute the value of $\mathrm{logaddexp}(a,b)=\ln(e^a+e^b)$ by using known alternative formulas for the right-hand side of this expression (see https://numpy.org/devdocs/reference/generated/numpy.logaddexp.html). Note that our log-losses are either $\mathrm{logaddexp}(0,z)$ or $\mathrm{logaddexp}(0,-z)$.

We do this for a few examples below.

In [None]:
import numpy as np

**Example:** $s=0.001$ and $t=1$

In [None]:
s=0.001
t=1

# We'll use a small helper function for z
def z(x):
  return s*x+t

# We haven't seen the keyword else before, but
# it works like this: first the condition y==1 is
# checked to be `true` or `false`. If this condition
# is true, then what is in the `if` block is evaluated.
# If the condition is false, then what is in the `else`
# block is evaluated.
def log_loss(x,y):
  if y==1:
    return np.logaddexp(0,-z(x))
  else:
    return np.logaddexp(0,z(x))

# Now we accumulate our log_losses for the data set (x_time, y_outcomes)
Total_log_loss = 0
for i in range(143):
  Total_log_loss = Total_log_loss + log_loss(x_time[i],y_outcomes[i])

# And we display it to the screen
print(Total_log_loss)

**Example:** $s=5$ and $t=-0.5$

In [None]:
# Since z(x) uses s and t in its definition,
# we can just change those values here and the computer
# will use them the next time z(x) is asked for.
s=5
t=-0.5

# Note that we tell the computer to look-up the log_loss function,
# and then to evaluate it at x_time[i], y_outcomes[i]. When it looks-up
# the log_loss function, it will see that it needs the z(x) function.
# When it computes z(x), it will use the most recent input for s and t.
Total_log_loss = 0
for i in range(143):
  Total_log_loss = Total_log_loss + log_loss(x_time[i],y_outcomes[i])

print(Total_log_loss)

**Example:** $s=-0.0123$ and $t=0.47$


In [None]:
s=-0.0123
t=0.47

Total_log_loss = 0
for i in range(143):
  Total_log_loss = Total_log_loss + log_loss(x_time[i],y_outcomes[i])

print(Total_log_loss)

Out of the previous three examples, the total log-loss when $s=0.001$ and $t=1$ was the smallest, so this appears to give us the "most accurate" prediction function $P(x)$ of the previous three examples.

Using more advanced mathematical techniques (e.g. a higher-dimensional version of the Newton-Raphson method), we can find the "best constants" to be (approximately) $$s=-4.96656060295091\quad \mbox{and}\quad t=0.04006626814304893.$$

Let's check that this beats the (i.e. produces a smaller value for the) log-loss than any of our previous examples.

In [None]:
s=0.04006626814304893
t=-4.96656060295091

Total_log_loss = 0
for i in range(143):
  Total_log_loss = Total_log_loss + log_loss(x_time[i],y_outcomes[i])

print(Total_log_loss)

Knowing that these values are our optimal constants giving us the most accurate prediction function $P(x)$, we can now define our probability function explicitly using these values. We'll then graph the probability function $P(x)$ versus our data-set and see how they compare. We'll see, finally, how we can use this information to make some informed guesses about student success rates for this particular exam.

In [None]:
s=0.04006626814304893
t=-4.96656060295091

def P(x):
  return 1/(1+math.exp(-(s*x+t)))

# Lists to display the graph of P
x_inputs = []
P_outputs = []

step = 0.5
for i in range(460):
  x_inputs.append(step*i)
  P_outputs.append(P(x_inputs[i]))

plt.grid(True)
plt.plot(x_inputs,P_outputs, color='black')
plt.scatter(x_time,y_outcomes)

Looking at the graph of this function, we can gain some insight about the test and student performance on the test.

As one observation, we can say that students who study for 120 hours before the exam will have equal odds of passing or failing the test (a 50/50 chance). To be very confident that students will pass the exam (say a 95% chance) however, we would want to see students that spend upwards of 200 hours studying prior to the exam.

We can find out how many students actually studied for this amount of time (in both absolute magnitude, and in relative terms as a percentage of the entire class) by running the code below.

In [None]:
count = 0
for i in range(len(x_time)):
  if x_time[i]>200:
    count = count+1

print("The number of students that studied more than 200 hours:", count)
print("The percentage of students that studied more than 200 hours:", 100*count/len(x_time))

We can also answer specific questions using our new function $P(x)$. For example, what is the probability that another student passes this same exam if they have only studied 65 hours prior to the exam?

In [None]:
print("The probability (as a percentage) that a student passes this exam, given that they've studied 65 hours prior:", 100*P(65))

We could also answer questions like: approximately how long would one need to study to have a 75% probability of passing this exam?

**Remark:** Technically, we could solve for this value exactly using algebra (try it yourself). A lazy approximation can be gotten by running the following code-block. A while loop will run until the expression following the `while` keyword evaluates to `True`.

In [None]:
x=1
while P(x)<0.75:
  x=x+0.1

print("It will take about", x, "hours of prior studying to have a 75% probability of passing the exam.")

As a comment: it's probably more accurate to say that any one student who has studied for this amount of time (about 151 hours) will either pass or fail the test; however if we look at a large number of students, say 1000 students, that have all studied for this amount of time (about 151 hours), then we expect that close to 75% of them will pass and the rest will fail (or, even more precisely, if we took the limit as the number of students went to infinity, then we would expect that the proportion that pass approaches 75% in the limit).

# 2. Chemical Prevalence in Animal Blood Analysis

In this section, you'll be tasked with carrying out a logistic regression for a collection of data provided below. You'll then be asked to complete some exercises that analyze this data.

In this example, we are a group of researchers that are analyzing the survivability of a group of animals after having been exposed to a chemical $R$. When an animal is exposed to chemical $R$, it goes through various pathways eventually ending up in the animals blood (and then it's spread to all parts of the animal where the blood will go, maybe accumulating in a certain part of the animals body or, maybe it won't accumulate but instead go through the natural decay process to eventually be passed as waste).

We've randomly selected $N=37$ exposed animals that have been tested for the concentration of chemical $R$ in a recently collected blood sample. The following data `R_concentration` reports these measurements in the units of $mg/mL$.

After 6 months, we record whether the animal that we collected blood from is alive or dead. We record these outcomes in the list `D_outcomes` below, where we use $1$ to mark that the animal has died and $0$ to mark that the animal is still alive.

In [None]:
R_concentration = [0.0018692225198974738, 0.009064855931122256, 0.009221190279840929, 0.0007407734620591024, 0.03170895410635477, 0.0026285529505347765, 0.005753885205066147, 0.007293212177370159, 0.002486416677612312, 0.014918607406788645, 0.018218475140936285, 0.006855838849034305, 0.005436202107593114, 0.017557900013196008, 0.0020041307331935586, 0.004282781215758221, 0.0038344694872489474, 0.004726919651374512, 0.0007462942021228984, 0.0007558990064413289, 0.036923541695680726, 0.011724231646764444, 0.005188094207675217, 0.006846426178591216, 1.3886612711223818e-06, 0.009184603223554657, 0.0024309719945358513, 0.0014823370489971605, 0.0009189350177019497, 0.026224475812082846, 0.006386821190372966, 0.011912259712352685, 0.0020914099056144834, 0.0012895488660066563, 0.007677437236697688, 0.005546945160333639, 0.01769446399830894]

In [None]:
D_outcomes = [0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]

## 2.1 Exercises

**Exercise 1:** This question has multiple parts.

a) Make a scatter plot of concentration of chemical $R$ in the blood of an animal vs its recorded outcome (dead or alive) after 6 months.

You'll hopefully see that, although most animals have a very small concentration of chemical $R$ in their blood sample, there are a few with much larger concentrations that all ended up dying.

A more useful comparison between concentration and survivability, in this case, might then be to compare the *magnitude* of the concentration to survivability (rather than comparing the concentration itself to the survivability outcome).

b) Make a list `R_log_con` that contains the values $\log_{10}(c)$ for all concentrations $c$ in `R_concentration`. This new list tells us the magnitude of the exposure to chemical $R$ that has reached an animals' blood.

c) Make a scatter plot comparing `R_log_con` with `D_outcomes`. Are there any outliers in this data? If there are, make a second plot that excludes these outliers by restricting the domain of the scatter plot to the most relevant data.
Label both plots to explicitly state that one contains all data points and that in the other outliers are excluded.

d) Finally, make a table that has columns with entries from `R_concentration`, `R_log_con`, and `D_outcomes`.

**Exercise 2:** We are going to do a logistic regression on the data corresponding to the log of our concentrations vs survivability outcomes.

a) Choose three points $(s,t)$ to use as values for a probability prediction function $P(x)=1/(1+e^{-(sx+t)})$.

For each of these points that you've selected, compute the total log-loss for the corresponding function $P(x)$. Note that $x$ should be the magnitude of concentration in animal blood (i.e. the log(concentration) to base 10) and $P(x)$ should be interpreted as the probability that the animal dies in 6 months after the blood sample was taken.

b) Compare the values you got for the total log-loss at the points $(s,t)$ from part a) to the total log-loss at the point $(10.732782658135832, 20.88396170753766)$. What do you notice about the total log-loss for this point?

**Exercise 3:** The point $(10.732782658135832, 20.88396170753766)$ is actually (approximately) the minimum possible value for the function that gives the total (cumulative) log-loss of our data points.

Come up with a possible strategy that would allow you to find this point given only the information of the total log-loss function $F(s,t)$ as above.

Your strategy doesn't have to be correct, or need to give the correct result. You only need to come up with a convincing strategy that utilizes the tools you have available to you.

Provide a description of how your strategy might be implemented. Give an explanation as to why your strategy should provide a good guess to a minimum value.

**Exercise 4:** Using the point given to you in **Exercise 3**, answer the following questions.

a) The Median Lethal Dose (LD-50) is the dose of a chemical or drug that would be required to kill approximately half of a given population. Using our optimal model for $P(x)$, at what concentration of chemical $R$ will the LD-50 occur? In other words, what concentration of chemical $R$ provides a log-concentration $x$ that produces $P(x)=0.5$?

b) Suppose that we measure two new animals that have both been exposed to chemical $R$. One animal appears to have a concentration of chemical $R$ of approximately $0.003132412342$ $mg/mL$ in their blood sample, and the other has concentration $0.0212203912372$ $mg/mL$ in their blood sample.

Given these values, what would we expect the survivability outcome to be for these animals after 6 months from the blood sample collection date?