In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab11.ipynb")

<img style="display: block; margin-left: auto; margin-right: auto" src="./ccsf-logo.png" width="250rem;" alt="The CCSF black and white logo">

<div style="text-align: center;">
    <h1>Lab 11: Conditional Probability</h1>
    <em>View the related <a href="https://ccsf.instructure.com" target="_blank">Canvas</a> Assignment page for additional details.</em>
</div>

This lab is an introduction to conditional probabilities.  

The lab includes a visualization called an *icon array*. It's meant to be an instructional part of the lab to help build intuitions about conditional probability.  These visualizations do not appear in the textbook and will not appear on any exam.

In [None]:
# Run this cell to set up the notebook, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore')

# This line loads the visualization code for this lab.
import visualizations

## What is conditional probability good for?

Suppose we have a known population, like all dogs in California.  So far, we've seen three ways of *predicting* something about an individual in that population, given incomplete knowledge about the identity of the individual:

* If we know nothing about the individual dog, we could predict that its speed is the *average* or *median* of all the speeds in the population.
* If we know the dog's height but not its speed, we could use *linear regression* to predict its speed from its height.  The resulting prediction is still imperfect, but it might be more accurate than the population average.
* If we know the dog's breed, height, and age, we could use *nearest-neighbor classification* to predict its speed by comparing it to a collection of dogs with known speed.


We can also compute conditional probabilities to make predictions about individuals or events. This technique is different from the previous methods we’ve examined because 

1. our prediction for each outcome is described by a probability, and 
2. each probability can be exactly calculated from assumptions, as opposed to estimated from data.

## Icon arrays

Parts of this lab work with a more complex example about disease, but first, let's start with a simple example.

Imagine you and Samantha are playing a game in which you are given a marble and tasked to determine the marble's texture and size. You don't know anything about the marble you're given, but you know that Samantha drew it **uniformly at random** from a bag that contained the following marbles:
* 4 large shiny marbles,
* 1 large dull marble,
* 6 small shiny marbles,
* 2 small dull marbles.

### Task 01 📍

Knowing only what we've told you so far, what's the probability that the marble you're given was a large shiny marble?

In [None]:
probability_large_shiny = ...

In [None]:
grader.check("task_01")

Here's a table with those marbles:

In [None]:
marbles = Table.read_table("marbles.csv")
marbles.show()

Here are the counts of each type of marble in a pivot table.

In [None]:
marbles.pivot('surface', 'size')

Here are all the different combinations of surface and size, with the count for each surface-size combination. Each type of marble appears in its own row.

In [None]:
marbles.group(['surface', 'size'])

We've included some code to display something called an *icon array*.  The functions in the cell below create icon arrays from various kinds of tables. Don't worry about understanding the code; just run this cell.

**NOTE:** You may ignore the the y-axis labels. Just remember that each box represents 1 marble.

In [None]:
# Run this cell.

#######################################################################
# The functions you'll need to actually use are in here.  Each is a
# way of making an icon array from a differently-formatted table.
#######################################################################

def display_icon_array(table, groups, individuals_name):
    """
    Given a table and some columns to group it on, displays an icon array
    of the groups.
    
    groups should be an array of labels of columns in table.
    
    individuals_name is your name for the individual rows of table.
    For example, if we're talking about a population of people,
    individuals_name should be "people".
    
    For example:
    
    display_icon_array(marbles, ["surface", "size"], "marbles")
    """
    display_grouped_icon_array(table.group(groups), individuals_name)

def display_grouped_icon_array(grouped_data, individuals_name):
    """
    Given a table with counts for data grouped by 1 or more categories,
    displays an icon array of the groups represented in the table.
    
    grouped_data should be a table of frequencies or counts, such as
    a table created by calling the groups method on some table.
    
    individuals_name is your name for the individual members of the
    dataset.  For example, if we're talking about a population of
    people, individuals_name should be "people".
    
    For example:
    
    display_grouped_icon_array(marbles.group(["surface", "size"]), "marbles")
    """
    visualizations.display_combinations(grouped_data, individuals_name=individuals_name)

def display_crosstab_icon_array(crosstabulation, x_label, individuals_name):
    """
    Given a crosstabulation table, displays an icon array of the groups
    represented in the table.
    
    crosstabulation should be a table of frequencies or counts created by
    calling pivot on some table.
    
    x_label should be the label of the categories listed as columns (on
    the "x axis" when the crosstabulation table is printed).
    
    individuals_name is your name for the individual members of the
    dataset.  For example, if we're talking about a population of
    people, individuals_name should be "people".
    
    For example:
    
    display_crosstab_icon_array(marbles.pivot("surface", "size"), "surface", "marbles")
    """
    display_grouped_icon_array(visualizations.pivot_table_to_groups(crosstabulation, x_label), individuals_name)

Here's an icon array of all the marbles, grouped by surface and size:

In [None]:
visualizations.make_displayed_rectangles(visualizations.compute_combination_data(marbles.group(["surface", "size"]), 'marble(s)'))


In [None]:
# Run this cell.
display_grouped_icon_array(marbles.group(["surface", "size"]), "marble(s)")

You (the marble) should imagine that you are a random draw from these 13 icons.

The following is an icon array of the marbles, grouped **only by their surface (shiny/dull)**.

In [None]:
display_grouped_icon_array(marbles.group("surface"), "marble(s)")

Knowing nothing else about the marble, it's equally likely to be any of the marbles depicted above; this is because we've assumed that the marble is selected **uniformly at random** from the bag.

### Task 02 📍

What's the probability that you've been given a shiny marble? Answer using mathematical expressions by looking at your icon array.

In [None]:
probability_shiny = ...

In [None]:
grader.check("task_02")

### Conditional probability

Suppose you overhear Samantha say that you were given a large marble. Does this somehow change the chance that your marble is shiny?  Let's find out.

Go back to the full icon array, displayed below for convenience.

In [None]:
display_grouped_icon_array(marbles.group(["surface", "size"]), "marble(s)")

In task 2.0.2, we assumed that the marble you were given was equally likely to be any of the marbles, simply because we didn't know any better.  That's why we looked at all the marbles to compute the probability that your marble was shiny.

But assuming that you've been given a large marble, we can eliminate some of these possibilities. In particular, you can't have been given a small shiny marble or a small dull marble.

You're still equally likely to have been given any of the remaining marbles, because you don't know any other information.  So here's an icon array of those remaining possibilities:

In [None]:
# Just run this cell.
display_grouped_icon_array(marbles.where("size", "large").group("surface"), "large marble")

#### Task 03 📍

What's the probability Samantha gives you a shiny marble, knowing that she gave you a large marble? 

*Hint: Take a look at the icon array.*

In [None]:
probability_shiny_given_large = ...

In [None]:
grader.check("task_03")

You should have found that this is different from the probability that Samantha gave you a shiny marble, given no size information, which you computed earlier.  The distribution of surfaces among the large marbles is a little different from the distribution of surfaces among all the marbles.

Run the following cell to display the icon array that is relevant for the next task.

In [None]:
display_grouped_icon_array(marbles.where("surface", "shiny").group("size"), "shiny marble(s)")

#### Task 04 📍

Suppose instead Samantha had said she gave you a **shiny** marble (hooray!). What's the probability that the marble given to you is large? 

Run the code cell above to display the icon array, then assign `probability_large_given_shiny` to the appropriate value.

In [None]:
probability_large_given_shiny = ...

In [None]:
grader.check("task_04")

#### Task 05 📍🔎

Can you answer the previous two tasks just by looking at the full icon array?  (You can run the cell below to see it again.).

*Hint:* Check out [Ch 18.2](https://inferentialthinking.com/chapters/18/2/Making_Decisions.html) for a refresher on how to use Bayes' Rule.

In [None]:
# Just run this cell.  The next cell is where you should write your answer.
display_grouped_icon_array(marbles.group(["surface", "size"]), "marble(s)")

<!-- BEGIN QUESTION -->

If you can, how? If not, why not? Check with someone to see if you are on the right track. 


_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Cancer Screening

Hopefully the icon arrays from the previous portion helped you build intuition for why conditional probabilities can be helpful. Now, let’s look at a real life application.

### Background

Cancer screening is not 100% reliable, and it can have errors. There are cases in which someone without cancer or with a benign tumor sees a positive test result. There are also cases where someone with cancer can receive a negative result. (“Positive” implies an indication of cancer in this context.) The first case, called a false positive, could cause anxiety for a patient and lead to further testing that may be risky and involve unnecessary radiation. In the second case, called a false negative, a person would not receive the necessary information and possible treatment for their condition.

Conditional probability can provide insight into the accuracy of initial cancer tests. For example, you can compute the chance that a person has cancer given the results of a diagnostic test by combining information from different probability distributions. You’ll see that the chance Person X does have cancer even if they receive a positive result is not necessarily 100%.

It is important to note that conditional probabilities offer just one line of insight to understanding a test result and that this statistical approach does not take into account important factors like heredity or environment that come to affect a person’s health and diagnoses. The [Centers for Disease Control and Prevention](https://www.cdc.gov/cancer/dcpc/prevention/screening.htm) (CDC) recommends getting preventative cancer screenings, and one should always consult a doctor if they are concerned about their health. 

_On May 9, 2023, the U.S. Preventive ServicesTask Force released an updated [draft recommendation statement](https://www.uspreventiveservicestaskforce.org/uspstf/sites/default/files/file/supporting_documents/breast-cancer-screening-draft-rec-bulletin.pdf) for breast cancer screening._

### Basic cancer statistics

---

Note: All of the following statistics are made up and don't necessarily reflect the actual state of the world.

---

Suppose that in a representative group of 10,000 people who are tested for cancer ("representative" meaning that the frequencies of different events are the same as the frequencies in the whole population):
1. 100 people have cancer.
2. Among the 100 people that have cancer, 90 have positive results on a cancer test and 10 have negative results.  (So 10 people receive false negative results.)
3. The other 9,900 people don't have cancer.
4. Among these 9,900 people, 198 have positive results on a cancer test and the other 9,702 have negative results.  (So 198 see "false positive" results.)

Below we've generated a table with data from these 10,000 hypothetical people.

In [None]:
people = Table().with_columns(
    "status", ["cancer", "cancer", "no cancer", "no cancer"],
    "test status", ["positive", "negative", "positive", "negative"],
    "count", [90, 10, 198, 9702])
people

One way to visualize this dataset is with a contingency table, which you've seen before.

#### Task 06 📍

Create a contingency table that looks like this:

|status|negative|positive|
|-|-|-|
|cancer|||
|no cancer||||

...with the **count** of each group filled in, according to what we've told you above.  The counts in the 4 boxes should sum to 10,000.

*Hint:* Use `pivot` with the `sum` function.


In [None]:
cancer = ...
cancer

In [None]:
grader.check("task_06")

Here is the `people` data in an icon array.

In [None]:
display_grouped_icon_array(people, "people who've taken a cancer test")

Now let's think about how you can use this kind of information when Person X is tested for cancer.

Before you know any information about Person X, you could imagine Person X as a **uniform random sample** of one of the 10,000 people in this imaginary population of people who have been tested.

What's the chance that Person X has cancer, knowing nothing else about them?  It's $\frac{100}{10000}$, or 1%.  We can see that more directly with this icon array:

In [None]:
by_health = people.select(0, 2).group(0, sum).relabeled(1, 'count')
display_grouped_icon_array(by_health, "people who've taken a cancer test")

#### Task 07 📍

What's the chance that Person X has a positive test result, knowing nothing else about them?
Run the next code cell to display an icon array, then assign `probability_positive_test` to this value.

In [None]:
# Run this cell to display an icon array. Then fill in the probability of a positive test result in the last line.
by_test = people.select(1, 2).group(0, sum).relabeled(1, 'count')
display_grouped_icon_array(by_test, "people who've taken a cancer test")

# Now fill in the probability of a positive test result
probability_positive_test = ...

In [None]:
grader.check("task_07")

### Interpreting test results

Suppose Person X has a positive test result.  This means that you can now narrow them down to being part of just one of the two following groups:
1. The people with cancer who have a positive test result.
2. The people without cancer who have a positive test result.

Here's an icon array for those two groups:

In [None]:
# Just run this cell.
display_grouped_icon_array(people.where("test status", are.equal_to("positive")).drop(1), "people who have a positive test result")

The *conditional probability* that Person X **has cancer given their positive test result** is the chance that they're in the first group (cancer), assuming they have a positive test result.

#### Task 08 📍

Eyeballing the icon array above, is the conditional probability that Person X has cancer **given their positive test result** closest to:

1. 9/10

2. 2/3

3. 1/2

4. 1/3

5. 1/100

Assign `rough_prob_cancer_given_positive` to an integer corresponding to your answer.


In [None]:
# Set this to one of the probabilities above.
rough_prob_sick_given_positive = ...

In [None]:
grader.check("task_08")

#### Task 09 📍

Now write code to calculate that probability exactly, using the original contingency table you wrote (the `cancer` table).

Run the next code cell to see the `cancer` table, then fill in `prob_sick_given_positive` with your code.


In [None]:
prob_sick_given_positive = ...
prob_sick_given_positive

# Run this cell first to see the cancer table. Then fill in the first line of this cell.
print(cancer)

print('Probability of cancer given positive test result: {}'.format(prob_sick_given_positive))

In [None]:
grader.check("task_09")

#### Task 10 📍🔎

<!-- BEGIN QUESTION -->

Look at the full icon array again.  Using that, how would you compute the conditional probability of cancer given a positive test?

Run the next code cell to see the full icon array. 


_Type your answer here, replacing this text._

<!-- END QUESTION -->



In [None]:
# The full icon array is given here for your convenience.
# Write your answer in the previous cell.
display_grouped_icon_array(people, "people who've taken a cancer test")

#### Task 11 📍🔎

<!-- BEGIN QUESTION -->

Is your answer to task 07 bigger than the overall proportion of people in the population who have cancer (given as 1% in Task 06)?  Does that make sense? 

Check with someone to see if you have the right idea. 


_Type your answer here, replacing this text._

<!-- END QUESTION -->

## A Subjective Prior

Being right isn’t always satisfying. Classifying a Positive patient as not having cancer still seems somehow wrong, for such an accurate test. In this section, we'll take a deeper look at the basis of our probability calculation: the assumption of randomness.

Our assumption was that a randomly chosen person was tested from our population of 10,000 individuals. This doesn’t happen in reality. People go in to get tested because they think they might have the disease, or because their doctor thinks they might have the disease. People getting tested are not randomly chosen members of the population.

That is why our intuition about people getting tested was not fitting well with the answer that we got. In **Task 07**, we calculated that the probability of cancer given a positive test result was **0.3125**, or **31.25%**. This seems unusually low, especially given what we know about the efficacy of cancer screenings. We were imagining a realistic situation of a patient going in to get tested because there was some reason for them to do so, whereas the calculation was based on a randomly chosen person being tested.

So let’s redo our calculation under the more realistic assumption that the patient is getting tested because the doctor thinks there’s a chance the patient has the disease and compare the two probabilities.

**Aside:** For more information about Part 4 of this lab, check out [Section 18.2.2](https://inferentialthinking.com/chapters/18/2/Making_Decisions.html#a-subjective-prior) from the textbook.

### A New Look at the Population

Suppose now that our population of 10,000 individuals takes into account the fact that choosing to take a medical test is not done at random; there must be some motive for an individual to take one. Thus, our modified population has the following properties:
1. 1,000 people have cancer.
1. Among the 1,000 people that have cancer, 900 have positive results on a cancer test and 100 have negative results. 
3. The other 9,000 people don't have cancer.
4. Among these 9,000 people, 180 have positive results on a cancer test and the other 8,820 have negative results.  (So 180 see "false positive" results.)

Below we've generated a population table called `people_new` and a contingency table with data from these 10,000 hypothetical people called `cancer_new`.

In [None]:
# Just run this cell to load the table
people_new = Table().with_columns(
    "status", ["cancer", "cancer", "no cancer", "no cancer"],
    "test status", ["positive", "negative", "positive", "negative"],
    "count", [900, 100, 180, 8820])
people_new

In [None]:
# Just run this cell to load the table
cancer_new = Table().with_columns(
    "status", ["cancer", "no cancer"],
    "negative", [100, 8820],
    "positive", [900, 180])
cancer_new

As before, we can present the data above in an icon array. Here is the `people_new` data in such a format.

In [None]:
display_grouped_icon_array(people_new, "people who've taken a cancer test")

### Interpretation and Comparison

As in Cancer Screening section, let's suppose Person X has a positive test result.  This means that you can now narrow them down to being part of just one of the two following groups:
1. The people with cancer who have a positive test result.
2. The people without cancer who have a positive test result.

Just as we did before in this lab, here's an icon array for those two groups:

In [None]:
# Just run this cell.
display_grouped_icon_array(people_new.where("test status", are.equal_to("positive")).drop(1), "people who have a positive test result")

#### Task 12 📍

Using the new contingency table, `cancer_new`, that was provided at the beginning of this section, write code to calculate the exact probability that an individual has cancer given that they've received a positive test result.

Run the next code cell to see the `cancer_new` table, then fill in `prob_cancer_given_positive_new` with your code.

In [None]:
# Run this cell first to see the cancer_new table. Then fill in the next line of this cell.
cancer_new.show()

prob_cancer_given_positive_new = ...

print(f'Probability of cancer given positive test result: {prob_cancer_given_positive_new}')

In [None]:
grader.check("task_12")

#### Task 13 📍🔎

<!-- BEGIN QUESTION -->

How does your probability in task 07 compare to your answer from task 12? Does that make sense? Check with someone to see if you have the right idea. 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Submit your Lab to Canvas

Once you have finished working on the lab questions, prepare to submit your work in Canvas by completing the following steps.

1. In the related Canvas Assignment page, check the requirements for a Complete score for this lab assignments.
2. Double-check that you have run the code cell near the end of the notebook that contains the command `"grader.check_all()"`. This command will run all of the run tests on all your responses to the auto-graded tasks marked with 📍.
3. Double-check your responses to the manually graded tasks marked with 📍🔎.
4. Select the menu items "File", "Save and Export Notebook As...", and "HTML (.html)" in the notebook's Toolbar to download an HTML version of this notebook file.
5. In the related Canvas Assignment page, click Start Assignment or New Attempt to upload the downloaded HTML file.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()