<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

# Lecture 20: Models

Associated Textbook Sections: [10.4, 11.0, 11.1](https://inferentialthinking.com/chapters/10/4/Random_Sampling_in_Python.html)

<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

## Outline

* [Assessing Models](#Assessing-Models)
* [Jury-Selection](#Jury-Selection)
* [A Genetic Model](#A-Genetic-Model)
* [Two Viewpoints](#Two-Viewpoints)

## Set Up the Notebook

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

---

## Assessing Models

### Models

* A model is a set of assumptions about the data
* In data science, many models involve assumptions about processes that involve randomness ("Chance models")
* Key question: Does the model fit the data?

### Approach to Assessment

* If we can simulate data according to the assumptions of the model, we can learn what the model predicts.
* We can then compare the predictions to the data that were observed.
* If the data and the model's predictions are not consistent, that is evidence against the model.

---

## Jury Selection

### Swain vs. Alabama, 1965

* Talladega County, Alabama
* Robert Swain, Black man convicted of crime
* Appeal: one factor was all-White jury
* Only men 21 years or older were allowed to serve
* 26% of this population were Black
* Swain's jury panel consisted of 100 men
* 8 men on the panel were Black

### Supreme Court Ruling (In English ... of the time)

* About disparities between the percentages in the eligible population and the jury panel, the Supreme Court wrote: 
> "... the overall percentage disparity has been small and reflects no studied attempt to include or exclude a specified number of Negroes"
* The Supreme Court denied Robert Swain’s appeal


### Supreme Court Ruling (in Data)

* Paraphrase: 8/100 is less than 26%, but not different enough to show Black men were systematically excluded
* Question: is 8/100 a realistic outcome if the jury panel selection process were truly unbiased?

### Sampling from a Distribution

* Sample at random from a categorical distribution using `sample_proportions(sample_size, pop_distribution)`
* Samples at random from the population returns an array containing the distribution of the categories in the sample


### Demo: Swain vs. Alabama

Create an 2-valued array that reflects the proportion of the population that is Black and not Black.

In [None]:
population_proportions = ...
population_proportions

Use `sample_proportions` to create the proportions of a sample of size 100 from the population defined by the above array values.

In [None]:
...

Define a function to return the proportion Black individuals in a random sample of size 100 from the given population. Visualize the distribution form from calling that function 1000 times.

In [None]:
def panel_proportion():
    return ...

In [None]:
panel_proportion()

In [None]:
panels = ...

for ...
    new_panel = ...
    panels = ...

In [None]:
Table().with_column('Number of Black Men on Panel of 100', panels).hist(bins=np.arange(5.5,40.))

---

## A Genetic Model

### Gregor Mendel, 1822-1884

<img src="img/Gregor_Mendel.jpeg" alt="Gregor Mendel"
width = 30%>

Image Source: [Wikipedia - Gregor Mendel](https://en.wikipedia.org/wiki/Gregor_Mendel)

### A Model

* Pea plants of a particular kind
* Each one has either purple flowers or white flowers
* Mendel’s model: Each plant is purple-flowering with chance 75%, regardless of the colors of the other plants
* Question: Is the model good, or not?


### Choosing a Statistic

* Take a sample, see what percent are purple-flowering
* If that percent is much larger or much smaller than 75, that is evidence against the model
* Distance from 75 is the key
* Statistic: `abs(sample_percent_of_purple_flowering_plants - 75)`
* If the statistic is large, that is evidence against the model


### Demo: Mendel and Pea Flowers

Define the observed proportion of purple flowers in Mendel's data and the predicted population proportions for purple and non-purple flowers.

In [None]:
## Mendel had 929 plants, of which 709 had purple flowers
observed_purples = ...
observed_purples

In [None]:
predicted_proportions = ...
predicted_proportions

Simulate sampling 929 plants from a population with the proportions predicted by Mendel.

In [None]:
...

Simulate randomly selecting samples of 929 plants based on Mendel's model. Repeat this process 10000 times and visualize the distribution of the sample proportions.

In [None]:
def purple_flowers():
    return ...

In [None]:
purple_flowers()

In [None]:
purples = ...

for ...
    new_purple = ...
    purples = ...

In [None]:
Table().with_column('Percent of purple flowers in sample of 929', purples).hist()

Visualize the distribution of the test statistics created from the simulations and identify where the observed statistic fits in the distribution.

In [None]:
test_statistics = Table().with_column('Discrepancy in sample of 929 if the model is true', abs(purples- 75))
test_statistics

In [None]:
test_statistics.hist()

In [None]:
observed_statistic = ...
observed_statistic

In [None]:
test_statistics.hist()
plt.scatter(observed_statistic, 0, color='red', s=60, zorder=3);

---

## Two Viewpoints

### Model and Alternative

* Jury selection:
    * Model: The people on the jury panels were selected at random from the eligible population
    * Alternative viewpoint: No, they weren't
* Genetics:
    * Model: Each plant has a 75% chance of having purple flowers
    * Alternative viewpoint: No, it doesn't

### Steps in Assessing a Model

* Choose a statistic to measure discrepancy between model and data
* Simulate the statistic under the model’s assumptions
* Compare the data to the model’s predictions:
    * Draw a histogram of simulated values of the statistic
    * Compute the observed statistic from the real sample
    * If the observed statistic is far from the histogram, that is evidence against the model


---

<footer>
    <p>Adopted from UC Berkeley DATA 8 course materials.</p>
    <p>This content is offered under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC Attribution Non-Commercial Share Alike</a> license.</p>
</footer>