<div class="alert alert-block alert-danger">

# 3D: Westvaco's DGP (Complete)

**Lesson assumes students have read up through page: 3.13**

***NOTE: This notebook is optional, and is very similar to notebook 01B The Whole Thing. Feel free to skip if you already completed 01B with your class, unless you would like to revisit the lesson now that they've had a chance to become a little more familar with coding and some of the concepts.***

</div>

<div class="alert alert-block alert-success">

## Approximate time to complete Notebook: 50-60 Mins

</div>

In [None]:
# This code will load the R packages we will use
suppressPackageStartupMessages({
    library(coursekata)
})

**Preamble:** In the year Robert Martin turned 54, the Westvaco Corporation (an envelope company) decided to downsize. They laid off many people, including Robert Martin. Later that year, he sued Westvaco, claiming he had been laid off because of his age. A major piece of Martin’s case was based on a statistical analysis of the ages of the Westvaco employees.


<div class="alert alert-block alert-success">

### 1.0 - Approximate Time:  5-8 mins

</div>

## 1.0 - The Data

1.1 - Here is one set of data from Martin v. Westvaco (from one of the departments, simplified and modified for teaching purposes). What are each of these lines of code doing? Poke around to see what this `Westvaco` data frames is like. 

In [None]:
# This line of code saves the csv file in a data frame called Westvaco
Westvaco <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRSAFbaPcTt1PJvW8C-_x8KA_KLqQQboYBfSDimHMkW8rn3z8VhZfpW7ZG7Bo0nuANCcqAbW4aKEFsq/pub?gid=1945114845&single=true&output=csv", header = TRUE)

# This line of filters the data to include just the people who are fired and saves it inot Westvaco.Fired
Westvaco_Fired <- filter(Westvaco, Status == "Fired")

# Check a few lines of the data
head(Westvaco)
head(Westvaco_Fired)


1.2 - What do you think the rows mean? What are the questions you have about this data?

<div class="alert alert-block alert-warning">

**Sample Response**

The rows are employees at Westvaco.

</div>

1.3 - What are the questions you have about this data?

<div class="alert alert-block alert-warning">

**Sample Response**

*Questions about the data will vary.*

</div>

<div class="alert alert-block alert-success">

### 2.0 - Approximate Time:  10-12 mins

</div>

## 2.0 - Explore Variation

2.1 - Run the code below. What is the histogram about?

In [None]:
gf_histogram(~ Age, data = Westvaco_Fired, color = "firebrick", fill = "firebrick2")

<div class="alert alert-block alert-warning">

**Sample Response**

The histogram is showing the distribution of employees that were fired.

</div>

### Check out the data but be careful not to jump to conclusions.

2.2 - What are some reasons (from the data) you have for suspecting discrimination?

<div class="alert alert-block alert-warning">

**Some possible reasons:**

- Most of the people who were fired were over 50
- The distribution is skewed with the peak toward the older employees

</div>

2.3 - What are some reasons (from the data) you have for not suspecting discrimination?

<div class="alert alert-block alert-warning">

**Some possible reasons**

- There were still some young people who were also fired
- The distribution of all the employees already had a lot of older people anyway
- There could be other reasons the people were fired

</div>

2.4 - Some context might be helpful. Try creating a histogram of the ages for all the people in this department at Westvaco or a histogram of ages for those that didn’t get fired below. (You may want to play around with colors, fill, etc.). You can also try layering a boxplot over your histogram.  

After doing so, add on to your reasons (above) for suspecting or not suspecting discrimination. 

In [None]:
# other histograms to explore these data
gf_histogram(~Age, data = Westvaco, color = "antiquewhite2" , fill = "cornflowerblue")

# A density histogram with a density curve
gf_dhistogram(~Age, data = Westvaco, color="royalblue", fill="lightpink")%>%gf_density(fill="palegreen")

<div class="alert alert-block alert-success">

### 3.0 - Approximate Time:  10-12 mins

</div>

## 3.0 - Thinking about the DGP

3.1 - What is the population in this case? What is the Data Generating Process that we are interested in?

<div class="alert alert-block alert-warning">

**Sample Response**

- The population is the group of people who could have been fired (the whole data set of people who worked for this department). The DGP we are interested in is the process by which these people were fired (was it a fair process or a process of discrimination?).

</div>

3.2 - Even though random firing isn't a very smart way to run a business, random firing is *at least* not discriminatory. Why is it not considered "discrimination"? 

<div class="alert alert-block alert-warning">

**Sample Response**

- Random firing is not discriminatory because it is a process that does not care how old or young you are. It does not take that information into account.

</div>

3.3 - Is it possible to randomly fire people and still to end up with an age distribution of people that has a lot of older people and fewer younger people in it? Is it likely?

<div class="alert alert-block alert-warning">

**Sample Response**

- It is possible to randomly fire people and get mostly older people. It's hard to say whether it's likely or not (student will vary in their opinion).

</div>

<div class="alert alert-block alert-success">

### 4.0 - Approximate Time:  12-15 mins

</div>

## 4.0 - Simulating a Random DGP

The goal of this section is to consider these questions: What if people were randomly fired? Do you think the distribution would look different? Would randomly fired distributions look similar to the empirical sample of fired people?  (If firing is random, then we all can agree it's not age discrimination.) 

But we aren't just going to sit around and think, we are actually going to **do**. We'll simulate randomly firing people in R.

4.1 - To get you started, how might we simulate randomly firing people in R? How many people should we fire in our simulations?

In [None]:
# Use sample() to simulate randomly firing people
# We should sample the same number of people that were actually fired

# Check the number of people fired
tally(~Status, data = Westvaco)

# Sample 22 people
Westvaco_Rand_Fired <- sample(Westvaco, 22)
Westvaco_Rand_Fired

### Who will this favor? Martin or Westvaco?

4.2 - Also let’s capture our initial thoughts (we will return to this after the simulations): If it turns out to be pretty rare to generate a random sample like the empirical sample of fired people, does that favor Martin or Westvaco? What DGP would that favor?


<div class="alert alert-block alert-warning">

**Sample Response**

- Students are likely to have mixed feelings. If they haven't done any Westvaco simulations before, they might be guessing (50% for Martin).
- Actual answer: If it turns out to be rare to get a sample like the real one, that favors Martin (DGP of discrimination).
- Common wrong answer: If it turns out to be rare to generate a random sample, and that sample was generated by randomness, that would favor randomness as the DGP.

**Follow up questions to help confused students**
- How are these simulated sample generated? 
- Is randomness similar to what Martin thinks happened? Or different from what Martin believes happened?
- If randomness is easily able to generate the outcome that happened, could Martin have been fired by a process like randomness? Is that fair or discriminatory?

</div>

In [None]:
# Fire some people randomly. Make a histogram of randomly fired people’s ages. 
Fired_Randomly <- sample(Westvaco, 22)
gf_histogram(~Age, data = Fired_Randomly, color = "orange" , fill = "yellow" )
# Alternate density histogram with density plot
gf_dhistogram(~Age, data = Fired_Randomly, color = "yellow", fill = "orange")%>%gf_density(color = "cyan")
# Histogram with boxplot 
gf_histogram(~Age, data = Fired_Randomly, color = "yellow", fill = "orange")%>%
gf_boxplot(width=.25)
# Run the code a few times to simulate firing samples of people a few times. 

### Do those simulated samples look like our empirical sample?

4.3 - Is it rare to get a sample like our empirical sample? What are you looking at when you decide that a distribution is similar or different from our empirical sample?

<div class="alert alert-block alert-warning">

**Sample Response**

- The majority of the samples will (typically) look more like the whole distribution (of retained and fired employees). 
- Looking at shape of distributions.
- Often students might look at the younger end of the distributions and notice those are different than the empirical sample.

</div>

<div class="alert alert-block alert-success">

### 5.0 - Approximate Time:  12-15 mins

</div>

## 5.0 - Re-cap and Summary

### Who will this favor? Martin v. Westvaco

5.1 - If it's very rare to randomly generate a random sample like the empirical sample of fired people, does that favor Martin or Westvaco? 

5.2 - If it's not too hard to randomly generate a random sample like the empirical sample of fired people, does that favor Martin or Westvaco? 

<div class="alert alert-block alert-warning">

**Sample Response**

If it is rare to get a sample like the real one, it favors Martin. If it's not too hard to randomly get a sample like the real one, that favors Westvaco.

</div>

### Analogy Time!

#### Distribution Triad with Dice Rolls

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_nFL51ytC-3-C-Dist-Triad-Dice-Rolls.png" title="Visual summary of the dice example in chapter 3" width = 70%/>

#### Distribution Triad with Westvaco
<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_vQ02D06M-3-C-Dist-Triad-Westvaco.png" title="Visual summary of today's Westvaco example" width = 70%/>


<div class="alert alert-block alert-warning">

### Sample Explanation for Graphic

(Note: This organization using the three distributions of statistics, we call this the distribution triad, comes up repeatedly in course videos starting in chapter 4. We use it to align multiple examples to each other even though they are each in a different context. In this case the context is a comparison of the distribution of fair dice rolls and the distribution of fired employees.)

1. We almost always start by just taking a look at the data... and data only represent a sample (sometimes we call this the empirical sample). 
In the Westvaco case, we looked at the data for the people that were actually fired and those that were not fired. Things like 58 represent something about this sample.

2. But in statistics, we are often interested in the Data Generating Process: how was that sample data generated? The hard thing about statistics is that we rarely have access to the actual DGP! But we can make little theories (or models) of the DGP.
Here, we wondered -- could the DGP have had something to do with age? (e.g., FIRED = AGE + OTHER STUFF) Or did it have nothing to do with age? (e.g., FIRED = OTHER STUFF)

3. Even though we can't look directly at the DGP, we can mimic a **random** DGP! We can mimic a random process that does NOT fire people based on age (or anything else). With computers, we can do this thousands of times! We can then wonder, "Hm, even if we used a completely random process, sometimes we get a sample that looks like our empirical sample. Did our empirical sample come from a process like this random sample? Or from some other process?"

### Analogies

A. *The Sample Distribution:*

- **Dice** - If we take a sample of 24 dice rolls and do not get any rolls of 3, does this mean the die is not fair? Will it look like other random samples of 24 dice rolls?
- **Westvaco** - The sample of people who were actually fired appears to have mostly older people. Is this likely to happen with other random samples?
    
B. *The Population Distribution/The DGP:*

- **Dice** - the theory is that a fair die will have a uniform distribution, with every number having an equal likelihood
- **Westvaco** - the theory is that the distribution of fired people should look similar to the distribution of all the employees (if they were truly fired randomly)

C. *The Sampling Distribution:*

- **Dice** - If we simulate several random samples (n=24) of dice rolls, how often do we get a distribution similar to our sample distribution? Sometimes we do not get any rolls of a particular number. Our samples don't look very much like a uniform distribution even though they came from a uniform DGP.
- **Westvaco** -  If we simulate several random samples (n=22) of employees, how often do we get a distribution similar to our sample distribution? This is a hard question to eyeball but many students feel like our sample is maybe a bit more "extremely old" than the randomly generated samples. Some of the samples don't look very much like the population even though they were randomly selected from the gray histogram. 

</div>