# 1A: Westvaco - The Whole Thing


At the beginning of each Jupyter notebook, we must load the packages that we are going to use within the notebook. We usually copy/paste this code from one of our other jupyter notebooks! 

Always run this first or a lot of our other code won't work.

When you run it, you will not see any output, but you will see the `In[]:` (on the left) change to indicate that it ran properly. If you see an asterisk in it, like this `In[*]:`, it means it is still running. 


In [None]:
# This code will load the R packages we will use
suppressPackageStartupMessages({
    library(coursekata)
})

### This is a Weird Lesson

Today we'll take a tour through *all the important concepts we will cover in this class*. So it's a weird lesson. You aren't expected to understand everything we talk about today. But we want you to see how it will all fit together. 

We figure if you have a sense of the whole story of statistics and how it gets used, then when we learn all the little details, you'll have a better sense of why they are important.

### The Case of Martin vs. Westvaco

Back in 1994, Robert Martin was laid off from his job at Westvaco, Inc., (an envelope company) soon after he turned 55 years old. Martin believed that he was fired because of his age and sued Westvaco for age discrimination. A major piece of Martin’s case was based on a statistical analysis of the ages of the Westvaco employees. We will be taking a look at small sample of the data that was part of this court case.




## 1.0 - The Data

This is data from one of Westvaco's departments. There were three employees that were fired from this department.
- Ages of all the employees in this department: 25, 33, 35, 38, 48, 55, 55, 55, 56, 64
- Ages of the three who were fired: 55, 55, 64

1.1 - What *in the data* supports Martin's claim of age discrimination?

1.2 - What *in the data* supports the company's claim that it was NOT age discrimination?


1.3 - We've put in the ages of all the employees into an object called `everyone`. Write some code to save the ages of the three *fired* employees into an object called `fired`. 

In [None]:
everyone <- c(25, 33, 35, 38, 48, 55, 55, 55, 56, 64)


# If you write this code correctly, will it print anything out?

1.4 There is a function called `mean()`. We have written code to calculate the mean age for everyone. Write some code to calculate the mean age of the three people who were fired.


In [None]:
mean(everyone)

1.5 - Take a moment to practice putting *meaning* onto this number. For example, what does the 58 stand for? Does this help Martin's side of the argument or Westvaco's?

## 2.0 - The Data Generating Process (DGP)

We know that the people who were fired were older on average than the rest. But what we really want to know is *how* were these people fired? **What was the Data Generating Process (DGP)?** 

2.1 - Age might have played a role. But maybe it didn't. Is it possible to have a completely fair process (e.g., randomly firing 3 people) and end up firing a group with an average age of 58? 


Without knowing exactly what happened in the brains of those who fired the three employees, the DGP is unknown. This one example is not enough to figure out the DGP because the *same* DGP can generate different samples! But we can come up with little theories (or models) of the DGP.

2.2 - Here are some possible DGPs written as word equations:

1. *FIRED = AGE + OTHER STUFF* (what is the other stuff?)
> Martin thought that the decision to fire someone had something to do with age. But maybe some other stuff was involved too. 

2. *FIRED = TARDINESS + OTHER STUFF*  
> Some people might think that the firing had a little bit to do with whether people were late to work.

3. *Come up with another idea as a word equation*
> Explain your idea here.

4. *How would you write Westvaco's claim that "age had nothing to do with it" as a word equation*
> Westvaco might say, "Hey the decision was based on other stuff, not age discrimination!"

## 3.0 - A Special DGP: Randomness 


Although firing people randomly would be a terrible way to run a business, it is fair. Also, it's one that we can easily simulate. 

3.1 - If you pick three of the ten ages **25, 33, 35, 38, 48, 55, 55, 55, 56, 64** at random, do you think you are likely to get an average age of 58 or greater? Why or why not?


Let's try it! Let’s simulate the process of randomly selecting three people to be fired.
- Write down each of the ten ages of the people in the department on separate little (roughly equal size) pieces of scrap paper.
- Mix up the pieces of paper, and, without looking, randomly select three “people” to be fired and calculate the mean of those three ages (add the ages up and divide by 3).

3.2 - Did anyone get a kind of old-ish group? If we do get an oldish group, is it because one of us discriminated based on age? 


3.3 - With R, we won't need to tear up pieces of paper. We can choose three ages randomly with a click of a button! 

In [None]:
# This code randomly fires 1 person. Modify it to fire 3 people randomly.
sample(everyone, 1)

3.4 - The code below saves a random group of three people into an object called `fired_randomly`. Write some code to calculate the mean age of the three people fired randomly.

In [None]:
fired_randomly <- sample(everyone, 3)

3.3 - What was the mean age? Did anyone in your class end up firing three people with an average age of 58 or older? Was it because R practiced age discrimination?


3.4 - What would happen if you run the code again? Try it a few times. Do you always get the same mean age from firing randomly? How often do we get an average age of 58 or greater from firing randomly?


3.5 - If the probability of getting an average age of 58 or greater turns out to be *small*, does this favor Martin or Westvaco? Why or why not?

## 4.0 - All the Possibilities: A Sampling Distribution

Now we can return to our two possible models of the DGP:

1. *FIRED = AGE + OTHER STUFF*
2. *FIRED = OTHER STUFF*

We have learned how to simulate firing people in a way that has *nothing to do with age*. A model of randomness is just all "other stuff". 

Even though it is hard to write down or even imagine all the possible ways that 10 people could get randomly fired, with R we can simulate doing this many, *many* times. Then we can see **how likely our actual data (e.g., a mean age of 58) would be to occur IF the second (completely random) model were true.**


In [None]:
# Run the code below. How come only one number gets printed? 
# What is being stored into fired_rand?

fired_rand <- mean(sample(everyone, 3))
fired_rand

4.1 Let's see what happens if you put `do(3) *` before the code `mean()` in the previous cell. What do you think this did? 

In [None]:
fired_rand <- do(3) * mean(sample(everyone, 3))
fired_rand

4.3 - What does each row in this data frame represent?

4.4 - Try modifying the code above to do this 1000 times! Did it work? Is that a lot faster than shuffling pieces of paper a thousand times?

### Graphing a sampling distribution

4.5 - Notice that it's hard to look through 1000 numbers. That's why we will teach you methods of graphing throughout this class. Run the code below. Which part of this visualization represents times when "oldish" groups of people were fired? Which parts represent times when "youngish" groups of people were fired?

In [None]:
# Run this code. This will make a visualization of the 1,000 mean ages of those fired from our last simulation. This is our sampling distribution.
gf_histogram(~ mean, data=fired_rand, binwidth = 1, color = "skyblue")


4.6 - The code below will shade in the random samples that had average ages of 58 or higher. Is that a lot of the random samples? A little? Is the likelihood of generating a sample that old high? Medium? Low?

In [None]:
gf_histogram(~ mean, data=fired_rand, binwidth = 1, color = "white", fill = ~mean >= 58)


### Using the sampling distribution

Let's use the sampling distribution to help us think about our previous question: How likely is it that we would have obtained an average age of 58 for fired employees if we had just fired people randomly?

R can quickly count up how many means were greater than or equal to 58.

In [None]:
# Run this code.
# 4.7 - How many of the 1000 simulated samples had a mean age of 58 or greater?

tally(~ mean >= 58, data=fired_rand)

In [None]:
# Run this code. 
# 4.8 - What proportion of the 1000 simulated samples had a mean age of 58 or greater?

tally(~ mean >= 58, data=fired_rand, format = "proportion")

## 5.0 - The Whole Thing

5.1 - If the probability of getting an average age of 58 or greater turns out to be small, does this favor Martin or Westvaco? Why?

### Distribution Triad as a .gif

<img src="https://i.postimg.cc/fsgBppCw/Westvaco-Dist-Triad-Gif.gif" title="Westvaco scenario analyzed with the distribution triad gif" />
</div>

### Distribution Triad as a static image

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_59dv1Bjm-Gy-Ml-Owy-Imgur.png" title="Westvaco scenario analyzed with the distribution triad" />

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=e6f3baa5-0bb3-4d99-89e3-4504df0ec91a' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>