# Understanding How Randomness Can Generate Patterns of Data

## Chapter 4.10-4.13 Overview Notebook

In [None]:
# run this to set up the notebook
library(coursekata)

# read in small and large version of data
set.seed(10)
stroop6 <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSHzwppRht6ukb9_6-r7_xcCXKhe5OMYH9Oj-ORyEa5sbI-zsrBIvUHXsyEpe4-qmMQZIS6wk6G_GEO/pub?gid=333713556&single=true&output=csv")
stroop50 <- sample(read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRMQaRUJQLA3NSwtetd-4IJWCJf4MwyxQtfyS2giAwHoYTuH-e9lRlLGg6lrO6xRt4TQQ9SozWNr5Mz/pub?output=csv"), 50)[, c(1,4,2,3)]

# set styles
css <- suppressWarnings(readLines("https://raw.githubusercontent.com/jimstigler/jupyter/master/ck_jupyter_styles_v2.css"))
IRdisplay::display_html(sprintf('<style>%s</style>', paste(css, collapse = "\n")))

## 1 Explaining Variation vs. Causing Variation

We’ve been thinking about how one variable can *explain variation* in another. We've defined explain informally to mean: if knowing an observation's score on one variable helps you make a better guess about it's score on a second variable, we can say that variable 1 *explains* some of the variation in variable 2. 

But just because a variable *explains* some of the variation in an outcome variable, it doesn't mean that it *causes* the variation in the outcome variable. 

Here's an example to illustrate the distinction:
#### Example 1: In months where cities have a higher number of **ice cream sales**, they also have a higher number of **drownings**. 

Here's the word equation to represent this relationship:

**drownings = ice cream sales + other stuff**


<div class="guided-notes">

### 1.1 We’ve filled in the word equation. In the second column, write Y/N/Maybe: does this ice cream sales *explain variation* in drownings? In the third column, write Y/N/Maybe: does ice cream sales *cause* drownings?

</div>


<div class="discussion-question">
    <h3>1.2 Key Discussion Question: Do you think ice cream sales <i>cause</i> drownings? (Should we ban ice cream sales in an effort to curb drownings?) Can you think of a reason why ice cream sales might <i>explain</i> some of the variation in drownings?</h3>
</div>

### 1.3 Some more examples

Let's look at some more examples of patterns found in data. In each example below, consider: does the explanatory variable *explain* variation in an outcome variable? Is the relationship merely *explanatory*, or is it *causal*?

#### Example 2: Kids who wear bigger shoes tend to score higher on math tests. 

- Does shoe size cause higher math ability? (Would giving first graders bigger shoes make them better at math?)
    
#### Example 3: Public health officials track flu virus particles in sewage to predict how many flu hospitalizations they will see the next week. 

- Does virus concentration in sewage cause hospitalizations? (If we zapped all the virus particles from sewage, would flu hospitalizations decrease?)
    
#### Example 4: Students who spend a lot of time alone tend to have more anxiety.

- Does alone time cause anxiety? (If students spent less time alone, would they have less anxiety?)

<div class="guided-notes">
    
### 1.4 For each example, fill out the first three columns of the table in the guided notes (word equation, explains variation?, cause variation?).

</div>

<div class="discussion-question">
    
### 1.5 *Key Discussion Question:* How is <i>explaining variation</i> different from  <i>causing variation</i>?
    
</div>

### 1.6 Challenges to inferring causality

Just because one variable *explains variation* in another doesn’t mean that the variable *causes variation* in the other. There are two other reasons one variable might *explain* variation in an outcome but not *cause* the variation:

1. **Directionality:** Maybe the outcome variable *causes* variation in the explanatory variable rather than the other way around. While the explanatory variable *explains* variation in the outcome, it does not cause it. 
2. **Confounding:** There may be a third variable - called a *confound* or *lurking variable* - that causes variation in *both* the explanatory and outcome variables. The two variables appear related, but only because they share a common cause.

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/4.10-4.13-directionality-confounding.jpg" alt="causes variation, directionality, confounding">



<div class="guided-notes">
    
### 1.7 For each example in the table, fill in the *Possible Confound* column.
There is a relationship between the two variables *in the data.* If the explanatory variable doesn't *cause* the relationship, what might be the source of the relationship? Propose a possible confounding variable that might explain the relationship.

</div>

## 2 Using Experiments to Overcome Directionality and Confounding

If we want to know whether one variable *causes* changes in another, we can't just observe the relationship in the world (this is known as a *correlational* research design). We must, instead, intervene in the world, *create* variation in the explanatory variable, and then observe the outcome. In short, we must use an *experimental* research design.

One way to do this is by randomly assigning research units (e.g., people, mice, cities, etc.) to different values of the explanatory variable (often referred to as different *conditions*). Random assignment creates groups that are comparable—even if individuals are very different. Any differences between groups at the point of random assignment are due only to chance. Then, if we intervene in some way with one of the groups, we have stronger evidence that any subsequent difference on the outcome variable was caused by the explanatory variable.

### 2.1 Review: The Stroop Task

In a previous notebook, we analyzed data from the Stroop Task—a classic psychology experiment used to study how conflicting information affects thinking.

**What’s the task?** You're shown a list of color words and asked to name the font color of each word as quickly as possible—not read the word itself. Sounds simple, right?

<br> <span style="color: red; font-size: 32px;"> RED&nbsp;&nbsp;&nbsp;BLUE </span> <br><br>

For the example above, you should say, "Red, red"—naming the font color, not the word.

In our data, students were randomly assigned to one of two lists:
- **Congruent:** Word and font color match (e.g., "RED" in red)
- **Incongruent:** Word and font color mismatch (e.g., "RED" in green)

<table>
  <tr>
    <th style="text-align: center; border: 1px solid black;">Congruent Words</th>
    <th style="text-align: center; border: 1px solid black;">Incongruent Words</th>
  </tr>
  <tr>
    <td style="text-align: center; border: 1px solid black;">
      <img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_overview_3.1-3.4_stroop_congruent.jpg">
    </td>
    <td style="text-align: center; border: 1px solid black;">
      <img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_overview_3.1-3.4_stroop_incongruent.jpg">
    </td>
  </tr>
</table>


We measured how long it took each student to read their list (time_sec) and compared the average times by list_type (congruent vs. incongruent).

#### Research question: Does incongruence (the list_type) cause people to be slower at naming font colors? 



<div class="discussion-question">

### 2.2 Key Discussion Question: Students were randomly assigned to read either the congruent or incongruent list. Before they even start reading, should we expect the two groups to be different in things like reading speed, motivation, or attention span? Why or why not?

</div>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/4.10-4.13-random-assignment.jpg" width=80%>

<div class="discussion-question">
    
### 2.3 Key Discussion Question: If there is a difference in `time_sec` across the two list types, can we assume that list type caused the difference in reading times? Why or why not?
    
</div>

<div class="guided-notes">
    
### 2.4 Write It Down: Why are experiments so powerful?
    
Write down a summary of our discussion in your own words. What is accomplished by randomly assigning students to different list types?

</div>

## 3 Inferring Causality in Random-Assignment Experiments 

Random assignment is great! But even with random assignment, some differences can happen just by chance.

To understand this, imagine a counterfactual: **What if all students had read the same list?** We would still expect small differences in average reading time across the two randomly assigned groups—just because of random variation. One group might, by luck, be a little faster. (Maybe a few more faster readers got assigned to that group.)

So when we see a difference between the groups (e.g., congruent group is faster than incongruent group), there are three possible explanations for this difference:

1. **It was caused by the explanatory variable:** The explanatory variable actually causes the outcome to be different.
2. **It was caused by random sampling variation:** – The difference happened just by chance.
3. **It was caused by a combination of explanatory variable and randomness.**


### 3.1 Write code to take a look at the `stroop6` data frame. Is one group reading faster than the other?

The `stroop6` data frame contains data from 6 students, 3 in each condition. In the code cell below, write code to see what's in the data frame.

In [None]:
# look at the contents of stroop6



<div class="guided-notes">

### 3.2 What are three possible reasons we might see a difference in average reading time between the congruent and incongruent groups in this data?

</div>


## 4 Using `shuffle()` to ask: How big could the difference be and still be caused by chance alone?

In `stroop6`, we observed that students in the congruent group read faster than those in the incongruent group. We know this might have been because list type caused this difference and it might be because of random chance. 

By shuffling pieces of data and with the `shuffle()` function in R, we can simulate what kinds of differences we'd expect to see if list type had no effect at all and any differences were just due to chance.

#### This lets us ask: How big could the difference be, just by chance?

If our observed difference is bigger than most of the differences we see by chance, that’s evidence that list type might also matter.

### 4.1 Shuffle: Hands-On Activity

Let's start with the small `stroop6` data frame. Download and [print out your own copies here](https://docs.google.com/document/d/1E2OZbwwZmJdsHdaW7A6dCBwwJH5rhW9QLCG_2eHGJio/edit?usp=sharing).

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb-4.10-4.13-shuffle-v2.png" width=60%>

We’re going to shuffle the outcome variable (`time_sec`). If we shuffle `time_sec`, each specific value will now be randomly put into a `list_type` group. 

#### Why are we doing this?

This is not the same as random assignment. Instead this is going to help us see what kinds of group differences might appear just by randomness alone!

**Step 1:** Get a [printed version of the dataset](https://docs.google.com/document/d/1E2OZbwwZmJdsHdaW7A6dCBwwJH5rhW9QLCG_2eHGJio/edit?usp=sharing) and some scissors.

**Step 2:** Cut out the right-hand column (shaded gray), which is an extra copy of all the `time_sec` values. Then cut out each individual value.

**Step 3:** Shuffle the values and assign them randomly back into the data frame, placing each value in the `time_sec_shuffled` column. 

<div class="discussion-question">

### 4.2 What happens when we shuffle?
    
- Did anyone’s shuffled groups end up with the `congruent` group reading faster than the `incongruent`? Did anyone see the opposite pattern (`incongruent` faster than `congruent`)?
- If you did find that `congruent` times were faster than `incongruent`, why do you think this is?
- If we repeated this random process many times, what do you think the typical difference between groups would look like?
- How large of a difference between the two groups do you think could be created just by a purely random process (like shuffle)?
    
</div>


### 4.3 Using R to simulate random shuffles

In our paper activity, we saw that randomly shuffling `time_sec` could lead to a variety of outcomes, all explainable just by chance. In some cases, congruent times were faster, in others, incngruent times were faster. Let's now simulate this process in R.

We will first `shuffle()` the values of `time_sec` (`stroop6$time_sec`), and then save the shuffled values as a new column in the data frame (`time_sec_shuffled`).

In [None]:
# build up the code here



# print out the resulting stroop6
stroop6

<div class="discussion-question">

### 4.4 Discussion Questions: What happens each time we run this code?
    
- Does it replicate what you did with paper?
- What changes each time you run it? 
- What stays the same?
    
</div>


<div class="guided-notes">

### 4.5 Scaling up: Write R code to simulate a random data generating process with <code>stroop50</code>

Now that we've shuffled `time_sec` in a small dataset, let's apply the same idea to our full dataset. Modify the code below so that it shuffles `time_sec` for all 50 rows in `stroop50`.
    
`stroop6$time_sec_shuffled <- shuffle(stroop6$time_sec)`
    
</div>

In [None]:
# write code here



# print out stroop50
stroop50

# run this a few times; what happens each time?

## 5 Visualizing the Original vs. Shuffled Data in a Jitter Plot

We have shuffled all 50 rows, but it’s hard to tell just by looking at the data frame whether one group is generally faster or slower. Instead of scanning numbers, let’s visualize the data with a jitter plot. We will start with visualizing the original data, then visualize the shuffled data, and then compare the two.

<div class="guided-notes">
    
### 5.1 Write the hypothesis that `list_type` would explain variation in `time_sec` as a word equation
</div>

<div class="guided-notes">
    
### 5.2 Write R code to make a jitter plot to explore the hypothesis in the `stroop50` data frame

</div>

In [None]:
# write code


<div class="discussion-question">

### 5.3 Does the jitter plot support the hypothesis?
    
</div>


### 5.4 Now let's make a jitter plot of the shuffled data.
    
There are two ways of doing this. We can use the saved shuffled variable (`time_sec_shuffled`). But we also could use a shortcut. Look at both methods in the cell below.

In [None]:
# using saved variable


# shortcut


<div class="guided-notes">
    
### 5.5 Write R code to make a jitter plot with the shuffled data.

</div>

<div class="discussion-question">

### 5.6 Key Discussion Question: What do you notice about the jitter plot after each time you run the shuffle?
    
</div>


In [None]:
# run this code to create a shuffled jitter plot
gf_jitter(shuffle(time_sec) ~ list_type, data = stroop50, width = .1, height = 0)

<div class="discussion-question">

### 5.7 Key Discussion Question: How does the actual data differ from the shuffled data? 

</div>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb-4.10-4.13-grid-of-shuffled-jitter-plots-stroop50.jpg">

<div class="discussion-question">

### 5.8  Could a random DGP have caused the patterns we saw in the real data?

- Could a random DGP like `shuffle()` produce a result that looks like our real data?
- Do most of the shuffled results look like our real data?
    
</div>

## 6 Exploring another variable: Does `vision` explain Stroop task time?

We saw that while `shuffle()` could sometimes create patterns similar to the real `list_type` data, most of the shuffled data looked very different from the real data. This suggests that while randomness alone is a *possible* explanation, it's not a particularly good one.

Now, let’s test another hypothesis: In one version of the Stroop experiment, half the students were asked to do this task with only one eye open. The others were allowed to keep both eyes open. Does `vision` (whether a person did the task with one or two eyes open) change reading speed in the Stroop task?

<div class="guided-notes">
    
### 6.1 Write this hypothesis as a word equation.
    
</div>

In [None]:
# run this
head(stroop50)

<div class="guided-notes">
    
### 6.2 Write the R code to make a jitter plot to explore the hypothesis

</div>

In [None]:
# write code; if you want to get squares like the guided notes, use shape = 15


<div class="discussion-question">

### 6.3 Does vision seem to explain variation in reading time?
    
</div>

<div class="guided-notes">
    
### 6.4 Write R code to make a jitter plot of shuffled data

</div>

In [None]:
# write code


<div class="discussion-question">

### 6.5 Key Discussion Question: What is the difference between the original and shuffled data for `vision`?
    
Is the difference between the original and shuffled data as strong as it was for `list_type` (original versus shuffled)?
    
</div>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb-4.10-4.13-grid-of-shuffled-jitter-vision.jpg" alt="grid of original and shuffled jitter plots">

<div class="discussion-question">

### 6.6 Could a random DGP have caused the patterns we saw in the actual data?

- Could a random DGP like `shuffle()` produce a result that looks like our real data?
- Do most of the shuffled results look like our real data?

</div>

## 7 Is a random data generating process responsible for our real data?

We’ve now looked at two different explanatory variables:

1. `list_type` (congruent vs. incongruent): The real data looked very different from the shuffled data.
2. `vision` (one vs. two eyes): The real data looked similar to the shuffled data.

<div class="guided-notes">
    
### 7.1 Which real data looks like it came from a purely random DGP: list_type or vision? How does `shuffle()` help us decide whether a pattern in actual data could have been generated by randomness alone? 
    
</div>