# Scatter, Jitter, Box Plots, & Histograms

## Chapter 4.3-4.5 Overview Notebook

In [None]:
# run this to set up the notebook
library(coursekata)

# read in data
set.seed(100)
stroop100 <- sample(read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSUSCmFFNt6jEMyTNGHMW9VxPsJrHoFbGvylhEAHkbonu0BuJBRH48Cgsk4hyKwkRPSQhhqpX8ypIND/pub?gid=569699744&single=true&output=csv"), 100)[,-4] 

# set styles
css <- suppressWarnings(readLines("https://raw.githubusercontent.com/jimstigler/jupyter/master/ck_jupyter_styles.css"))
IRdisplay::display_html(sprintf('<style>%s</style>', paste(css, collapse = "\n")))

## 1 Data Collection Activity: The Stroop Task

The Stroop Task originates from a landmark study by John Ridley Stroop in 1935, published in the paper _Studies of Interference in Serial Verbal Reactions._ Stroop was an American psychologist who investigated how conflicting information affects cognitive processing. Variations of the Stroop tasks are still widely used today by researchers in many subfields of psychology. Today, you are going to be a participant in a classic Stroop task!

**What is the Stroop task?** The classic Stroop task is very simple. The research participant is given a list of words and asked to _name the color of the font each word is printed in_ as fast as they can. How hard could that be?<br><br>

<center><span style="color: red; font-size: 32px;">
RED&nbsp;&nbsp;&nbsp;BLUE
</span></center>

For example, if you were given the list of two words above you would say, "Red, red," because both words are printed in a red font. (If you said "Red, blue" that would be a mistake, though it is a common mistake.) <br><br>

<div class="guided-notes"> 
<h3>1.1 Collect some data and record it in the guided notes.</h3>
Follow the steps below to collect data on the Stroop task.
</div>

**Step 1.** Pair up with a partner. Decide who will be the **timer** and who will be the **reader** for the first set of words (**section 1.2, below**).

**Step 2.** The timer says "Go" and starts the time. The reader should name the font colors out loud as quickly as possible, row by row, until completing all five rows of words. Record the time in seconds in the data table on your guided notes.

**Step 3.** Switch roles. Repeat Step 2 with the second set of words (**section 1.3, below**).
    
**Step 4.**  Write down times from other classmates until you fill up the table in your guided notes.


### 1.2 The First Set of Words: Congruent
<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_overview_3.1-3.4_stroop_congruent.jpg">


### 1.3 The Second Set of Words: Incongruent
<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_overview_3.1-3.4_stroop_incongruent.jpg">



<div class="discussion-question">
<h3>1.4 Key Discussion Questions: What did you notice in doing the Stroop task? Was your experience different from your partner's?</h3> 
Why do you think the first list is called <i>congruent</i> and the second list <i>incongruent</i>?
</div>

## 2 Does list type explain variation in  time?
There are two different types of lists: congruent and incongruent. Half of the class read the colors of the congruent list and half from the incongruent list. Do you think list type would explain variation in time?

<div class="discussion-question">
    <h3>2.1 Key Discussion Question: What is our informal definition of <i>explain variation</i>? </h3>
    
- What does it mean for one variable to <i>explain variation</i> in another? 
- In the current example, what is the outcome variable? What is the explanatory variable?
</div>

<div class="discussion-question">
    <h3>2.2 Key Discussion Question: Do you think list type would explain some of the variation in time? Why or why not?</h3>
</div>

<div class="guided-notes">
    
### 2.3 Write a word equation to represent the hypothesis that `list_type` would explain some (but not all) of the variation in `time_sec`.
    
</div>

## 3. Explore Visually Whether `list_type` Explains Variation in `time_sec` (Scatter Plots)

Now that we have a hypothesis, let's explore whether it seems to be true in a sample of data. We have loaded a data frame called `stroop100`. It contains data from 100 students who completed the same Stroop task you just did.

### 3.1 Run some code in the next cell to see what's in the data frame

In [None]:
# write code here to see what's in the data frame


# write some code to see how many students read each list_type


# write some code to look at the distribution of the outcome variable time_sec


<div class="guided-notes">   
<h3>3.2 Scatter plot</h3>
    Write some R code to generate a scatter plot to explore the word equation <b>time_sec = list_type + other stuff</b>
</div>

In [None]:
# write code for a scatter plot here


<div class="discussion-question">
    <h3>3.3 Key Discussion Question: What do you see in the scatter plot? Do you see support for the hypothesis represented in the word equation? Do you see support for the inclusion of "other stuff" in the word equation?</h3>
</div>

## 4. Visualizing the Relationship with Jitter Plots

One thing that is annoying about scatter plots is that sometimes dots are right on top of each other. This is especially true in cases like this one, where the explanatory variable is categorical with only two levels. In cases like this we might choose a jitter plot. It's like a scatter plot, but adds in some random "jittering" to make it easier to see the individual dots. Instead of `gf_point` we can use `gf_jitter`.

<div class="guided-notes">
    
### 4.1 Jitter plot
    
Look at the `gf_point` code. What might be the R code to produce a jitter plot?

</div>

In [None]:
# write code for a jitter plot here


<div class="discussion-question">
    <h3>4.2 Key Discussion Question: What can you see in the jitter plot? Which way are the dots jittered? Horizontally or vertically? Does it matter?</h3>
</div>

In [None]:
# modify the amount of jitter on this jitter plot
gf_jitter(time_sec ~ list_type,
          data = stroop100)

## 5. Visualizing the Relationship with Box Plots
In the previous chapter we learned to make box plots to represent a single distribution. Now let's see if we can use them to represent a relationship such as the one between list type and time.

<div class="guided-notes">
    
### 5.1 Box plots
    
Look at the `gf_point` and `gf_jitter` code. What might be the R code to produce box plots that represent the relationship between `time_sec` and `list_type`?
</div>

In [None]:
# write code for box plots here


<div class="discussion-question">

### 5.2 Key Discussion Question: Can you see the relationship of `list_type` and `time_sec` in the box plot? What can you more easily see in the box plot? In the jitter plot?
    
</div>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb-4.3-4.5-jitter-boxplots.jpg">

<div class="guided-notes">

### 5.3 [on the next page] In the two graphs, label the x- and y-axes. 
    
Try to estimate the medians for the two groups from the graphs.
    
</div>

<div class="discussion-question">

<h3>5.4 Make a prediction: What would you get if you ran this code: <br><code>favstats(time_sec ~ list_type, data = stroop100)</code>?</h3>
    
</div>

In [None]:
# try and see
favstats(time_sec ~ list_type, data = stroop100)

<div class="guided-notes">

### 5.5 Box plots with jitter plot overlayed on top

We can use the `%>%` operator to overlay the jitter plot on top of the box plot. This is a way to get the benefits of both types of plots.

You can think of each line of `gf_` code as a _layer_. If we write the `gf_boxplot` code then add on `gf_jitter` with `%>%`, we are telling R to print out a box plot first, then add the jitter plot on top.
    
</div>

In [None]:
# overlay a jitter plot to this
gf_boxplot(time_sec ~ list_type, data = stroop100)

<div class="discussion-question">
<h3>5.6 Make a prediction: What do you think would change about the plot if we put the <code>gf_jitter</code> first and the <code>gf_boxplot</code> second?</h3>  
</div>

In [None]:
# try and see
gf_jitter(time_sec ~ list_type, data = stroop100, height = 0) %>%
  gf_boxplot() 

<div class="discussion-question">
    
### 5.7 Key Discussion Questions:
- What percentage of the dots are within the "box" part of the box plots?
- What is represented by the thicker horizontal line in the box plots?
- Are there any outliers in this data? How can you tell?
- What do the box plots tell you about the *distributions* of time_sec? (Think shape, center, spread)
</div>

## 6. Visualizing the Relationship with Faceted Histograms

One more way to visualize a relationship between two variables is with a faceted histograms. It is similar to the histograms you have seen before, but this time they are broken up separately into two _facets_ (or faces) — one for congruent and the other for incongruent times. 

<div class="guided-notes"> 
    
### 6.1 Faceted histograms
    
Use `%>%` to chain on `gf_facet_grid()` to the code below to create faceted histograms: a histogram for each `list_type`.

You may wish to see what happens when you try `gf_facet_grid(list_type ~ .)` versus `gf_facet_grid(. ~ list_type)`.
</div>

In [None]:
# add gf_facet_grid to this
gf_histogram(~ time_sec, data = stroop100) 

<div class="discussion-question">

### 6.2 Key Discussion Question: What do faceted histograms show you that the other visualizations did not? What's different about it?
    
</div>

<div class="guided-notes"> 
    
### 6.3 Faceted histograms with box plots overlayed on top

Just like we can overlay box plots onto jitter/scatter plots (and vice versa), we can overlay box plots onto faceted histograms. Try figuring out the code for that based on what you know so far.

(If you want to make the box plots thicker, you can use the `width` argument.)
    
</div>

In [None]:
# overlay boxplots to this
gf_histogram(~ time_sec, data = stroop100) %>%
  gf_facet_grid(list_type ~ .) 

<div class="discussion-question">

### 6.4 Key Discussion Question: Why are the boxes arranged horizontally when overlaid on the histograms, but vertically when overlaid on the jitter plots?
    
</div>

## 7. Summary: Which visualizations can we use to explore a relationship between an explanatory variable and and outcome variable?

Let's review all we've learned about visualizing the distributions of individual variables and visualizing the relationships between two variables.

<div class="guided-notes">
    
### 7.1 Visualizations for relationships between variables 

Complete the table by identifying the appropriate visualization and R function for different combinations of outcome and explanatory variables. Use the examples provided as a guide and consider whether the variables are quantitative or categorical.

</div>

<div class="guided-notes">
    
### 7.2 Visualizations for a single outcome variable 

Now, complete the last two rows of the table by selecting the best visualizations for displaying the distribution of a single quantitative or categorical variable.

</div>