# Boxplots and the Five-Number Summary

## Chapter 3.5-3.9 Overview Notebook

In [None]:
# run this to set up the notebook
suppressMessages(library(coursekata))

# read in a subset of the New Zealand Census at School data set
nz_census <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRvwncL5kF1-Y5nLEj7JeN6FzXQB2_CcelCHTLq-Ow05u4Fbsgii2PRByRIv7atjk_IrPVt92I4mPyG/pub?output=csv") %>%
  select(region, gender, age, time_standing_on_left_leg, height, travel_time_to_school, bag_weight, left_foot_length, sleep_time) %>%
  filter(region %in% c("Auckland", "Bay of Plenty", "Canterbury","Otago")) %>%
  rename("time_standing" = "time_standing_on_left_leg")

region <- c("Auckland", "Bay of Plenty", "Canterbury","Otago")

# set styles
css <- suppressWarnings(readLines("https://raw.githubusercontent.com/jimstigler/jupyter/master/ck_jupyter_styles_v2.css"))
IRdisplay::display_html(sprintf('<style>%s</style>', paste(css, collapse = "\n")))

## 1. Remember the `time_standing` Variable?

We've looked at this variable before (we called it `time_standing_on_left_leg` but we're just shortened the name). Remember Ram Phal? He holds the world record for standing on one leg blindfolded, which he did for 2 hours and 21 minutes! (<a href="https://www.guinnessworldrecords.com/news/2024/8/blindfolded-indian-man-balances-on-one-leg-for-record-breaking-time">Guiness Book of World Records</a>, August, 2024.) 

**Who cares how long you can stand on one leg?** It turns out, the amount of time you can balance on one leg declines with age, and in older adults it is actually a predictor of longevity! <br><br>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/stand-one-leg.png" >

A distribution shows the pattern of variation in a variable—for example, the number of seconds different people can stand on one leg (not just one person’s time).

**So far in Chapter 3, we’ve explored three ways to view a distribution:**
- A list of values (unsorted)
- A list of values (sorted)
- A histogram

**Today, we’ll explore a few more:**
- The five-number summary
- A box plot
- A bar graph

## 2. The Five-Number Summary

### Starting with a Small Data Set  

To introduce the five-number summary, we will start with a set of data from **20 New Zealand high school students**. The data represent how many seconds each student was able to stand on their left leg with their eyes closed.  

### Run the code in the cell below to load the 20 data points in a vector called `time_standing`

In [None]:
# run this code to input the data into R
time_standing <- c(15, 30, 5, 21, 25, 50, 18, 11, 17, 10, 
                   19, 39, 20, 30, 4, 36, 12, 19, 47, 20)

### 2.1 Review Some Ways to View the Distribution  

Before we introduce new ways of summarizing the data, let’s review three ways we have learned to look at a distribution in R:
- Display the list of data points **unsorted**  
- Display the list of data points **sorted**  
- Create a **histogram** of the data points

<div class="guided-notes">
    <h3>2.2 Write the R code to examine the distribution of <code>time_standing</code> in three ways</h3> 
    As an <b>unsorted list</b>, a <b>sorted list</b>, and as a <b>histogram</b>.
</div>

In [None]:
# write code to view the distribution unsorted (as is)


# write code to view the distribution sorted


# write code to view the distribution as a histogram


### 2.3 What is the Five-Number Summary?  

The **five-number summary** describes key points in a distribution:  

- **Q0 (Minimum)** – The smallest value  
- **Q1**  
- **Q2 (Median)** – The middle value of the distribution  
- **Q3**  
- **Q4 (Maximum)** – The largest value  

The `favstats()` function in R is an easy way to get the five-number summary. Let's get the five-number summary for `time_standing`, and then see what each of the five numbers means. 

<div class="guided-notes">
    <h3>2.4 Write the R code to calculate the five-number summary for <code>time_standing</code></h3> 
</div>

In [None]:
# code here


<div class="guided-notes">
    <h3>2.5 Find the minimum and maximum</h3> 
    Circle the five numbers that constitute the <i>five-number summary</i> in the favstats output. Then, draw two dots on the x-axis of the histogram to show where the min and max are.
</div> 

## 3. The Median
The median is the middle value in the distribution. You can find the median by putting the values in order, then finding the middle value. If there are an odd number of values, the median is the middle value. If there are an even number of values, the median is the average of the middle two numbers. 

<p align="center" style="text-align: center;"><img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/1sbWX4tc.png" width=95% alt="The two halves of data points have ovals drawn around them to indicate that they are groups; Min, Median, and Max are drawn as lines" /></p>

<div class="guided-notes">
    <h3>3.1 Calculate the median</h3>
    Use the 20 data points for time_standing to calculate the median yourself. Then, compare it to the median from favstats. Draw a dot on the x-axis of the histogram to show where the median is.
</div> 

**Here are the data for `time_standing` sorted in order:** <br><br>

<span style="font-size: 26px; padding: 5px; display: block; text-align: center; margin: 10px auto; width: fit-content;">
  4, 5, 10, 11, 12, 15, 17, 18, 19, 19, 20, 20, 21, 25, 30, 30, 36, 39, 47, 50
</span>

## 4. Q1 and Q3
We've looked at min, max, and median. The remaining two numbers in the *five-number summary* are labeled Q1 and Q3. **The easiest way to think of Q1 is as the middle of the lower half of the distribution. And Q3 is the middle of the upper half.** 

<p align="center" style="text-align: center;"><img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/p2BbKK5t.png" width=95% alt="Q0 through Q4 drawn to cut the dot plot into four equal groups of data points" /></p>

There are many different ways to calculate Q1 and Q3 because there are many ways to interpret the word "middle". For this reason, your manual calculation based on finding the median of the lower half of scores, for example, may not match what R calculates as the value of Q1. R actually gives you nine different options for calculating Q1 and Q3! The `favstats()` function uses the method most widely used by statisticians (R refers to this method as `type=7`); in general we will use that one.

<div class="guided-notes">
    <h3>4.1 Label Q1 and Q3 in your histogram</h3>
    Draw two dots on the x-axis of the histogram to show where Q1 and Q3 are.
</div> 

In [None]:
# run this to see what it might look like 
# to align the histogram with the five number summary
time_favstats <- favstats(time_standing)

gf_histogram(~ time_standing) %>%
  gf_point(0 ~ time_favstats$min)%>%
  gf_point(0 ~ time_favstats$Q1)%>%
  gf_point(0 ~ time_favstats$median)%>%
  gf_point(0 ~ time_favstats$Q3)%>%
  gf_point(0 ~ time_favstats$max)

Together, the three middle cutpoints - Q1, Q2 (the median), and Q3 - divide the distribution into four parts, each containing the same number of scores. These four parts are called *quartiles*. 

<br>
<div class="guided-notes">
    <h3>4.2 Shade the 1st quartile in your histogram</h3> 
    Shade the quartile that contains the lowest 25% of values. How many data points are in that quartile? 
</div>

In [None]:
# run this to see what it might look like to 
# shade the 1st quartile differently
time_favstats <- favstats(time_standing)

gf_histogram(~ time_standing, boundary = 0, binwidth = 2,
            fill = ~(time_standing < time_favstats$Q1),
            show.legend = FALSE) 

<div class="discussion-question">
<h3>4.3 Key Discussion Question: Why are there 5 numbers in the five-number summary but there are only 4 quartiles?</h3>
</div>

## 5. Boxplots
Boxplots are a handy way to visualize the distribution because it will visually represent the five-number summary. 

<div class="guided-notes">
    <h3>5.1 Write the R code for viewing a distribution as a boxplot</h3>
     Note that the R code to make a box plot is very similar to the code for making a histogram.
</div> 

In [None]:
# write code to create a boxplot



<div class="guided-notes">
    <h3>5.2 Draw lines from each part of the five-number summary to the corresponding point where it is represented in the boxplot</h3>
</div> 

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/boxplot-time-standing-2.png" width = 60%>

<div class="discussion-question">
<h3>5.3 Key Discussion Question: How many observations are inside the left green rectangle? How many in the right rectangle?
</div>

### 5.4 New Distributions: `student1` and `student2`

Two friends, student1 and student2, each tried balancing on their left leg 20 times with their eyes closed and recorded their times. Run the code in the cell below to save their data into two vectors, called `student1` and `student2`. 

In [None]:
# Run this code to save the two students' times 
student1 <- c(189, 184.8, 198.1, 185.2, 192.1, 176.1, 182.3, 192.9, 190.1, 190, 172, 158.1, 161.7, 190.8, 194.7, 183.1, 155, 187.4, 160.1, 167.5)
student2 <- c(11,15,2,15,8,24,18,7,10,10,28,42,38,9,5,17,45,13,40,32)

print("Why don't these vectors print out?")

<div class="guided-notes">
    <h3>5.5 Write the R code to create a box plot for each student</h3>
</div> 

In [None]:
# Code here



<div class="discussion-question">
<h3>5.6 Key Discussion Question: We've put the two boxplots side by side to make them easier to compare (below). How would you describe the two distributions?
</div><br>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb-2boxplots-student1-student2.jpg" alt="student1 and student2's boxplots side-by-side" width = 100%>

<div class="discussion-question">
<h3>5.7 <i>Key Discussion Question:</i> Which histogram goes with this box plot? How can you tell?</h3></div> <br>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb-grid3graphs.jpg" width = 80%>

### 5.8 Let's find out which histogram belongs to student1 by overlaying the boxplot on a histogram
Run the code cell below to create a histogram of `student1`. Here's how to overlay a boxplot onto the histogram.

In [None]:
# add onto this code
gf_histogram(~ student1, fill = "gray15") 

## 6. Range and Inter-Quartile Range (IQR)
Now that you understand the five-number summary, we can introduce two useful measures of the *spread* of a distribution. The median tells you where, roughly, the middle of the distribution is along the measurement scale. **Range** and **Inter-Quartile Range (IQR)** tell you how spread out the distribution is (or how much variation there is around the median).
- The **range** is the distance from Q0 to Q4 (i.e., from the min to the max). It tells you the full range on the measurement scale that is covered by the data points.
- The **IQR** is the distance from Q1 to Q3. It tells you the range within which 50% of the data points lie.

<div class="guided-notes">
    
### 6.1 On the box plot for student1, draw a bracket or a line to indicate the IQR
    
Label it as the IQR.
    
</div>

In [None]:
# run this code
favstats(~student1)
gf_boxplot(~student1)

## 7. Outliers

Sometimes on a boxplot you will see a data point (or more than one) that lies beyond the end of the whiskers. Run the code below to see an example, from a distribution we'll call `student3`. 

In [None]:
# Run this to see an example of an outlier
student3 <- c(189, 184.8, 178.1, 185.2, 192.1, 176.1, 182.3, 192.9, 190.1, 190, 172, 128.1, 161.7, 190.8, 194.7, 183.1, 155, 187.4, 160.1, 167.5)
gf_boxplot(~ student3)

### 7.1 What is an outlier

Typically, the ends of the two whiskers mark the min and max of the distribution. But in this box plot, there’s a point beyond the left whisker—this is actually the smallest value in the dataset. R categorizes this data point as an **outlier**.

Outliers are data points that are much farther from the rest of the distribution. They might come from a different Data Generating Process (DGP) than the rest of the data. For this reason, sometimes researchers will decide to omit outliers from their analyses.

### 7.2 How R defines an outlier
Outliers are data points that are so far outside of the main part of the distribution that they may not be generated by the same DGP that generated the rest of the data points. Sometimes researchers will decide to omit outliers from their analyses for this reason. There is no one way to define what counts as an outlier, but R uses this definition: **If a data point is more than 1.5 times the IQR below Q1, or more than 1.5 times the IQR above Q3, it is considered an outlier.**

<div class="guided-notes">

### 7.3 Estimate the IQR for student3's data based on favstats. According to R, how high or low would a student3 score need to be to count as an outlier?
</div> 

## 8. Exploring Distributions in the `nz_census` Data Frame

So far, we’ve explored distributions of individual vectors. Now, we’ll shift to working with distributions in data frames, where variables are stored as columns within a dataset.

The `nz_census` data frame, which has been pre-loaded into this notebook, includes data from a subset of New Zealand high school students (n=472). Here are the the variables in the dataset:

- **region**: Region of the country students live in
- **gender**: What gender are students
- **age**: Age in years
- **time_standing** How long in seconds they were able to stand on left leg with eyes closed
- **height**: Height in centimeters
- **travel_time_to_school**: How long does it usually take you to get to school (to nearest minute)
- **bag_weight**: What is the weight of your school bag today (to nearest tenth of a kilogram)
- **left_foot_length**: Length of left foot in centimeters
- **sleep_time**: How much sleep did you get last night (to nearest half hour)

### 8.1 We have looked at distributions by: 
- printing values in order, 
- using histograms, 
- using the five-number summary,  
- using box plots

<div class="guided-notes">
    
### 8.2 Write R code to do each of these things for a variable in a data frame
The table shows how we perform the same types of analyses on a single vector versus a variable in a data frame. Some of the entries are already filled in. Complete the missing pieces by writing the equivalent R code for the empty cells.
</div>

### 8.3 Exploring Distributions of Categorical Variables

Everything we've learned so far has been with quantitative outcome variables (e.g., `time_standing`). Can we use the same tools for exploring the distributions of categorical variables? Let's find out. And if we can't what tools can we use?
    
### 8.2 Exploring the distribution of `region` in the `nz_census` data frame
Let's consider the variable `region` in the `nz_census` data frame. What makes `region` a categorical variable? Running the code below may help you compare `region` to a variable like `time_standing`.

In [None]:
# run this
head(select(nz_census, region, time_standing))

<div class="discussion-question">
    <h3>8.3 <i>Key Discussion Questions:</i> Can we make a histogram of <code>region</code>?</h3>

Why or why not? If we can't make a histogram, what kind of visualization can we make?
</div>

In [None]:
# try making a histogram of region


# try making a bar graph of region



<div class="discussion-question">
    <h3>8.4 <i>Key Discussion Question:</i> How can we describe the distribution of <code>region</code>?</h3>
        Are the concepts of shape, center, and spread defined the same way as in a histogram?
</div>

<div class="guided-notes">

### 8.5 *Key Discussion Question:* Compare `gf_bar` and `gf_props` and `gf_percents`
    
What's different about these visualizations?
    
</div>

In [None]:
# try replacing gf_bar with the other functions
gf_bar(~region, data=nz_census) 


<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb-bar-props-percents.jpg" width = 80%>

## 9. Practice What You Learned
Look at the other variables in the `census` data frame. Practice using what you've learned: boxplots, favstats, tally, gf_histogram, and gf_bar.

In [None]:
# run this to see what variables are available
head(nz_census)

### 9.1 See what you can learn by visualizing these variables

In [None]:
# code here


### 9.2 Reflect on what you've learned about these variables
Use the cell below to write your observations