# Exploring Distributions with Histograms 

## Chapter 3.1-3.4 Overview Notebook

In [None]:
# run this to set up the notebook
suppressMessages(library(coursekata))

# read in Census at School data set
census <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRvwncL5kF1-Y5nLEj7JeN6FzXQB2_CcelCHTLq-Ow05u4Fbsgii2PRByRIv7atjk_IrPVt92I4mPyG/pub?output=csv")

# set styles
census <- select(census, region, gender, age, time_standing_on_left_leg, height, travel_time_to_school, bag_weight, left_foot_length, sleep_time)
css <- suppressWarnings(readLines("https://raw.githubusercontent.com/jimstigler/jupyter/master/ck_jupyter_styles_v2.css"))
IRdisplay::display_html(sprintf('<style>%s</style>', paste(css, collapse = "\n")))


<div class="teacher-note">
    <b>Section Goals:</b> In this chapter students are introduced to the concept of distribution, defined as the pattern of variation in a variable. This abstract idea is first made concrete by making and interpreting histograms of quantitative data. Students will learn to create histograms with R, and will understand that:
    <ul>
        <li>Histograms are constructed by grouping data points into bins, which are equal-sized intervals along a quantitative measurement scale, then representing the number or proportion of data points falling within each bin by the height of the bars. Changing the bin size will affect how the data points are grouped into bars on a histogram, such that a larger bin size will result in fewer bars with more data points in each bin. </li>
        <li>Histograms can be used to identify key features of a distribution, namely, shape, center, spread, and “weird things” (e.g., outliers or gaps); and that while distributions have these features, individual data points do not. Varying the bin sizes in a histogram will afford different insights into the distribution of the data.</li>
    </ul>
    A <a href="https://docs.google.com/document/d/1NdvgmHintsgMQnTo5_jwT8Wv3PUkV1hcj3N9iFm-eyM/edit?tab=t.5y2a0ykmi2fk#heading=h.wjaasjj3pg90" target="_blank">printable student guided-notes worksheet</a> is available to go with this Jupyter notebook, as well as a student version of this notebook.
</div>

## 1. Data Collection Activity: Time Standing on Left Leg

Close your eyes. Now, stand on your left leg for as long as you can. Have your partner measure your time in seconds and write down the number on the data collection sheet. Do it just once: we want to see how long you can go without practicing. When you are done, switch places and record time for your partner. Enter that into the data collection sheet. 

**Who cares how long you can stand on one leg?** It turns out, the amount of time you can balance on one leg declines with age, and in older adults it is actually a predictor of longevity!  

**World record:** Ram Phal, 50, balanced on his right leg for 2 hr 21 min, improving upon his previous time by 20 min. Ram has broken this record three times before – once in 2021 and twice in 2023. “That's why I work on this same record to check my own potential as well as try to set a benchmark as high as possible.” (<a href="https://www.guinnessworldrecords.com/news/2024/8/blindfolded-indian-man-balances-on-one-leg-for-record-breaking-time">Guiness Book of World Records</a>, August, 2024.)

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/stand-one-leg.png" width = 100%>

<div class="guided-notes">
    
### 1.1 Record your times on the guided notes
</div>


## 2. Looking for Patterns of Variation
Below is the data from a high school class of students in New Zealand. 

**Raw Data:** <br><br>
<span style="font-size: 26px;">15, 30, 5, 21, 25, 50, 18, 11, 17, 10, 19, 39, 20, 30, 4, 36, 12, 19, 47, 20</span>

<div class="discussion-question">
    <h3>2.1 Key Discussion Questions: What patterns to you see in the variation across these 20 students? </h3><li>Describe characteristics of the <i>distribution</i> of standing times. </li>
<li>What makes this a <i>distribution</i> as opposed to just a list of numbers? </li>
<li>Now let's look at the 20 data points in order. Does that help you see any patterns more clearly?</li>
</div>


<div class="teacher-note">
    <b>Teacher note:</b> A distribution is not just a list of numbers. The numbers must represent measures of something - and all the same thing - that varies in the world. Students should begin to notice that characteristics of distributions include shape, center, spread, and weird things. Just looking at numbers out of order, they may notice the range (i.e., minimum and maximum). When you sort the numbers in order, it's easier to see more patterns. When students represent the numbers in a histogram (below), it will be even easier for them to see patterns. <i>(Tip: Scroll the notebook so students only see the unsorted distribution first, then the sorted one.)</i>
</div>

**Sorted Data:** <br><br>
<span style="font-size: 26px;">4, 5, 10, 11, 12, 15, 17, 18, 19, 19, 20, 20, 21, 25, 30, 30, 36, 39, 47, 50</span>

## 3. Representing a Distribution with a Histogram
A histogram is a type of graph that helps you see characteristics of the distribution of a variable. On the x-axis is the scale on which the variable is measured - in this case, the number of seconds each person could stand on their left foot with their eyes closed. The x-axis is divided into equal-sized *bins* along the measurement scale. The y-axis shows how many data points fall within each bin.



<table style="width: 100%; border-collapse: collapse; font-size: 1.2em; text-align: center; ">
    <tr>
        <th style="text-align: center;">binwidth = 2</th>
        <th style="text-align: center;">binwidth = 10</th>
    </tr>
    <tr>
        <td style="padding: 10px;">
            <img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/3.1_graph_paperx10.png">
        </td>
        <td style="padding: 10px;">
            <img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/3.1_graph_paperx10.png">
        </td>
    </tr>
</table><br>
<span style="font-size: 26px; display: block; text-align: center;">Data: 4, 5, 10, 11, 12, 15, 17, 18, 19, 19, 20, 20, 21, 25, 30, 30, 36, 39, 47, 50</span>

<div class="guided-notes">
    <h3>3.1 Draw a histogram with binwidth = 2 on the graph on the left</h3>
    <ul>
        <li>What is the lower boundary? In this case we will start it at 0. </li>
        <li>What is the range of your data? Is there room on the x-axis to go from 0 to 50?</li>
        <li>What is the binwidth? In this case, it's 2.</li>
        <li>To make our histogram, shade the bars to represent the count of scores in each bin</li>
</div>

<div class="teacher-note"><b>Teacher Note:</b>         
Students may wonder whether a score of 10 should go in the 0-10 bar or in the 10-20 bar. The convention is that the left endpoint of a bin is <i>inclusive</i>, so a score of 10 would go in the 10-20 bar.
</div>

<div class="guided-notes">
    <h3>3.2 Draw a histogram with binwidth = 10 on the graph on the right</h3>
</div>

<div class="discussion-question">
<h3>3.3 Key Discussion Questions: Interpreting histograms. </h3>

- What does the height of the tallest bin (or bar) represent? 
- What does the height of the shortest bin (or bar) represent? 
- What is represented by the left-most bar of the histogram? 
- What is represented by the right-most bar of the histogram? 
- If there are any gaps between bars of a histogram, what does that mean?
    
</div>

<div class="teacher-note">

<b>Teacher Note:</b>         

You may want to cold call through this or allow students to break into pairs to answer these questions.

<b>Sample answers:</b>
- tallest bins: most frequent values of time standing on one leg
- shortest bins: least frequent values of time standing on one leg
- left-most bar: the lowest values of time standing on one leg (briefest times)
- right-most bar: the highest values of time standing on one leg (longest times)
- gaps: no one stood on their left leg for that amount of time (no values in that bin)
    
</div>

<div class="discussion-question">
    <h3>3.4 Compare the two histograms. How does each highlight different features of the distribution? </h3>
    Are there features you can see in the histograms that were harder to see in just the list of numbers?
</div>


<div class="teacher-note">

<b>Sample answers:</b>
- The histogram with a binwidth of 2 has more gaps where there aren’t any values, but the one with a binwidth of 10 looks smoother.
- The binwidth = 10 histogram combine more values together, so you can’t see as much detail.
- The shape looks different depending on the binwidth—like, the smaller bins show more ups and downs, but the bigger bins make it look more like one general shape.
- In histograms (versus list of numbers), it’s easier to see where most of the values are—e.g., between 10 and 20 or 10 and 30.
- In histograms, we can see that there are a lot more short times and only a few long times, which was harder to notice just looking at the numbers.
    
</div>

## 4. Making Histograms with R

<div class="discussion-question"><h3>4.1 Make a Prediction: Here's the code to put the 20 data points into a vector called <code>time_standing_on_left_leg</code>. What will happen when you run this code?</h3>
</div>

In [None]:
time_standing_on_left_leg <- c(4, 5, 10, 11, 12, 15, 17, 18, 19, 19, 
                               20, 20, 21, 25, 30, 30, 36, 39, 47, 50)


<div class="teacher-note">

<b>Sample answers:</b>
- Students might predict that it would print these numbers out. But it won't.
- This only saves these values as a vector (called `time_standing_on_left_leg`) in R's memory.
    
</div>

<div class="guided-notes">
    
### 4.2 Create a histogram of `time_standing_on_left_leg` with `binwidth = 2`
    
</div>

In [None]:
# just run
gf_histogram(~ time_standing_on_left_leg, boundary=0, binwidth=2) 

<div class="guided-notes">

### 4.3 Create a histogram of `time_standing_on_left_leg` with `binwidth = 10`
    
</div>

In [None]:
# code here
gf_histogram(~ time_standing_on_left_leg, boundary=0, binwidth=10) 

### 4.4 Try out different binwidths and boundaries to explore how they change the histogram

- What if you leave out `boundary` and `binwidth`? What happens?
- How would you modify `binwidth` to make fewer bins? Why does that work?

<div class="teacher-note">

<b>Teacher note:</b>
- R's defaults are to use the range of the data to find the lower and upper boundaries of the histogram. It also defaults to creating 30 bins.
- Then take this as a starting point `gf_histogram(~ time_standing_on_left_leg, boundary = 0, binwidth = 10)` and modify it to make fewer bins (by increasing the binwidth). 
    
</div>

In [None]:
# modify this
gf_histogram(~ time_standing_on_left_leg, boundary = 0, binwidth = 10)

# default histogram
gf_histogram(~ time_standing_on_left_leg)

# to make fewer bins
gf_histogram(~ time_standing_on_left_leg, boundary = 0, binwidth = 25)

## 5. Applying What You've Learned to a Larger Dataset
Our sample of 20 data points is useful for learning about how histograms work. But the real value of histograms is in seeing patterns in larger datasets with numerous values. Let’s apply what we've learned to a much larger dataset of 754 New Zealand high school students. **The dataset is called `census` and is already loaded into this notebook.**

### 5.1 Check out the contents of the `census` dataset

In [None]:
# sample response
str(census)

<div class="guided-notes">
    
### 5.2 Write the code to make a histogram of `time_standing_on_left_leg` in the `census` dataset
   
Try using the `boundary` and `binwidth` settings you used with the smaller dataset. Then try experimenting with different values for these arguments.
    
</div> 

In [None]:
# type this for students
gf_histogram(~ time_standing_on_left_leg, data = census)
gf_histogram(~ time_standing_on_left_leg, data = census, boundary = 0, binwidth = 100)

<div class="teacher-note">

<b>Teacher note:</b>
- Students may wonder about the `Removed 27 rows...` warning. This is a warning about missing data.
- You may want to start with the default code and then add on arguments like `boundary` and `binwidth`
    
</div>

<div class="discussion-question">
<h3>5.3 Key Discussion Question: How does this histogram differ from the ones you made of the smaller dataset</h3>
</div>

<div class="teacher-note">

<b>Sample responses:</b>
- Sometimes students may point out visual features such as skinnier bars.
- Students might say how this is "skewed"; most of the data is between 0 and 100 on the x-axis but a few go all the way up to 600 seconds (that's 10 minutes of standing on one leg!). With more students in the data set, we seem to have captured a few people who are extremely good at standing on one leg.
- Some might point out that the code is a little different -- that one of them includes `data = census`.
    
</div>

<div class="discussion-question"><h3>5.4 Make a Prediction: If you increase the binwidth, will the numbers on the y-axis get smaller or larger? Explain your prediction.</h3>
</div>

<div class="teacher-note">

<b>Sample responses:</b>
- You might try going from binwidth = 10 to binwidth = 100. 
- This makes the numbers on the y-axis go up a lot because there are more values grouped into each bin now.
    
</div>

### 5.5 The `bins` argument
We have been using the `binwidth` argument to control the width of the bins. We also could use the `bins` argument. Instead of specifying the width of the bins, the `bins` argument specifies the number of bins to use in the histogram.

<div class="discussion-question">
<h3>5.6 Make a Prediction: If we increase the number of bins, what will happen to the binwidth? Why?</h3>
</div>

In [None]:
# try modifying the bins argument
gf_histogram(~ time_standing_on_left_leg, data = census, boundary = 0, bins = 5)


<div class="guided-notes">
<h3>5.7 What is a bin? What is a bin width? How do they relate to one another?</h3>
</div>

<div class="teacher-note">

<b>Sample responses:</b>
- bin sets how many bins 
- binwidth sets how wide (or skinny) the bin is
- more bins means a smaller binwidth; larger binwidth means fewer bins (an inverse relationship)
    
</div>

## 6. Using Histograms to See Features of Distributions

Now that you know how a histogram works, now let's use them to explore some distributions. When looking at distributions, we tend to look at four features: **shape, center, spread, and weird things**.

In [None]:
# run this code
gf_histogram(~ time_standing_on_left_leg, data=census, boundary = 0, binwidth = 10)

<div class="guided-notes">
<h3>6.1 Describe the shape, center, spread, and weird things in the distribution above</h3>
</div>

<div class="teacher-note">

<b>Sample responses:</b>
- shape: skewed (lots of data in one place and a few stragglers in the "tail")
- center: maybe less than 100
- spread: the range goes from 0 to 600; most of the times are less than 100
    
</div>

## 7 Practice What You Learned

<div class="teacher-note">

<b>Teacher Note:</b>

After following along and filling in their guided notes, this is a section where students can write code and explanations on their own. You may want to have students go to **Kernel → Restart & Run All** to ensure the notebook has run all the code cells above. 
    
</div>

### 7.1 To get familiar with the data, start by inspecting the data frame `census` with functions like `head()`, `str()`, or `glimpse()`. 

In [None]:
# take a look at the census data frame


Let's use histograms to look at some of the other quantitative variables in the `census` dataset and describe their distributions.

Here are the other variables in the `census` dataset:
- **region**: Region of the country students live in
- **gender**: What gender are students
- **age**: Age in years
- **time_standing_on_left_leg**
- **height**: Height in centimeters
- **travel_time_to_school**: How long does it usually take you to get to school (to nearest minute)
- **bag_weight**: What is the weight of your school bag today (to nearest tenth of a kilogram)
- **left_foot_length**: Length of left foot in centimeters
- **sleep_time**: How much sleep did you get last night (to nearest half hour)


### 7.2 Pick another quantitative variable (other than `time_standing_on_left_leg`) and make a histogram in the code cell below. Then describe the shape, center, spread, and weird things.

In [None]:
# make a histogram here


Describe the shape, center, spread, and weird things.


### 7.3 Pick another quantitative variable (other than `time_standing_on_left_leg` and the one you picked for 7.2) and make a histogram in the code cell below. Then describe the shape, center, spread, and weird things.

In [None]:
# make a histogram here


Describe the shape, center, spread, and weird things.


### 7.4 Pick a categorical variable and try making a histogram in the code cell below. What happens? Why?

In [None]:
# try making a histogram with a categorical variable


What happens? Why?


### 7.5 What is different about making a histogram from a vector (like we did for the 20 students from New Zealand) and making a histogram from a data frame (like we did from the `census` data frame)?