# Where Data Comes From (COMPLETE)

## Chapter 2.1-2.7 Overview Notebook

In [None]:
# run this to set up the notebook
library(coursekata)

# set styles
css <- suppressWarnings(readLines("https://raw.githubusercontent.com/jimstigler/jupyter/master/ck_jupyter_styles_v2.css"))
IRdisplay::display_html(sprintf('<style>%s</style>', paste(css, collapse = "\n")))


# Table of Contents

1 [Where Data Comes From](#1)

2 [Reading tidy data into R](#2)

3 [Variables and values](#3)

4 [Counting categories](#4)

5 [Practice What You Learned: Explore Our Class Data](#5)

<div class="teacher-note">
    <b>Section Goals:</b> In this section, students learn that data results from the process of identifying features of interest that vary in the world (such as happiness, thumb length, or amount of plastic waste), and then measuring those features in a sample of cases (such as college students or beach sites).
    <ul>
    <li>Students will understand that sampling is the process of selecting cases from a population and that measurement is the process of quantifying features of interest in a sample.</li>
    <li>Students are introduced to the concept of independent random sample, and the idea that a sample can be representative or biased.</li>
    <li>Students will understand that quantitative variables result from numerical measurements that vary across cases (e.g., thumb length), and that categorical variables result from assigning objects to categories (e.g., gender or beach type). </li>
    <li>Students will learn the distinction between a variable and its values (e.g., Thumb versus 61 mm).</li>
    <li>Students will learn that data are typically organized in tables in which rows represent the cases or observations; columns represent the variables; and cells contain the measurements or values assigned to each case for each variable, a structure that may be referred to as “tidy data.” </li> 
    <li>Finally, students will learn that tidy datasets are stored as data frames in R, and will practice viewing and exploring data frames using functions like head(), str(), and glimpse(). They will also learn to view individual variables in a data frame using the $ operator (e.g., Fingers$Thumb).</li>
    </ul>
    A <a href="https://docs.google.com/document/d/1MpbaU1ZFmv7YPcv7LXq6Sh2ORbV9tnEjlcQ5hMaar-M/edit?tab=t.5y2a0ykmi2fk#heading=h.wjaasjj3pg90" target="_blank">printable student guided-notes worksheet</a> is available to go with this Jupyter notebook, as well as a student version of this notebook.
</div>





<a id =1> </a> 
## 1 Where Data Comes From

Data might eventually in up in a neat table or spreadsheet but that's not where it starts out. Data starts with variation in the messy world around us. Every dataset starts with a decision: What do we want to measure, and who (or what) should we measure it from?

These two big decisions are called:
- **Sampling**: choosing the cases (e.g., people, animals, products) we want to study
- **Measurement**: deciding what we want to record about each case, and how to do it

Today, our class will act as both researchers and participants. You’ll be the cases, and we’ll measure different features of you by asking questions like: "What’s the worst color?" We'll use our data as an example to help us understand some key ideas about data: 

- What is a "case"? *Each person is one case in our dataset.*
- What is a "variable"? *“Worst color” is one variable we're measuring.*
- What do we mean by a "value"? *Each person’s answer—like “brown” or “neon green” is a value*.
- And how do we organize all this information into a usable dataset?

### 1.1 Collect some data

A group of students decided to collect data from their classmates on some variables of interest. They were interested in their classmates responses to seven questions:

- What are your initials?
- How  many dogs have you kissed?
- Are you left handed or right handed? (1=left-handed; 2=right-handed)
- How many days do you think you would survive in a zombie apocalypse?
- Do you have a driver's license?
- What is the worst color?
- How extroverted are you (on a 10-point scale, where 10 is the most extroverted)?

To collect the data, they gave each classmate 7 pieces of paper, and labeled each piece with the variable they were trying to measure. Try recording the data for yourself.

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/2.1-2.7-student-data.jpg">

<div class="guided-notes">

### 1.2  Fill in your own data on the “slips of paper” (on your guided notes).
</div>

<div class="discussion-question">
    
### 1.3 Key Discussion Question: What was hard about reporting your own data? What kinds of decisions did you have to make?

</div>

<div class="teacher-note">

<b>Teacher note:</b> Give students time to share how hard it is to decide things like "number of dogs kissed": What is the definition of kissed? Over what period of time? What do you do if you can't remember all the dogs you've kissed? Measurement of attributes that seem clear at first can actually be quite complicated in practice. It's important to define each variable well enough that the definition can be applied in a consistent way across each person measured.

</div>


### 1.4 A small dataset

Below, the students have arranged the 7 pieces of paper for each of three students.

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/2.1-2.7-data-collection.png" width = 100%>

<div class="discussion-question">
    
### 1.5 Key Discussion Questions: Look at the data from these three students. 
    
- Describe how features such as  **handedness** or **zombie survival time** vary across the three people. What makes it hard to make these comparisons? What would make it easier?
- Would it be harder or easier if the data for each student were not on the same row? Why?

</div>

<div class="teacher-note">

<b>Sample responses:</b>

- I have to search to find the same variable in each row.
- The variables are in a different order on each row; it would be easier if they were in the same order.
- If the data from each student were not on the same row you would have no way to know which pieces of data belonged to the same student.

</div>


### 1.6 *Tidy data:* A way to organize data to make it easier to analyze

As you noticed looking at the data above, some ways of organizing data are easier to analyze than others. There are many ways data can be organized, but the most common way, and the one we use in this class, is called **tidy data**. When organized in this way, it helps us (and computers) read, explore, and analyze the data more easily.

Tidy data are organized according to these three principles:

| 1. Each row is a case<br>(or observation). | 2. Each column is a variable. | 3. Each cell contains a value<br>for the particular case and variable. |
|:--:|:--:|:--:|
| <img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/R9jDj2v6.png" width="229px" alt="Each row is a case"> | <img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/HTwvwwvP.png" width="230px" alt="Each column is a variable"> | <img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/qJKtbHzk.png" width="228px" alt="Each cell is a value"> |


<div class="guided-notes">
    
### 1.7 Complete the tidy data principles.    
    
</div>

<div class="guided-notes">

### 1.8 Use the blank table to organize the messy data from the three students as tidy data.
    
<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/2.1-2.7-data-collection.png" width = 100%>
</div>

<a id =2> </a> 
## 2 Reading tidy data into R

Many tools you might be familiar with (e.g., Google sheets, Excel spreadsheets) are organized in tidy data format. Data represented in tidy data format also usually have a **header row** at the top that tells you what each column is (e.g., the headers for this data might say **initials** or **handedness**). 

R uses something similar called a **data frame**—a structure that stores tidy data in rows and columns.

To work with your own data in R, a common workflow is:
- Enter the data into a Google Sheet.
- Export it as a CSV file (short for Comma-Separated Values, a simple text file version of the spreadsheet).
- Use R’s `read.csv()` function to read that CSV into a data frame.

We'll practice this using a Google Sheet that another class already filled in. 

### 2.1 Run the code below. This saves the data frame in R's memory. But how do we look at this data frame?

In [None]:
# edit this
little_survey <- read.csv("https://docs.google.com/spreadsheets/d/1edJ2vEMp80CTrQV7nRv5SC7hUzFG-Oyh12b1ThIew3c/export?format=csv&id=1edJ2vEMp80CTrQV7nRv5SC7hUzFG-Oyh12b1ThIew3c&gid=0")

# sample code
little_survey
head(little_survey)

<div class="discussion-question">
    
### 2.2 Key Discussion Question: What happened when we ran this code?

</div>

<div class="teacher-note">

<b>Teacher notes:</b> 
- Make sure students understand that this code reads the data in the CSV file located at the specified URL on the internet into a data frame in R called <b>little_survey</b>. However nothing will print out. Teach students that they can simply type the name of the data frame to look at it or use the `head()` command to look at the first 6 rows.
- It's easy to input data from a Google Sheet into R; here's some <a href="https://coursekata.org/preview/book/3cc54c0e-3b8a-4804-9a2e-186c8c2bf28f/lesson/19/3" target="_blank">detailed instructions</a> on how to do that. Essentially, you first make the sheet public; then publish it to the web in CSV format; and finally use the <code>read.csv()</code> function to input the data into R.

</div>


### 2.3 How can we figure out how many cases and how many variables are in the data frame? What are the variable names?

In [None]:
# sample response
little_survey

<div class="teacher-note">

<b>Teacher notes:</b>  

- Let students suggest code for this. Some will suggest printing the entire data frame by typing <code>little_survey</code> and then counting the number of columns and rows.
- Other students may notice the a little note like this: <code>A data.frame: 18 × 7</code>. They may figure out that this corresponds to 18 rows (not counting the row of variable names) and 7 columns (i.e., variables). Highlight this 18 × 7 line if no one brings it up.
    
</div>


<div class="guided-notes">

### 2.4 Explain what each line of R code does in your own words.
    
</div>

### 2.5 Let's learn two more functions to show you what's in the data frame
In the code cell below, try running `str()` and `glimpse()` to see what information they provide.

In [None]:
# code here
str(little_survey)
glimpse(little_survey)

<div class="guided-notes">
    
### 2.6 Explain what `str()` and `glimpse()` does in your own words.

How are these functions similar to and different from `head()`?
    
</div>

In [None]:
# code here
str(little_survey)
glimpse(little_survey)

<div class="discussion-question">

### 2.7 Key Discussion Question: How good is this sample? Should we use it to draw conclusions about how people in general estimate their zombie survival time? Why or why not?
</div>

<div class="teacher-note">

<b>Sample responses:</b>
- It’s just a few people from a class — that’s not enough to say what everyone thinks.
- These are other students, so it’s not a good mix of people (like old, young, different areas in the world).
- People probably don’t know how to estimate this kind of thing anyway. (Like one person said they would survive 1,000,000 days which is more than 2,700 years.)


<div class="guided-notes">
    
### 2.8 What kind of critiques did you come up with: About Sampling, Measurement, or Both?

The critiques you came up with can be grouped into two broad categories:
- **Sampling:** Concerns about the cases in the dataset. Do the cases (the people) in the sample represent the larger population we care about?
- **Measurement:** Concerns about the values in the dataset--how the data was collected or what it tells us. Was the question clear? Did people interpret it the same way? Are the responses meaningful?

In the table are some example critiques that students (like you!) often raise. For each one, decide whether the issue is mostly about sampling, measurement, or both. Then explain your reasoning.
    
| **Student Critiques**                                                                     | **What kind of criticism is this? Sampling, Measurement, or Both? Why? (Explain briefly.)** |
|:-------------------------------------------------------------------------------------------|:--------------------------------------------------------------------|
| "It’s just a few people from a class."                                                     |                                                                    |
| "It’s just students so it doesn't represent the whole age range."                         |                                                                    |
| "One person said they’d survive 1,000,000 days — that seems unrealistic."                 |                                                                    |
| "Students might be overconfident about their survival time."                              |                                                                    |

    
</div>

<a id =3> </a> 
## 3 Variables and values

It's important to distinguish *variables* from *values*. Take a look at the head of `little_survey` (also printed in your guided notes). 
- Each **column** is a variable — something we measured, like how many dogs someone kissed, or what they think is the worst color. 
- Each **cell** contains a **value** — one person’s data for that variable.

In [None]:
# run this
head(little_survey)

<div class="discussion-question">
    
### 3.1 Key Discussion Question: What do you notice about how these 6 students vary in the `dogs_kissed` variable?

</div>

<div class="teacher-note">

<b>Sample responses:</b>
- One person kissed 16 dogs!?
- Most people kissed 0–2 dogs
- One person didn’t answer — maybe they didn't know how many dogs they kissed.
- Not sure if we believe these are accurate. They may not remember some dogs they kissed. What does it mean to "kiss a dog" anyway?

Note that the student (BN) who didn't report how many dogs they had kissed was coded as `NA`. Missing data in R appear as `NA` (not available).

</div>

### 3.2 How can we look at just one variable in R?

Sometimes we want to look at just one column — like `dogs_kissed` — instead of the whole dataset.

In [None]:
# write code

# sample code
little_survey$dogs_kissed

<div class="teacher-note">

<b>Teacher note:</b>
- Students may want to return to the discussion of how this variable varies once they see values like 1,000,000 (which begs the question of whether these are believable measurements).
</div>

<div class="guided-notes">
    
### 3.3 Write down the code for looking at just one variable from a data frame.
    
</div>

### 3.4 Why are `dogs_kissed` or `worst_color` called *variables* in the first place? What varies?


<div class="teacher-note">

<b>Teacher note:</b>
- You may want to annotate the guided notes to show that `dogs_kissed` is a **variable** while the numbers people reported are called **values**. 
- Students often learn the word variable (maybe it sounds math-y) but they often don't realize they need a word for *value*. 
   
</div>

### 3.5 What's different about the values of a variable such as `dogs_kissed` versus a variable like `worst_color`?

<div class="teacher-note">

<b>Sample Response:</b>
- One is a set of numbers; the other is a set of words. 
   
</div>

<div class="guided-notes">

### 3.6 Circle all the quantitative variables in the `head(little_survey)` data printed in your notes. 
    
- **Quantitative variables** contain measurements that are **numbers** — the numbers represent quantities that you can add, average, or compare. Some who gets a 2 on a quantitative variable has twice as much of whatever it is being measured than someone who gets a 1. 
- **Categorical variables** contain labels that assign each case to a **category or group**. We could represent these categories with numbers (e.g., 1=left-handed, 2=right-handed) but these numbers don't represent quantities (so, right handed is not "twice as much as left-handed). Values of categorical variables are often just represented as words (e.g., "gray").

If you're not sure about one, mark it with a question mark and be ready to discuss!

</div>


<div class="teacher-note">

<b>Teacher note:</b>  
- Some students may assume anything numeric is quantitative; `handedness` provides a counterexample: even though the values are numbers, they stand for categories "left" and "right"

<b>Sample responses:</b> 
- Quantitative variables: dogs_kissed, zombie_days, extroverted
- Categorical variables: initials, have_dl, worst_color, handedness

</div>


### 3.7 Does R know whether a variable is quantitative or categorical?

Let’s run `glimpse()` again. In addition to the name of each variable, it also tells us the **type of variable** for each variable in the data frame. (This information is conveyed inside the < > brackets in the output.)

In [None]:
# run this
glimpse(little_survey)

<div class="guided-notes">

### 3.8 Fill in what R says about the type of each variable listed, and whether you think the variable is quantitative or categorical.
    
- `zombie_days`
- `have_dl`
- `extroverted`
- `handedness`

</div>


### 3.9 Write R code to tell R that a variable is categorical (also called a "factor").

Sometimes R categorized a variable incorrectly. We coded `handedness`, for example, as 1 for left-handed and 2 for right-handed. R called it `<int>` for integer, thinking it's a quantitative variable. It's an understandable mistake; R doesn't know what *handedness* means.

We can correct the mistake by using the `factor()` function. A factor in R is always treated as **categorical**, even if it's made up of numbers. Use the code cell below to create a new version of `handedness` called `hand_factor`, and save it as a new variable in the `little_survey` data frame. Then check the data frame to make sure the new variable is there.


In [None]:
# write code

# sample code
little_survey$hand_factor <- factor(little_survey$handedness)

glimpse(little_survey)

### 3.10 Predict what you think will happen if you run each line of code. Then run the code in the cell below (one line at a time).

In [None]:
# try running each line of code one at a time
#sum(little_survey$handedness)
#sum(little_survey$hand_factor)

<div class="discussion-question">
    
### 3.11 Key Discussion Question: Why do you think we get the results we get from these lines of code?

</div>

<div class="teacher-note">

<b>Sample responses:</b> 
- R thinks handedness is a number so it has no problem summing those numbers up.
- R recognizes that hand_factor, although the values look like numbers, is a categorical variable so it refuses to sum those numbers up.

</div>


### 3.12 Quantitative vs. categorical: Different names, same idea

In this class, we’ll mostly use the terms **quantitative** and **categorical** to describe types of variables. But in R (and in statistics more broadly), you’ll see other words used to describe different types of variables. For now, you can just think of all these terms as referencing the same distinction: *quantitative vs. categorical*.

<br><center><table style="margin: 0; font-size: 16px; font-family: Segoe UI, sans-serif; border-collapse: collapse; width: auto;">
  <thead>
    <tr style="background-color:#f2f2f2;">
      <th style="text-align: left; padding: 10px; border: 1px solid #ccc;">Quantitative Variable</th>
      <th style="text-align: left; padding: 10px; border: 1px solid #ccc;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</th>
      <th style="text-align: left; padding: 10px; border: 1px solid #ccc;">Categorical Variable</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left; padding: 10px; border: 1px solid #ccc;">Numeric (num)</td>
      <td style="text-align: left; padding: 10px; border: 1px solid #ccc;"> </td>
      <td style="text-align: left; padding: 10px; border: 1px solid #ccc;">Character (chr)</td>
    </tr>
    <tr style="background-color:#f9f9f9;">
      <td style="text-align: left; padding: 10px; border: 1px solid #ccc;">Integer (int)</td>
      <td style="text-align: left; padding: 10px; border: 1px solid #ccc;"> </td>
      <td style="text-align: left; padding: 10px; border: 1px solid #ccc;">Factor</td>
    </tr>
    <tr>
      <td style="text-align: left; padding: 10px; border: 1px solid #ccc;">Double (dbl)</td>
      <td style="text-align: left; padding: 10px; border: 1px solid #ccc;"> </td>
      <td style="text-align: left; padding: 10px; border: 1px solid #ccc;">Nominal</td>
    </tr>
    <tr style="background-color:#f9f9f9;">
      <td style="text-align: left; padding: 10px; border: 1px solid #ccc;">Continuous</td>
      <td style="text-align: left; padding: 10px; border: 1px solid #ccc;"> </td>
      <td style="text-align: left; padding: 10px; border: 1px solid #ccc;">Qualitative</td>
    </tr>
    <tr>
      <td style="text-align: left; padding: 10px; border: 1px solid #ccc;">Scale</td>
      <td style="text-align: left; padding: 10px; border: 1px solid #ccc;"> </td>
      <td style="text-align: left; padding: 10px; border: 1px solid #ccc;"></td>
    </tr>
  </tbody>
</table></center>


<a id =4> </a> 
## 4 Counting categories

### 4.1 How do we count how many people in this data set have a driver's license?

We can use the `tally()` function to count the number of rows that fall into each category of a categorical variable. 

In [None]:
# write code

# sample code
tally(~have_dl, data = little_survey)

<div class="discussion-question">

### 4.2 We found that 6 students in `little_survey` have driver's licenses. Can we add that information to the `little_survey` data frame? Why or why not?
    
</div>


<div class="teacher-note">

<b>Teacher note:</b> Some students will suggest that a new row could be added. Ask them how this would or would not violate the three principles of tidy data (in this case, this row would not represent another case of the same thing. It wouldn't be a person, but a row describing the whole dataset. In tidy data, "Each row should represent one case **of the same kind.**"

</div>


<div class="guided-notes">

### 4.3 Practice using `tally()` to count other categories in the data.
For each variable, write the code and report the findings. 

</div>

In [None]:
# code here

# sample code
tally(~handedness, data = little_survey)
tally(~worst_color, data = little_survey)

<a id =5> </a> 
## 5 Practice What You Learned: Explore Our Class Data

Let's read our own class's data into an R data frame and see what we find. 

Use this shared Google Sheet to add your own row: **[Teacher Shares Link]**

Once it is complete, you can use this CSV link: **[Teacher Shares Link]**


### 5.1 How many cases are in the data set? How many variables?

### 5.2 How many quantitative variables should there be? 

### 5.3 How many categorical variables should there be? 

### 5.4 Are there any variables that should be one type but R sees them as a different type? What's going on? What can we do to fix those?

### 5.5 Try creating the `hand_factor` variable. Use R code to tally up how many left-handed people are in our class.