# Categorical, and More-Than-One, Explanatory Variables 

## Chapter 4.6-4.9 Overview Notebook

In [None]:
# run this to set up the notebook
library(coursekata)

# read in data
epic_char <- read.csv("https://docs.google.com/spreadsheets/d/1lIQJUqAmZTgBnuRw4DfR8_S-K_moYJH5gsnCEnMZ_BI/export?format=csv&gid=808955994") %>%
  select(char_name, uni_name, gender, chill_quant, chill_cat, loveable_quant, loveable_cat)
epic_characters <- read.csv("https://docs.google.com/spreadsheets/d/1lIQJUqAmZTgBnuRw4DfR8_S-K_moYJH5gsnCEnMZ_BI/export?format=csv&gid=808955994")

# set styles
css <- suppressWarnings(readLines("https://raw.githubusercontent.com/jimstigler/jupyter/master/ck_jupyter_styles_v2.css"))
IRdisplay::display_html(sprintf('<style>%s</style>', paste(css, collapse = "\n")))

## 1 The `epic_char` Data Frame

The `epic_char` data frame contains data on 81 fictional characters from various epic universes (including the Marvel Cinematic Universe, Star Wars, Dark Knight, Harry Potter, and Lord of the Rings). Each character was rated on a number of character traits by more than 3 million volunteers as part of the [Open Psychometrics Project](https://openpsychometrics.org/tests/characters/). 

For example, volunteers were presented with the character Darth Vader (from *Star Wars*), and asked to indicate, using a slider, where Darth Vader would fall on a 100-point scale from punchable (coded as 0) to loveable (100). The average rating for a subset of these dimensions is saved in the `epic_char` data frame. The data frame includes 8 variables.

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb-rate-characters.jpg" alt="example of how people rated a character with a slider" width = 50%>

### The Variables

- `char_name` The character's name.	
- `uni_name` The universe name of the book, game, movie, or TV show.
- `gender` The gender of the character (M=Male, F=Female).
- `chill_quant` Average rating of how chill (vs offended) the character is on a scale from 0 (offended) to 100 (chill).
- `chill_cat` Coded as *offended* if the character had a score less than 50 on `chill_quant` and *chill* otherwise
- `loveable_quant` Average rating of how loveable (vs punchable) the character is on a scale from 0 (punchable) to 100 (loveable).
- `loveable_cat` Coded as *punchable* if the character had a score less than 50 on `loveable_quant` and *loveable* otherwise

#### Data Source: 

The complete data set from which `epic_char` was drawn was made available by Tanya Shapiro as a [Tidy Tuesday data set](https://github.com/rfordatascience/tidytuesday/tree/master/data/2022/2022-08-16).

### 1.1 An excerpt from the `epic_char` data frame is below (and also available in your guide notes). What are the rows? The columns? The values?

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb-4.6-4.9-data-table.png" alt="example of how people rated a character with a slider" width = 90%>

<div class="discussion-question">
<h3>1.2 Discussion Question: Do you agree with how each character was rated? Why or why not? </h3>
</div>

<div class="guided-notes">
    
### 1.3 Fill in the missing values in the data frame
    
Use the `chill_quant` and `loveable_quant` values to fill in the missing `chill_cat` and `loveable_cat` columns.
    
</div>

<div class="guided-notes">
    
### 1.4 Circle/highlight the names of all the *categorical* variables

How can you tell a variable is categorical?

</div>

## 2 Exploring a Hypothesis: Characters Who Are More Chill Tend to Be More Loveable
We will start by using the quantitative versions of the two variables: `chill_quant` and `loveable_quant`

<div class="guided-notes">

### 2.1 Write a word equation to represent the hypothesis
    
</div>

word equation: 

<div class="guided-notes">

### 2.2 Write R code to visualize the distribution of the outcome variable
What do you notice about the distribution of this variable?
 
</div>

In [None]:
# code here



<div class="guided-notes">

### 2.3 Write R code to visualize the hypothesis
What do you notice about the relationships between these quantitative variables? 
</div>

In [None]:
# code here


<div class="discussion-question">
<h3>2.4 Discussion Question: Do you see support for the hypothesis in the graph? Explain. </h3>
</div>

## 3 Explore the Hypothesis with Categorical Versions of the Variables
In the previous section we used `loveable_quant` and `chill_quant`. In this section let's use `loveable_cat`  and `chill_cat` - categorical versions of the variables - instead.

<div class="guided-notes">

### 3.1 Write a word equation to represent the hypothesis
Be sure to use the categorical variable names in your word equation.
    
</div>

word equation: 

<div class="guided-notes">

### 3.2 Write R code to visualize the distribution of the outcome variable
What do you notice about the distribution of this variable?
 
</div>

In [None]:
# code here




<div class="discussion-question">
    <h3>3.3 Discussion Question: Can we adjust the binwidth for <code>loveable_cat</code>? Why or why not? </h3>
</div>

<div class="guided-notes">

### 3.4 Write R code to visualize the hypothesis
What do you notice about the relationships between these categorical variables? 
</div>

In [None]:
# add onto this code 
gf_bar(~ loveable_cat, data = epic_char)

<div class="discussion-question">
    <h3>3.5 Discussion Question: Do you see support for the hypothesis--that chill characters are more loveable--in the faceted bar graph? Does it make more sense for the outcome variable to be in the bars or in the facets? Explain. </h3>
</div>

### 3.6 The function `gf_bar` graph counts (the number of observations in each category).  How would we modify the code below to show proportions? Percents?

Try functions such as `gf_props()` (shows the proportion), and `gf_percents()` (shows the percent). 

In [None]:
# edit the code here
gf_bar(~ loveable_cat, data = epic_char) %>%
  gf_facet_grid(. ~ chill_cat)

<div class="discussion-question">
    <h3>3.7 Discussion Question: The percent of characters that are chill and loveable is 97%; the percent of characters that are offended and loveable is 35%. Why don't these two bars add up to 100%?</h3>

</div>

This just shows the gf_bar, gf_props, and gf_percents all at once.

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb-4.6-4.9-bar-props-percents.png" alt="all three bar graphs">  

<div class="guided-notes">
    
### 3.8 Write the R code that was used to make the gray graph in your guided notes.
    
</div>

## 4 Explore the Hypothesis with Contingency Tables
Up to now we've been using bar graphs to explore the relationship between two categorical variables. Let's see how we could accomplish the same goal using contingency tables.

<div class="guided-notes">
    
### 4.1 Write the R code to make the frequency tables that correspond to each graph in the guided notes

</div>

In [None]:
# code here


<div class="guided-notes">

### 4.2 Annotate the bar graphs with the corresponding numbers from the `tally()` output

</div>

## 5 Adding More Explanatory Variables to a Graph
Now that we’ve seen how to visualize the relationship between one explanatory variable and an outcome, let’s try adding more explanatory variables.

<div class="discussion-question">
    
### 5.1 Key Discussion Question: How would you write a word equation to express the hypothesis that **both** gender and chill explain variation in loveable?
    
</div>

### 5.2 Using color to represent an additional explanatory variable

We've created a scatter plot below to show `loveable_quant` as a function of `chill_quant`. Try adding in the argument `color = ~gender` and see what happens.

(Try using `shape` or `size` instead of `color` to see what happens.)

In [None]:
# edit this code 
gf_point(loveable_quant ~ chill_quant, data = epic_char) 



### 5.3 Using facets (`gf_facet_grid`) to represent an additional explanatory variable

Try faceting this basic scatter plot by `gender`.

In [None]:
# add facets to this
gf_point(loveable_quant ~ chill_quant, data = epic_char) 


<div class="discussion-question">
    
### 5.4 Key Discussion Question: What do these visualizations tell you about how gender and chill relate to loveable?
    
</div>

<div class="guided-notes">
    
### 5.5 Summarize: Fill in what you've learned today in the table of tools for visualizing relationships between two variables

</div>

<div class="guided-notes">
    
### 5.6 Summarize: What are some ways you can add more explanatory variables to a visualization?

</div>

## 6 Practice What You Learned

Take a look at the `epic_characters` dataset — it includes 15 variables, compared to the 8 in `epic_char`. That gives you more to work with!

In [None]:
# write code


### 6.1 Write a word equation that hypothesizes more than one explanatory variable to explain variation in `loveable_quant`)

word equation: 

### 6.2 Use R to visualize your hypothesis

In [None]:
# visualization


### 6.3 Write a word equation that hypothesizes more than one explanatory variable to explain variation in `loveable_cat`)

word equation: 

### 6.4 Use R to visualize your hypothesis

In [None]:
# visualization
