<div class="alert alert-block alert-danger">

# 14C: Super Bowl Ads (COMPLETE)

**Use with textbook version 6.0+**


**Lesson assumes students have read up through page: 14.11**


</div>

<div class="alert alert-block alert-warning">

#### Summary of Notebook:

In this notebook, students will explore a dataset that measured a variety of factors about Super Bowl ads, such as their YouTube analytics and whether certain features are present or absent as part of the ad. They will have the freedom to explore any multivariate hypothesis they are interested in. Many of the potential outcome variables lack variation and are highly skewed with outliers, so students will be need to decide how to handle these outliers, and will likely be faced with a model that does not explain any variation. There are also slides and a printable worksheet to help students understand PRE by drawing Venn Diagrams.
 
#### Includes:

- Fitting and interpreting multivariate models
- Dealing with outliers
- Interpreting a model that does not explain any variation
- Drawing Venn Diagrams of PRE

<b>Teacher Resources:</b> 


Link to [sample slides](https://docs.google.com/presentation/d/16vD_BcpKw-1-6oIe_S8Ar5wtLq8sMC5_x3MSaOJNSt4/edit?usp=sharing)

Link to [printable handout](https://docs.google.com/document/d/1dEG1phIGxmx8dEDLM_s1HbszCydJaJa5Keb18kuOzQg/edit?usp=sharing)

</div>

<div class="alert alert-block alert-success">

## Approximate time to complete Notebook: 75-105 Mins

</div>

In [None]:
# Load the CourseKata library
library(coursekata)

# Adjust scientific notation
options(scipen = 10)

# Read data set
ads <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-03-02/youtube.csv")

## Introduction

Brands will pay millions of dollars to have a 30-second commercial air during the Super Bowl because they know they will have tens of millions of people watching.

**What makes an impactful Super Bowl commercial? If you only had 30-60 seconds to broadcast your brand to the public, what kinds of ad have the greatest impact?**

Today, you will get the chance to analyze some Super Bowl commercials and develop your own theories on what factors have the biggest impact on how people engage with these commercials.

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/jnb_Vm03hJpg-xcd-08c-superbowl.png" alt="collage of images from super bowl ads depicted in a television set" width = 50%>

## The Data

This data frame comes from [FiveThirtyEight](https://fivethirtyeight.com/). They watched over 200 ads from the 10 brands that aired the most spots in all 21 Super Bowls this century, according to superbowl-ads.com. They evaluated ads using the specific criteria listed below.

**Description of Variables**

- `year` Superbowl year
- `brand` Brand for commercial
- `superbowl_ads_dot_com_url` Superbowl ad URL
- `youtube_url` Youtube URL
- `funny` Contains humor
- `show_product_quickly` Shows product quickly
- `patriotic` Patriotic
- `celebrity` Contains celebrity
- `danger` Contains danger
- `animals` Contains animals
- `use_sex`	Uses sexuality
- `id` Youtube ID
- `kind` Youtube Kind
- `etag` Youtube etag
- `view_count` Youtube view count
- `like_count` Youtube like count
- `dislike_count` Youtube dislike count
- `favorite_count` Youtube favorite count
- `comment_count` Youtube comment count
- `published_at` Youtube when published
- `title` Youtube title
- `description` Youtube description
- `thumbnail` Youtube thumbnail
- `channel_title` Youtube channel name
- `category_id` Youtube content category id


**Data Sources** 

- [Tidy Tuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-03-02/readme.md)

- Here is a link to their [article](https://projects.fivethirtyeight.com/super-bowl-ads/) on the topic, where you can also watch the commercials, and see their analyes.

<div class="alert alert-block alert-success">

### 1.0 - Approximate Time:  10-15 mins

</div>

## 1.0: Explore Variation

**1.1:** Take a look at the data frame called `ads`. What is a good outcome variable to explore?

In [None]:
# Sample Response
head(ads)

<div class="alert alert-block alert-warning">

<b>Sample Responses:</b> 

- view_count
- like_count
- dislike_count 
- favorite_count
- comment_count

</div>

**1.2:** What are some explanatory variables that would help explain variation in that outcome? Why are those interesting?

Write your multivariate idea as a word equation.

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

Students may come with a variety of multivariate hypotheses here. One example might be:

- like_count = funny + animals + other stuff

Note that most of the possible outcome variables are skewed, with many outliers that make it difficult to see much variation in their visualizations of their models. Students may want to think about why some of these outliers are so extreme and will need to grapple with how they want to handle the outliers. Some may choose to keep them, some may choose to filter them out/into new data frames, and some may choose to pursue both paths and compare them. Whatever they choose, they should be able to articulate the reason they chose the path(s) they did.

Note that a lot of the possible outcomes have extreme outliers so students may want to think about why some of these outliers are so extreme and whether they want to include them or not.

</div>

**1.3:** Create a data visualization of your word equation. Describe whether you see the pattern you thought you might see.

In [None]:
# Sample Response for like_count = funny + animals + other stuff

gf_jitter(like_count ~ funny, data = ads) %>% 
    gf_facet_grid(animals ~.)

gf_jitter(like_count ~ funny, color = ~animals, data = ads)

# If they notice the lack of variation, you may want to encourage them to
# look at the distribution of the outcome variable to discover the shape/outliers
gf_histogram(~like_count, data = ads)
gf_boxplot(~like_count, data = ads)

# Some students may want to recode categorical variables from TRUE/FALSE
# For example:
ads$funny_2 <- factor(ads$funny, levels = c("TRUE", "FALSE"), labels = c("funny", "not funny"))
ads$animals_2 <- factor(ads$animals, levels = c("TRUE", "FALSE"), labels = c("animals", "no animals"))

############################################
# If they choose to filter outliers:

# Students will need to decide their own cutoff value
# They could formally calculate the outliers, or come up with 
# a justification for choosing another number

ads_2 <- filter(ads, like_count < 5000)

gf_histogram(~like_count, data = ads_2)
gf_boxplot(~like_count, data = ads_2)

gf_jitter(like_count ~ funny, data = ads_2) %>% 
    gf_facet_grid(animals ~.)

gf_jitter(like_count ~ funny, color = ~animals, data = ads_2)


## 2.0: Model Variation

**2.1:** Create the best-fitting model (or models), expressed in GLM notation, to represent your research question/word equation. Overlay it onto your data.


In [None]:
# Sample Response

model <- lm(like_count ~ funny + animals, data = ads)
model

gf_jitter(like_count ~ funny, color = ~animals, data = ads) %>%
    gf_model(model)

# If they choose to filter outliers:
model_no_outliers <- lm(like_count ~ funny + animals, data = ads_2)
model_no_outliers

gf_jitter(like_count ~ funny, color = ~animals, data = ads_2) %>%
    gf_model(model_no_outliers)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

- $like\_count_i = 5991 + -1772(funnyTRUE) + -1671(animalsTRUE) + e_i$

Or, no outliers:

- $like\_count_i = 436 + -104.6(funnyTRUE) + 112.1(animalsTRUE) + e_i$

</div>

**2.2:** How well does your model fit the data?

In [None]:
# Sample Response

supernova(model)

supernova(model_no_outliers)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

In these examples, it does not fit the data very well. There is a lot of error, or unexplained variation, as the PREs are quite low.

</div>

**2.3:** Draw the venn diagrams of the PRE with the handout provided by your instructor.

<div class="alert alert-block alert-warning">

<b>Teacher Resources:</b> 


Link to [sample slides](https://docs.google.com/presentation/d/16vD_BcpKw-1-6oIe_S8Ar5wtLq8sMC5_x3MSaOJNSt4/edit?usp=sharing)

Link to [printable handout](https://docs.google.com/document/d/1dEG1phIGxmx8dEDLM_s1HbszCydJaJa5Keb18kuOzQg/edit?usp=sharing)


</div>

## 3.0: Evaluate Models

**3.1:** Compare your multivariate model against the empty model? Which is a better model of the DGP and why?

In [None]:
# Sample Response

supernova(model)

supernova(model_no_outliers)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

In these examples, the empty model is the better model of the DGP. The p-value for both of these models is quite high, meaning we cannot rule out randomness as a potential DGP. Thus, we should not reject the empty model.

</div>

**3.2:** If your multivariate model is better than the empty model, then compare your multivariate model against all possible single-predictor models? Which is a better model of the DGP and why?

In [None]:
# Sample Response for like_count = funny + animals + other stuff

# The MV model was already not better than the empty model
# Thus, both of the single-predictor models are not better than the empty model

model_funny <- lm(like_count ~ funny, data = ads)
supernova(model_funny)

model_animals <- lm(like_count ~ animals, data = ads)
supernova(model_animals)

**3.3:** What has your class learned about SuperBowl ads?

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

*Student responses will vary.*

In this example case, we learned that ads with animals and humor did not explain variation in like count.

</div>