# 4C: Explaining Variation in Video Games 

In [None]:
# This code will load the R packages we will use
suppressPackageStartupMessages({
    library(coursekata)
})

In [None]:
# This code will import the data frame with the filtered Platform variable

gamesales3platforms_csv_link <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vQoFFhkA1rGSNoPoeOUnt-Km6uZbnyUyLSUNyj6kmIszrXOZdr-SEsNNOibiZBVPxwJ2n2XiG3iKk07/pub?gid=722274646&single=true&output=csv"
gamesales <- read.csv(gamesales3platforms_csv_link, header = TRUE)


## 1.0 - Why are some games better than others?

A little while ago, we explored how critics and users rate games. Now let’s turn our attention to this question: Why are some games rated more highly than others? 

1.1 - Let’s get your first impressions. Why do you think some games are “better” than others?

### About the `gamesales` data 

We will be using the `gamesales` data set. Because there are over 30 platform categories in this data frame, these data have been filtered to include only three platforms that were produced by popular competitors and were released around the same time (between 2005-2006): 
 - PS3 (Sony, 2005)
 - Wii (Nintendo, 2006)
 - X360 (Microsoft, 2006)

Since we are loading the data from a CSV, there is no documentation for the data frame. That means you can't use `gamesales` to get more information about the variables. Instead, here is a description of each of the variables:


| Variable          | Description                                                   |
|:------------------|:--------------------------------------------------------------|
| `Name`            | The video game's name                                         |
| `Platform`        | Platform of the game’s release; PS3, Wii, and X360            |
| `Year_of_Release` | Year of the game's release                                    |
| `Genre`           | Genre of the game                                             |
| `Publisher`       | Publisher of the game                                         |
| `NA_Sales`        | Sales in North America (in millions)                          |
| `EU_Sales`        | Sales in Europe (in millions)                                 |
| `JP_Sales`        | Sales in Japan (in millions)                                  |
| `Other_Sales`     | Sales in the rest of the world (in millions)                  |
| `Global_Sales`    | Total worldwide sales (in millions)                           |
| `Critic_Score`    | Aggregate score compiled by Metacritic staff                  |
| `Critic_Count`    | The number of critics used in coming up with the critic score |
| `User_Score`      | Score by Metacritic's subscribers                             |
| `User_Count`      | Number of users who gave the user score                       |
| `Rating`          | The ESRB ratings (Entertainment Software Rating Board)        |

1.2 - Let’s consider one hypothesis as a class: 

> Game developers might create better games for some platforms than others. Maybe that’s why those games might then be rated more highly. 

Let’s call this the Platform hypothesis. Before we look at the data, what do you think about this hypothesis? Do you think it’s true? Which platform might have better or worse games? 

## 2.0 - Exploring the Platform Hypothesis with Critic Scores

2.1 - If we think a game’s `Platform` can help us explain the variation we see in `Critic_Score`, how can we write this idea as a word equation? Which is the outcome and which is the explanatory variable?

2.2 - Try to create a faceted histogram to help us see whether `Platform` might be related to critics’ ratings of games.

2.3 - Looking at the histograms, what do you notice about these distributions? Does any `Platform` seem like it tends to have better or worse critic ratings? 

2.4 - Now try using a boxplot or jitterplot to explore the same hypothesis. (Optional: Can you overlay the two?)

2.5 - Looking at this other visualization, do you see the same pattern you saw in the histograms? What pattern do you notice?

2.6 - Based on these visualizations, if you knew that a game was a Wii game and had to guess it’s critics rating, would you adjust your guess to be a little lower or a little higher? Why?

## 3.0 - Exploring the Platform Hypothesis on User Scores

3.1 - As we all know, users are a little different from critics. Do you think we will see the same relationship between platform and ratings when we look at user scores? Why or why not?

3.2 - If we think a game’s `Platform` can help us explain the variation we see in `User_Score`, how can we write this idea as a word equation? Which is the outcome and which is the explanatory variable?

3.3 - Make two different visualizations (your choice!) to check out the `Platform` hypothesis on user ratings.

3.4 - Looking at the visualizations, what do you notice about these distributions? Does any `Platform` seem like it tends to have better or worse user ratings? 

3.5 - If you knew that a game was a Wii game and had to guess it’s user rating, would you adjust your guess to be a little lower or a little higher? Why?

## 4.0 - Wrap-up of the Platform Hypothesis

4.1 - One definition of “explain variation” is that if we know a little bit more about a game, we can use that information to make a better prediction of some outcome. Now that we’ve explored the Platform Hypothesis, does Platform help us explain variation on some outcome (either critic or user ratings)? Which outcome?

4.2 - When we look at a visualization and think – yeah, this one looks like some of the variation is explained – what features of the visualization should we look at?

4.3 - What are your overall conclusions? Does `Platform` matter for “how good a game is”? 

## 5.0 - Reflect and Connect

5.1 - In *4A*, we looked at the `States` data and explored whether `SubstanceAbuse`, `HouseholdIncomeK`, or `MedianRent` can explain variation in `PITHomeless`. As a reminder, write the word equations from that discussion here. 

5.2 - Compare those to the models we explored today:

- `User_Score` = `Platform` + Other Stuff
- `Critic_Score` = `Platform` + Other Stuff

Aside from the variable names, what makes our models in 4A different from our models in 4B? What makes them similar?

5.3 - In both 4A and this lesson, how did we decide whether the explanatory variables were explaining variation in the outcome variables, even though we were using different visualizations? Why did we need to use different visualizations?

## 6.0 - From Data to Decision-Making

6.1 - Hmmm… just out of curiosity, let’s check which platforms have the top 10 highest selling games globally (using `Global_Sales`). Which platforms have the highest game sales? Are they the same platforms that seem to have higher ratings? What explains this?

6.2 - If you had to write a public relations statement on behalf of Wii games (Nintendo company) based on all the analyses we’ve done so far, what would you say? 