# Visualizing Video Games Sales Data

In this code-along, we'll use ggplot2 to visualize sales of popular video games in North America, Europe, and Japan.

The dataset used here is a subset of the "Video Games Sales Data" dataset available in Workspace. The dataset was originally [sourced from Kaggle](https://www.kaggle.com/datasets/gregorut/videogamesales).

![](dataset-video-games-sales-data.png)


- Every video game in this dataset has at least 100k global sales.
- We'll look at games from some of the most popular desktop consoles in the [4th to 8th console generations](https://en.wikipedia.org/wiki/Home_video_game_console_generations).
- Since the dataset is from Kaggle, the trustworthiness is questionable. This is for fun, not for real-world business decisions.

## Loading packages

In this code-along, we'll use `readr` to import the dataset, `dplyr` and `forcats` to manipulate the data, and `ggplot2` to visualize the data.

#### Instructions

- Load the `readr`, `dplyr`, `forcats` and `ggplot2` packages.

In [None]:
# Load readr, dplyr, forcats, ggplot2


The following lines of code make it easier to see the visualizations during the webinar.

In [None]:
# Set the default figure font size to 20 (and use the gray theme for plot colors)
theme_set(theme_gray(20))

# Display plots in the workspace with a width of 10 inches and a height of 7 inches
opts <- options(repr.plot.width = 10, repr.plot.height = 8)

## Import the dataset

The dataset is stored in a CSV file named `vgsales.csv` in the `data` directory.

#### Instructions

- Read the CSV file "data/vgsales.csv". Assign the result to `vgsales`.
- Glimpse at the column information in `vgsales`.

In [None]:
# Read the CSV file "data/vgsales.csv"
vgsales <- 

# Glimpse the result


#### Data dictionary

- `Name`: The name of the game.
- `Year`: The year that the game was released on the platform.
- `Platform`: The name of the console platform that the games was released on.
- `Platform_Generation`: The console generation of the platform.
- `Platform_Company`: The company that made the console platform.
- `NA_Sales`: Millions of units sold on that platform in North America.
- `EU_Sales`: Millions of units sold on that platform in Europe.
- `JP_Sales`: Millions of units sold on that platform in Japan.
- `Global_Sales`: Millions of units sold on that platform globally.

## Drawing bar plots

### What are the top selling video games in the dataset?

Let's start by visualizing which games in the dataset were the top sellers. Since we want to plot a numeric variable (`Global_Sales`), split by a categorical variable (`Name`), a bar plot is the ideal choice.

There's a slight catch: some video games are available on multiple platforms. Look at the rows of the dataset for *Grand Theft Auto V* to see this.

#### Instructions

- Using `vgsales`, filter for rows where `Name` equals `"Grand Theft Auto V"`.

In [None]:
# Using vgsales, filter for rows where Name equals "Grand Theft Auto V"


We need to get the total sales for each video game aross all platforms, then get the top 10 sellers.

#### Instructions

- Using `vgsales`, group by `Name`, then summarize to calculate `Total_Global_Sales` as the sum of `Global_Sales`. Assign to `global_sales_by_name`.
- Slice `global_sales_by_name` to get the top 10 rows by maximum `Total_Global_Sales`. Assign to `top_global_sales_by_name`.

In [None]:
# Using vgsales, group by Name, then summarize to calculate Total_Global_Sales as the sum of Global_Sales
global_sales_by_name <- 


# Slice global_sales_by_name for top 10 rows by maximum Total_Global_Sales
top_global_sales_by_name <- 


# See the result
top_global_sales_by_name

Now we can draw a bar plot of `Total_Global_Sales` versus `Name`. Since the data is already summarized (one game per row), ggplot2 refers to this type of bar plot as a "column plot".

#### Instructions

- Using `top_global_sales_by_name`, plot `Total_Global_Sales` versus `Name`.
- Add a column geom.

In [None]:
# Using top_global_sales_by_name, plot Total_Global_Sales versus Name
# Add a column geom



This is a good start, but the game name labels are overlapping. We can flip the axis coordinates to solve this.

#### Instructions

- Redraw the previous plot, but with the x and y axes flipped.

In [None]:
# Redraw the previous plot
# Use flipped coordinates



Currently the bars are ordered by alphabetical name of the game. It's easier to read the plot if the bars are ordered from longest to shortest.

#### Instructions

- Mutate `top_global_sales_by_name` so `Name` is reordered by `Total_Global_Sales`. Assign to `top_global_sales_by_name_ordered`.
- Redraw the previous plot.

In [None]:
# Mutate top_global_sales_by_name so Name is reordered by Total_Global_Sales
top_global_sales_by_name_ordered <- 

# Redraw the previous plot



## Drawing line plots

To explore questions around how numeric metrics change from year to year line plots are ideal.

For simplicity, let's first look at the 7th generation of consoles. We need to filter the dataset.

#### Instructions

- Using `vgsales`, filter for rows where `Platform_Generation` is equal to `"7th"`. Assign to `seventh_generation`.

In [None]:
# Using vgsales, filter for rows where Platform_Generation is equal to "7th"
seventh_generation <- 


# See the result
seventh_generation

### What are the total yearly sales of the 7th gen games included in the dataset?

The 7th generation of consoles is widely considered to have run from 2005 to 2017. By looking at total sales by year, we can get a sense of when this generation peaked in popularity.

#### Instructions

- Using `seventh_generation`, group by `Year`, then summarize to calculate `Total_Global_Sales` as the sum of `Global_Sales`. Assign to `total_7th_gen_global_sales_by_year`.

In [None]:
# Using seventh_generation,  
# group by Year,  
# then summarize to calculate Total_Global_Sales as the sum of Global_Sales
total_7th_gen_global_sales_by_year <- 


# See the result
total_7th_gen_global_sales_by_year

Now we can visualize these sales over time with a line plot.

#### Instructions

- Using `total_7th_gen_global_sales_by_year`, plot `Total_Global_Sales` versus `Year`.
- Add a line geom. To make the line easier to see, set the size to `2`.

In [None]:
# Using total_7th_gen_global_sales_by_year, plot Total_Global_Sales versus Year
# Add a line geom with size 2



### What's the split of those games by platform?

Over all the 7th generation platforms, (based on the games in the dataset) the sales peaked in 2009. But the peak for individual platforms may have been in different years. We can explore this by drawing a separate line for each platform.

#### Instructions

- Using `seventh_generation`, group by `Year` and `Platform`, then summarize to calculate `Total_Global_Sales` as the sum of `Global_Sales`. Drop all groups from the summarization. Assign to `total_7th_gen_global_sales_by_year_platform`.
- Using `total_7th_gen_global_sales_by_year_platform`, plot `Total_Global_Sales` versus `Year`, colored by `Platform`.
- Add a line geom with size `2`.

In [None]:
# Using seventh_generation,  
# group by Year and Platform,  
# then summarize to calculate Total_Global_Sales as the sum of Global_Sales
total_7th_gen_global_sales_by_year_platform <- 


# Using total_7th_gen_global_sales_by_year_platform, plot Total_Global_Sales versus Year, colored by Platform.
# Add a line geom with size 2



How do you interpet this plot?

*add your answer here*

### How can we visualize all generations together?

Let's try the same plot again with all the data from `vgsales`.

#### Instructions

- Rework the previous plot, but start with `vgsales`.

In [None]:
# Using vgsales, 
# group by Year and Platform,  
# then summarize to calculate Total_Global_Sales as the sum of Global_Sales
total_global_sales_by_year_platform <- 


# Using total_global_sales_by_year_platform, plot Total_Global_Sales versus Year, colored by Platform.
# Add a line geom with size 2



This is really messy! With so many colors it is hard to tell what is going on. The plot can be made clearer by using one color for each company, and by plotting each generation in its own panel.

#### Instructions

- Using `vgsales`, group by `Year`, `Platform_Company` and `Platform_Generation`, then summarize to calculate `Total_Global_Sales` as the sum of `Global_Sales`. Drop all groups from the summarization. Assign to `total_global_sales_by_year_platform`.
- Using `total_global_sales_by_year_platform`, plot `Total_Global_Sales` versus `Year`, colored by `Platform_Company`.
- Add a line geom with size `2`.
- Facet the plot, wrapping by `Platform_Generation`.

In [None]:
# Using vgsales, 
# group by Year, Platform_Company and Platform_Generation, 
# then summarize to calculate Total_Global_Sales as the sum of Global_Sales
total_global_sales_by_year_platform <- 


# Using total_global_sales_by_year_platform, plot Total_Global_Sales versus Year, colored by Platform_Company
# Add a line geom with size 2
# Facet the plot, wrapping by Platform_Generation.



This is much clearer, but it's a bit tricky to compare timelines for generations that are side by side. It would be easier to see what is happening if we put all the panels in a single column.

#### Instructions

- Redraw the same plot, with 1 column in the facetting.

In [None]:
# Redraw the same plot, with 1 column in the facetting

