![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fdata-viz-of-the-week&branch=main&subPath=video-game-popularity/video-game-popularity.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Callysto’s Weekly Data Visualization

## Video Game Popularity

### Recommended grade level: 5-8

Callysto's Weekly Data Visualization is a learning resource that helps Grades 5-12 teachers and students grow and develop data literacy skills. We do this by providing a data visualization, like a graph, and asking teachers and students to interpret it. This companion resource walks learners through how the data visualization is created and interpreted using the data science process. The steps of this process are listed below and applied to each weekly topic.

1. Question - What are we trying to answer? 
2. Gather - Find the data source(s) you will need. 
3. Organize - Arrange the data so that you can easily explore it. 
4. Explore - Examine the data to look for evidence to answer our question. This includes creating visualizations. 
5. Interpret - Explain how the evidence answers our question. 
6. Communicate - Reflect on the interpretation. 

## 1. Question

Have you ever wondered what the most popular video games are? 

There are many ways you might want to answer this question. 
- You could ask your friends for their opinions.
- You could read Facebook posts, Twitter tweets, Reddit posts and judge which ones are discussed most.
- You could gather financial information from companies listings about their activities.

Certainly it is worthwhile to think about this question and how you can gather information, to really answer the question well.

Our goal here is to use some Python code to create pie charts and bar charts that summarize current information about which video games are most popular. This is a short example of data visualization, using data about the popularity of video games.

There are many possible sources for the data. In this examples, we look at total dollar value of sales of games, and also at the number of users. We use Wikipedia as a convenient source for this data.

In this notebook, we will take the approach of looking at video game sales records, as well as number of players.

## 2. Gather

The code below will import the Python programming libraries we need to gather and organize the data to answer our question.

In [None]:
## import libraries
%pip install -q pyodide_http plotly nbformat
import pyodide_http
pyodide_http.patch_all()
import pandas as pd
import cufflinks as cf
cf.go_offline()

### Popularity of video games, by sales

We grabbed the data from this Wikipedia page on video game sales:

https://en.wikipedia.org/wiki/List_of_best-selling_video_games

We saved the data in csv format (Comma Separated Values) as a simple list of games and sales figures.

We upload the .csv file on our Jupyter hub, where we can access it with our code. This file should be available when you access this code on the Callysto hub.

In [None]:
## import data
sales_df = pd.read_csv('data/VideoBySales.csv'); 
sales_df

## 3. Organize

The code below will arrange the data cleanly so that we can do analysis on it. This is a quality control step for our data and involves examining the data to detect anything odd with the data (e.g. structure, missing values), fixing the oddities, and checking if the fixes worked. 

The data needs to be adjusted a bit before we can plot. This is called "Data munging" which just means we are fixing things for the computer.

The list of "Sales" dollar values looks like a list of numbers, but to the computer, these are actually text strings -- little words made up of digits and commas. We need to transform these into computer numbers, by removing the commas, then telling the computer to treat these entries as numbers. We will also delete the "Rank" column as we will explore the data by sorting the data in different ways, which will make this column not very meaningful.

In [None]:
# data cleaning
sales_df['Sales'] = sales_df['Sales'].str.replace(',' , '')  ## this replaces the commans with empty space
sales_df['Sales'] = pd.to_numeric(sales_df['Sales'])         ## this changes the texts to numerical values
sales_df.drop('Rank', axis=1, inplace=True) ## this deletes the 'Rank' column as the order of the data will change

## 4. Explore

The code below will be used to help us look for evidence to answer our question. This can involve looking at data in table format, applying math and statistics, and creating different types of visualizations to represent our data.

Let's start with a pie chart:

In [None]:
# data exploration
sales_df.iplot(kind='pie',labels='Title',values='Sales',title='Sales by Game Title')

Let's look at just the top ten entries

In [None]:
sales_df[0:10].iplot(kind='pie',labels='Title',values='Sales',title='Sales by Game Title (US dollars), Top Ten')

Maybe you prefer to see this data as a bar chart. This will show explicitly the dollar value.

In [None]:
sales_df.index = sales_df['Title']
sales_df[0:10]['Sales'].iplot(kind='bar',values='Sales',title='Sales by Game Title, Top Ten')

While we are looking at sales, maybe we can also look at which companies are selling well. We have that data in our dataframe, so let's group the information together by company, sort it by sales and visualize it.

In [None]:
publisher_df = sales_df.groupby('Publisher',as_index=False).sum()
publisher_df.sort_values(by="Sales",ascending=False)

In [None]:
# pie chart, all companies
publisher_df.iplot(kind='pie',labels='Publisher',values='Sales',title='Sales by Company (US dollars)')

In [None]:
# pie chart, top 10 companies
publisher_df[0:10].iplot(kind='pie',labels='Publisher',values='Sales',title='Sales by Company (US dollars), Top Ten')

Let's visualize the same top 10 sales by company data with a bar chart.

In [None]:
# bar chart, top 10 companies
publisher_df.index = publisher_df['Publisher']
publisher_df[0:10]['Sales'].iplot(kind='bar',values='Sales',title='Sales by Company (US dollars), Top Ten')

## Repeat Steps 2, 3, and 4. Gather, Organizer, Explore, with more data

Let's look at the game popularity, as measured by number of players. We also got this data from Wikipedia, here: 
https://en.wikipedia.org/wiki/List_of_most-played_video_games_by_player_count

In [None]:
players_df = pd.read_csv('data/VideoByPlayers.csv'); 
players_df

Let's fix up the 'Numbers' column, to change the text into a number.

It is pretty easy -- just strip out the words, and leave the digits. This will give us the number of players in millions.

The Python code is a bit trick here. We want to look at each entry in the 'Number' column, call it x, examine each character in x, and keep it (join it) if that character is a digit. The function to do that is written like this:
```
''.join( c for c in x if c.isdigit() )
```
We apply this as a lambda function to the dataframe, as follows


In [None]:
players_df['Number']=players_df['Number'].apply(lambda x: ''.join(c for c in x if c.isdigit()))
players_df

### Problem!

We notice that the entry for CrossFire is wrong -- it say 1 million players, when it should be 1000 million players. That is, the original entry said there is a billion players. So let's fix that by changing that value to 1000.

In [None]:
players_df.values[0,1]=1000
players_df

Now we are ready to visualize our data. Let's start with a pie chart.

In [None]:
# pie chart, all data
players_df.iplot(kind='pie',labels='Game',values='Number',title='Popularity by Number of Players (millions)')

Again, it might be more interesting to plot the top ten games only.

In [None]:
# pie chart, top 10 games
players_df[0:10].iplot(kind='pie',labels='Game',values='Number',title='Popularity by Number of Players (millions), Top Ten')

Maybe you prefer to see this data as a bar chart.

In [None]:
# bar chart, top 10 games
players_df.index = players_df['Game']
players_df[0:10]['Number'].iplot(kind='bar',title='Popularity by Number of Players, Top Ten')

## 5. Interpret

Below we will discuss the results of the data exploration. Here are a few key questions to ask when interpreting the results from data analysis to answer our question. These questions help you think critically about the information you see.

- Where did the data come from? How was the data gathered? 
- If you’re using more than one data source, how are the sources similar? 
- Describe what’s happening in the data visualization (graph). What do you notice (e.g. big or small values, or trends)? 
- How does our key evidence help answer our question?

## Some interpretations

- The ranking of games depends on what you measure. 
- For instance,  **Sales** says *Minecraft* is the top game, **Number of Players** says *Crossfire* is the top game.
- There is some consensus on **Sales, Number of Players.** The games *Minecraft, PlayerUnknown, and Tetris* show up in both lists.
- Free games don't show up at all under **Sales** but can still be popular, e.g. *MS Solitaire.*
- Be careful about double counting. e.g. two versions of *Tetris* (EA, Nintendo) show up separately under **Sales.**
- The numbers are HUGE! 200 million dollars for *Minecraft*, a billion players for *Crossfire*.


## 6. Communicate

Below are some writing prompts to help you reflect on the new information that is presented from the data. When we look at the evidence, think about what you perceive about the information. Is this perception based on what the evidence shows? If others were to view it, what perceptions might they have?

- I used to think ____________________but now I know____________________. 
- I wish I knew more about ____________________. 
- This visualization reminds me of ____________________. 
- I really like ____________________.

## Add your comments here

✏️

## Saving your work

You can download this notebook, with your comments, by using the **File** menu item in the toolbar above. You might like to downloads this in .html format so that the graphics remain active and can be viewed in your web browser.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)