![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Callysto’s Weekly Data Visualization

## Video Game Popularity

### Recommended grade level: 6-9

Callysto's Weekly Data Visualization is a learning resource that helps Grades 5-12 teachers and students grow and develop data literacy skills. We do this by providing a data visualization, like a graph, and asking teachers and students to interpret it. This companion resource walks learners through how the data visualization is created and interpreted using the data science process. The steps of this process are listed below and applied to each weekly topic.

1. Question - What are we trying to answer? 
2. Gather - Find the data source(s) you will need. 
3. Organize - Arrange the data so that you can easily explore it. 
4. Explore - Examine the data to look for evidence to answer our question. This includes creating visualizations. 
5. Interpret - Explain how the evidence answers our question. 
6. Communicate - Reflect on the interpretation. 

## 1. Question

Have you ever wondered what the most popular video games are? 

There are many ways you might want to answer this question. 
- You could ask your friends for their opinions.
- You could read Facebook posts, Twitter tweets, Reddit posts and judge which ones are discussed most.
- You could gather financial information from companies listings about their activities.

Certainly it is worthwhile to think about this question and how you can gather information, to really answer the question well.

Our goal here is to use some Python code to create pie charts and bar charts that summarize current information about which video games are most popular. This is a short example of data visualization, using data about the popularity of video games.

There are many possible sources for the data. In this examples, we look at total dollar value of sales of games, and also at the number of users. We use Wikipedia as a convenient source for this data.

In this notebook, we will take the approach of looking at video game sales records, as well as number of players.

## 2. Gather

The code below will import the Python programming libraries we need to gather and organize the data to answer our question.

In [1]:
## import libraries
import numpy as np
import pandas as pd
import cufflinks as cf
cf.go_offline()
%matplotlib inline

cf.set_config_file(offline=True)


### Popularity of video games, by sales

We grabbed the data from this Wikipedia page on video game sales:

https://en.wikipedia.org/wiki/List_of_best-selling_video_games

We saved the data in csv format (Comma Separated Values) as a simple list of games and sales figures.

We upload the .csv file on our Jupyter hub, where we can access it with our code. This file should be available when you access this code on the Callysto hub.

In [2]:
## import data
sales_df = pd.read_csv('VideoBySales.csv'); 
sales_df

Unnamed: 0,Rank,Title,Sales,Platform,Initial release date,Developer,Publisher,Ref.
0,1,Minecraft,200000000,Multi-platform,"November 18, 2011[b]",Mojang Studios,Mojang Studios,[3]
1,2,Grand Theft Auto V,135000000,Multi-platform,"September 17, 2013",Rockstar North,Rockstar Games,[4]
2,3,Tetris (EA),100000000,Mobile,"September 12, 2006",EA Mobile,Electronic Arts,[5]
3,4,Wii Sports,82900000,Wii,"November 19, 2006",Nintendo EAD,Nintendo,[6]
4,5,PlayerUnknown's Battlegrounds,70000000,Multi-platform,"December 20, 2017",PUBG Corporation,PUBG Corporation,[7]
5,6,Super Mario Bros.,48240000,Multi-platform,"September 13, 1985",Nintendo,Nintendo,[c]
6,7,Pokémon Red / Green / Blue / Yellow,47520000,Multi-platform,"February 27, 1996",Game Freak,Nintendo,[d]
7,8,Wii Fit and Wii Fit Plus,43800000,Wii,"December 1, 2007",Nintendo EAD,Nintendo,[6]
8,9,Tetris (Nintendo),43000000,Game Boy / NES,"June 14, 1989",Nintendo R&D1,Nintendo,[e]
9,10,Pac-Man,39098000,Multi-platform,July 1980,Namco,Namco,[f]


## 3. Organize

The code below will arrange the data cleanly so that we can do analysis on it. This is a quality control step for our data and involves examining the data to detect anything odd with the data (e.g. structure, missing values), fixing the oddities, and checking if the fixes worked. 

The data needs to be adjusted a bit before we can plot. This is called "Data munging" which just means we are fixing things for the computer.

The list of "Sales" dollar values looks like a list of numbers, but to the computer, these are actually text strings -- little words made up of digits and commas. We need to transform these into computer numbers, by removing the commas, then telling the computer to treat these entries as numbers.

In [3]:
# data cleaning
sales_df['Sales'] = sales_df['Sales'].str.replace(',' , '')  ## this replaces the commans with empty space
sales_df['Sales'] = pd.to_numeric(sales_df['Sales'])         ## this changes the texts to numerical values

## 4. Explore

The code below will be used to help us look for evidence to answer our question. This can involve looking at data in table format, applying math and statistics, and creating different types of visualizations to represent our data.

Let's start with a pie chart:

In [4]:
# data exploration
sales_df.iplot(kind='pie',labels='Title',values='Sales',title='Sales by Game Title')

Let's look at just the top ten entries

In [5]:
sales_df[0:10].iplot(kind='pie',labels='Title',values='Sales',title='Sales by Game Title, Top Ten')

Maybe you prefer to see this data as a bar chart. This will show explicitly the dollar value.

In [6]:
sales_df.index = sales_df['Title']
sales_df[0:10]['Sales'].iplot(kind='bar',values='Sales',title='Sales by Game Title, Top Ten')

While we are looking at sales, maybe we can also look at which companies are selling well. We have that data in our dataframe, so let's group the information together and visualize it.

In [7]:
df = sales_df.groupby('Publisher',as_index=False).sum()
df

Unnamed: 0,Publisher,Rank,Sales
0,2K Games,40,22000000
1,Activision,141,129600000
2,Bethesda Softworks,19,30000000
3,Blizzard Entertainment,20,30000000
4,CD Projekt,24,28000000
5,Electronic Arts,37,124000000
6,Mojang Studios,1,200000000
7,Namco,10,39098000
8,Nintendo,493,695452500
9,Nintendo / The Pokémon Company,147,91380000


In [8]:
df = df.sort_values(by="Sales",ascending=False)

In [9]:
df.iplot(kind='pie',labels='Publisher',values='Sales',title='Sales by Company')

In [10]:
df[0:10].iplot(kind='pie',labels='Publisher',values='Sales',title='Sales by Company, Top Ten')

In [11]:
df.index = df['Publisher']
df[0:10]['Sales'].iplot(kind='bar',values='Sales',title='Sales by Company, Top Ten')

## 2,3,4. Gather, Organizer, Explore, with more data

Let's look at the game popularity, as measured by number of players. We also got this data from Wikipedia, here: 
https://en.wikipedia.org/wiki/List_of_most-played_video_games_by_player_count

In [12]:
players_df = pd.read_csv('VideoByPlayers.csv'); 
players_df

Unnamed: 0,Game,Number,As of,Business model,Release date,Publisher(s),Ref.
0,CrossFire,1 billion,February 2020,Free-to-play,"May 3, 2007",Smilegate / Tencent,[1]
1,Dungeon Fighter Online (DFO),700 million,May 2020,Free-to-play,August 2005,Nexon / Tencent,[2]
2,PlayerUnknown's Battlegrounds (PUBG),600 million[a],December 2019,Pay-to-play/free-to-play,"December 20, 2017",PUBG Corporation / Bluehole,[3]
3,Pac-Man Doodle,505 million peak daily players[b][a],June 2010,Free-to-play,"May 21, 2010",Namco / Google,[4][5]
4,QQ Speed,500 million[a],January 2020,Free-to-play,"January 23, 2008",Tencent,[7]
5,Candy Crush Saga,500 million[a],June 2014,Free-to-play,"April 12, 2012",King,[8]
6,Tetris,500 million[a],2016,Pay-to-play/free-to-play,"June 6, 1984",Various,[9]
7,Minecraft,480 million[a],November 2019,Pay-to-play/free-to-play,"November 18, 2011",Mojang,[10]
8,Microsoft Solitaire,400 million,July 2015,Free-to-play,1990,Microsoft,[11]
9,Mini World,400 million[a],April 2020,Free-to-play,"December 26, 2015",Miniwan,[12]


Let's fix up the 'Numbers' column, to change the text into a number.

It is pretty easy -- just strip out the words, and leave the digits. This will give us the number of players in millions.

The Python code is a bit trick here. We want to look at each entry in the 'Number' column, call it x, examine each character in x, and keep it (join it) if that character is a digit. The function to do that is written like this:
```
''.join( c for c in x if c.isdigit() )
```
We apply this as a lambda function to the dataframe, as follows


In [13]:
players_df['Number']=players_df['Number'].apply(lambda x: ''.join(c for c in x if c.isdigit()))
players_df

Unnamed: 0,Game,Number,As of,Business model,Release date,Publisher(s),Ref.
0,CrossFire,1,February 2020,Free-to-play,"May 3, 2007",Smilegate / Tencent,[1]
1,Dungeon Fighter Online (DFO),700,May 2020,Free-to-play,August 2005,Nexon / Tencent,[2]
2,PlayerUnknown's Battlegrounds (PUBG),600,December 2019,Pay-to-play/free-to-play,"December 20, 2017",PUBG Corporation / Bluehole,[3]
3,Pac-Man Doodle,505,June 2010,Free-to-play,"May 21, 2010",Namco / Google,[4][5]
4,QQ Speed,500,January 2020,Free-to-play,"January 23, 2008",Tencent,[7]
5,Candy Crush Saga,500,June 2014,Free-to-play,"April 12, 2012",King,[8]
6,Tetris,500,2016,Pay-to-play/free-to-play,"June 6, 1984",Various,[9]
7,Minecraft,480,November 2019,Pay-to-play/free-to-play,"November 18, 2011",Mojang,[10]
8,Microsoft Solitaire,400,July 2015,Free-to-play,1990,Microsoft,[11]
9,Mini World,400,April 2020,Free-to-play,"December 26, 2015",Miniwan,[12]


### Problem!

We notice that the entry for CrossFire is wrong -- it say 1 million players, when it should be 1000 million players. That is, the original entry said there is a billion players. So let's fix that by changing that value to 1000.

In [14]:
players_df.values[0,1]=1000
players_df

Unnamed: 0,Game,Number,As of,Business model,Release date,Publisher(s),Ref.
0,CrossFire,1000,February 2020,Free-to-play,"May 3, 2007",Smilegate / Tencent,[1]
1,Dungeon Fighter Online (DFO),700,May 2020,Free-to-play,August 2005,Nexon / Tencent,[2]
2,PlayerUnknown's Battlegrounds (PUBG),600,December 2019,Pay-to-play/free-to-play,"December 20, 2017",PUBG Corporation / Bluehole,[3]
3,Pac-Man Doodle,505,June 2010,Free-to-play,"May 21, 2010",Namco / Google,[4][5]
4,QQ Speed,500,January 2020,Free-to-play,"January 23, 2008",Tencent,[7]
5,Candy Crush Saga,500,June 2014,Free-to-play,"April 12, 2012",King,[8]
6,Tetris,500,2016,Pay-to-play/free-to-play,"June 6, 1984",Various,[9]
7,Minecraft,480,November 2019,Pay-to-play/free-to-play,"November 18, 2011",Mojang,[10]
8,Microsoft Solitaire,400,July 2015,Free-to-play,1990,Microsoft,[11]
9,Mini World,400,April 2020,Free-to-play,"December 26, 2015",Miniwan,[12]


Now we are ready to plot.

In [15]:
players_df.iplot(kind='pie',labels='Game',values='Number',title='Popularity by Number of Players')

Again, it might be more interesting to plot the top ten games only.

In [16]:
players_df[0:10].iplot(kind='pie',labels='Game',values='Number',title='Popularity by Number of Players, Top Ten')

Maybe you prefer to see this data as a bar chart.

In [17]:
players_df.index = players_df['Game']
players_df[0:10]['Number'].iplot(kind='bar',title='Popularity by Number of Players, Top Ten')

## 5. Interpret

Below we will discuss the results of the data exploration. Here are a few key questions to ask when interpreting the results from data analysis to answer our question. These questions help you think critically about the information you see.

- Where did the data come from? How was the data gathered? 
- If you’re using more than one data source, how are the sources similar? 
- Describe what’s happening in the data visualization (graph). What do you notice (e.g. big or small values, or trends)? 
- How does our key evidence help answer our question?

*Markdown cell, add interpretation here* Lorem ipsum ….


## 6. Communicate

Below we will reflect on the new information that is presented from the data. When we look at the evidence, think about what you perceive about the information. Is this perception based on what the evidence shows? If others were to view it, what perceptions might they have? These writing prompts can help you reflect.

- I used to think ____________________but now I know____________________. 
- I wish I knew more about ____________________. 
- This visualization reminds me of ____________________. 
- I really like ____________________.

*Markdown cell, add reflections here* Lorem ipsum ….

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)