![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Callysto’s Weekly Data Visualization

## Twitch and Game Popularity

### Recommended Grade levels: 8-12

![Twitch website](images/Twitch_header.png)
<br>

### Instructions
#### “Run” the cells to see the graphs
Click “Cell” and select “Run All”.<br> This will import the data and run all the code, so you can see this week's data visualization. Scroll to the top after you’ve run the cells.<br> 

![instructions](https://github.com/callysto/data-viz-of-the-week/blob/main/images/instructions.png?raw=true)

**You don’t need to do any coding to view the visualizations**.
The plots generated in this notebook are interactive. You can hover over and click on elements to see more information. 

Email contact@callysto.ca if you experience issues.

### About this Notebook

Callysto's Weekly Data Visualization is a learning resource that aims to develop data literacy skills. We provide Grades 5-12 teachers and students with a data visualization, like a graph, to interpret. This companion resource walks learners through how the data visualization is created and interpreted by a data scientist. 

The steps of the data analysis process are listed below and applied to each weekly topic.

1. Question - What are we trying to answer? 
2. Gather - Find the data source(s) you will need. 
3. Organize - Arrange the data, so that you can easily explore it. 
4. Explore - Examine the data to look for evidence to answer the question. This includes creating visualizations. 
5. Interpret - Describe what's happening in the data visualization. 
6. Communicate - Explain how the evidence answers the question. 

# Question

**Twitch** is an onlines streaming service that lets viewers anywhere in the world watch their faviourite gamers play video games, live. It is extremely popular, with views spending hundreds of hours to watch someone else play a game.

Here is a screen shot from the Twitch website, showing a game in play as well as other channels to look at.

![Image](images/Twitch.png)

Why  is Twitch so  popular? I do not know! 

Have you ever wondered which games on Twitch are most popular, and just how much watching is going on?

### Goal
Our goal is to show an overview of what games are most popular, based on counts of number of hours watched, and number of viewers watching.

We will use pie charts and bar graphs to visually represent this data in an informative way.  

# Gather

### Code:
The code below will import the Python programming libraries we need to gather and organize the data to answer our question.

In [1]:
## import libraries
import pandas as pd
import cufflinks as cf
cf.go_offline()

### Data:

There are many sources for information about Twitch and its usage statistics. We used a site called Twitch Analytics hosted by SullyGnome.com The website is here: https://sullygnome.com/

This web pages has several options for downloading information. They ask us not to "scrape" data from the site, so we had a choice of downloading their files in csv format (Comma Separated Values), or copying and pasting directly formt the web page into our own spreadsheets and saving as csv. Here are our four files for this project: 
- watch-time-30.csv  The list amount of time (hours) watched, per game, in the last 30 days
- watch-time-365.csv Ditto, but in the last 365 days
- peak-viewers-30.csv The list of peak number of viewers, per game, in the last 30 days
- peak-viewers-365.csv Ditto, but in the last 365 days

We did discover that copying and pasting gave us better data. We suspect the CSV files directly downloaded from the website are flawed. (We sent a message to the author of the sullygnom.com webpage.) Our data was downloaded on February 28, 2021. 

We then upload the .csv file on our Jupyter hub, where we can access it with our code. These files are all available when you access this code on the Callysto hub.

### Import the data

In [2]:
## import data, from csv into a data frame (df)
time30_df = pd.read_csv('data/watch-time-30.csv'); 
time365_df = pd.read_csv('data/watch-time-365.csv'); 
viewers30_df = pd.read_csv('data/peak-viewers-30.csv'); 
viewers365_df = pd.read_csv('data/peak-viewers-365.csv'); 

### Comment on the data

We can check the size of each data frame by using the "shape" command. This will tell us how many rows and columns are in each data frame.

In [3]:
time30_df.shape, time365_df.shape, viewers30_df.shape, viewers365_df.shape

((49, 14), (50, 14), (50, 14), (50, 14))

From this "shape" inquiry, we see each data frame has 49 or 50 rows and 14 columns. 

We can display the first few rows of each data frame using the "head" command, as in the following code:

In [4]:
time30_df.head()

Unnamed: 0.1,Unnamed: 0,Game,Watch time,Stream time,Peak viewers,Peak channels,Streamers,Average viewers,Average channels,Average viewer ratio,Followers gained,Views gained,FPR,VPR
0,1,Just Chatting,"259,503,353 hours","3,455,992 hours",1138978,13308,468472,360421,4799,75.09,20418682,393638526,5.91,113.9
1,2,League of Legends,"164,249,369 hours","3,988,498 hours",703375,9687,283482,228124,5539,41.18,3239671,213156507,0.81,53.44
2,3,Grand Theft Auto V,"139,348,297 hours","2,152,262 hours",506815,6008,186520,193539,2989,64.75,12944046,116185493,6.01,53.98
3,4,Fortnite,"115,073,855 hours","7,113,555 hours",541227,18928,682572,159824,9879,16.18,11185681,189220609,1.57,26.6
4,5,Minecraft,"89,118,793 hours","3,821,103 hours",647145,9865,477777,123776,5307,23.32,7497684,83435320,1.96,21.84


In [5]:
time365_df.head()

Unnamed: 0.1,Unnamed: 0,Game,Watch time,Stream time,Peak viewers,Peak channels,Streamers,Average viewers,Average channels,Average viewer ratio,Followers gained,Views gained,FPR,VPR
0,1,Just Chatting,"2,300,697,968 hours","32,232,285 hours",2787896,13308,1835635,262636,3679,71.38,171344648,3193916413,5.32,99.09
1,2,League of Legends,"1,652,835,706 hours","40,766,355 hours",2020835,11606,1209767,188679,4653,40.54,42937137,2010989819,1.05,49.33
2,3,Fortnite,"1,165,037,506 hours","88,236,119 hours",2331987,129860,3496932,132995,10072,13.2,142967972,1455149313,1.62,16.49
3,4,Grand Theft Auto V,"945,599,705 hours","19,037,766 hours",506815,6530,1014076,107945,2173,49.67,49180683,901816871,2.58,47.37
4,5,Call of Duty: Warzone,"944,741,957 hours","60,397,163 hours",951957,31506,2143304,107847,6894,15.64,40124735,820593661,0.66,13.59


In [6]:
viewers30_df.head()


Unnamed: 0.1,Unnamed: 0,Game,Watch time,Stream time,Peak viewers,Peak channels,Streamers,Average viewers,Average channels,Average viewer ratio,Followers gained,Views gained,FPR,VPR
0,1,Just Chatting,"259,503,353 hours","3,455,992 hours",1138978,13308,468472,360421,4799,75.09,20418682,393638526,5.91,113.9
1,2,Special Events,"4,469,982 hours","42,301 hours",930019,4232,10782,6208,58,105.67,156399,9842045,3.7,232.67
2,3,League of Legends,"164,249,369 hours","3,988,498 hours",703375,9687,283482,228124,5539,41.18,3239671,213156507,0.81,53.44
3,4,Minecraft,"89,118,793 hours","3,821,103 hours",647145,9865,477777,123776,5307,23.32,7497684,83435320,1.96,21.84
4,5,Counter-Strike: Global Offensive,"72,063,331 hours","1,635,037 hours",607485,4551,172349,100087,2270,44.07,2739766,124697779,1.68,76.27


In [7]:
viewers365_df.head()

Unnamed: 0.1,Unnamed: 0,Game,Watch time,Stream time,Peak viewers,Peak channels,Streamers,Average viewers,Average channels,Average viewer ratio,Followers gained,Views gained,FPR,VPR
0,1,Special Events,"89,403,711 hours","476,595 hours",3123208,6896,60379,10205,54,187.59,2749021,133446937,5.77,280.0
1,2,Just Chatting,"2,300,697,968 hours","32,232,285 hours",2787896,13308,1835635,262636,3679,71.38,171344648,3193916413,5.32,99.09
2,3,Fortnite,"1,165,037,506 hours","88,236,119 hours",2331987,129860,3496932,132995,10072,13.2,142967972,1455149313,1.62,16.49
3,4,League of Legends,"1,652,835,706 hours","40,766,355 hours",2020835,11606,1209767,188679,4653,40.54,42937137,2010989819,1.05,49.33
4,5,VALORANT,"936,638,883 hours","34,068,505 hours",1728977,16287,1380665,106922,3889,27.49,44001457,1231846982,1.29,36.16


##### Organize

The code below will arrange the data cleanly so that we can do analysis on it. This is a quality control step for our data and involves examining the data to detect anything odd with the data (e.g. structure, missing values), fixing the oddities, and checking if the fixes worked. 

One thing we notice is that **Just Chatting** shows up as the top item in each data frame. Bowever, this is not really a game but rather a channel that viewers go to in order to chat, not play games. Since it is not a game, we will remove it from the data frame. Similarly, **Special Events** shows up in the "viewers30" and "viewers365" data frames, so we will remove that one as well.


In [8]:
# data cleaning
time30_df = time30_df.drop(index=0);  ## drop row 0, which is Just Chatting
time365_df = time365_df.drop(index=0);
viewers30_df = viewers30_df.drop(index=[0,1]); ## drop rows 0 and 1, Just Chatting and Special Events
viewers365_df = viewers365_df.drop(index=[0,1]);


We also need to convert the columns 'Watch time' and 'Peak viewers' to numbers, rather than text. 
We do this in Python by replacing the text 'hours' with a blank, replacing the commas with blank, 
and then convert the text to an integer. We do this for all four data frames, and for the two columns.

In [9]:
# convert strings to numbers
time30_df['Watch time'] = time30_df['Watch time'].str.replace(',', '').str.replace('hours','').astype(int)
time365_df['Watch time'] = time365_df['Watch time'].str.replace(',', '').str.replace('hours','').astype(int)
viewers30_df['Watch time'] = viewers30_df['Watch time'].str.replace(',', '').str.replace('hours','').astype(int)
viewers365_df['Watch time'] = viewers365_df['Watch time'].str.replace(',', '').str.replace('hours','').astype(int)

time30_df['Peak viewers'] = time30_df['Peak viewers'].str.replace(',', '').str.replace('hours','').astype(int)
time365_df['Peak viewers'] = time365_df['Peak viewers'].str.replace(',', '').str.replace('hours','').astype(int)
viewers30_df['Peak viewers'] = viewers30_df['Peak viewers'].str.replace(',', '').str.replace('hours','').astype(int)
viewers365_df['Peak viewers'] = viewers365_df['Peak viewers'].str.replace(',', '').str.replace('hours','').astype(int)


### Comment on the data

We can look at the data head again, to ensure that we have removed those annoying channels, Just Chatting and Special Events.

We also verify that the two columns "Watch time" and "Peak viewers" show up as plain numbers, not text.

In [10]:
time30_df.head()

Unnamed: 0.1,Unnamed: 0,Game,Watch time,Stream time,Peak viewers,Peak channels,Streamers,Average viewers,Average channels,Average viewer ratio,Followers gained,Views gained,FPR,VPR
1,2,League of Legends,164249369,"3,988,498 hours",703375,9687,283482,228124,5539,41.18,3239671,213156507,0.81,53.44
2,3,Grand Theft Auto V,139348297,"2,152,262 hours",506815,6008,186520,193539,2989,64.75,12944046,116185493,6.01,53.98
3,4,Fortnite,115073855,"7,113,555 hours",541227,18928,682572,159824,9879,16.18,11185681,189220609,1.57,26.6
4,5,Minecraft,89118793,"3,821,103 hours",647145,9865,477777,123776,5307,23.32,7497684,83435320,1.96,21.84
5,6,Call of Duty: Warzone,78161349,"3,489,624 hours",305415,9097,215326,108557,4846,22.4,3335512,70580684,0.96,20.23


In [11]:
viewers30_df.head()

Unnamed: 0.1,Unnamed: 0,Game,Watch time,Stream time,Peak viewers,Peak channels,Streamers,Average viewers,Average channels,Average viewer ratio,Followers gained,Views gained,FPR,VPR
2,3,League of Legends,164249369,"3,988,498 hours",703375,9687,283482,228124,5539,41.18,3239671,213156507,0.81,53.44
3,4,Minecraft,89118793,"3,821,103 hours",647145,9865,477777,123776,5307,23.32,7497684,83435320,1.96,21.84
4,5,Counter-Strike: Global Offensive,72063331,"1,635,037 hours",607485,4551,172349,100087,2270,44.07,2739766,124697779,1.68,76.27
5,6,Fortnite,115073855,"7,113,555 hours",541227,18928,682572,159824,9879,16.18,11185681,189220609,1.57,26.6
6,7,Grand Theft Auto V,139348297,"2,152,262 hours",506815,6008,186520,193539,2989,64.75,12944046,116185493,6.01,53.98


# Explore

The code below will be used to help us look for evidence to answer our question. This can involve looking at data in table format, applying math and statistics, and creating different types of visualizations to represent our data.

We will start by displaying the 30 day data in a pie chart.

In [12]:
# data exploration
time30_df.iplot(kind='pie',labels='Game',values='Watch time',title='Hours watched, by Game Title')

In [13]:
viewers30_df.iplot(kind='pie',labels='Game',values='Peak viewers',title='Peak viewers, by Game Title')

# Interpret

The pie charts are overwhelming as they contain information about 50 games each. Let's display again, using only the top ten items in each chart. 

Notice the code to plot this is very similar to the above, except we indicate the row range by introducing the index list [0:10]

In [15]:
time30_df[0:10].iplot(kind='pie',labels='Game',values='Watch time',title='Hours watched, by Game Title')

In [16]:
viewers30_df[0:10].iplot(kind='pie',labels='Game',values='Peak viewers',title='Peak viewers, by Game Title')

## Interpreting the charts

We notice that the order of games depends on what you measure: hours watched, or peak number of viewers.

However, this is some consistency. For instance, in both charts, the top five games include these four: 
> League of Legends, Minecraft, Fortnite, and Grand Theft Auto. 

## Alternate visualizations

Perhaps you might find a bar chart to be more revealing. Not much code is needed for this, we just change the pie chart code above into a bar chart, being sure to plot the 'Watch time" column here. 

In [17]:
time30_df.index = time30_df['Game']
time30_df[0:10]['Watch time'].iplot(kind='bar',labels='Game',values='Watch time',title='Hours watched, by Game Title')

We can also do a plot of the Peak Viewers, in bar chart form.

In [18]:
viewers30_df.index = viewers30_df['Game']
viewers30_df[0:10]['Peak viewers'].iplot(kind='bar',labels='Game',values='Peak viewers',title='Peak viewers, by Game Title')

## Sanity check

In our first attempt in making this notebook, we thought there is something weird about the numbers. In the first instance, we found that League of Nations had about 10 billion hours of viewing, and 700 thousand viewers at its peak. 

10 billion divide by 700 thousand is about 14,000, which is the number of hours each viewer spends watching the game, over 30 days. 

Per day, this is about 475 hours. Yet there are ony 24 hours in the day. How can this be?

After some investigation, we determined that the csv files downloaded directly from the sullygnome.com website were flawed. So we used a cut-and-paste message to get better data, which were used in this current version of our Python notebook.



In [19]:
## Here are the numbers
10000000000/700000, 10000000000/700000/30


(14285.714285714286, 476.1904761904762)

## Sanity check number two

With this new, corrected data, we found that League of Nations had about 165 million hours of viewing, and 700 thousand viewers at its peak. 

165 million divided by 700 thousand is about 236, which is the number of hours each viewer spends watching the game, over 30 days. 

Per day, this is about 7.8 hours. That does fit into a 24 hour day, but it is like a full work day. So it seems a lot of people are viewing these games for a long time each day. 

You might ask yourself why the number of hours is so high. Some possibilities:
- people turn on their computer onto Twitch and have the games playing in the background during the work day
- people create bots, or self-running programs that pretend to watch the game, to bring up the numbers
- game players like to "game" the numbers to increase the payments they get for having many viewers. What are some other ways they can game the system?


Here are the numbers, for numbers of hours per viewer, over 30 days, and per day:


In [20]:
165000000/700000, 165000000/700000/30

(235.71428571428572, 7.857142857142858)

## 6. Communicate

Below are some writing prompts to help you reflect on the new information that is presented from the data. When we look at the evidence, think about what you perceive about the information. Is this perception based on what the evidence shows? If others were to view it, what perceptions might they have?

- I used to think ____________________but now I know____________________. 
- I wish I knew more about ____________________. 
- This visualization reminds me of ____________________. 
- I really like ____________________.

## Add your comments here

✏️

## Saving your work

You can download this notebook, with your comments, by using the **File** menu item in the toolbar above. You might like to downloads this in .html format so that the graphics remain active and can be viewed in your web browser.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)