1. **Video Game Data**

For this exercise, I combine two datasets with information about video games, their reviews and sales. The first, can be found here:
https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings - This includes video games and their sales figures, as well as the reviews from both critics and users at metacritic.com.

And the second here:
http://wiki.urbanhogfarm.com/index.php/IGN_Game_Reviews - This has the reviews of video games on the gaming website IGN.com.

I am interested in combining the two to see if there are many similarities between the two datasets. Then, in the future, I can do more complicated analysis with the resulting dataset to see if either of the two sites are better at predicting the success of the video game (in terms of sales). 

Let's take a look at the first dataset.

In [12]:
import pandas as pd

sales = pd.read_csv("game_sales.csv")
sales.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,


Now let's import and look at the IGN dataset.

In [13]:
ign = pd.read_excel("ign.xlsx")
ign.head()

Unnamed: 0,Game,Platform,Score,Genre
0,Wolfenstein: The New Order,Xbox One,7.8,Shooter
1,Mario Kart 8,Wii U,9.0,"Racing, Action"
2,Sportsfriends,PlayStation 3,8.7,"Action, Compilation"
3,Sportsfriends,PlayStation 4,8.7,"Action, Compilation"
4,Sportsfriends,PC,8.7,"Action, Compilation"


Now that we have those loaded into memory, it's time to see if we can combine them. We'll do that by going through each row in the sales dataframe and seeing if we can find the game in the IGN dataframe. If we can, we'll add that to a list. If not, we'll add zero to indicate that we didn't find it. That way we can just put this array into the sales dataframe when we're done.

In [14]:
ign_rating = []
for game in sales['Name']:
    matches = ign[ign['Game'] == game]
    if matches.shape[0] > 0:
        ign_rating.append(matches['Score'].iloc[0])
    else:
        ign_rating.append(0)
print(ign_rating[0:10])

[7.5, 9.0, 8.5, 7.7000000000000002, 0, 9.0, 9.5, 0, 8.9000000000000004, 0]


Now that we've created an array, it's time to add IGN stuff to the rest of the list, then clean up the data a little bit. That means making sure the scores that are 0s in the ign column end up as NaN, not zero so we don't have any problems.

In [15]:
import numpy as np
sales['IGN'] = ign_rating
sales['IGN'] = sales['IGN'].replace(0, np.NaN)

It's time to get rid of all of those pesky NaN. First we'll find out what we're dealing with, that way we can understand how much data we're losing

In [16]:
rating_columns = ['IGN', 'User_Score', 'Critic_Score']
sales[pd.isnull(sales['User_Score'])].head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating,IGN
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,,9.0
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,,
5,Tetris,GB,1989.0,Puzzle,Nintendo,23.2,2.26,4.22,0.58,30.26,,,,,,,9.0
9,Duck Hunt,NES,1984.0,Shooter,Nintendo,26.93,0.63,0.28,0.47,28.31,,,,,,,
10,Nintendogs,DS,2005.0,Simulation,Nintendo,9.05,10.95,1.93,2.74,24.67,,,,,,,


For some reason, some of the most popular games don't appear to have review information. That might mean we will miss some important data points if we delete them, but with them in the dataset it will be impossible to analyze them.

In [17]:
cleaned_sales = sales
for col in rating_columns:
    cleaned_sales = cleaned_sales[pd.notnull(cleaned_sales[col])]
cleaned_sales.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating,IGN
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E,7.5
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E,8.5
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E,7.7
6,New Super Mario Bros.,DS,2006.0,Platform,Nintendo,11.28,9.14,6.5,2.88,29.8,89.0,65.0,8.5,431.0,Nintendo,E,9.5
8,New Super Mario Bros. Wii,Wii,2009.0,Platform,Nintendo,14.44,6.94,4.7,2.24,28.32,87.0,80.0,8.4,594.0,Nintendo,E,8.9


While we've lost a lot of sales data, it does look a lot better. For what we want to do though the ratings are essential. So let's export this data file so we can use it to visualize the data, and then do some cool analysis.

In [18]:
cleaned_sales = cleaned_sales.reset_index()
cleaned_sales.to_csv("cleaned_game.csv")

But before we finish up, let's take a look at some summary statistics.

In [19]:
cleaned_sales.describe()



Unnamed: 0,index,Year_of_Release,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Count,IGN
count,5832.0,5712.0,5832.0,5832.0,5832.0,5832.0,5832.0,5832.0,5832.0,5083.0,5832.0
mean,6790.136488,2006.391632,0.394901,0.224703,0.060355,0.080163,0.760352,69.730796,27.23251,163.949439,6.970456
std,4533.674532,3.58485,0.986996,0.699866,0.295021,0.281303,2.014686,13.997864,18.667517,546.26652,1.663179
min,0.0,1988.0,0.0,0.0,0.0,0.0,0.01,17.0,3.0,4.0,1.0
25%,2858.25,,0.06,0.01,0.0,0.01,0.11,61.0,13.0,,6.0
50%,6200.5,,0.15,0.05,0.0,0.02,0.28,72.0,23.0,,7.3
75%,10274.25,,0.38,0.19,0.01,0.07,0.72,80.0,37.0,,8.2
max,16702.0,2016.0,41.36,28.96,6.5,10.57,82.53,98.0,107.0,9629.0,10.0


A few things that we can see from the dataset:

1. The mean isn't an accurate description of what we might expect the average game to make. For example, the top 75% of global sales is lower than the mean for global sales. That means that there are some games that performed unusually well, resulting in the average being skewed.

2. IGN and Metacritic have a different system (100 possible vs 10 possible) for review scores, but if the two are normalized their means are nearly identical. The range of IGN is higher than critic scores (after normalizing the two). But otherwise, their distributions have a lot in common.

3. The amount of user reviews on metacritic between games is highly variable. That means that the scores they give games may not be compared easily.