# Get top Video Games through Gamespot API


### ABSTRACT


We are working on a video game dataset and will be extracting video game data from 3 different data sources and then will be munging them together to form a consistent dataset. We will perform several operations over the dataset extracted to make the data clean and error free and consistent. After that we will be developing a database from using the extracted source data and display it in the form of an Entity-Relationship Diagram.
The dataset used is the vgchartz.com. It contains details about Video Game ratings, genres, publisher and year of release.

## Importing Libraries

In [21]:
import requests
import pandas as pd
import json
import os
import rawgpy
from bs4 import BeautifulSoup
import numpy as np

### DATA SOURCE 1: Using Web Scraping using Beautiful Soup

### What is Beautiful Soup?

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.


### The site we are going to use is http://www.vgchartz.com/gamedb/games.php?name=&keyword=&console=&region=All&developer=&publisher=&goty_year=&genre=&boxart=Both&banner=Both&ownership=Both&showmultiplat=Yes&results=200&order=Sales&showtotalsales=0&showpublisher=0&showpublisher=1&showvgchartzscore=0&shownasales=0&showdeveloper=0&showcriticscore=0&showpalsales=0&showreleasedate=0&showuserscore=0&showjapansales=0&showlastupdate=0&showothersales=0&showshipped=0. Please visit the link to get information on what is being scrapped.



In [22]:
# Fetching the tags from the website
url = 'http://www.vgchartz.com/gamedb/games.php?name=&keyword=&console=&region=All&developer=&publisher=&goty_year=&genre=&boxart=Both&banner=Both&ownership=Both&showmultiplat=Yes&results=200&order=Sales&showtotalsales=0&showpublisher=0&showpublisher=1&showvgchartzscore=0&shownasales=0&showdeveloper=0&showcriticscore=0&showpalsales=0&showreleasedate=0&showuserscore=0&showjapansales=0&showlastupdate=0&showothersales=0&showshipped=0'
html = requests.get(url, headers = {'User-Agent':'Mozilla/5.0'})
print(html.status_code) 
# Printing the status code, 200 means the request has succeeded

200


In [23]:
#Displaying the above format to more readable format using html parser
soup = BeautifulSoup(html.content, 'html.parser')
print(soup.prettify)

<bound method Tag.prettify of <!DOCTYPE html>

<html lang="en">
<head>
<!-- VGCHARTZ MAIN HEADER TAGS -->
<!-- Venatus Market Ad-Manager (vgchartz.com) -->
<script>
    (function(){document.write('<div id="vmv3-ad-manager" style="display:none"></div>');document.getElementById("vmv3-ad-manager").innerHTML='<iframe id="vmv3-frm" src="javascript:\'<html><body></body></html>\'" width="0" height="0" data-mode="scan" data-site-id="5b11330346e0fb00017cd841"></iframe>';var a=document.getElementById("vmv3-frm");a=a.contentWindow?a.contentWindow:a.contentDocument;a.document.open();a.document.write('<script src="https://hb.vntsm.com/v3/live/ad-manager.min.js" type="text/javascript" async>'+'</scr'+'ipt>');a.document.close()})();
    </script>
<!-- / Venatus Market Ad-Manager (vgchartz.com) -->
<meta content="4J-ohFOExhky4T8zKx5Nz04hdjbaxo52B6qVrLlM8o0" name="google-site-verification"/>
<script src="http://www.google-analytics.com/urchin.js" type="text/javascript"></script>
<script type="text/java

In [24]:
# We will be extracting Rank, Name, Platform and Publisher from vgchartz.com 
k=[] 
rank=[]
gname=[]
publisher=[]
platform=[]
release_date=[]
# Retrieving values using a for loop
for tag in soup.find_all('a'):
    if tag['href'].startswith('http://www.vgchartz.com/game/'):
        k.append(tag.get_text().strip())
        data=tag.parent.parent.find_all('td')
        if data!=[]:
            rank.append(np.int32(data[0].string))
            platform.append(data[3].find('img').attrs['alt'].strip(' '))
            publisher.append(data[4].string.strip(' '))
gname = k[10:] # Our data starts from index position 10 onwards
# Creating a dictionary to store the column names for the dataframe
columns = {
    'Rank': rank,
    'Name':gname,
    'Platform':platform,
    'Publisher':publisher
}
df = pd.DataFrame(columns) # Creating a dataframe with column names Rank, Name, Platform and Publisher.
df = df[[
    'Rank', 'Name', 'Platform',
    'Publisher']]
# Saving the obtained dataframe on a file named vgsales.csv
df.to_csv("vgsales.csv", sep=",", encoding='utf-8', index=False) # Saves the data to .csv file
#df.drop_duplicates(subset ="Name",keep = False, inplace = True)
df

Unnamed: 0,Rank,Name,Platform,Publisher
0,1,Wii Sports,Wii,Nintendo
1,2,Super Mario Bros.,NES,Nintendo
2,3,Mario Kart Wii,Wii,Nintendo
3,4,PLAYERUNKNOWN'S BATTLEGROUNDS,PC,PUBG Corporation
4,5,Minecraft,PC,Mojang
...,...,...,...,...
195,196,New Super Mario Bros. U,WiiU,Nintendo
196,197,Red Dead Redemption 2,XOne,Rockstar Games
197,198,Destiny,PS4,Activision
198,199,Tekken 2,PS,Namco


In [25]:
print(df.isnull().any())
print(df.columns)

Rank         False
Name         False
Platform     False
Publisher    False
dtype: bool
Index(['Rank', 'Name', 'Platform', 'Publisher'], dtype='object')


## DATASOURCE 2 - Using Raw Data

In [26]:
df2 = pd.read_csv('vgsales2019.csv')
#df2.drop(['Rank','ESRB_Rating','Platform','Publisher','Developer','User_Score', 'Total_Shipped', 'Global_Sales', 'NA_Sales', 'PAL_Sales', 'JP_Sales', 'Other_Sales'], axis=1, inplace= True)

### Displaying the output

In [27]:
df2.head()

Unnamed: 0,Rank,Name,Genre,ESRB_Rating,Platform,Publisher,Developer,Critic_Score,User_Score,Total_Shipped,Global_Sales,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Year
0,1,Wii Sports,Sports,E,Wii,Nintendo,Nintendo EAD,7.7,,82.86,,,,,,2006.0
1,2,Super Mario Bros.,Platform,,NES,Nintendo,Nintendo EAD,10.0,,40.24,,,,,,1985.0
2,3,Mario Kart Wii,Racing,E,Wii,Nintendo,Nintendo EAD,8.2,9.1,37.14,,,,,,2008.0
3,4,PlayerUnknown's Battlegrounds,Shooter,,PC,PUBG Corporation,PUBG Corporation,,,36.6,,,,,,2017.0
4,5,Wii Sports Resort,Sports,E,Wii,Nintendo,Nintendo EAD,8.0,8.8,33.09,,,,,,2009.0


### Checking the information of the data - data type and total number of records in each column

In [28]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55792 entries, 0 to 55791
Data columns (total 16 columns):
Rank             55792 non-null int64
Name             55792 non-null object
Genre            55792 non-null object
ESRB_Rating      23623 non-null object
Platform         55792 non-null object
Publisher        55792 non-null object
Developer        55775 non-null object
Critic_Score     6536 non-null float64
User_Score       335 non-null float64
Total_Shipped    1827 non-null float64
Global_Sales     19415 non-null float64
NA_Sales         12964 non-null float64
PAL_Sales        13189 non-null float64
JP_Sales         7043 non-null float64
Other_Sales      15522 non-null float64
Year             54813 non-null float64
dtypes: float64(9), int64(1), object(6)
memory usage: 6.8+ MB


### How to find the missing values

In [29]:
# checking missing, NaN data in the dataframe through CSV
df2.isnull().any()

Rank             False
Name             False
Genre            False
ESRB_Rating       True
Platform         False
Publisher        False
Developer         True
Critic_Score      True
User_Score        True
Total_Shipped     True
Global_Sales      True
NA_Sales          True
PAL_Sales         True
JP_Sales          True
Other_Sales       True
Year              True
dtype: bool

### Checking the total null values in the column using sum() function

In [30]:
df2.isnull().sum()

Rank                 0
Name                 0
Genre                0
ESRB_Rating      32169
Platform             0
Publisher            0
Developer           17
Critic_Score     49256
User_Score       55457
Total_Shipped    53965
Global_Sales     36377
NA_Sales         42828
PAL_Sales        42603
JP_Sales         48749
Other_Sales      40270
Year               979
dtype: int64

### Checking the shape of the data

In [31]:
df2.shape

(55792, 16)

### Checking the columns present in the data

In [32]:
df2.columns

Index(['Rank', 'Name', 'Genre', 'ESRB_Rating', 'Platform', 'Publisher',
       'Developer', 'Critic_Score', 'User_Score', 'Total_Shipped',
       'Global_Sales', 'NA_Sales', 'PAL_Sales', 'JP_Sales', 'Other_Sales',
       'Year'],
      dtype='object')

In [33]:
df2.drop(['Rank','ESRB_Rating','Platform','Publisher','Developer','User_Score', 'Total_Shipped', 'Global_Sales', 'NA_Sales', 'PAL_Sales', 'JP_Sales', 'Other_Sales'], axis=1, inplace= True)
df2.dropna(axis=0, how='any', thresh=None, subset=None, inplace=True)
#df2.head(100)
#print(df2.columns)

In [34]:
# Merging the two dataframe df and dframe having unique names
result = pd.merge(df,df2, 
                 on = 'Name')

In [35]:
result

Unnamed: 0,Rank,Name,Platform,Publisher,Genre,Critic_Score,Year
0,1,Wii Sports,Wii,Nintendo,Sports,7.7,2006.0
1,2,Super Mario Bros.,NES,Nintendo,Platform,10.0,1985.0
2,2,Super Mario Bros.,NES,Nintendo,Platform,9.0,2006.0
3,3,Mario Kart Wii,Wii,Nintendo,Racing,8.2,2008.0
4,5,Minecraft,PC,Mojang,Misc,10.0,2010.0
...,...,...,...,...,...,...,...
308,193,Super Mario 3D World,WiiU,Nintendo,Platform,9.5,2013.0
309,195,Link's Crossbow Training,Wii,Nintendo,Shooter,6.9,2007.0
310,196,New Super Mario Bros. U,WiiU,Nintendo,Platform,8.4,2012.0
311,198,Destiny,PS4,Activision,Shooter,8.3,2014.0


In [36]:
result.to_csv("result.csv",encoding="utf-8",index=False)

In [37]:
result.drop_duplicates(subset ="Name",keep = "first", inplace = True)
result

Unnamed: 0,Rank,Name,Platform,Publisher,Genre,Critic_Score,Year
0,1,Wii Sports,Wii,Nintendo,Sports,7.7,2006.0
1,2,Super Mario Bros.,NES,Nintendo,Platform,10.0,1985.0
3,3,Mario Kart Wii,Wii,Nintendo,Racing,8.2,2008.0
4,5,Minecraft,PC,Mojang,Misc,10.0,2010.0
24,6,Wii Sports Resort,Wii,Nintendo,Sports,8.0,2009.0
...,...,...,...,...,...,...,...
308,193,Super Mario 3D World,WiiU,Nintendo,Platform,9.5,2013.0
309,195,Link's Crossbow Training,Wii,Nintendo,Shooter,6.9,2007.0
310,196,New Super Mario Bros. U,WiiU,Nintendo,Platform,8.4,2012.0
311,198,Destiny,PS4,Activision,Shooter,8.3,2014.0


In [38]:
print(len(result))
#for i in range(len(result)):
    #gen.setdefault(result["Name"][i], {})[result["Genre"][i]] = ''
    #To eliminate the duplication of key-value pairs
    #plat.setdefault(result["Name"][i], {})[result["Platform"][i]] = ''  

144


In [39]:
#result.drop_duplicates(subset ="Name",keep = 'first', inplace = True)

In [40]:
critic_score = result['Critic_Score']
Id = result['Rank']
year = result['Year']
genre = gen.values()
final = pd.DataFrame(columns = ['ID','Name', 'Genre', 'Platform',
                               'Critic_Score', 'RAWG.IO Ratings',
                                'Rank', 'Year' ])

## DATASOURCE 3 - Using API

#### What is an API:

API stands for Application Programming Interface, and it lets developers integrate any two parts of an application or any different applications together. It consists of various elements such as functions, protocols, and tools that allow developers to build applications. A common goal of all types of APIs is to accelerate the development of applications by providing a part of its functionality out-of-the-box, so developers do not have to implement it themselves.

We will be using an API Wrapper to get data from the video game database www.rawg.io. Since it's a public database we won't be requiring an API key for using www.rawg.io's API.
We will be importing the API wrapper python class for www.rawg.io, which is rawgpy.

In [41]:
import rawgpy # imports API Wrapper created for getting datas from rawg.io
import json
gname = result['Name']
platform1 = []
rawg_ratings = []
genre = []
rawg = rawgpy.RAWG("User-Agent, this should identify your app")
for name in gname:
    results = rawg.search(name)  # defaults to returning the top 5 results
    game = results[0]
    game.populate()
    rawg_ratings.append(game.rating)
print (len(rawg_ratings))

144


In [42]:
result["RAWG Score"] = rawg_ratings
result

Unnamed: 0,Rank,Name,Platform,Publisher,Genre,Critic_Score,Year,RAWG Score
0,1,Wii Sports,Wii,Nintendo,Sports,7.7,2006.0,4.20
1,2,Super Mario Bros.,NES,Nintendo,Platform,10.0,1985.0,4.27
3,3,Mario Kart Wii,Wii,Nintendo,Racing,8.2,2008.0,4.16
4,5,Minecraft,PC,Mojang,Misc,10.0,2010.0,4.33
24,6,Wii Sports Resort,Wii,Nintendo,Sports,8.0,2009.0,4.19
...,...,...,...,...,...,...,...,...
308,193,Super Mario 3D World,WiiU,Nintendo,Platform,9.5,2013.0,4.33
309,195,Link's Crossbow Training,Wii,Nintendo,Shooter,6.9,2007.0,3.67
310,196,New Super Mario Bros. U,WiiU,Nintendo,Platform,8.4,2012.0,3.89
311,198,Destiny,PS4,Activision,Shooter,8.3,2014.0,3.78
