# Data Collection

The majority of my video game data was collected from Metacritic. I supplemented any missing and additional information from the RAWG API, such as missing genres, missing summary descriptions, and the URLs for the online stores where you can purchase each video games. 

## Imports

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import regex as re
import time

# Functions

## Names and Scores from Metacritic

This scraping function gets the name, slug, metascore, user score, and release date for every video game on the page and returns it as a list of tuples. The parameters are your user agent and the link to the Metacritic page you want to scrape. (The slug is the part of the URL to the game's web page that includes the name of the game.)

I used this function to get all the PC games that had a metascore from 2010 until 2020.

In [2]:
# Function for scraping the game names and basic info

def get_games(user_agent, url):
    game_list = []
    
    # Get request
    # User-agent required or else will return a 403 status code
    res = requests.get(url, headers={'User-Agent': user_agent})
    soup = BeautifulSoup(res.content, 'lxml')
    
    # Get name, slug, scores, and date released for first game
    # First game has a different class attribute than the rest 
    game = soup.find('li', {'class': 'product game_product first_product'})
    
    name = game.find('div', {'class': 'basic_stat product_title'}).text.strip()
    slug = game.find('a')['href'].split('pc/', 1)[1]
    meta = game.find('div', {'class': 'metascore_w small game positive'}).text
    user = game.find('span', {'class': 'data'}).text
    released = game.find_all('span', {'class': 'data'})[-1].text
    
    # Append it to list of games as a tuple
    game_list.append((name, slug, meta, user, released))
    
    # Loop through the rest of the games on the page and append it to list
    for i in range(len(soup.find_all('li', {'class': 'product game_product'}))):
        game = soup.find_all('li', {'class': 'product game_product'})[i]
        
        name = game.find('div', {'class': 'basic_stat product_title'}).text.strip()
        slug = game.find('a')['href'].split('pc/', 1)[1]
        meta = game.find('div', {'class': 'metascore_w small game positive'}).text
        user = game.find('span', {'class': 'data'}).text
        released = game.find_all('span', {'class': 'data'})[-1].text
        
        game_list.append((name, slug, meta, user, released))
    
    # Return game info as a list of tuples
    return game_list


## Summary and Genres

To get the summary and genres for each game, I again used a function to access the Metacritic page for that game. The parameters for this function are the user agent and a list of slugs for every game.

I scraped my data in batches in order to avoid 500 and 503 error codes, but if I did happen to get any sort of error code, the function would break out of the loop and return everything up until that error.

In [3]:
# Function for scraping the summary and genre of each game from Metacritic

def get_summary_genre(user_agent, list_of_slugs):
    game_details = []

    # Loop through each game
    for slug in list_of_slugs:
        # Get request
        url = f'https://www.metacritic.com/game/pc/{slug}'
        res = requests.get(url, headers={'User-Agent': user_agent})
        
        # If page doesn't exist, return missing values for summary and genre 
        # and break out of loop
        if res.status_code == 404:
            print(f'Error. Status code: {res.status_code} for {slug}.')
            summary = genres = np.nan
            game_details.append((slug, summary, genres))
            break
            
        # If 5xx error, break out of loop without saving that particular game
        elif res.status_code >= 500:
            print(f'Error. Status code: {res.status_code}. {slug} not scraped.')
            break
        
        elif res.status_code == 200:
            print(f'Status: {res.status_code}. Scraping {slug}...')
            soup = BeautifulSoup(res.content, 'lxml')

            # Summary
            summ = soup.find('li', {'class': 'summary_detail product_summary'})

            if summ == None:
                summary = np.nan
            else:
                try:
                    summary = summ.find('span', {'class': 'blurb blurb_expanded'}).get_text()
                except:
                    summary = summ.get_text()

            # Genre
            genre = soup.find('li', {'class': 'summary_detail product_genre'})
            genre_list = genre.find_all('span', {'class': 'data'})
            genres = set([genre_list[i].text for i in range(len(genre_list))])

        game_details.append((slug, summary, genres))

        # Lag time between requests
        time.sleep(3)
    
    return game_details


## Ratings

This function grabs the ESRB rating from the Metacritic page of each game. The parameters for this function are the user agent and list of slugs. Not all PC games have a rating because only console manufacturers are required to include a rating on their game. Ratings are optional for everyone else.

Source: https://www.esrb.org/faqs/#are-all-games-required-to-have-a-rating

In [4]:
# Function for scraping the ESRB rating of each game from Metacritic

def get_rating(user_agent, list_of_slugs):
    rating_list = []
    
    for slug in list_of_slugs:
        url = f'https://www.metacritic.com/game/pc/{slug}'
        res = requests.get(url, headers={'User-Agent': user_agent})  
        
        # Return missing value if page doesn't exist
        if res.status_code == 404:
            print(f'{res.status_code} Error. No rating scraped.')
            rating = np.nan
        
        # If 5xx status code, send another get request after a 3 sec lag
        elif res.status_code >= 500:
            print(f'{res.status_code} Error. Trying {slug} again.')
            time.sleep(3)
            
            res = requests.get(url, headers=headers)
            soup = BeautifulSoup(res.content, 'lxml')
            
            rate = soup.find('li', {'class': 'summary_detail product_rating'})
            
            # Return NaN if there is no rating
            if rate == None:
                rating = np.nan
            else:
                rating = rating.text.strip().replace('Rating:\n', '')

        elif res.status_code == 200:
            print(f'Status: {res.status_code}. Scraping {slug}...')
            
            soup = BeautifulSoup(res.content, 'lxml')
            rate = soup.find('li', {'class': 'summary_detail product_rating'})
            
            if rate == None:
                rating = np.nan
            else:
                rating = rate.text.strip().replace('Rating:\n', '')
        
        # Append rating to list
        rating_list.append((slug, rating))
        time.sleep(3)
        
    return rating_list

## RAWG API

I retrieved the data on every PC games in the RAWG database with these functions. These games were also collected in batches to avoid any errors.

After that, I matched the API IDs to my list of Metacritic games in order to get the descriptions for each game from the API.  

### PC Games

In [5]:
# Filter results for PC games only 
def filter_pc(results):
    pc = []
    
    for i in range(len(results)):
        for platform in results[i]['platforms']:
            if platform['platform']['id'] == 4:
                pc.append(results[i])
                break  
    return pc

In [6]:
# Function for grabbing all the PC games from the RAWG API

def get_api_games(key, start=1, end=500):
    data = []
    
    # Loop through n pages (max page = 9942)
    for i in range(start, end):
        url = f'https://rawg-video-games-database.p.rapidapi.com/games?page={i}'

        headers = {
            'x-rapidapi-host': "rawg-video-games-database.p.rapidapi.com",
            'x-rapidapi-key': key
        }
        
        # Max results per page
        params = {'page_size': 40}
        
        res = requests.request('GET', url, headers=headers, params=params)
        games = res.json()['results']
        
        # Filter results for PC games only
        pc = filter_pc(games)
        data += pc
        
    return data

### Descriptions

In [7]:
# Function for getting the game descriptions from RAWG API

def get_description(key, api_id): 
    # Accounts for missing IDs
    if api_id > 0:
        url = f"https://rawg-video-games-database.p.rapidapi.com/games/{int(api_id)}"
        headers = {
            'x-rapidapi-host': "rawg-video-games-database.p.rapidapi.com",
            'x-rapidapi-key': key
        }
        
        res = requests.request('GET', url, headers=headers)
        
        # Gets rid of html tags
        description = BeautifulSoup(res.json()['description'])
        return description.get_text()
    
    else:
        return np.nan

# Stores and Images

After cleaning and combining all the data I collected using the functions above into one dataframe, I went back and retrieved the URLs to the stores from RAWG and scraped the images from Metacritic.

### Online Stores

In [8]:
# Function for getting the URLs to the online stores from RAWG API

def get_store(key, api_id):
    url = f"https://rawg-video-games-database.p.rapidapi.com/games/{int(api_id)}"
    headers = {
        'x-rapidapi-host': "rawg-video-games-database.p.rapidapi.com",
        'x-rapidapi-key': key
    }

    res = requests.request('GET', url, headers=headers)
    stores = res.json()['stores']
    
    # Grab link to Steam or Epic Games, two popular distrubtions services,
    # if available. If not, grabs the first link in the list
    if stores != []:
        for i in range(len(stores)):
            if 'Steam' in stores[i]['store']['name'] or 'Epic' in stores[i]['store']['name']:
                return stores[i]['url']
                break
            else:
                pass

        return stores[0]['url']
    
    else:
        return np.nan

### Image Links

In [9]:
# Function for scraping the image links from Metacritic
def get_image(user_agent, slug):
    url = f'https://www.metacritic.com/game/pc/{slug}'
    res = requests.get(url, headers={'User-Agent': user_agent}) 
    
    time.sleep(3)
    
    if res.status_code == 200:
        soup = BeautifulSoup(res.content, 'lxml')
        image = soup.find('img', {'class': 'product_image large_image'})['src']

        return image
    else:
        return np.nan