# Data Wrangling
***
Video game reviews along with other pertinent features for the genres of role-playing, shooter and sports games of three gaming consoles (Xbox One, PS4, Nintendo Switch) were scraped from www.metacritic.com. 15 of the most recent user reviews for all games were scraped along with their individual user score and sentiment. However, to truly focus on the actual sentiment of the common user only the average review score and sentiment for all critics were considered.

__Note:__ Missing values were substituted with np.nan

### Feature definition of scraped data:
***

1. __title:__ Title of the game <br>

2. __platform:__ The console reviewer played the game on <br>

3. __metascore:__ The average score given to the game by various game critics (float range of 1-100) <br>

4. __metasentiment:__ The overall critic sentiment classification based on critic ratings/metascore (positive, mixed, negative) <br>

5. __average_userscore:__ The average score given to the game by users (float range of 1-10) <br>

6. __average_usersentiment:__ The overall user sentiment classification based on average user score (positive, mixed, negative) <br>

7. __developer:__ Developer of game <br>

8. __genre:__ Genre of game <br>

9. __number_of_players:__ Number of players that can play the game <br>

10. __esrb_rating:__ Entertainment Software Rating Board (ESRB) rating <br>

11. __release_date:__ Release date of game <br>

12. __username:__ The Metacritic username of the game reviewer <br>

13. __userscore:__ Individual user rating (integer range of 1-10) <br>

14. __usersentiment:__ Individual user sentiment classification based on their user score (positive, mixed, negative) <br>

15. __review:__ Text review left by user

16. __review_date:__ Date review was left by user

In [1]:
# Import requests to make connections to url and beautiful soup to sort through the html
import requests
from bs4 import BeautifulSoup

In [2]:
# Create url variable to establish connection
urls = ['https://www.metacritic.com/browse/games/release-date/available/xboxone/metascore?page=',
       'https://www.metacritic.com/browse/games/release-date/available/ps4/metascore?page=',
       'https://www.metacritic.com/browse/games/release-date/available/switch/metascore?page=']

# Create header for this particular website as you get a 403 code if it doesnt recognize the system trying to make the connection
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36"}

In [3]:
from time import sleep

def get_request(url, headers):
    '''This function continuously attempts to establish a connection to a webpage. Used to overcome failed
    attempts in quick succession.'''
    request = ''
    while request == '':
        try:
            request = requests.get(url, headers=headers)
            break
        except:
            print("Connection refused by the server..")
            print("URL:", url)
            print("Awaiting 5 seconds")
            print("...")
            sleep(5)
            print("Retrying...")
            continue          
    return request

In [4]:
# Scrape all titles and platform they were released on
title = []
platform = []

# Loop through all pages for rated Xbox One, PS4 and Switch game titles
for url in urls:
            
    if url.split('/')[-2] == 'xboxone':
        for page in range(0, 18):
            request = get_request(url+str(page), headers)
            soup = BeautifulSoup(request.content, 'html.parser')
            
            titles = [t.string for t in soup.findAll('a', class_=['title'])]
            pltfrm = [p.find('span', class_='data').string.strip() for p in soup.findAll('div', class_=['clamp-details'])]

            title.extend(titles)
            platform.extend(pltfrm)

    elif url.split('/')[-2] == 'ps4':
        for page in range(0, 25):
            request = get_request(url+str(page), headers)
            soup = BeautifulSoup(request.content, 'html.parser')
            
            titles = [t.string for t in soup.findAll('a', class_=['title'])]
            pltfrm = [p.find('span', class_='data').string.strip() for p in soup.findAll('div', class_=['clamp-details'])]

            title.extend(titles)
            platform.extend(pltfrm)
            #sleep(15)
            
    elif url.split('/')[-2] == 'switch':
        for page in range(0, 18):
            request = get_request(url+str(page), headers)
            soup = BeautifulSoup(request.content, 'html.parser')
            
            titles = [t.string for t in soup.findAll('a', class_=['title'])]
            pltfrm = [p.find('span', class_='data').string.strip() for p in soup.findAll('div', class_=['clamp-details'])]
            
            title.extend(titles)
            platform.extend(pltfrm)


In [5]:
# Combine platform and title into a subset of the url that will be used to collect user reviews
url_title = []
for i in range(len(title)):
    t_for_url = "-".join(title[i].lower().replace(":","").replace("'","").split())
    p_for_url = "-".join(platform[i].lower().split())
    url_title.append(p_for_url + '/' + t_for_url)

In [6]:
# Import numpy to create null values and regex to extract data
import numpy as np
import re

# Initialize a url base
url_base = 'https://www.metacritic.com/game/'
# Initialize a regex pattern to extract metacritic sentiment classes
sentiment_pattern = r'\b(positive|mixed|negative)\b'
# Initialize the number of reviews to scrape
num_of_reviews = 15

# Create empty lists for features to be scraped
game_title = []
console = []
metascore = []
metasentiment = []
avg_userscore = []
avg_usersentiment = []
developer = []              
genre = []                  
players = []
esrb_rating = []
release_date = []
username = []
userscore = []
usersentiment = []
review = []
review_date = []

# Loop over all titles and extract aforementioned features
for i in range(len(url_title)):
    # Create a request to gather general features before actual review page
    request = get_request(url_base + url_title[i], headers)
    soup = BeautifulSoup(request.content, 'html.parser')
    
    ttl = title[i]
    pltfrm = platform[i]
    
    try:
        mscore = soup.find('a', class_=['metascore_anchor'], href=["/game/" + url_title[i] + "/critic-reviews"]).find('span').get_text()
    except AttributeError:
        mscore = np.nan
        
    try:
        msentiment = soup.find('a', class_=['metascore_anchor'], href=["/game/" + url_title[i] + "/critic-reviews"]).find('div')['class'][-1]
    except AttributeError:
        msentiment = np.nan
        
    try:
        avg_uscore = soup.find('a', class_=['metascore_anchor'], href=["/game/" + url_title[i] + "/user-reviews"]).find('div').get_text()
    except AttributeError:
        avg_uscore = np.nan
        
    try:
        avg_usentiment = soup.find('a', class_=['metascore_anchor'], href=["/game/" + url_title[i] + "/user-reviews"]).find('div')['class'][-1]
    except AttributeError:
        avg_usentiment = np.nan
    
    try:
        dvlpr = soup.find('li', class_=['summary_detail developer']).find('span', class_=['data']).get_text(strip=True)
    except AttributeError:
        dvlpr = np.nan
    
    try:
        gnrs = [genre.get_text(strip=True) for genre in soup.find('li', class_=['summary_detail product_genre']).findAll('span', class_=['data'])]
        
        RPG_kw = r"\b(RPG|Role-Playing|Action RPG)\b"
        shooter_kw = r"\bShooter\b"
        sports_kw = r"\b(Sports|Racing)\b"
        action_adventure_kw = r"\b(Action Adventure|Open-World)\b"

        if re.search(RPG_kw, " ".join(gnrs)):
            gnr = 'RPG'
        elif re.search(shooter_kw, " ".join(gnrs)):
            gnr = 'Shooter'
        elif re.search(sports_kw, " ".join(gnrs)):
            gnr = 'Sports'
        elif re.search(action_adventure_kw, " ".join(gnrs)):
            gnr = 'Action adventure'
        else:
            gnr = 'Other'
            
    except AttributeError:
        gnr = np.nan
        
    try:
        num_of_players = soup.find('li', class_=['summary_detail product_players']).find('span', class_=['data']).get_text(strip=True)
    except AttributeError:
        num_of_players = np.nan
    
    try:
        esrb = soup.find('li', class_=['summary_detail product_rating']).find('span', class_=['data']).get_text()
    except AttributeError:
        esrb = np.nan
        
    try:
        rel_date = soup.find('li', class_=['summary_detail release_data']).find('span', class_=['data']).get_text()
    except AttributeError:
        rel_date = np.nan
    

    # Create a new request to extract user reviews and other pertinent features
    request = get_request(url_base + url_title[i] + '/user-reviews', headers)
    soup = BeautifulSoup(request.content, 'html.parser')
    
    reviews = soup.findAll('li', class_=['user_review'])

    if len(reviews) >= num_of_reviews:
        for rev in reviews[:num_of_reviews]:
            uname = rev.find('a').get_text()
            uscore = rev.find('div', class_=['review_grade']).div.get_text()
            match = re.search(sentiment_pattern, ' '.join(rev.find('div', class_=['review_grade']).div['class']))
            usentiment = match.group()
            try:
                rvw = rev.find('span', class_=['blurb blurb_expanded']).get_text(" ", strip=True)
            except:
                rvw = rev.find('div', class_=['review_body']).get_text(" ", strip=True)
            rev_date = rev.find('div', class_=['date']).get_text()
            
            game_title.append(ttl)
            console.append(pltfrm)
            metascore.append(mscore)
            metasentiment.append(msentiment)
            avg_userscore.append(avg_uscore)
            avg_usersentiment.append(avg_usentiment)
            developer.append(dvlpr)
            genre.append(gnr)
            players.append(num_of_players)
            esrb_rating.append(esrb)
            release_date.append(rel_date)
            username.append(uname)
            userscore.append(uscore)
            usersentiment.append(usentiment)
            review.append(rvw)
            review_date.append(rev_date)

Connection refused by the server..
URL: https://www.metacritic.com/game/xbox-one/ori-and-the-blind-forest/user-reviews
Awaiting 5 seconds
...
Retrying...
Connection refused by the server..
URL: https://www.metacritic.com/game/xbox-one/kingdom-new-lands
Awaiting 5 seconds
...
Retrying...
Connection refused by the server..
URL: https://www.metacritic.com/game/xbox-one/kingdom-new-lands
Awaiting 5 seconds
...
Retrying...
Connection refused by the server..
URL: https://www.metacritic.com/game/xbox-one/f1-2016
Awaiting 5 seconds
...
Retrying...
Connection refused by the server..
URL: https://www.metacritic.com/game/xbox-one/overcooked!-2
Awaiting 5 seconds
...
Retrying...
Connection refused by the server..
URL: https://www.metacritic.com/game/xbox-one/bayonetta-&-vanquish
Awaiting 5 seconds
...
Retrying...
Connection refused by the server..
URL: https://www.metacritic.com/game/xbox-one/spyro-reignited-trilogy/user-reviews
Awaiting 5 seconds
...
Retrying...
Connection refused by the server..

Retrying...
Connection refused by the server..
URL: https://www.metacritic.com/game/playstation-4/just-dance-2020/user-reviews
Awaiting 5 seconds
...
Retrying...
Connection refused by the server..
URL: https://www.metacritic.com/game/playstation-4/operencia-the-stolen-sun
Awaiting 5 seconds
...
Retrying...
Connection refused by the server..
URL: https://www.metacritic.com/game/playstation-4/saints-row-iv-re-elected
Awaiting 5 seconds
...
Retrying...
Connection refused by the server..
URL: https://www.metacritic.com/game/playstation-4/draugen/user-reviews
Awaiting 5 seconds
...
Retrying...
Connection refused by the server..
URL: https://www.metacritic.com/game/playstation-4/ancestors-legacy
Awaiting 5 seconds
...
Retrying...
Connection refused by the server..
URL: https://www.metacritic.com/game/playstation-4/neversong/user-reviews
Awaiting 5 seconds
...
Retrying...
Connection refused by the server..
URL: https://www.metacritic.com/game/playstation-4/warhammer-end-times---vermintide/use

Retrying...
Connection refused by the server..
URL: https://www.metacritic.com/game/switch/pikuniku
Awaiting 5 seconds
...
Retrying...
Connection refused by the server..
URL: https://www.metacritic.com/game/switch/cattails
Awaiting 5 seconds
...
Retrying...
Connection refused by the server..
URL: https://www.metacritic.com/game/switch/miles-&-kilo/user-reviews
Awaiting 5 seconds
...
Retrying...
Connection refused by the server..
URL: https://www.metacritic.com/game/switch/opus-the-day-we-found-earth/user-reviews
Awaiting 5 seconds
...
Retrying...
Connection refused by the server..
URL: https://www.metacritic.com/game/switch/project-warlock
Awaiting 5 seconds
...
Retrying...
Connection refused by the server..
URL: https://www.metacritic.com/game/switch/monster-prom-xxl/user-reviews
Awaiting 5 seconds
...
Retrying...
Connection refused by the server..
URL: https://www.metacritic.com/game/switch/opus-rocket-of-whispers
Awaiting 5 seconds
...
Retrying...
Connection refused by the server..


In [7]:
# Import pandas to manage extracted data
import pandas as pd

# Create a dictionary to transform into dataframe
scraped = {'title': game_title,
           'platform': console,
           'metascore': metascore,
           'metasentiment': metasentiment,
           'average_userscore': avg_userscore,
           'average_usersentiment': avg_usersentiment,
           'developer': developer,
           'genre': genre,
           'number_of_players': players,
           'esrb_rating': esrb_rating,
           'release_date':release_date,
           'username': username,
           'userscore': userscore,
           'usersentiment': usersentiment,
           'review': review,
           'review_date': review_date
          }
df = pd.DataFrame(scraped)

# View scraped data
pd.set_option('display.max_columns', None)
df.head()

Unnamed: 0,title,platform,metascore,metasentiment,average_userscore,average_usersentiment,developer,genre,number_of_players,esrb_rating,release_date,username,userscore,usersentiment,review,review_date
0,Red Dead Redemption 2,Xbox One,97,positive,7.8,positive,Rockstar Games,Action adventure,Up to 32,M,"Oct 26, 2018",gnadenlos,7,mixed,"The main problem is, that it's not a real open...","Nov 1, 2018"
1,Red Dead Redemption 2,Xbox One,97,positive,7.8,positive,Rockstar Games,Action adventure,Up to 32,M,"Oct 26, 2018",Feriatus,7,mixed,It's not a bad game but the gameplay is an out...,"Oct 29, 2018"
2,Red Dead Redemption 2,Xbox One,97,positive,7.8,positive,Rockstar Games,Action adventure,Up to 32,M,"Oct 26, 2018",ponux,7,mixed,"Visually superb (except cutscenes), good (not ...","Nov 5, 2018"
3,Red Dead Redemption 2,Xbox One,97,positive,7.8,positive,Rockstar Games,Action adventure,Up to 32,M,"Oct 26, 2018",Picklock,5,mixed,"Great looking game backed up by clumsy, overly...","Nov 4, 2018"
4,Red Dead Redemption 2,Xbox One,97,positive,7.8,positive,Rockstar Games,Action adventure,Up to 32,M,"Oct 26, 2018",Saints,6,mixed,Red Dead Redemption 2 is an amazing game that ...,"Oct 30, 2018"


In [9]:
# Export data to csv file
df.to_csv('MetacriticGameReviews.csv')