# Research Question: Can We Predict Metacritic Game Score

### by: Tzach Fleischer and Moshe Dego

![title](video_game_photo.png)

The video game industry is a billion dollar industry, and part of everday life for most people.
Metacritic is an aggregator site - The site collects reviews from many video game outlets and gives a total aggregated score of each game.

We chose this project out of our mutual love of video games, both of us have been gamers all our lives.


### Our dataset is made of the following columns:
- Game Name: The title of the video game.
- Developer: The company that developed the game.
- Developer Score: The aggregated metacritic score of the developer career.
- Publisher: the company that published and financed the game development.
- Publisher Score: The aggregated metacritic score of the publisher career.
- Rating: Each game recieves an ESRB rating from the board of video game ratings.
- User Score: Metacritic users can write their own review of the game and the site will aggregate it.
- Genre: Which gaming genre does the game belong to.
- Number of Critic Reviews: How many reviews were counted for the game score.
- Number of Positive Reviews: Out of all the reviews, how many were positive.
- Number of Mixed Reviews: Out of all the reviews, how many were mixed.
- Number of Negative Reviews: Out of all the reviews, how many were negative.
- Release Month: On what month was the game published.
- Release Year: On what year was the game published.
- Platform: On what platform was the game reviewed on.
- Description: The description of the game from the site itself.
- **Score: Our target column, the aggregated score of the game.**

## Data Collection:

On Metacritic there is a list of all video games by [score](https://www.metacritic.com/browse/games/score/metascore/all/all/filtered)
On each page there are 100 games, there are 197 pages.
Our Scraper runs over each page, enters each game - takes the required information, then it enters the publisher page and if there is a developer, it enters it's page (some of the publishers are the developers).
After each 100 game page it adds the current information on a csv file.
	

In [6]:
import requests 
from bs4 import BeautifulSoup
import pandas as pd
import time 
from dateutil.parser import parse #to proccess dates
import re # to parse partial tags
import numpy as np

In [8]:
userAgent={'User-agent':'Mozilla/5.0'}
base_url='https://www.metacritic.com'
for ind in range(0, 195):
    game_names=[]  #name
    scores=[]  #game score - what we will try and predict
    developers=[]  #developer name
    publishers=[]  #publisher name
    genres=[]  #genres the game is associated with
    ratings=[]  #game age rating
    user_score=[]  #game user score as aggragated by user reviews on site
    num_crit_reviews=[]  #number of critics reviews for game
    release_month=[]  #the month the game was released in
    release_year=[]  #the year the game was released in
    platforms=[]  #platform the game was reviewed for
    descriptions=[]  #game description for text proccessing
    dev_career_score=[]  #developer aggragated score
    pub_career_score=[]  #publisher aggragated score
    num_pos_critic=[]  #number of positive reviews
    num_mix_critic=[]  #number of mixed reviews
    num_neg_critic=[]  #number of negative reviews
    
    print('***************')
    # parses list page
    url_page=base_url+'/browse/games/score/metascore/all/all/filtered?page='+str(ind)
    print(url_page)
    response=requests.get(url_page,headers=userAgent)
    time.sleep(0.1) #we had problems with parsing the page so we added the pause so the page had time to load
    soup_page = BeautifulSoup(response.content, 'html.parser')
    # creates a list of all games links in the page
    url_game_list=soup_page.find_all('a',attrs={'class':'title'})
    #**************#
    
    # The loop runs over each game in the list
    for game in url_game_list:
        game_url=base_url+game['href']
        print(game_url)
        
        #this 'if' statement is for each page that created a specific problem that couldnt be solved
        if(game_url == "https://www.metacritic.com/game/pc/wild-west-online"):
            continue
        response = requests.get(game_url,headers=userAgent)
        time.sleep(0.5) #we had problems with parsing the page so we added the pause so the page had time to load
        soup_game = BeautifulSoup(response.content, 'html.parser')
        #**************#
    
        #game name
        game_names.append(soup_game.find("a",{"class":"hover_none"}).get_text().strip())
        #**************#

        #game score
        scores.append(int(soup_game.find("span",{"itemprop":"ratingValue"}).get_text().strip()))
        #**************#

        #list of genres
        genres_list=[]#the list of stripped down genres
        #finding the right tag
        genre_li=soup_game.find_all("li",attrs={"class":"summary_detail product_genre"})
        if(genre_li!=None):
            #receiving a list of all the genres
            gen_list=genre_li[0].find_all("span",attrs={"class":"data"})
            for genre in gen_list:
                genres_list.append(genre.get_text().strip()) #stripping names and adding them to list
            genres.append(list(dict.fromkeys(genres_list))) #removes duplicates inside genre list
        else:
            genres.append(np.nan)
        #**************#

        #rating
        #checking if rating exists, not all game get ratings - usually hyper violent games
        if(soup_game.find("li",attrs={"class":"summary_detail product_rating"})!=None):
            rating=soup_game.find("li",attrs={"class":"summary_detail product_rating"})
            ratings.append(rating.find("span",attrs={"class":"data"}).get_text().strip())
        else:
            ratings.append(np.nan)
        #**************#

        #user score
        user_score_element = soup_game.find("div",{"class":re.compile(r'^metascore_w user large game')})
        # some games didnt have an aggragated user score
        if(user_score_element == None or user_score_element.get_text().strip() == 'tbd'):
            user_score.append(np.nan)
        else:
            user_score.append(float(soup_game.find("div",{"class":re.compile(r'^metascore_w user large game')}).get_text().strip()))
        #**************#

        #number of reviews
        #reviews divided by positive,mixed and negative
        game_scores_div=soup_game("div",attrs={"class":"module reviews_module critic_reviews_module"})[0]
        score_list=game_scores_div.find_all("span",attrs={"class":"count"})
        sum_crit=0
        num_pos_critic.append(int(score_list[0].get_text().strip()))
        num_mix_critic.append(int(score_list[1].get_text().strip()))
        num_neg_critic.append(int(score_list[2].get_text().strip()))
        for spans in score_list:
            sum_crit=int(spans.get_text().strip())+sum_crit
        num_crit_reviews.append(sum_crit)
        #**************#

        #release date divide by day, month, year
        releas=soup_game("li",attrs={"class":"summary_detail release_data"})[0]
        date=releas.find("span",attrs={"class":"data"}).get_text().strip()
        dt=parse(date) #a function able to return a date from a string
        dt=dt.strftime('%Y-%m-%d') #a function able to return a date by format
        dt=dt.split(sep='-') #seperating the date
        release_month.append(int(dt[1]))
        release_year.append(int(dt[0]))
        #**************#

        #platforms
        platforms.append(soup_game.find("span",{"class":"platform"}).get_text().strip())
        #**************#

         #game description
         #checking if a description exists
        if(soup_game.find("li",{"class":"summary_detail product_summary"}) != None):
            desc=soup_game("li",attrs={"class":"summary_detail product_summary"})[0]
            #in some pages the summary is too long so they separate it to collapsed and expanded
            if(desc.find("span",attrs={"class":"blurb blurb_expanded"})==None):
                descr=desc.find("span",attrs={"class":"data"}).get_text().strip()
            else:
                descr=desc.find("span",attrs={"class":"blurb blurb_expanded"}).get_text().strip()
            descriptions.append(descr)
        else:
            descriptions.append(np.nan)
        #**************#

        #publisher
        #publisher career score
        pub_li=soup_game.find("li",attrs={"class":"summary_detail publisher"}) #finding the right tag
        publisher_temp = np.nan
        pub_career_temp = np.nan
        #checking if a publisher exists on page
        if(pub_li != None):          
            #finding first publisher
            publisher_temp=pub_li.find("a").get_text().strip()
            publishers.append(publisher_temp)
            
            # finding the publisher url
            pub_url=pub_li.find("a")['href']
            pub_domain=base_url+pub_url
            response=requests.get(pub_domain,headers=userAgent)
            soup_pub = BeautifulSoup(response.content, 'html.parser')
            
            # finding the publisher score with a specific elemnt
            pub_career_temp=soup_pub.select("tr.review_average > td > span")
            # some publisher didnt have a score, so we had to check
            if(len(pub_career_temp) == 0):
                pub_career_temp = np.nan
                pub_career_score.append(pub_career_temp)
            else:
                pub_career_temp = pub_career_temp[0].get_text().strip()
                pub_career_score.append(int(pub_career_temp))     
        else:
            publishers.append(publisher_temp)
            pub_career_score.append(pub_career_temp)
        #**************#

        #developer
        #developer career score
        dev_li=soup_game.find("li",attrs={"class":"summary_detail developer"}) #finding the right tag
        #some games didnt have a developer written
        if(dev_li != None):
            developer_temp=dev_li.find("a").get_text().strip()        
            developers.append(developer_temp)
        else:
            developer_temp = publisher_temp
            developers.append(developer_temp)
        # appending the publisher score
        #checking if the publisher and developer aren't the same to save time
        if(developer_temp!=publisher_temp):
            dev_url=dev_li.find("a")['href']
            dev_domain=base_url+dev_url
            response=requests.get(dev_domain,headers=userAgent)
            soup_dev = BeautifulSoup(response.content, 'html.parser')
            dev_career_temp=soup_dev.select("tr.review_average > td > span")
            if(len(dev_career_temp) == 0):
                dev_career_temp = np.nan
                dev_career_score.append(dev_career_temp)
            else:
                dev_career_temp = dev_career_temp[0].get_text().strip()
                dev_career_score.append(int(dev_career_temp))
        else:
            dev_career_score.append(int(pub_career_temp))
        #**************#
        #end of loop
        
    # creating a datframe of the games.    
    df=pd.DataFrame({'game_name': game_names, 'score': scores,'user_score': user_score,'platform': platforms,
                         'developer': developers,"developer_score":dev_career_score, 'publisher': publishers,
                         "publisher_score":pub_career_score,'rating': ratings,'release_month': release_month,
                         'release_year': release_year,'num_crit_review': num_crit_reviews, 'num_pos_critic': num_pos_critic,
                         'num_mix_critic': num_mix_critic,'num_neg_critic': num_neg_critic, 'genres':genres,
                         'descriptions': descriptions})
    if(ind>0): # on succesive runs append to existing file
        df.to_csv("game_score_database.csv",mode='a',index=False,header=False)
    else:  # on first time create the file
        df.to_csv("game_score_database.csv",index=False)
    time.sleep(30)

***************
https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?page=0
https://www.metacritic.com/game/nintendo-64/the-legend-of-zelda-ocarina-of-time
https://www.metacritic.com/game/playstation/tony-hawks-pro-skater-2
https://www.metacritic.com/game/playstation-3/grand-theft-auto-iv
https://www.metacritic.com/game/dreamcast/soulcalibur
https://www.metacritic.com/game/xbox-360/grand-theft-auto-iv
https://www.metacritic.com/game/wii/super-mario-galaxy
https://www.metacritic.com/game/wii/super-mario-galaxy-2
https://www.metacritic.com/game/xbox-one/red-dead-redemption-2
https://www.metacritic.com/game/xbox-one/grand-theft-auto-v
https://www.metacritic.com/game/playstation-3/grand-theft-auto-v
https://www.metacritic.com/game/pc/disco-elysium-the-final-cut
https://www.metacritic.com/game/xbox-360/grand-theft-auto-v
https://www.metacritic.com/game/dreamcast/tony-hawks-pro-skater-2
https://www.metacritic.com/game/switch/the-legend-of-zelda-breath-of-the-wild
https://

https://www.metacritic.com/game/switch/hades
https://www.metacritic.com/game/xbox/tony-hawks-pro-skater-3
https://www.metacritic.com/game/pc/half-life-alyx
https://www.metacritic.com/game/switch/divinity-original-sin-ii---definitive-edition
https://www.metacritic.com/game/pc/divinity-original-sin-ii
https://www.metacritic.com/game/pc/unreal-tournament-2004
https://www.metacritic.com/game/xbox-360/bioshock-infinite
https://www.metacritic.com/game/switch/ori-and-the-will-of-the-wisps
https://www.metacritic.com/game/xbox-360/braid
https://www.metacritic.com/game/playstation-2/god-of-war-ii
https://www.metacritic.com/game/wii-u/super-mario-3d-world
https://www.metacritic.com/game/pc/starcraft-ii-wings-of-liberty
https://www.metacritic.com/game/playstation-2/ssx
https://www.metacritic.com/game/xbox-360/street-fighter-iv
https://www.metacritic.com/game/pc/minecraft
https://www.metacritic.com/game/switch/undertale
https://www.metacritic.com/game/playstation-vita/persona-4-golden
https://www.m

## This is our raw dataset

In [7]:
df2=pd.read_csv('game_score_database.csv')
df2


Unnamed: 0,game_name,score,user_score,platform,developer,developer_score,publisher,publisher_score,rating,release_month,release_year,num_crit_review,num_pos_critic,num_mix_critic,num_neg_critic,genres,descriptions
0,The Legend of Zelda: Ocarina of Time,99,9.1,Nintendo 64,Nintendo,76.0,Nintendo,76.0,E,11,1998,22,22,0,0,"['Action Adventure', 'Fantasy']","As a young boy, Link is tricked by Ganondorf, ..."
1,Tony Hawk's Pro Skater 2,98,7.5,PlayStation,Neversoft Entertainment,80.0,Activision,68.0,T,9,2000,19,19,0,0,"['Sports', 'Alternative', 'Skateboarding']",As most major publishers' development efforts ...
2,Grand Theft Auto IV,98,7.8,PlayStation 3,Rockstar North,90.0,Rockstar Games,81.0,M,4,2008,64,64,0,0,"['Action Adventure', 'Modern', 'Open-World']",[Metacritic's 2008 PS3 Game of the Year; Also ...
3,SoulCalibur,98,8.5,Dreamcast,Namco,70.0,Namco,70.0,T,9,1999,24,24,0,0,"['Action', 'Fighting', '3D']","This is a tale of souls and swords, transcendi..."
4,Grand Theft Auto IV,98,7.9,Xbox 360,Rockstar North,90.0,Rockstar Games,81.0,M,4,2008,86,86,0,0,"['Action Adventure', 'Modern', 'Open-World']",[Metacritic's 2008 Xbox 360 Game of the Year; ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19317,Vroom in the Night Sky,17,3.2,Switch,Poisoft,37.0,Poisoft,37.0,E,4,2017,15,0,0,15,"['Sports', 'Individual', 'Biking']",Vroom in the night sky is a magical bike actio...
19318,Leisure Suit Larry: Box Office Bust,17,1.9,PlayStation 3,Team17,70.0,Funsta,21.0,M,5,2009,11,0,0,11,"['Action Adventure', 'Adventure', 'Third-Perso...",The Leisure Suit Larry: Box Office Bust video ...
19319,Yaris,17,4.4,Xbox 360,Backbone Entertainment,66.0,Backbone Entertainment,66.0,E10+,10,2007,7,0,0,7,"['Driving', 'Racing', 'Arcade', 'Automobile']",[Xbox Live Arcade] Hop into a Toyota Yaris an...
19320,Ride to Hell: Retribution,16,1.4,PC,Eutechnyx,52.0,Deep Silver,67.0,M,6,2013,9,0,0,9,"['Driving', 'Modern', 'Racing', 'Motorcycle', ...",The game is set in the last years of the roari...
