This code functioned as of: 13/07/22. Future changes to HTML may break it!

The aim of this project is to teach myself more about web scraping. 

I want to scrape the titles of games from the IGN website and then scrape more data about those games from the Steam Store. 

I have tried using BeautifulSoup and Requests to scrape data. This approach could not deal with infinite scrolling on the IGN website. Instead, I am using Selenium.

Selenium will allow me to interact with dynamic webpages i.e., infinite scrolling webpages. 

In [None]:
!pip install selenium
!pip install chromedriver-py==103.0.5060.53
!pip install webdriver-manager

The first step is to use Selenium to open the IGN webpage.

I also change my Chrome options to disable images. Through trial and error, I found that images were preventing elements loading before the next scroll. This resulted in some data not being scraped.

In [None]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By

chrome_options = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option("prefs", prefs)
# block images to speed up webpage load

driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)
url = 'https://www.ign.com/reviews/games/pc'

driver.get(url) #open IGN webpage

driver.implicitly_wait(10) 
# in general wait 10 seconds for loading before exception

The data I want is structured in order of date. The webpage has to be scrolled down to load older data. I want to get data from present until the start of 2019. Scrolling down to that point took approximately 4 minutes.

In [None]:
import time

t0 = time.time()

for scroll in range(50): #scroll down to the start of 2019

    scroll_string = "window.scrollTo(0, document.body.scrollHeight);"
    driver.execute_script(scroll_string)
    time.sleep(5) #wait for page to load
    
t1 = time.time()

T = round((t1 - t0)/60,2)
print('Time to load data:', T, 'mins')

By inspecting the HTML of the webpage, I know that the elements I want to scrape are identified by the class name 'item-body'.

Next, I get just the text from the scraped data and store it in a list. I also removed some common junk text and trailing spaces. For each game in my list, I have an IGN review score and the game title.

In [None]:
# scrap the data
reviews = driver.find_elements(By.CLASS_NAME, "item-body")

ign_data = []
for review in reviews:
    review_string = review.text
    strings = review_string.split("\n")
    strings[1] = strings[1].replace(' Review','')
    strings[1] = strings[1].replace(' Early Access', '')
    strings[1] = strings[1].replace(' - Final', '')
    # remove common junk text
    strings[1] = strings[1].rstrip() #remove trailing spaces
    game = [strings[0],strings[1]] # [IGN score, game title]
    ign_data.append(game)
    
#print(len(ign_data)) #check how many titles were scraped

driver.close()  

The next step should be to use the titles to scrape more data about the game from Steam. However, the list of game titles needs to be cleaned first.

For example, some titles as scraped from the IGN website do not direct to a game on the Steam website. This could be because the title is written differently, or the title is simply not available on Steam.

The list of IGN games has over 500 elements so checking each manually could take ~4hrs. Instead, I can speed up the cleaning process by first checking for game titles that do not generate an exact match on Steam.

In [None]:
# getting list of ign titles to clean

t0 = time.time()

driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)
driver.implicitly_wait(10)
url = 'https://store.steampowered.com/'
driver.get(url) #open Steam website

dirty_list = []
for i in range(len(ign_data)):
    
    driver.find_element(By.ID, "store_nav_search_term").send_keys(ign_data[i][1])        
    time.sleep(1)

    try:
        # look for pop-up text
        pop_up = driver.find_element(By.XPATH, "//*[@id='search_suggestion_contents']/a[1]")
        pop_up_text = pop_up.text.split('\n')[0]
    except:
        pop_up_text = ""
        
    # if no pop_up wait longer for pop-up to occur
    if pop_up_text == "":
        time.sleep(2) 
        pop_up = driver.find_element(By.XPATH, "//*[@id='search_suggestion_contents']/a[1]")
        pop_up_text = pop_up.text.split('\n')[0]
    
    if pop_up_text.upper() == ign_data[i][1].upper(): # if pop_up match title move on
        # (we don't care about text case)
        driver.find_element(By.ID, "store_nav_search_term").clear()
        time.sleep(1) # the game is on steam       
    else: 
        # the game is not on steam/ does not match add it to dirty list
        driver.find_element(By.ID, "store_nav_search_term").clear()
        list_element = [ign_data[i][0], pop_up_text, ign_data[i][1],i] # title and index
        dirty_list.append(list_element)
        time.sleep(1)

driver.close()

t1 = time.time()
T = round((t1 - t0)/60,2)

print('Time to check titles :', T, 'mins')

#print(len(dirty_list)

What is the above code doing?:
    
It generates a list of game titles that need to be cleaned. 

For each game title it types the title into the steam search bar. The Steam search bar has a dynamic dropdown list of suggested titles that might match.

If the text in the first element of the drop down matches the title scraped from IGN then it is not added to the list. I.e. it does not need to be cleaned.

To generate the list of game titles that need to be cleaned took ~1hr. The list has 228 elements (45% of the original scraped data). This step cut manual cleaning time in half!

177 titles were removed because they were not available leaving 393 of the original 510 scraped titles.

Now I can do a final check with the same code on the cleaned data to confirm that all titles return a result on Steam.

Now I can finally scrape data from Steam:

In [None]:
# scrape steam data using final (cleaned) titles

import pandas as pd
from selenium.webdriver.support.select import Select

t0 = time.time()

driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)
driver.implicitly_wait(10)
url = 'https://store.steampowered.com/'
driver.get(url)

time.sleep(5)
driver.find_element(By.ID, 'acceptAllButton').click()
# click pop-up. it was interfering with age check.

game_data = []
# selenium webpage must my half window snap mode for page layout!
for title in steam_titles:
    
    text1, text2, text3 = "", "", "" #initiate text data
    
    # get pricing data
    try: 
        driver.find_element(By.ID, "store_nav_search_term").send_keys(title[1])   
        data2 = driver.find_element(By.XPATH, "//*[@id='search_suggestion_contents']/a[1]")
        # xpath very helpful when you know an element will be there
        text2 = data2.text
    except:
        pass #clean later
    
    # pass titles and click pop-up to go to game page
    try:
        driver.find_element(By.XPATH, "//*[@id='search_suggestion_contents']/a[1]").click()
        time.sleep(2) # wait for page to load
    except: 
        # if pop up to click isn't there try again...
        driver.find_element(By.ID, "store_nav_search_term").clear()
        time.sleep(1)
        driver.find_element(By.ID, "store_nav_search_term").send_keys(title[1]) 
        driver.find_element(By.XPATH, "//*[@id='search_suggestion_contents']/a[1]").click()
    
    #look for and deal with age check window
    flag = 1 #age check is present
    try:
        driver.find_element(By.ID, 'view_product_page_btn')
    except:
        flag = 0 #age check is not present
    
    if flag == 1:
        select = Select(driver.find_element(By.ID,'ageYear'))
        select.select_by_visible_text('2022')
        select.select_by_value('1993')
        driver.find_element(By.ID, 'view_product_page_btn').click() 
    
    # get review and release data
    try:
        data1 = driver.find_element(By.ID, "game_highlights")
        text1 = data1.text
    except:
        pass #clean later
    
    #get genre data
    try:
        data3 = driver.find_element(By.ID, "genresAndManufacturer")
        text3 = data3.text
    except:
        pass #clean later
            
    game = parse_steam(title,text1,text2,text3) #parse text data
    game_data.append(game)
    
game_df = pd.DataFrame(game_data)
driver.close()
t1 = time.time()
T = round((t1 - t0)/60,2)
print('Time to check titles :', T, 'mins')

The above code tries to:

1. Get the price data of the game from the Steam drop down menu.
2. Click the Steam drop down to go to the game page.
3. Check and deal with an age check before some adult rated games.
4. Get the review and release date data from a webpage element.
5. Get the game genre data from a webpage element.

The use of Try/Except statements is very useful at this stage because differences in formatting between game webpages. It took ~1.5 hrs to scrape this data.

The parse_steam function is a function that I created to parse text data into a useful format. It can be found at the end of this notebook.

Notes:
    
There will still be errors in the dataframe. Some titles may have directed to the wrong game. For example Call  of Duty. I believe several games titles in this franchise went to the same Steam Webpage. I can catch duplicate entries when I am cleaning the dataframe. 

Similarly, games with low review numbers also likely went to the wrong page. IGN games should be popular and have many reviews.

In [None]:
from datetime import datetime, date

def parse_steam(title,text1,text2,text3):

    sentiment, positive_per, no_review, release_date = "", "", "", ""
    try:
        lines1 = text1.split("\n")
        previous_line = ""
        for line in lines1:
            if previous_line == 'ALL REVIEWS:':

                sentiment = line.split('-')[0].rstrip()
                sub_string = line.split('-')[1].split(' ')
                positive_per = float(sub_string[1].replace('%',''))/100
                no_review = int( sub_string[4].replace(',','') )

            if previous_line == 'RELEASE DATE:':
                release_date = datetime.strptime(line, '%d %b, %Y').date()
            previous_line = line
    except:
        pass #can be cleaned later
    
    price = ""
    try:
        lines2 = text2.split("\n")
        for char in lines2[1]:
            if char == "£":
                price = float(lines2[1].replace('£',''))
    except:
        pass #can be cleaned later
    
    genre = ""
    try:
        lines3 = text3.split("\n")
        for line in lines3:
            if 'GENRE:' in line:
                genre = line.replace('GENRE:','')
    except:
        pass #can be cleaned later

    game = {'Game':title[1], 'Price':price, 'Genres':genre, 'IGN_score':float(title[0]),\
            'Release Date':release_date, 'Player Sentiment':sentiment, 'Positive Reviews [%]':positive_per,\
            'No. of Reviews':no_review}
    
    return game