# Steam Store and SteamSpy Scraping

## Introduction

In this file, I will be scraping data from the official Steam store and the SteamSpy API for game information. You will not have to run this page, since I will be including all the datasets in the Google Drive folder (and GitHub). If you would like to get updated data, or would like to make sure this page works, feel free to run to code.

# Scraping

In order to begin scraping, I need to look for what data/games I want to specifically look for. Using the [Steam Reviews](https://www.kaggle.com/datasets/andrewmvd/steam-reviews) dataset by Larxel from Kaggle, I will find the app_ids of games I want to scrape to analyze. This dataset includes over 6.4 publicly available reviews in English, review sentiments, and the number of users who found the review helpful. In order to make the time to run the code reasonable, I will only be taking the first 1500 unique games from this dataset to analyze.

In [None]:
# Importing the necessary libraries for scraping
import requests
import pandas as pd
from bs4 import BeautifulSoup
import time
import json
import altair as alt

In [2]:
df = pd.read_csv("steam_review_data.csv")
unique_ids = df['app_id'].dropna().unique()[:1500]

First, I will be scraping the game information from the official Steam Store. In order to do this, I appended the app_ids from the Steam Reviews dataset to the base Steam Store url. However, on the Steam Store, some games with a mature rating require an age verification. To bypass this check, I needed to add cookies like birthtime, lastagecheckage, and mature_content, so that when accessing this page, it will immediately open the game page instead of redirecting to age verification. Then, from this page, I extracted content like the game title, genre, developer, publisher, franchise, and release date and created a Pandas DataFrame using this data.

NOTE: Code below took 37 minutes and 4 seconds to run on MacBook (VS Code); may take longer/shorter on Google Colab.

In [None]:
# Create cookies to bypass age verification
cookies = {
    'birthtime': '568022401',
    'lastagecheckage': '1-January-1990',
    'mature_content': '1'
}

base_url = "https://store.steampowered.com/app/"

# Created an array to store data before inputting to DataFrame
results = []

# For each game in unique_ids, navigate to the Steam page and extract data
for app_id in unique_ids:
    url = f"{base_url}{int(app_id)}/"
    try:
        response = requests.get(url, cookies=cookies, timeout=10)
        
        if response.status_code != 200:
            raise Exception(f"Bad status code {response.status_code} for app_id {app_id}")
        
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find the HTML element with the information needed
        info_block = soup.find('div', id='genresAndManufacturer')

        # Make the data default to N/A if information is not found
        title = genre = developer = publisher = franchise = release_date = "N/A"

        if info_block:
            # Title
            title_line = info_block.find(string=lambda t: 'Title:' in t)
            if title_line:
                title = title_line.parent.next_sibling.strip()

            # Genre
            genre_line = info_block.find(string=lambda t: 'Genre:' in t)
            if genre_line:
                genre_span = genre_line.find_next('span')
                if genre_span:
                    genre = ', '.join([a.text.strip() for a in genre_span.find_all('a')])
            # Developer
            dev_block = info_block.find('b', string='Developer:')
            if dev_block:
                developer = ', '.join([a.text for a in dev_block.find_next_siblings('a')])

            # Publisher
            pub_block = info_block.find('b', string='Publisher:')
            if pub_block:
                publisher = ', '.join([a.text for a in pub_block.find_next_siblings('a')])

            # Franchise
            fr_block = info_block.find('b', string='Franchise:')
            if fr_block:
                franchise = ', '.join([a.text for a in fr_block.find_next_siblings('a')])

            # Release Date
            rel_line = info_block.find(text=lambda t: 'Release Date:' in t)
            if rel_line:
                release_date = rel_line.parent.next_sibling.strip()

        results.append({
            'app_id': app_id,
            'title': title,
            'genre': genre,
            'developer': developer,
            'publisher': publisher,
            'franchise': franchise,
            'release_date': release_date
        })

    # If there is an error, then make the row N/A
    except Exception as e:
        results.append({
            'app_id': app_id,
            'title': f"Error: {e}",
            'genre': "N/A",
            'developer': "N/A",
            'publisher': "N/A",
            'franchise': "N/A",
            'release_date': "N/A"
        })
    # Added sleep to make sure I don't make too many requests at once
    time.sleep(.5)

# Save results
output_df = pd.DataFrame(results)
output_df.to_csv("steam_game_details.csv", index=False)
print(output_df.head())

Then, to scrape data from the SteamSpy API, I parsed through the JSON that the API outputs and extract data like the number of positive/negative reviews, range of owners, price, and tags. Since the API only allows one request per second, I included a one second buffer for each call. And this data was saved as a Pandas DataFrame. 

NOTE: Code below took 63m and 34 seconds to run on Macbook (VS Code); may take longer/shorter on Google Colab

In [None]:
base_url = "https://steamspy.com/api.php?request=appdetails&appid="

# Created an array to store data before inputting to DataFrame
results = []

# For each app_id in unique_ids extract game information
for app_id in unique_ids:
    url = f"{base_url}{int(app_id)}/"
    try:
        response = requests.get(url)
        
        if response.status_code != 200:
            raise Exception(f"Bad status code {response.status_code} for app_id {app_id}")
        
        data = json.loads(response.text)
        
        # Getting data and storing in variables to append to results array
        positive = data['positive']
        negative = data['negative']
        owners = data['owners']
        price = data['price']
        tags = data['tags']
        results.append({
            'app_id': app_id,
            'positive': positive,
            'negative': negative,
            'owners': owners,
            'price': price,
            'tags': tags,
        })
        
    # If there is an error, then make the row N/A
    except Exception as e:
        results.append({
            'app_id': app_id,
            'positive': "N/A",
            'negative': "N/A",
            'owners': "N/A",
            'price': "N/A",
            'tags': "N/A",
        })
        print(e)

    # Added 1 second sleep to make sure that I don't send too many requests at once
    time.sleep(1)

# Save results
output_df = pd.DataFrame(results)
output_df.to_csv("steam_game_stats.csv", index=False)
print(output_df.head())


<Response [200]>
{"appid":10,"name":"Counter-Strike","developer":"Valve","publisher":"Valve","score_rank":"","positive":242768,"negative":6388,"userscore":0,"owners":"10,000,000 .. 20,000,000","average_forever":0,"average_2weeks":0,"median_forever":0,"median_2weeks":0,"price":"999","initialprice":"999","discount":"0","ccu":11198,"languages":"English, French, German, Italian, Spanish - Spain, Simplified Chinese, Traditional Chinese, Korean","genre":"Action","tags":{"Action":5499,"FPS":4924,"Multiplayer":3471,"Shooter":3417,"Classic":2847,"Team-Based":1920,"First-Person":1760,"Competitive":1654,"Tactical":1391,"1990's":1246,"e-sports":1232,"PvP":928,"Old School":828,"Military":665,"Strategy":640,"Survival":322,"Score Attack":304,"1980s":285,"Assassin":248,"Nostalgia":205}}
<Response [200]>
{"appid":1002,"name":"Rag Doll Kung Fu","developer":"Mark Healey","publisher":"Mark Healey","score_rank":"","positive":89,"negative":30,"userscore":0,"owners":"20,000 .. 50,000","average_forever":0,"av