<a href="https://colab.research.google.com/github/d3ttl4ff/steam-data-scraper/blob/main/steam_data_collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Import Libraries**

We begin by importing the libraries we will be using. We start with standard library imports, or those available by default in Python, then import the third-party packages. We'll be using requests to handle interacting with the APIs, then the popular pandas and numpy libraries for handling the downloaded data.

In [45]:
# standard library imports
import csv
import datetime as dt
import json
import os
import statistics
import time
import sys

# third-party imports
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup


# customisations - ensure tables show all columns
pd.set_option("display.max_columns", 100)

Next, we define a general, all-purpose function to process get requests from an API, supplied through a URL parameter. A dictionary of parameters can be supplied which is passed into the get request automatically, depending on the requirements of the API.

Rather than simply returning the response, we handle a couple of scenarios to help automation. Occasionally we encounter an SSL Error, in which case we simply wait a few seconds then try again (by recursively calling the function). When this happens, and generally throughout this project, we provide quite verbose feedback to show when these errors are encountered and how they are handled.

Sometimes there is no response when a request is made (returns None). This usually happens when too many requests are made in a short period of time, and the polling limit has been reached. We try to avoid this by pausing briefly between requests, as we'll see later, but in case we breach the polling limit we wait 10 seconds then try again.

Handling these errors in this way ensures that our function almost always returns the desired response, which we return in json format to make processing easier.

In [46]:
def get_request(url, parameters=None):
    """Return json-formatted response of a get request using optional parameters.

    Parameters
    ----------
    url : string
    parameters : {'parameter': 'value'}
        parameters to pass as part of get request

    Returns
    -------
    json_data
        json-formatted response (dict-like)
    """
    try:
        response = requests.get(url=url, params=parameters)
    except SSLError as s:
        print('SSL Error:', s)

        for i in range(5, 0, -1):
            print('\rWaiting... ({})'.format(i), end='')
            time.sleep(1)
        print('\rRetrying.' + ' '*10)

        # recusively try again
        return get_request(url, parameters)

    if response:
        return response.json()
    else:
        # response is none usually means too many requests. Wait and try again
        print('No response, waiting 10 seconds...')
        time.sleep(10)
        print('Retrying.')
        return get_request(url, parameters)

# **Generate List of App IDs**

Every app on the steam store has a unique app ID. Whilst different apps can have the same name, they can't have the same ID. This will be very useful to us for identifying apps and eventually merging our tables of data.

Before we get to that, we need to generate a list of app ids which we can use to build our data sets. It's possible to generate one from the Steam API, however this has over 70,000 entries, many of which are demos and videos with no way to tell them apart. Instead, SteamSpy provides an 'all' request, supplying some information about the apps they track. It doesn't supply all information about each app, so we still need to request this information individually, but it provides a good starting point.

Because many of the return fields are strings containing commas and other punctuation, it is easiest to read the response into a pandas dataframe, and export the required appid and name fields to a csv. We could keep only the appid column as a list or pandas series, but it may be useful to keep the app name at this stage.

In [47]:
# define the base URL and parameters
url = "https://steamspy.com/api.php"

# initialize an empty list to store data from all pages
all_data = []

# loop through all pages (0 to 79)
for page in range(1):
    parameters = {"request": "all", "page": page}
    response = requests.get(url, params=parameters)
    if response.status_code == 200:
        page_data = response.json()
        all_data.extend(page_data.values())
        sys.stdout.write(f"\rFinished fetching data for page {page}")
        sys.stdout.flush()
    else:
        sys.stdout.write(f"\rFailed to fetch data for page {page}")
        sys.stdout.flush()

# convert the collected data into a DataFrame
steam_spy_all = pd.DataFrame(all_data)

# generate a sorted app_list from steamspy data
app_list = steam_spy_all[['appid', 'name']].sort_values('appid').reset_index(drop=True)

# save to CSV
output_path = '/content/drive/MyDrive/Colab Notebooks/data/download/app_list.csv'
app_list.to_csv(output_path, index=False)

# read from the stored CSV
app_list = pd.read_csv(output_path)

# count all items
app_list_count = len(app_list)
print(f"\nThe number of items in the dataset is: {app_list_count}")

# display first few rows
print(app_list.head())

Finished fetching data for page 0
The number of items in the dataset is: 1000
   appid                       name
0     10             Counter-Strike
1     20      Team Fortress Classic
2     30              Day of Defeat
3     50  Half-Life: Opposing Force
4     60                   Ricochet


# **Define Download Logic**

Now we have the app_list dataframe, we can iterate over the app IDs and request individual app data from the servers. Here we set out our logic to retrieve and process this information, then finally store the data as a csv file.

Because it takes a long time to retrieve the data, it would be dangerous to attempt it all in one go as any errors or connection time-outs could cause the loss of all our data. For this reason we define a function to download and process the requests in batches, appending each batch to an external file and keeping track of the highest index written in a separate file.

This not only provides security, allowing us to easily restart the process if an error is encountered, but also means we can complete the download across multiple sessions.

Again, we provide verbose output for rows exported, batches complete, time taken and estimated time remaining.

In [48]:
def get_app_data(start, stop, parser, pause):
    """Return list of app data generated from parser.

    parser : function to handle request
    """
    app_data = []

    # iterate through each row of app_list, confined by start and stop
    for index, row in app_list[start:stop].iterrows():
        print('Current index: {}'.format(index), end='\r')

        appid = row['appid']
        name = row['name']

        # retrive app data for a row, handled by supplied parser, and append to list
        data = parser(appid, name)
        app_data.append(data)

        time.sleep(pause) # prevent overloading api with requests

    return app_data


def process_batches(parser, app_list, download_path, data_filename, index_filename,
                    columns, begin=0, end=-1, batchsize=100, pause=1):
    """Process app data in batches, writing directly to file.

    parser : custom function to format request
    app_list : dataframe of appid and name
    download_path : path to store data
    data_filename : filename to save app data
    index_filename : filename to store highest index written
    columns : column names for file

    Keyword arguments:

    begin : starting index (get from index_filename, default 0)
    end : index to finish (defaults to end of app_list)
    batchsize : number of apps to write in each batch (default 100)
    pause : time to wait after each api request (defualt 1)

    returns: none
    """
    print('Starting at index {}:\n'.format(begin))

    # by default, process all apps in app_list
    if end == -1:
        end = len(app_list) + 1

    # generate array of batch begin and end points
    batches = np.arange(begin, end, batchsize)
    batches = np.append(batches, end)

    apps_written = 0
    batch_times = []

    for i in range(len(batches) - 1):
        start_time = time.time()

        start = batches[i]
        stop = batches[i+1]

        app_data = get_app_data(start, stop, parser, pause)

        rel_path = os.path.join(download_path, data_filename)

        # writing app data to file
        with open(rel_path, 'a', newline='', encoding='utf-8') as f:
            writer = csv.DictWriter(f, fieldnames=columns, extrasaction='ignore')

            for j in range(3,0,-1):
                print("\rAbout to write data, don't stop script! ({})".format(j), end='')
                time.sleep(0.5)

            writer.writerows(app_data)
            print('\rExported lines {}-{} to {}.'.format(start, stop-1, data_filename), end=' ')

        apps_written += len(app_data)

        idx_path = os.path.join(download_path, index_filename)

        # writing last index to file
        with open(idx_path, 'w') as f:
            index = stop
            print(index, file=f)

        # logging time taken
        end_time = time.time()
        time_taken = end_time - start_time

        batch_times.append(time_taken)
        mean_time = statistics.mean(batch_times)

        est_remaining = (len(batches) - i - 2) * mean_time

        remaining_td = dt.timedelta(seconds=round(est_remaining))
        time_td = dt.timedelta(seconds=round(time_taken))
        mean_td = dt.timedelta(seconds=round(mean_time))

        print('Batch {} time: {} (avg: {}, remaining: {})'.format(i, time_td, mean_td, remaining_td))

    print('\nProcessing batches complete. {} apps written'.format(apps_written))

Next we define some functions to handle and prepare the external files.

We use reset_index for testing and demonstration, allowing us to easily reset the index in the stored file to 0, effectively restarting the entire download process.

We define get_index to retrieve the index from file, maintaining persistence across sessions. Every time a batch of information (app data) is written to file, we write the highest index within app_data that was retrieved. As stated, this is partially for security, ensuring that if there is an error during the download we can read the index from file and continue from the end of the last successful batch. Keeping track of the index also allows us to pause the download, continuing at a later time.

Finally, the prepare_data_file function readies the csv for storing the data. If the index we retrieved is 0, it means we are either starting for the first time or starting over. In either case, we want a blank csv file with only the header row to begin writing to, se we wipe the file (by opening in write mode) and write the header. Conversely, if the index is anything other than 0, it means we already have downloaded information, and can leave the csv file alone.

In [49]:
def reset_index(download_path, index_filename):
    """Reset index in file to 0."""
    rel_path = os.path.join(download_path, index_filename)

    with open(rel_path, 'w') as f:
        print(0, file=f)


def get_index(download_path, index_filename):
    """Retrieve index from file, returning 0 if file not found."""
    try:
        rel_path = os.path.join(download_path, index_filename)

        with open(rel_path, 'r') as f:
            index = int(f.readline())

    except FileNotFoundError:
        index = 0

    return index


def prepare_data_file(download_path, filename, index, columns):
    """Create file and write headers if index is 0."""
    if index == 0:
        rel_path = os.path.join(download_path, filename)

        with open(rel_path, 'w', newline='') as f:
            writer = csv.DictWriter(f, fieldnames=columns)
            writer.writeheader()

# **Download Steam Data**

Now we are ready to start downloading data and writing to file. We define our logic particular to handling the steam API - in fact if no data is returned we return just the name and appid - then begin setting some parameters. We define the files we will write our data and index to, and the columns for the csv file. The API doesn't return every column for every app, so it is best to explicitly set these.

Next we run our functions to set up the files, and make a call to process_batches to begin the process. Some additional parameters have been added for demonstration, to constrain the download to just a few rows and smaller batches. Removing these would allow the entire download process to be repeated.

In [50]:
def parse_steam_request(appid, name):
    """Unique parser to handle data from Steam Store API.

    Returns : json formatted data (dict-like)
    """
    url = "http://store.steampowered.com/api/appdetails/"
    parameters = {"appids": appid}

    json_data = get_request(url, parameters=parameters)
    json_app_data = json_data[str(appid)]

    if json_app_data['success']:
        data = json_app_data['data']
    else:
        data = {'name': name, 'steam_appid': appid}

    return data


# set file parameters
download_path = '/content/drive/MyDrive/Colab Notebooks/data/download'
steam_app_data = 'steam_app_data.csv'
steam_index = 'steam_index.txt'

steam_columns = [
    'type', 'name', 'steam_appid', 'required_age', 'is_free', 'controller_support',
    'dlc', 'detailed_description', 'about_the_game', 'short_description', 'fullgame',
    'supported_languages', 'header_image', 'website', 'pc_requirements', 'mac_requirements',
    'linux_requirements', 'legal_notice', 'drm_notice', 'ext_user_account_notice',
    'developers', 'publishers', 'demos', 'price_overview', 'packages', 'package_groups',
    'platforms', 'metacritic', 'reviews', 'categories', 'genres', 'screenshots',
    'movies', 'recommendations', 'achievements', 'release_date', 'support_info',
    'background', 'content_descriptors'
]

# overwrites last index for demonstration (would usually store highest index so can continue across sessions)
reset_index(download_path, steam_index)

# retrieve last index downloaded from file
index = get_index(download_path, steam_index)

# wipe or create data file and write headers if index is 0
prepare_data_file(download_path, steam_app_data, index, steam_columns)

# set end and chunksize for demonstration - remove to run through entire app list
process_batches(
    parser=parse_steam_request,
    app_list=app_list,
    download_path=download_path,
    data_filename=steam_app_data,
    index_filename=steam_index,
    columns=steam_columns,
    begin=index,
    end=10,
    batchsize=5
)

Starting at index 0:

Exported lines 0-4 to steam_app_data.csv. Batch 0 time: 0:00:08 (avg: 0:00:08, remaining: 0:00:08)
Exported lines 5-9 to steam_app_data.csv. Batch 1 time: 0:00:08 (avg: 0:00:08, remaining: 0:00:00)

Processing batches complete. 10 apps written


In [51]:
# inspect downloaded data
pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/download/steam_app_data.csv').head()

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,legal_notice,drm_notice,ext_user_account_notice,developers,publishers,demos,price_overview,packages,package_groups,platforms,metacritic,reviews,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
0,game,Counter-Strike,10,0,False,,,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,,"English<strong>*</strong>, French<strong>*</st...",https://shared.akamai.steamstatic.com/store_it...,,{'minimum': '\n\t\t\t<p><strong>Minimum:</stro...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'PLN', 'initial': 3599, 'final': ...","[574941, 7]","[{'name': 'default', 'title': 'Buy Counter-Str...","{'windows': True, 'mac': True, 'linux': True}","{'score': 88, 'url': 'https://www.metacritic.c...",,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://shared.a...",,{'total': 157131},,"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://shared.akamai.steamstatic.com/store_it...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
1,game,Team Fortress Classic,20,0,False,,,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,,"English, French, German, Italian, Spanish - Sp...",https://shared.akamai.steamstatic.com/store_it...,,{'minimum': '\n\t\t\t<p><strong>Minimum:</stro...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'USD', 'initial': 499, 'final': 4...",[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://shared.a...",,{'total': 6423},,"{'coming_soon': False, 'date': 'Apr 1, 1999'}","{'url': '', 'email': ''}",https://shared.akamai.steamstatic.com/store_it...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
2,game,Day of Defeat,30,0,False,,,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,,"English, French, German, Italian, Spanish - Spain",https://shared.akamai.steamstatic.com/store_it...,http://www.dayofdefeat.com/,{'minimum': '\n\t\t\t<p><strong>Minimum:</stro...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'USD', 'initial': 499, 'final': 4...","[30, 944613]","[{'name': 'default', 'title': 'Buy Day of Defe...","{'windows': True, 'mac': True, 'linux': True}","{'score': 79, 'url': 'https://www.metacritic.c...",,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://shared.a...",,{'total': 4208},,"{'coming_soon': False, 'date': 'May 1, 2003'}","{'url': '', 'email': ''}",https://shared.akamai.steamstatic.com/store_it...,"{'ids': [2, 5], 'notes': 'This game includes f..."
3,game,Half-Life: Opposing Force,50,0,False,,,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,,"English, French, German, Korean",https://shared.akamai.steamstatic.com/store_it...,,{'minimum': '\n\t\t\t<p><strong>Minimum:</stro...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Gearbox Software'],['Valve'],,"{'currency': 'USD', 'initial': 499, 'final': 4...",[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://shared.a...",,{'total': 21289},,"{'coming_soon': False, 'date': 'Nov 1, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://shared.akamai.steamstatic.com/store_it...,"{'ids': [2, 5], 'notes': 'Half-Life: Opposing ..."
4,game,Ricochet,60,0,False,,,A futuristic action game that challenges your ...,A futuristic action game that challenges your ...,A futuristic action game that challenges your ...,,"English, French, German, Italian, Spanish - Sp...",https://shared.akamai.steamstatic.com/store_it...,,{'minimum': '\n\t\t\t<p><strong>Minimum:</stro...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'USD', 'initial': 499, 'final': 4...",[33],"[{'name': 'default', 'title': 'Buy Ricochet', ...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://shared.a...",,{'total': 4233},,"{'coming_soon': False, 'date': 'Nov 1, 2000'}","{'url': '', 'email': ''}",https://shared.akamai.steamstatic.com/store_it...,"{'ids': [], 'notes': None}"


# **Download SteamSpy data**

To retrieve data from SteamSpy we perform a very similar process. Our parse function is a little simpler because of the how data is returned, and the maximum polling rate of this API is higher so we can set a lower value for pause in the process_batches function and download more quickly. Apart from that we set the new variables and make a call to the process_batches function once again.

In [52]:
def parse_steamspy_request(appid, name):
    """Parser to handle SteamSpy API data."""
    url = "https://steamspy.com/api.php"
    parameters = {"request": "appdetails", "appid": appid}

    json_data = get_request(url, parameters)
    return json_data


# set files and columns
download_path = '/content/drive/MyDrive/Colab Notebooks/data/download'
steamspy_data = 'steamspy_data.csv'
steamspy_index = 'steamspy_index.txt'

steamspy_columns = [
    'appid', 'name', 'developer', 'publisher', 'score_rank', 'positive',
    'negative', 'userscore', 'owners', 'average_forever', 'average_2weeks',
    'median_forever', 'median_2weeks', 'price', 'initialprice', 'discount',
    'languages', 'genre', 'ccu', 'tags'
]

reset_index(download_path, steamspy_index)
index = get_index(download_path, steamspy_index)

# Wipe data file if index is 0
prepare_data_file(download_path, steamspy_data, index, steamspy_columns)

process_batches(
    parser=parse_steamspy_request,
    app_list=app_list,
    download_path=download_path,
    data_filename=steamspy_data,
    index_filename=steamspy_index,
    columns=steamspy_columns,
    begin=index,
    end=20,
    batchsize=5,
    pause=0.3
)

Starting at index 0:

Exported lines 0-4 to steamspy_data.csv. Batch 0 time: 0:00:03 (avg: 0:00:03, remaining: 0:00:10)
Exported lines 5-9 to steamspy_data.csv. Batch 1 time: 0:00:03 (avg: 0:00:03, remaining: 0:00:07)
Exported lines 10-14 to steamspy_data.csv. Batch 2 time: 0:00:04 (avg: 0:00:04, remaining: 0:00:04)
Exported lines 15-19 to steamspy_data.csv. Batch 3 time: 0:00:04 (avg: 0:00:04, remaining: 0:00:00)

Processing batches complete. 20 apps written


In [53]:
# inspect downloaded steamspy data
pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/download/steamspy_data.csv').head()

Unnamed: 0,appid,name,developer,publisher,score_rank,positive,negative,userscore,owners,average_forever,average_2weeks,median_forever,median_2weeks,price,initialprice,discount,languages,genre,ccu,tags
0,10,Counter-Strike,Valve,Valve,,240080,6307,0,"10,000,000 .. 20,000,000",12188,57,159,35,999,999,0,"English, French, German, Italian, Spanish - Sp...",Action,13846,"{'Action': 5489, 'FPS': 4916, 'Multiplayer': 3..."
1,20,Team Fortress Classic,Valve,Valve,,7465,1117,0,"5,000,000 .. 10,000,000",3881,0,11,0,499,499,0,"English, French, German, Italian, Spanish - Sp...",Action,70,"{'Action': 764, 'FPS': 330, 'Multiplayer': 279..."
2,30,Day of Defeat,Valve,Valve,,6324,678,0,"5,000,000 .. 10,000,000",954,0,18,0,499,499,0,"English, French, German, Italian, Spanish - Spain",Action,81,"{'FPS': 801, 'World War II': 270, 'Multiplayer..."
3,50,Half-Life: Opposing Force,Gearbox Software,Valve,,23422,1159,0,"2,000,000 .. 5,000,000",992,0,114,0,499,499,0,"English, French, German, Korean",Action,136,"{'FPS': 931, 'Action': 360, 'Classic': 289, 'S..."
4,60,Ricochet,Valve,Valve,,4867,1027,0,"5,000,000 .. 10,000,000",565,0,7,0,499,499,0,"English, French, German, Italian, Spanish - Sp...",Action,10,"{'Action': 593, 'FPS': 138, 'Multiplayer': 120..."


In [54]:
def parse_steamspy_html(appid, name):
    """Parse HTML from SteamSpy to extract followers and old userscore."""
    url = f"https://steamspy.com/app/{appid}"
    try:
        response = requests.get(url)
        if response.status_code != 200:
            print(f"Failed to fetch data for appid {appid}. HTTP status: {response.status_code}")
            return {"appid": appid, "name": name, "followers": None, "old_userscore": None}

        # Parse the HTML content
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract followers count
        followers_tag = soup.find('strong', string='Followers')
        if followers_tag:
            followers_text = followers_tag.next_sibling.strip()  # Extract text after "Followers"
            followers = followers_text.lstrip(': ').replace(',', '')  # Remove leading ": " and commas
        else:
            followers = None
            print(f"Followers not found for appid {appid}")

        # Extract old userscore
        userscore_tag = soup.find('strong', string='Old userscore:')
        if userscore_tag:
            old_userscore = userscore_tag.next_sibling.strip()  # Extract text after "Old userscore:"
        else:
            old_userscore = None
            print(f"Old userscore not found for appid {appid}")

        return {
            "appid": appid,
            "name": name,
            "followers": followers,
            "old_userscore": old_userscore
        }
    except Exception as e:
        print(f"Error while processing appid {appid}: {e}")
        return {"appid": appid, "name": name, "followers": None, "old_userscore": None}

# Define the file paths and columns
download_path = '/content/drive/MyDrive/Colab Notebooks/data/download'
steamspy_extended_data = 'steamspy_data_extended.csv'
steamspy_extended_index = 'steamspy_data_extended_index.txt'
full_data_path = f"{download_path}/{steamspy_extended_data}"
index_file_path = f"{download_path}/{steamspy_extended_index}"

steamspy_extended_columns = [
    'appid', 'name', 'followers', 'old_userscore'
]

# Reset index for the extended data file
reset_index(download_path, steamspy_extended_index)
index = get_index(download_path, steamspy_extended_index)

# Prepare the file to write results
prepare_data_file(download_path, steamspy_extended_data, index, steamspy_extended_columns)

# Process and write results to the file
process_batches(
    parser=parse_steamspy_html,
    app_list=app_list,
    download_path=download_path,
    data_filename=steamspy_extended_data,
    index_filename=steamspy_extended_index,
    columns=steamspy_extended_columns,
    begin=index,
    end=20,  # Adjust as needed
    batchsize=5,
    pause=0.3
)

Starting at index 0:

Exported lines 0-4 to steamspy_data_extended.csv. Batch 0 time: 0:00:06 (avg: 0:00:06, remaining: 0:00:18)
Exported lines 5-9 to steamspy_data_extended.csv. Batch 1 time: 0:00:06 (avg: 0:00:06, remaining: 0:00:11)
Exported lines 10-14 to steamspy_data_extended.csv. Batch 2 time: 0:00:05 (avg: 0:00:06, remaining: 0:00:06)
Exported lines 15-19 to steamspy_data_extended.csv. Batch 3 time: 0:00:05 (avg: 0:00:06, remaining: 0:00:00)

Processing batches complete. 20 apps written


In [55]:
# inspect downloaded steamspy data
pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/download/steamspy_data_extended.csv').head()

Unnamed: 0,appid,name,followers,old_userscore
0,10,Counter-Strike,316536,97%
1,20,Team Fortress Classic,10665,86%
2,30,Day of Defeat,8017,90%
3,50,Half-Life: Opposing Force,12127,95%
4,60,Ricochet,3070,82%


# **Download Steamcharts Data**

In [56]:
def parse_steamcharts_html(appid, name):
    """
    Parse HTML from SteamCharts to extract 24-hour peak, all-time peak,
    and the all-time peak date based on the Peak Players table.
    """
    url = f"https://steamcharts.com/app/{appid}"
    try:
        response = requests.get(url)
        if response.status_code != 200:
            print(f"Failed to fetch data for appid {appid}. HTTP status: {response.status_code}")
            return {"appid": appid, "name": name, "24-hour peak": None, "all-time peak": None, "all-time peak date": None}

        # Parse the HTML content
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract 24-hour peak
        peak_24h_tag = soup.find('div', class_='app-stat').find_next('div', class_='app-stat')
        peak_24h = peak_24h_tag.find('span', class_='num').text.replace(',', '') if peak_24h_tag else None

        # Extract all-time peak
        all_time_peak_tag = peak_24h_tag.find_next('div', class_='app-stat') if peak_24h_tag else None
        all_time_peak = all_time_peak_tag.find('span', class_='num').text.replace(',', '') if all_time_peak_tag else None

        # Extract all-time peak date
        all_time_peak_date = None
        if all_time_peak:
            peak_players_table = soup.find('table', class_='common-table')
            if peak_players_table:
                rows = peak_players_table.find_all('tr')
                for row in rows:
                    # Find the row with the matching all-time peak value
                    cells = row.find_all('td')
                    if len(cells) > 4:
                        peak_value = cells[4].text.replace(',', '').strip()
                        if peak_value == all_time_peak:
                            all_time_peak_date = cells[0].text.strip()
                            break

        return {
            "appid": appid,
            "name": name,
            "24-hour peak": peak_24h,
            "all-time peak": all_time_peak,
            "all-time peak date": all_time_peak_date,
        }
    except Exception as e:
        print(f"Error while processing appid {appid}: {e}")
        return {"appid": appid, "name": name, "24-hour peak": None, "all-time peak": None, "all-time peak date": None}

# Set files and columns
download_path = '/content/drive/MyDrive/Colab Notebooks/data/download'
steamcharts_data = 'steamcharts_data.csv'
steamcharts_index = 'steamcharts_index.txt'

steamcharts_columns = [
    'appid', 'name', '24-hour peak', 'all-time peak', 'all-time peak date'
]

# Reset index for the SteamCharts data file
reset_index(download_path, steamcharts_index)
index = get_index(download_path, steamcharts_index)

# Prepare the file to write results
prepare_data_file(download_path, steamcharts_data, index, steamcharts_columns)

# Process and write results to the file
process_batches(
    parser=parse_steamcharts_html,
    app_list=app_list,
    download_path=download_path,
    data_filename=steamcharts_data,
    index_filename=steamcharts_index,
    columns=steamcharts_columns,
    begin=index,
    end=20,  # Adjust as needed
    batchsize=5,
    pause=0.3
)

Starting at index 0:

Exported lines 0-4 to steamcharts_data.csv. Batch 0 time: 0:00:04 (avg: 0:00:04, remaining: 0:00:11)
Exported lines 5-9 to steamcharts_data.csv. Batch 1 time: 0:00:04 (avg: 0:00:04, remaining: 0:00:07)
Exported lines 10-14 to steamcharts_data.csv. Batch 2 time: 0:00:04 (avg: 0:00:04, remaining: 0:00:04)
Exported lines 15-19 to steamcharts_data.csv. Batch 3 time: 0:00:04 (avg: 0:00:04, remaining: 0:00:00)

Processing batches complete. 20 apps written


In [57]:
# inspect downloaded steamcharts data
pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/download/steamcharts_data.csv').head()

Unnamed: 0,appid,name,24-hour peak,all-time peak,all-time peak date
0,10,Counter-Strike,17784,65188,January 2013
1,20,Team Fortress Classic,89,210,January 2020
2,30,Day of Defeat,97,654,January 2013
3,50,Half-Life: Opposing Force,156,5373,August 2023
4,60,Ricochet,8,87,May 2019
