The goal of this little script is to get a bunch of ids for Steam games using Beautiful Soup. We'll just go straight to  steamdb.info and scrape that, since it has a simple interface. We'll avoid getting info on ALL of the apps, since some of them are DLC and some have little information about them (there are IDs but they contain nothing). But first, let's try scraping the first page, and then look at what we see.

In [1]:
from bs4 import BeautifulSoup
import requests

In [1]:
response = requests.get("https://steamdb.info/apps/")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

So there's a lot to learn about this stuff, like how to actually parse everything. I can't manage to get all the information that I want. I'll come back to this.

In [2]:
app_text = parser.find_all('tr')
app_embedded = [a for a in app_text]
print(app_embedded[0:3])

[<tr>
<th class="no-sort" style="width:119px"> </th>
<th class="span1">AppID</th>
<th>Name</th>
<th class="span2">Last Updated</th>
</tr>, <tr class="app" data-appid="509980">
<td class="applogo">
<img alt="" onerror="this.onerror=null;this.src='/static/img/applogo.gif'" src="/static/camo/apps/509980/capsule_sm_120.jpg">
</img></td>
<td><a href="/app/509980/">509980</a></td>
<td>
<a aria-label="This app is in store" class="pull-right tooltipped tooltipped-w" href="//store.steampowered.com/app/509980/?utm_source=SteamDB&amp;utm_medium=SteamDB&amp;utm_campaign=SteamDB%20Apps%20Page" style="margin:10px 10px 0 0">
<span class="octicon octicon-globe"></span>
</a>
<a class="b" href="/app/509980/">Finding Bigfoot</a>
<i class="subinfo">Game</i>
</td>
<td class="timeago" data-sort="1486673771" title="2017-02-09T20:56:11+00:00">February 9, 2017 – 20:56:11 UTC</td>
</tr>, <tr class="app" data-appid="590380">
<td class="applogo">
<img alt="" onerror="this.onerror=null;this.src='/static/img/applog

In [2]:
def get_app_id(row):
    attrs = row.attrs
    if 'data-appid' in attrs:
        if 'data-appid' in attrs['data-appid']:
            return attrs['data-appid']['data-appid']
        else:
            return attrs['data-appid']

In [3]:
# This isn't working quite yet, we'll try this again later...
def get_app_name(row):
    links = row.find_all('a')
    for a in links:
        attrs = a.attrs
        if 'class' in attrs:
            text = a.text
            if 'SteamDB Unknown App' in text:
                return None
            else:
                print(a)
                return text

The first is unreliable, but if we go straight to Steam, we might get a better result:

In [7]:
# How about we try just going to steam?
def get_from_steam(app_id):
    loc = "http://store.steampowered.com/app/" + app_id + '/'
    req = requests.get(loc)
    c = req.content
    parsed = BeautifulSoup(c, 'html.parser')
    title = parsed.title.text
    if 'Welcome to Steam' in title:
        return None
    else:
        title = title.replace('on Steam','')
        stripped = [c for c in title if 0 < ord(c) < 127]
        return ''.join(stripped)

It worked out, and whenever we get 'Welcome to Steam', we know that the app doesn't exist and we can continue searching through the next entries. 

In [5]:
def is_game(row):
    info = row.find_all('i')
    for i in info:
        if 'Game' in i.text:
            return True
    return False

In [7]:
games = {}
for row in app_embedded:
    app_id = get_app_id(row)
    links = row.find_all('a')
    if app_id != None:
        title = get_from_steam(app_id)
        if title != None:
            games[app_id] = title
    # Then we need to go through all of the possibilities of a and find our name:
    #title = get_app_name(row)

In [None]:
games

{'221380': 'Age of Empires II HD ',
 '230270': 'N++ (NPLUSPLUS) ',
 '25000': 'Overgrowth ',
 '252490': 'Rust ',
 '252870': 'PULSAR: Lost Colony ',
 '253840': 'Shantae: Half-Genie Hero ',
 '260430': 'The Four Kings Casino and Slots ',
 '271240': 'Offworld Trading Company ',
 '272270': 'Torment: Tides of Numenera ',
 '286000': 'Tooth and Tail ',
 '288140': 'Jack Nicklaus Perfect Golf ',
 '289070': 'Save 20% on Sid Meier’s Civilization® VI ',
 '291860': 'Pit People® ',
 '298240': 'Red Crucible®: Firestorm ',
 '298900': 'Space Hulk: Deathwing ',
 '312660': 'Pre-purchase Sniper Elite 4 ',
 '332650': 'The Exiled ',
 '333950': 'Medieval Engineers ',
 '352460': 'Dead Realm ',
 '376210': 'The Isle ',
 '383120': 'Empyrion - Galactic Survival ',
 '390540': 'High Fidelity ',
 '394490': 'Buildanauts ',
 '414390': 'Dirty Bomb® - Down and Dirty Starter Pack ',
 '431650': 'Save 10% on Phoning Home ',
 '434030': 'Aerofly FS 2 Flight Simulator ',
 '439310': 'Until I Have You ',
 '447820': 'Day of Infamy

So now that we have it working, we need to get it working and going through all the pages. Once we've done that, we can then turn that into a list of ids.

In [8]:
import pandas as pd
def crawl_pages(page_list):
    games = {'game': [], 'game_id': []}
    print('...', end='')
    for page in page_list:
        response = requests.get("https://steamdb.info/apps/" + page)
        content = response.content
        parser = BeautifulSoup(content, 'html.parser')
        app_text = parser.find_all('tr')
        app_embedded = [a for a in app_text]
        for row in app_embedded:
            app_id = get_app_id(row)
            links = row.find_all('a')
            if app_id != None:
                title = get_from_steam(app_id)
                if title != None:
                    print(title, ': ', app_id)
                    #TODO: Make sure we clean up the title so that there are no non-unicode strings
                    games['game'].append(title)
                    games['game_id'].append(app_id)
    g_df = pd.DataFrame(games)
    g_df.to_csv("games.csv")
pages = ['page' + str(x) + '/' for x in range(1, 20)]
pages.insert(0, '')
crawl_pages(pages)

...Rise: Battle Lines  :  386350
Go! Go! Nippon! ~My First Trip to Japan~  :  251870
Save 30% on Stage of Development: Indie City  :  591480
Soviet Monsters: Ekranoplans  :  372160
Toricky  :  477720
Age of Empires II HD  :  221380
RiME  :  493200
DRAGON BALL XENOVERSE 2  :  454650
Mind Unleashed  :  450740
PULSAR: Lost Colony  :  252870
AltspaceVRThe Social VR App  :  407060
FORTIFY  :  505040
Runbow  :  464650
Labyrinth  :  412310
Torch Cave  :  501690
Sleengster  :  559500
Rage Parking Simulator 2016  :  449630
Squad  :  393380
Pixvana SPIN Technology Preview  :  534690
Torch Cave 2  :  533710
Skull Rush  :  587210
Fur Fun  :  589370
High Fidelity  :  390540
Urban Pirate: The 8-bit Soundtrack  :  523910
Syndrome  :  409320
Clash Cup Turbo  :  393280
Red Crucible: Firestorm  :  298240
Mutant Fighting Cup 2  :  561450
Grappledrome  :  242610
Constellation Distantia  :  570150
Oxygen Not Included  :  457140
Fernbus Simulator  :  427100
Thunderbird: The Legend Begins  :  355460
Rogalia 