# IndieP: Predicting the Success of Indie Games on Steam

If you haven't been to a video game convention such as PAX before, let me paint a picture for you... indie mega booth, with 10 games chosen as the PAX 10, the indie games of the year that have been designated the "best" of the year by some committee. But what is "best"? What factors go into determining which of the thousand games released each year will succeed, which will be remembered for years to come and which will fade away?

In this project, I will create a dataset of all of the games in the "Indie" genre of the Steam store, and investigate the factors (or features) that lead to a successful indie game.

## 1. Data Collection

The goal is to create a data set that includes details for every indie game available on Steam. I'm interested in gaining insight into what features have the greatest effect on the success of a game, and using this information to predict what new games will be successful.  

...API, etc ... steamfront python wrapper


### 1.1 Gathering Initial Data from Steam Store

While the genre tags of each Steam game are accessible through Steamfront, the Steam API only provides a list of *all* available apps and no way to filter through them efficiently. A few alternative user-made APIs offer a wider variety of API features and could be filtered by genre, but as they are user-maintained and not always online, I chose to avoid these alternatives for this initial step and make sure our data comes directly from the main source (though we will use one user-made API, SteamSpy, later to collect data on estimated game sales/popularity since as far as I know there's no other way to get this information). Rather than make an API request for every single app (~70,000 requests) and check if it fits our genre, it will be more efficient to scrape a list of Indie games from Steam's online store: 

https://store.steampowered.com/search/?sort_by=Name_ASC&tags=492&category1=998

Here, I've filtered Steam's full game list to include only apps with the tag "indie," further narrowed the list down to include only games (excluding other software, downloadable content, demos, and soundtracks), and sorted the list alphabetically. There's a total of 27,271 results at the time of writing this.

I've learned to love (and hate) [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) for all my prior web scraping projects, but for this project I've chosen to use the simpler and elegant [Scrapy](https://scrapy.org/) which will have a much easier time crawling through the 1,085 pages of results while also providing more flexibility since Steam updates their store frequently (plus, I've never used it and I like trying new things!). The ability to easily control a rate limiter for Scrapy is also a plus, since I want to ensure that the scraper remains well-behaved.

For the initial data collection, I scraped the following features for each game, as they are universal, easy to understand, and all accessible without needing to follow additional links (thus increasing the number of requests):
* AppID: 
* Url:
* Title:
* Release Date:
* Price:

When I initially gathered this data, I only collected the Title, Release Date, and Price - but quickly discovered that there's no guarantee that these three features would be unique and there could easily be multiple games with the same title. The AppID, however, is a unique identifier and will be used to distinguish between data points throughout.

In [166]:
import csv
import json
import time 
import os

import pandas as pd
import requests
import numpy as np

pd.set_option("max_columns", 100)

To start, let's load in the basic data we gathered into a Pandas dataframe - for now, all we really need here is the AppIDs, but we'll load in everything to begin. We'll use the rest of the columns later when we're cleaning up the full dataset, in the case of apps that had unsuccessful API calls.

In [107]:
def get_scrape_data(f):
    ''' Load in json data that we previously collected to extract a list of appIDs for all Indie games '''
    
    data = pd.read_json(f)
    return data

In [121]:
scrapy_data = get_scrape_data('./steam-scrapy/steam_scrape/output/indie_all_final.json')

In [122]:
scrapy_data.head()

Unnamed: 0,appid,price,release_date,title,url
0,1034230,,"Mar 17, 2019",***,https://store.steampowered.com/app/1034230/_/?...
1,603750,$2.99,"Mar 10, 2017",- Arcane Raise -,https://store.steampowered.com/app/603750/_Arc...
2,729370,$14.99,"Jan 23, 2019",-KLAUS-,https://store.steampowered.com/app/729370/KLAU...
3,638510,,2019,.Age,https://store.steampowered.com/app/638510/Age/...
4,1091520,$0.99,"Sep 9, 2019",鸿门一宴(Malicious Dinner),https://store.steampowered.com/app/1091520/Mal...


Right off the bat, we can see that some of the titles are questionable, so we'll need to later dig in to the data to ensure that everything we include in our dataset would be considered a game (no demos/soundtracks/videos/etc - we filtered these out initially but you never know what crept through). We can also see that not all of the release dates have the full day, month, and year, and that not all games have prices (of course some are free to play). We'll keep everything as is until we have all of the data collected, and then later work on cleaning it up into a nice dataset.

In [110]:
applist=data['appid']

In [111]:
applist.head()

0    1034230
1     603750
2     729370
3     638510
4    1091520
Name: appid, dtype: object

In [112]:
applist.describe()

count       22758
unique      22753
top       1155470
freq            2
Name: appid, dtype: object

For some reason, it looks likes there are a few duplicates in the applist, which is suprising because the AppID should be a unique identifier. Let's look under the hood to see if we can figure out why:

In [113]:
data[data.duplicated(subset='appid')]

Unnamed: 0,appid,price,release_date,title,url
4612,230820,$24.99,"May 28, 2013",The Night of the Rabbit Premium Edition,https://store.steampowered.com/sub/28005/?snr=...
12707,430280,$4.99,"Dec 21, 2015",Nature Defenders,https://store.steampowered.com/app/430280/Natu...
12782,1155470,Free,"Oct 7, 2019",Mythic Ocean: Prologue,https://store.steampowered.com/app/1155470/Myt...
12833,434260,$0.99,"Feb 15, 2016",My Name is Mayo,https://store.steampowered.com/app/434260/My_N...
15982,557260,Free to Play,"Jan 16, 2017",iREC,https://store.steampowered.com/app/557260/iREC...


We can see that, for some reason, these 5 AppIDs have been included more than once. Before just deleting the duplicates, we want to make sure that the full rows are duplicated, not just the AppID:

In [114]:
data.loc[data['appid'] == '230820']

Unnamed: 0,appid,price,release_date,title,url
4611,230820,,"May 28, 2013",The Night of the Rabbit,https://store.steampowered.com/app/230820/The_...
4612,230820,$24.99,"May 28, 2013",The Night of the Rabbit Premium Edition,https://store.steampowered.com/sub/28005/?snr=...


In [115]:
data.loc[data['appid'] == '430280']

Unnamed: 0,appid,price,release_date,title,url
12658,430280,$4.99,"Dec 21, 2015",Nature Defenders,https://store.steampowered.com/app/430280/Natu...
12707,430280,$4.99,"Dec 21, 2015",Nature Defenders,https://store.steampowered.com/app/430280/Natu...


In [116]:
data.loc[data['appid'] == '1155470']

Unnamed: 0,appid,price,release_date,title,url
12727,1155470,Free,"Oct 7, 2019",Mythic Ocean: Prologue,https://store.steampowered.com/app/1155470/Myt...
12782,1155470,Free,"Oct 7, 2019",Mythic Ocean: Prologue,https://store.steampowered.com/app/1155470/Myt...


So, it looks like in the first case, both the original and premium versions of the game are given the same AppID - this could be something important to keep in mind for later, but since we'll be making API calls based on a specific AppID, it doesn't make sense to keep the duplicates in our list since you can't have two different request responses from the same API call for an AppID - presumably, info on both versions (or at least that multiple versions exist) will be encoded into the same request, but we'll want to check this later. Regardless, the remaining four duplicates seem to just have printed twice, so we can go ahead and delete them.

In [222]:
data = data.drop_duplicates(subset='appid')

Before we finish and write our applist to file, there's one other caveat from the scraped data: it turns out (after a handful of confusing error messages from the Steam API later on...) that in the cases where the game includes some DLC packages, soundtracks or other media, the AppID that was scraped from the store was actually a string of concatenated AppIDs for the main game as well as the other media. For example, you can see the first 5 instances of this below, but there's actually quite a few of these! In all cases, the first AppID listed is the game, and those following it are additional media. Since we only want the games and aren't interested in the DLC or soundtracks (in fact, we'll later collect just a "yes or no" answer as to whether the game has DLC, etc), we want to keep only part of these strings up until the first comma (we'll likely use this technique again later in the project for creating new features out of existing ones). Then, we'll want to check for duplicates again.

In [241]:
stop = 5
c = 0
for index, row in data.iterrows():
    if ',' in row['appid'] and c<stop:
        print(row['appid'])
        c+=1
    

258090,262141,262142,262230,262300
220820,220822
92300,92302,92303
15500,15520
273700,259830


In [256]:
data['appid']=data['appid'].str.split(',', n=1, expand=True)

In [270]:
applist=data['appid']

In [271]:
applist.describe()

count      22754
unique     22695
top       209190
freq           4
Name: appid, dtype: object

In [276]:
applist = applist.drop_duplicates()

In [277]:
applist.describe()

count      22695
unique     22695
top       601320
freq           1
Name: appid, dtype: object

Now, we have a dataset of AppIDs with no duplicates and no concatenated strings that will throw errors upon API requrests. We'll now write the data to file for safe keeping, take this list of apps to the Steam Store API, and start filling in our dataset.

In [279]:
applist.to_csv('applist.csv')

## 1.2 Using Steam Store API to Get Game Data

Now that we have a list of AppIDs for all of the Indie games on Steam, we can use the Steam API to obtain as much information as we can about each app. We start with some functions to make and parse through API requests, and take a look at the Steam API data that's returned for one appID before iterating through our whole list.

In [123]:
def get_api_request(url, parameters=None):

    try:
        response = requests.get(url=url, params=parameters)
    except SSLError as err:
        print('SSL Error:', err)
        
        for i in range(5, 0, -1):
            print('Waiting... ({})'.format(i))
            time.sleep(1)
        print('Retrying...')
        
        # recusively try again if errored
        return get_api_request(url, parameters)
    
    if response:
        return response.json()
    else:
        # If there's no response it usually means too many requests in a given time 
        print('No response, waiting 10 seconds...')
        time.sleep(10)
        print('Retrying...')
        return get_api_request(url, parameters)

In [124]:
def parse_steam_request(appid):
    '''Makes a request to the Steam Store API and returns all data for a given AppID.
    
    Returns : data in json format
    '''
    
    url = "http://store.steampowered.com/api/appdetails/"
    param = {"appids": appid}
    
    data = get_api_request(url, parameters=param)
    
    json_app_data = data[str(appid)]
    
    if json_app_data['success']:
        data = json_app_data['data']
    else:
        data = {'steam_appid': appid}
        
    return data

Pick an arbitrary appID to make sure the API call to Steam works, and see what kind of data we're getting.

In [125]:
test_data=parse_steam_request(1091520)

In [126]:
test_data.keys()

dict_keys(['type', 'name', 'steam_appid', 'required_age', 'is_free', 'detailed_description', 'about_the_game', 'short_description', 'supported_languages', 'header_image', 'website', 'pc_requirements', 'mac_requirements', 'linux_requirements', 'developers', 'publishers', 'price_overview', 'packages', 'package_groups', 'platforms', 'categories', 'genres', 'screenshots', 'movies', 'achievements', 'release_date', 'support_info', 'background', 'content_descriptors'])

And we can take a peak at the full data for one game (trying not to think too much about this particular game I picked with questionable content but no age restrictions...I guess this is an indication of the full scope of Indie games we're going to find...)

In [127]:
test_data

{'type': 'game',
 'name': '鸿门一宴(Malicious Dinner)',
 'steam_appid': 1091520,
 'required_age': 0,
 'is_free': False,
 'detailed_description': 'Malicious Dinner is an AVG indie game I made <strong>(not a Gal or sexual game)</strong> after work. It\'s aslo my 1st diy game. Ur choices will drive the entire plot. Pls do it wisely. Different choices will have influence on ur quality, popularity, game difficulty and results. <br><img src="https://steamcdn-a.akamaihd.net/steam/apps/1091520/extras/games.jpg?t=1572422069" ><br>Including 1 bonus egg, 4 casual games, 5 careers, 6 results and 100+ hidden items. This is only 1st chapter of the story. Will optimize, update and add 4 more chapters of DLC later.<h2 class="bb_tag"><strong>【Chapter 1】</strong></h2><img src="https://steamcdn-a.akamaihd.net/steam/apps/1091520/extras/故事-动画_1-000.png?t=1572422069" ><br>Now U play a school girl who is invited to a volunteer-reward active, only to find it\'s actually a mailcious dinner. How to survive from it?

For now, let's keep everything except: 'detailed_description' (because, if you look, it can be *really long* and there are two other categories with very similar information), 'pc_requirements', 'mac_requirements', 'linux_requirements' (because there's a feature for 'platforms' which will give us plenty of information there), 'legal_notice' (because who wants to deal with that...), and 'email' (probably won't be relevant). We'll keep all of the images such as screenshots and assets for now, even if we end up not using all, because as a stretch goal we may want to do some image analysis later to see if we can learn anything about the art styles being used in each game.

In [68]:
cols_to_keep = [
    'type', 'name', 'steam_appid', 'required_age', 'is_free', 'about_the_game', 'short_description', 'supported_languages',
    'header_image', 'website', 'developers', 'publishers', 'price_overview', 'packages', 'package_groups', 'platforms', 
    'categories', 'genres', 'screenshots', 'movies', 'achievements', 'release_date', 'background', 'content_descriptors'
]

## 1.3 Collecting, Formatting and Writing Steam Store Data 

With the framework set up to collect the data we want from the Steam Store API, we can go ahead and iterate through the full list of AppIDs, writing to a CSV file. Since we have a large number of API requests to make, in order to avoid catastrophic data loss or massive overloading of the API, we'll want to split up the requests into batches and make sure to pause briefly between requests (Thanks to Nik Davis for this tip!).

In [190]:
def make_app_data_list(start, stop, parse_func, applist, pause, output=True):
    """Return list of app data generated from parser.
    
    parse_func : function to handle request
    """
    app_data = []
    
    # iterate through each row of app_list between start and stop
    for index, appid in applist[start:stop].iteritems():
        if output==True: #print details, useful for short runs but should be supressed for full production
            print('Index: {}; Appid: {}'.format(index, appid))
        
        # retrive app data for a row, handled by supplied parser, and append to list
        data = parse_func(appid)
        app_data.append(data)

        time.sleep(pause) # prevent overloading API with requests
    
    return app_data

First, check on a small subset of apps to make sure the above function works as intended:

In [191]:
app_data = make_app_data_list(0,2,parse_steam_request, applist, 5)

Index: 0; Appid: 1034230
Index: 1; Appid: 603750


Next, create and initialize a CSV file to which we will iteratively write our data.

In [280]:
def init_csv_file(path, filename, columns):
    '''Create file and write header row'''
    file = os.path.join(path, filename)
    with open(file, 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=columns)
        writer.writeheader()

In [282]:
datafile = 'steam_data.csv'
data_path = './data/'
init_csv_file(data_path, datafile, cols_to_keep)

Finally, collect and write the entire dataset in a batch process, saving to file every 100 (or whatever you chose) lines. This process takes quite a while, as there's over 22,000 apps to be processed), so as an example I've reduced the "stop" parameter to only 500. Change this to "-1" or use the default to rerun the full dataset.

In [284]:
def get_and_write_data_batches(path, filename, columns, applist, parse_func, start=0, stop=-1, batchsize=100, pause=2):
    
    if stop==-1: #process entire app list
        stop=len(applist)+1
        
    batches = np.arange(start, stop, batchsize) #array to be used for "start" and "stop" values for batch requests
    batches = np.append(batches, stop)
    start_time = time.time()
    
    for i in range(len(batches)-1):
        start_time_batch = time.time()
        
        batch_start = batches[i]
        batch_stop = batches[i+1]
        print('Batch #: {}'.format(i+1))
        
        #get data for batch
        print('Collecting data from API...')
        app_data = make_app_data_list(batch_start, batch_stop, parse_func, applist, pause, output=False)
        
        #write batch to file
        print('Writing data to file...')
        file = os.path.join(path, filename)
        with open(file, 'a', newline='', encoding='utf-8') as f:
            writer = csv.DictWriter(f, fieldnames=cols_to_keep, extrasaction='ignore')
            time.sleep(1)
            writer.writerows(app_data)
        
        #get time info and estimate remaining time
        total_time = time.time()-start_time
        batch_time = time.time()-start_time_batch
        estimated_time_remaining = (stop-start)/batchsize*batch_time-total_time
        print('Elapsed time: {} seconds'.format(round(total_time,3)))
        print('Estimated time remaining: {} minutes'.format(round(estimated_time_remaining/60,3)))
        print(' ')
    print('Done!')

In [285]:
get_and_write_data_batches(path=data_path, filename=datafile, columns=cols_to_keep, applist=applist, parse_func = parse_steam_request, start=0, stop=1000, batchsize=100, pause=2)

Batch #: 1
Collecting data from API...
Writing data to file...
Elapsed time: 250.309 seconds
Estimated time remaining: 37.546 minutes
 
Batch #: 2
Collecting data from API...
Writing data to file...
Elapsed time: 489.038 seconds
Estimated time remaining: 31.638 minutes
 
Batch #: 3
Collecting data from API...
Writing data to file...
Elapsed time: 742.828 seconds
Estimated time remaining: 29.918 minutes
 
Batch #: 4
Collecting data from API...
Writing data to file...
Elapsed time: 984.301 seconds
Estimated time remaining: 23.84 minutes
 
Batch #: 5
Collecting data from API...
Writing data to file...
Elapsed time: 1234.803 seconds
Estimated time remaining: 21.17 minutes
 
Batch #: 6
Collecting data from API...
Writing data to file...
Elapsed time: 1476.933 seconds
Estimated time remaining: 15.739 minutes
 
Batch #: 7
Collecting data from API...
Writing data to file...
Elapsed time: 1736.837 seconds
Estimated time remaining: 14.37 minutes
 
Batch #: 8
Collecting data from API...
Writing d

## 1.4 Using SteamSpy to Get Game Popularity Data

In [52]:
def parse_steamspy_request(appid, name):
    """Parser to handle SteamSpy API data."""
    url = "https://steamspy.com/api.php"
    parameters = {"request": "appdetails", "appid": appid}
    
    json_data = get_request(url, parameters)
    return json_data



In [54]:
parse_steamspy_request(603750, 'test')

{'appid': 603750,
 'name': '- Arcane Raise -',
 'developer': 'Arcane Raise',
 'publisher': 'WAX Publishing',
 'score_rank': '',
 'positive': 54,
 'negative': 91,
 'userscore': 0,
 'owners': '50,000 .. 100,000',
 'average_forever': 220,
 'average_2weeks': 0,
 'median_forever': 226,
 'median_2weeks': 0,
 'price': '299',
 'initialprice': '299',
 'discount': '0',
 'languages': 'English',
 'genre': 'Adventure, Casual, Indie, RPG, Strategy',
 'ccu': 0,
 'tags': {'Adventure': 66,
  'RPG': 48,
  'Strategy': 40,
  'Casual': 34,
  'RPGMaker': 30,
  'JRPG': 26,
  'Indie': 24,
  'Fantasy': 24,
  'Story Rich': 19,
  'Rogue-like': 5}}