# IndieP: Predicting the Success of Indie Games on Steam

If you haven't been to a video game convention such as PAX before, let me paint a picture for you... indie mega booth, with 10 games chosen as the PAX 10, the indie games of the year that have been designated the "best" of the year by some committee. But what is "best"? What factors go into determining which of the thousand games released each year will succeed, which will be remembered for years to come and which will fade away?

In this project, I will create a dataset of all of the games in the "Indie" genre of the Steam store, and investigate the factors (or features) that lead to a successful indie game.

# 1. Data Collection

The goal is to create a data set that includes details for every indie game available on Steam. I'm interested in gaining insight into what features have the greatest effect on the success of a game, and using this information to predict what new games will be successful.  

...API, etc ... steamfront python wrapper

While the genre tags of each Steam game are accessible through Steamfront, the Steam API only provides a list of *all* available apps and no way to filter through them efficiently. Alternative user-made APIs such as SteamSpy offer a wider variety of API features, but as they are user-maintained and not always online, I chose to avoid these alternatives. Rather than make an API request for every single app (~70,000 requests) and check if it fits our genre, it will be more efficient to scrape a list of Indie games from Steam's online store: 

https://store.steampowered.com/search/?sort_by=Name_ASC&tags=492&category1=998

Here, I've filtered Steam's full game list to include only apps with the tag "indie," further narrowed the list down to include only games (excluding other software, downloadable content, demos, and soundtracks), and sorted the list alphabetically. There's a total of 27,271 results at the time of writing this.

I've learned to love (and hate) [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) for all my prior web scraping projects, but for this project I've chosen to use the simpler and elegant [Scrapy](https://scrapy.org/) which will have a much easier time crawling through the 1,085 pages of results while also providing more flexibility since Steam updates their store frequently (plus, I've never used it and I like trying new things!). The ability to easily control a rate limiter for Scrapy is also a plus, since I want to ensure that the scraper remains well-behaved.

For the initial data collection, I scraped the following features for each game, as they are universal, easy to understand, and all accessible without needing to follow additional links (thus increasing the number of requests):
* AppID: 
* Url:
* Title:
* Release Date:
* Price:

When I initially gathered this data, I only collected the Title, Release Date, and Price - but quickly discovered that there's no guarantee that these three features would be unique and there could easily be multiple games with the same title. The AppID, however, is a unique identifier and will be used to distinguish between data points throughout.

In [1]:
import csv
import json

import pandas as pd
import requests

pd.set_option("max_columns", 100)

In [2]:
def get_request(url, parameters=None):
    """Return json-formatted response of a get request using optional parameters.
    
    Parameters
    ----------
    url : string
    parameters : {'parameter': 'value'}
        parameters to pass as part of get request
    
    Returns
    -------
    json data
        json-formatted response (dict-like)
    """
    try:
        response = requests.get(url=url, params=parameters)
    except SSLError as s:
        print('SSL Error:', s)
        
        for i in range(5, 0, -1):
            print('\rWaiting... ({})'.format(i), end='')
            time.sleep(1)
        print('\rRetrying.' + ' '*10)
        
        # recusively try again
        return get_request(url, parameters)
    
    if response:
        return response.json()
    else:
        # response is none usually means too many requests. Wait and try again 
        print('No response, waiting 10 seconds...')
        time.sleep(10)
        print('Retrying.')
        return get_request(url, parameters)

To start, let's load in the basic data we gathered into a Pandas dataframe - it's possible that all we really need here is the appIDs, but we'll load in everything to begin.

In [23]:
def get_scrape_data(f):
    """ Load in json data that we previously collected to extract a list of appIDs for all Indie games """
    data = pd.read_json(f)
    return data

In [24]:
data = get_scrape_data('./steam-scrapy/steam_scrape/output/indie_all_final.json')

In [25]:
data.head()

Unnamed: 0,appid,price,release_date,title,url
0,1034230,,"Mar 17, 2019",***,https://store.steampowered.com/app/1034230/_/?...
1,603750,$2.99,"Mar 10, 2017",- Arcane Raise -,https://store.steampowered.com/app/603750/_Arc...
2,729370,$14.99,"Jan 23, 2019",-KLAUS-,https://store.steampowered.com/app/729370/KLAU...
3,638510,,2019,.Age,https://store.steampowered.com/app/638510/Age/...
4,1091520,$0.99,"Sep 9, 2019",鸿门一宴(Malicious Dinner),https://store.steampowered.com/app/1091520/Mal...


Right off the bat, we can see that some of the titles are questionable, so we'll need to later dig in to the data to ensure that everything we include in our dataset would be considered a game (no demos/soundtracks/videos/etc). We can also see that not all of the release dates have the full day, month, and year, and that not all games have prices (of course some are free to play). We'll keep everything as is until we have all of the data collected, and then later work on cleaning it up into a nice dataset.

In [26]:
applist=data['appid']

In [27]:
applist.head()

0    1034230
1     603750
2     729370
3     638510
4    1091520
Name: appid, dtype: object

In [28]:
applist.describe()

count      22758
unique     22753
top       230820
freq           2
Name: appid, dtype: object

For some reason, it looks likes there are a few duplicates in the applist, which is suprising because the AppID should be a unique identifier. Let's look under the hood to see if we can figure out why:

In [46]:
data[data.duplicated(subset='appid')]

Unnamed: 0,appid,price,release_date,title,url
4612,230820,$24.99,"May 28, 2013",The Night of the Rabbit Premium Edition,https://store.steampowered.com/sub/28005/?snr=...
12707,430280,$4.99,"Dec 21, 2015",Nature Defenders,https://store.steampowered.com/app/430280/Natu...
12782,1155470,Free,"Oct 7, 2019",Mythic Ocean: Prologue,https://store.steampowered.com/app/1155470/Myt...
12833,434260,$0.99,"Feb 15, 2016",My Name is Mayo,https://store.steampowered.com/app/434260/My_N...
15982,557260,Free to Play,"Jan 16, 2017",iREC,https://store.steampowered.com/app/557260/iREC...


We can see that, for some reason, these 5 AppIDs have been included more than once. Before just deleting the duplicates, we want to make sure that the full rows are duplicated, not just the AppID:

In [38]:
data.loc[data['appid'] == '230820']

Unnamed: 0,appid,price,release_date,title,url
4611,230820,,"May 28, 2013",The Night of the Rabbit,https://store.steampowered.com/app/230820/The_...
4612,230820,$24.99,"May 28, 2013",The Night of the Rabbit Premium Edition,https://store.steampowered.com/sub/28005/?snr=...


In [39]:
data.loc[data['appid'] == '430280']

Unnamed: 0,appid,price,release_date,title,url
12658,430280,$4.99,"Dec 21, 2015",Nature Defenders,https://store.steampowered.com/app/430280/Natu...
12707,430280,$4.99,"Dec 21, 2015",Nature Defenders,https://store.steampowered.com/app/430280/Natu...


In [40]:
data.loc[data['appid'] == '1155470']

Unnamed: 0,appid,price,release_date,title,url
12727,1155470,Free,"Oct 7, 2019",Mythic Ocean: Prologue,https://store.steampowered.com/app/1155470/Myt...
12782,1155470,Free,"Oct 7, 2019",Mythic Ocean: Prologue,https://store.steampowered.com/app/1155470/Myt...


In [41]:
data.loc[data['appid'] == '434260']

Unnamed: 0,appid,price,release_date,title,url
12783,434260,$0.99,"Feb 15, 2016",My Name is Mayo,https://store.steampowered.com/app/434260/My_N...
12833,434260,$0.99,"Feb 15, 2016",My Name is Mayo,https://store.steampowered.com/app/434260/My_N...


In [43]:
data.loc[data['appid'] == '557260']

Unnamed: 0,appid,price,release_date,title,url
15933,557260,Free to Play,"Jan 16, 2017",iREC,https://store.steampowered.com/app/557260/iREC...
15982,557260,Free to Play,"Jan 16, 2017",iREC,https://store.steampowered.com/app/557260/iREC...


So, it looks like in the first case, both the original and premium versions of the game are given the same AppID - this could be something important to keep in mind for later. However, the remaining four duplicates seem to just have printed twice, so we can go ahead and delete the duplicates for the latter four. We'll keep both the original and premium versions of AppID 230820, at least for now. We can exclude the first duplicated app from our to-delete list by requiring that both the AppID and the price (or any other column, really) be the same:

In [50]:
data = data.drop_duplicates(subset=['appid', 'price'])