# Pulling data from Steam Web API

After looking at the documentation for the [storefront API](https://wiki.teamfortress.com/wiki/User:RJackson/StorefrontAPI) (which can return store page information for each game), it requires each games' appid. Our data doesn't have these appids included, however, a second API found through [this thread](https://stackoverflow.com/questions/57441606/how-to-get-the-steam-appid-by-appname-in-steam-webapi) allows us to retrieve a list of game appids to their corresponding title. 

Since the engine to be built will be content based

Since this process takes several minutes, I decided to add the [tdqm](https://github.com/tqdm/tqdm) wrapper on my loop.

In [1]:
#inline graphing just in case, and libraries to be used
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import pickle
from tqdm import tqdm
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

After looking at the documentation for the [storefront API](https://wiki.teamfortress.com/wiki/User:RJackson/StorefrontAPI) (which can return store page information for each game), it requires each games' appid. Our data doesn't have these appids included, however, a second API found through [this thread](https://stackoverflow.com/questions/57441606/how-to-get-the-steam-appid-by-appname-in-steam-webapi) allows us to retrieve a list of game appids to their corresponding title. 

Since the API returns a json, it will be easier to first move it into a DataFrame before merging it with our other table.

Since this process takes several minutes, I decided to add the [tdqm](https://github.com/tqdm/tqdm) wrapper on my loop.

In [2]:
#using requests modules to send a GET requestion
r = requests.get('http://api.steampowered.com/ISteamApps/GetAppList/v2')

#assign the relevant information from the json return to the variable 'app_json'
app_json = r.json()['applist']['apps']

#Create empty DataFrame with two columns, loop through each dict in the list, add both values to DataFrame
appid_df = pd.DataFrame(columns=['appid', 'name'])
for app in tqdm(app_json):
    appid_df = appid_df.append(app, ignore_index=True)

#print head to confirm results
appid_df.head()

100%|██████████| 100000/100000 [13:39<00:00, 121.96it/s]


Unnamed: 0,appid,name
0,216938,Pieterw test app76 ( 216938 )
1,660010,test2
2,660130,test3
3,1346400,Battle Engine Aquila
4,1346420,Transpire


Now that all the app names and correspond appid are stored in a data frame, information about each app can be queried from the storefront API.

To understand what to pull from this API, the json response will be loaded into a DataFrame to explore. It is also good to cross refrerence the description of the data from the documentation linked above. Since the API call requires an appid to test out, the appid for 'PAY DAY 2' was used.

In [5]:
#make GET request for page front content of Pay Day 2
app_info = requests.get('https://store.steampowered.com/api/appdetails?appids=218620')

#assign app_data for testing in the next few cells
app_data = app_info.json()['218620']['data']

#load to DataFrame to view
pd.DataFrame(app_info.json()['218620'])

Unnamed: 0,success,data
about_the_game,True,"<strong><a href=""https://store.steampowered.co..."
achievements,True,"{'total': 1207, 'highlighted': [{'name': 'Comi..."
background,True,https://steamcdn-a.akamaihd.net/steam/apps/218...
categories,True,"[{'id': 2, 'description': 'Single-player'}, {'..."
content_descriptors,True,"{'ids': [], 'notes': None}"
controller_support,True,full
detailed_description,True,"<strong><a href=""https://store.steampowered.co..."
developers,True,[OVERKILL - a Starbreeze Studio.]
dlc,True,"[1347750, 1347751, 1351060, 1252200, 1255151, ..."
genres,True,"[{'id': '1', 'description': 'Action'}, {'id': ..."


After looking through the json, it seems like the data that will be most useful for a our recommendation system is in the description values, such as "about_the_game", "detailed_description", "short_description", and it wouldn't hurt to add the words from "genres" and "categories", too.

Below are functions defined to properly extract the strings from each of theses values.

In [10]:
def get_categories(app_data):
    try:
        categories = app_data['categories']
    except KeyError:
        return ""
    else:
        category_string = ""
        for cat in app_data['categories']:
            category_string = category_string + " " + cat['description']

        return category_string

In [11]:
def get_genres(app_data):
    try:
        genres = app_data['genres']
    except KeyError:
        return ""
    else:
        genres_string = ""
        for gen in genres:
            genres_string = genres_string + " " + gen['description']
    
        return genres_string

Since all the descriptions are HTML formatted strings, utilizing BeautifulSoup library to parse the text will save us a lot of work of cleaning the tags ourselves. This last function parses all three descriptive HTML strings into one long string.

In [13]:
def html_parser(html_string):
    soup = BeautifulSoup(html_string, 'html.parser')
    return soup.get_text()

def get_all_descriptions(app_data):
    try:
        detailed_description = html_parser(app_data['detailed_description'])
        about_the_game = html_parser(app_data['about_the_game'])
        short_description =  html_parser(app_data['short_description'])
        return " ".join([detailed_description, about_the_game, short_description])
    
    except:
        return " "
    else:
        return " ".join([detailed_description, about_the_game, short_description])
                                         

Final function puts everything together to return one string with descriptions, genres, and categories. Not the conditional if-statement to deal with -1 values.

In [14]:
def collect_descriptions(app_data):
    return get_categories(app_data) + get_genres(app_data) + " " + get_all_descriptions(app_data)

For each row in the DataFrame, we have to create the querystring and add it to the URL for the GET request. One more function will be written to complete this.

In [15]:
def getDescrFromAPI(appid):
    if appid == -1:
        return ""
    else:
        url_string = 'https://store.steampowered.com/api/appdetails?appids=' + str(appid)
        try:
            app_json = requests.get(url_string).json()
            app_data = app_json[str(appid)]['data']
        except:
            return ""
        else:    
            return collect_descriptions(app_data)

The function is applied to this column, and then saved as a new column, all-descriptions. After completion, the values will be investigated

In [16]:
appid_df['all-descriptions'] = appid_df['appid'].apply(getDescrFromAPI)

In [21]:
appid_df['all-descriptions'].value_counts().sort_values(ascending=False)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

The value count above tells us that 90% of the games didn't receive a description for whatever reason. This still leaves the table with about 10000 values, which should be suffice for the content based algorithm.

In [11]:
description_mask = appid_df['all-descriptions'] != ""

appid_wDescr = appid_df[description_mask]

Next, the text description will be vectorized using two methods, Count Vectorization and TfIdf Vectorization. First is the CountVectorizer implimentation.

In [12]:
#note CountVectorizer was imported. Stopwords set to english
Vectorizer = CountVectorizer(stop_words='english')

#fit and transform corpus
CountVector = Vectorizer.fit_transform(appid_wDescr['all-descriptions'])

Next, to see how similar each text is to each other, the cosine similarity will be calculated using scikit-learn's implementation.

In [20]:
cos_sim_cnt_count = cosine_similarity(CountVector)

To make it easier to interpret the result, the results are put into a dataframe with index and columns corresponding to the game titles.

In [24]:
game_names = appid_wDescr.name.to_list()

cos_sim_cnt_count_df = pd.DataFrame(cos_sim_cnt_count, index=game_names, columns=game_names)

In [25]:
cos_sim_cnt_count_df["The Pirate's Fate"].sort_values(ascending=False)

The Pirate's Fate                    1.000000
What Makes Us Special                0.257252
Always Sometimes Monsters            0.256670
Capsize                              0.244739
Pirates? Pirates!                    0.242325
                                       ...   
Men of War Red Tide Trailer          0.000000
Aztaka Trailer                       0.000000
Super Laser Racer Trailer            0.000000
AvP - Alien Gameplay                 0.000000
TIGER GAME ASSETS LOGOTYPE VOL.10    0.000000
Name: The Pirate's Fate, Length: 9022, dtype: float64

In [31]:
with open('cosine_similarity_countvector.pkl', 'wb') as pklfile:
    pickle.dump(cos_sim_cnt_count_df, pklfile)

Now the same thing will be done using tfidf vectorizer and linear kernel. The use of the linear kernel is due to it giving the same effective score.

In [28]:
#vectorizer from sklearn
Vectorizer = TfidfVectorizer(stop_words='english')

#fit and trasnform corpus
TfidfVector = Vectorizer.fit_transform(appid_wDescr['all-descriptions'])

#compare
cos_sim_cnt_tfidf = linear_kernel(TfidfVector)

#load in for viewing
cos_sim_cnt_tfidf_df = pd.DataFrame(cos_sim_cnt_tfidf, index=game_names, columns=game_names)

In [29]:
cos_sim_cnt_tfidf_df['Adventure Farm VR'].sort_values(ascending=False)

Adventure Farm VR                                                                                              1.000000
Hope's Farm                                                                                                    0.398280
Farmtale                                                                                                       0.331884
Magic Farm 2: Fairy Lands (Premium Edition)                                                                    0.261245
My Breast Friend Sally                                                                                         0.241121
                                                                                                                 ...   
Sailor Moon Season 1: Naru's Tears: Nephrite Dies for Love                                                     0.000000
The Assignment: Filmmaking Portraits: The Assignment                                                           0.000000
Miss Kobayashi's Dragon Maid: Emperor of

In [30]:
cos_sim_cnt_tfidf

array([[1.        , 0.02893374, 0.00554547, ..., 0.00140314, 0.02537805,
        0.03858828],
       [0.02893374, 1.        , 0.05157235, ..., 0.01314139, 0.02377156,
        0.05689864],
       [0.00554547, 0.05157235, 1.        , ..., 0.00593665, 0.01115483,
        0.05555096],
       ...,
       [0.00140314, 0.01314139, 0.00593665, ..., 1.        , 0.0029901 ,
        0.02907821],
       [0.02537805, 0.02377156, 0.01115483, ..., 0.0029901 , 1.        ,
        0.06136301],
       [0.03858828, 0.05689864, 0.05555096, ..., 0.02907821, 0.06136301,
        1.        ]])

In [32]:
with open('cosine_similarity_tfidfvector.pkl', 'wb') as pklfile:
    pickle.dump(cos_sim_cnt_tfidf_df, pklfile)

The two recommenders can then be combined by averaging the values in each cos_sim arrays

In [35]:
cos_sim_cnt_hybrid = np.mean([cos_sim_cnt_count, cos_sim_cnt_tfidf],axis=0)

cos_sim_cnt_hybrid_df = pd.DataFrame(cos_sim_cnt_hybrid, index=game_names, columns=game_names)

In [36]:
cos_sim_cnt_hybrid_df['Adventure Farm VR'].sort_values(ascending=False)

Adventure Farm VR                                  1.000000
Hope's Farm                                        0.394847
Farmtale                                           0.331603
VRGROUND : Crazy Farm                              0.254929
Magic Farm 2: Fairy Lands (Premium Edition)        0.247873
                                                     ...   
CME DLC GF and TR Trailer                          0.000000
The Eccentric Family: The Friday Club Once More    0.000000
The Eccentric Family: Arima Hell                   0.000000
The Eccentric Family: The Conjurer Tenmaya         0.000000
Mandela: A Long Walk to Freedom                    0.000000
Name: Adventure Farm VR, Length: 9022, dtype: float64

In [37]:
with open('cosine_similarity_hybrid.pkl', 'wb') as pklfile:
    pickle.dump(cos_sim_cnt_hybrid_df, pklfile)

### End

This concludes the development of content based recommendation system. Unfortunately, since the data used is missing the majority of games, it isn't viable to create a public facing application to play around with this model. However, with sklearn and pandas, its quick and easy to test build a recommendation system.