<a href="https://colab.research.google.com/github/abnormalPotassium/DATA620/blob/main/Project%20Final/ProjectFinal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Final Project: Video Game Description Processing for Model Building
By: Al Haque, Taha Ahmad


---
## Requirements

This project's requirements are to:
- Answer our proposal's guiding question with work that incorporates text processing, one of the main themes of the course
- Show all of our work in a coherent workflow and in a reproducible format
- Explain how we evaluate the “goodness” of the final model and parameters



---
## Introduction

### Project Goals

Our project is an attempt at tackling a portion of a problem that has exponentially grown more difficult for companies to navigate in the era of thousands of similar competing products: How does someone stand out from the competitors and garner more sales than the rest in our huge digital marketplaces? Tackling this problem would be invaluable to those trying to break into markets without an existing large amount of consumer recognition.

There are multiple different ways to tackle this problem like focusing on getting advertisement in front of prospective customers, choosing flattering images to showcase the product, attempting to maximize quality to win the word of mouth favor, and many more. Our focus will be attempting to use text mining on product descriptions to see if there are certain writing features which are leading to more sales as a partial indicator of visibility to the customer. Specifically, we want to attempt this analysis with products in the video game industry on the computer game storefront of Steam. Thus, the problem we want to tackle is:

What features within a description of a computer game on Steam are most influential to the amount of sales for the product?

### Methodology

To answer this question, we will start by sourcing the data from [Gamalytic](https://gamalytic.com/api-reference.txt), a third-party API which collects data from the Steam marketplace platform including features of the products on the platform and the estimated sales. Additionally, we will need to gather description data from an official [Steam API](https://store.steampowered.com) with no official documentation, but [unofficial documentation](https://wiki.teamfortress.com/wiki/User:RJackson/StorefrontAPI) is available and was used.

Once we are able to load the data in, the plan will be to break down the descriptions into a multitude of different features such as the amount of time a word may be used, the length of the description, the overall sentiment of the description based on a sentiment dictionary. With the features broken down we will train multiple different classification models, but focus on a logistic regression model with a larger featureset due to the benefit of increased interpretability. Both members will work on the iterative model building process and feature extraction. Our final trained model should give us insight into which features within the description increase sales and thus consumer reach.

Perhaps the biggest concern at the moment would be extracting enough features with actual meaning to our response variable to make a good model. As description features may actually not be that influential to the total amount of buyers.

---
## Package Installation


Any packages that need to be installed for working on our project will be added in the code block below. The very initial package assumption is that we'll simply need nltk and possibly pandas.

In [None]:
!pip install nltk
!pip install pandas
!pip install requests
!pip install beautifulsoup4

---
## Data Collection

We begin our data collection by importing the Requests Python package which allows us to easily query the APIs that we will be using the data from. We'll also be importing the time module so we can wait a bit before querying the API repeatedly in order to be polite and not overload servers. The json package in combination with the google colab drive module are used to save the data queried to google drive so far in case of an error when querying the API so we will not have to retry from scratch.

In [None]:
import requests
import time
import json

### SteamIDs and Copies Sold

From the Gamalytic API we have created a function using requests json to page through and get information on 200 of the latest games. For each page we add the json information to an array and save it once the function is done.

In [49]:
def paged_requests(url = 'https://api.gamalytic.com/steam-games/list', payload = {"price_min": 1, 'date_min': 1420070399999, "limit" : 100, "page": 0}, maxpage = 2, timeout = 10):
  request_list = []
  start = payload['page']

  r = requests.get(url, params=payload)

  if isinstance(maxpage, int):
    maxpage = min(maxpage, r.json()['pages'])
  else:
    maxpage = r.json()['pages']

  for page in range(payload['page'], maxpage):
    if page == start:
      request_list.extend(r.json()['result'])
      time.sleep(1)
    else:
      payload['page'] = page

      try:
        r = requests.get(url, params=payload, timeout = timeout)
      except requests.exceptions.Timeout:
        print(f"Timed out on page {page}")
        with open(f'{drive}/erroratpage{page}.json', 'w') as f:
          f.write(json.dumps(request_list))
        return request_list

      request_list.extend(r.json()['result'])
      time.sleep(1)
  with open(f'{drive}/completedpagedquery.json', 'w') as f:
    f.write(json.dumps(request_list))
  return request_list

Initially our functions saved the API information to Google Drive after mounting the drive. However, once we were able to download the information we moved it to GitHub for easier reproducibility and download the data again from there.

In [18]:
drive = "/content/drive/MyDrive/Colab Notebooks"

'''
from google.colab import drive
drive.mount('/content/drive')
gamalytic_list = paged_requests()
'''

Afterwards, we import pandas as the package will be crucial for working with our data in a reasonable manner. Reading our uploaded json file as a dataframe allows us to preview the data we have collected so far. Although, we really only need the id and copiesSold field from this data.

In [None]:
import pandas as pd

!wget https://github.com/abnormalPotassium/DATA620/raw/main/Project%20Final/completedpagedquery.json
df = pd.read_json("completedpagedquery.json")

In [316]:
df

Unnamed: 0,steamId,id,name,price,reviews,reviewScore,copiesSold,revenue,avgPlaytime,tags,genres,features,developers,publishers,unreleased,earlyAccess,releaseDate,EAReleaseDate,publisherClass
0,1086940,1086940,Baldur's Gate 3,59.99,537110,96,16113300,8.386532e+08,65.780621,"[RPG, Choices Matter, Character Customization,...","[Adventure, RPG, Strategy]","[Single-player, Online Co-op, LAN Co-op, Steam...",[Larian Studios],[Larian Studios],False,False,1691035200000,1.601957e+12,AAA
1,271590,271590,Grand Theft Auto V,39.99,1640585,86,35118817,7.586372e+08,190.169759,"[Open World, Action, Multiplayer, Crime, Autom...","[Action, Adventure]","[Single-player, Online PvP, Online Co-op, Stea...",[Rockstar North],[Rockstar Games],False,False,1428897600000,,AAA
2,1091500,1091500,Cyberpunk 2077,59.99,653469,81,14072558,5.800287e+08,66.727864,"[Cyberpunk, Open World, Nudity, RPG, Singlepla...",[RPG],"[Single-player, Steam Achievements, Steam Trad...",[CD PROJEKT RED],[CD PROJEKT RED],False,False,1607490000000,,AAA
3,306130,306130,The Elder Scrolls® Online,19.99,137887,82,5226369,5.233994e+08,128.857166,"[RPG, MMORPG, Open World, Fantasy, Adventure, ...","[Action, Adventure, Massively Multiplayer, RPG]","[MMO, Online PvP, Online Co-op, Steam Trading ...",[ZeniMax Online Studios],[Bethesda Softworks],False,False,1495425600000,1.404173e+12,AAA
4,359550,359550,Tom Clancy's Rainbow Six® Siege,19.99,1178856,86,21327664,4.976081e+08,218.986833,"[FPS, PvP, eSports, Multiplayer, Tactical, Sho...",[Action],"[Single-player, Online PvP, Online Co-op, Stea...",[Ubisoft Montreal],[Ubisoft],False,False,1448946000000,,AAA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,431240,431240,Golf With Your Friends,14.99,58936,89,3734612,3.566563e+07,11.216670,"[Multiplayer, Casual, Mini Golf, Sports, Golf,...","[Casual, Indie, Sports]","[Single-player, Online PvP, Shared/Split Scree...","[Blacklight Interactive, Team17]",[Team17],False,False,1589860800000,1.451606e+12,AA
196,292730,292730,Call of Duty®: Infinite Warfare,59.99,25718,57,600697,3.555098e+07,22.269230,"[Action, FPS, Multiplayer, Zombies, Space, Fut...","[Action, Adventure]","[Single-player, Online PvP, Online Co-op, Stea...",[Infinity Ward],[Activision],False,False,1478145600000,,AAA
197,310950,310950,Street Fighter V,19.99,42075,67,1278316,3.535854e+07,44.706498,"[Fighting, Action, 2D Fighter, Multiplayer, Co...",[Action],"[Single-player, Cross-Platform Multiplayer, St...","[CAPCOM Co., Ltd.]","[CAPCOM Co., Ltd.]",False,False,1455512400000,,AAA
198,960090,960090,Bloons TD 6,13.99,275170,97,7209688,3.497538e+07,56.794273,"[Tower Defense, Strategy, Multiplayer, Singlep...",[Strategy],"[Single-player, Online Co-op, Steam Achievemen...",[Ninja Kiwi],[Ninja Kiwi],False,False,1545022800000,,Indie


### Steam Descriptions and Historical Copies Sold

In order to gather the data regarding game descriptions, we'll need to utilize another API, the Steam app details API. This will require querying on the individual steamId of the games that we currently have. So, we take the ids from the dataframe of the 200 games that we have gathered from the API and convert it into a list to be used down the line.

In [318]:
id_list = list(df["steamId"])

We import the BeautifulSoup package in order to parse any HTML remnants that are common within descriptions that are retrieved from the Steam API. However, some of the HTML might be a good indicator for sales such as the amount of images or links within the description. So we make sure to count those in our parsing function.

In [191]:
from bs4 import BeautifulSoup

def getdesc(description, id):
  description = {k: description[str(id)]['data'][k] for k in ('steam_appid', 'about_the_game')}
  description["steamId"] = description.pop('steam_appid')
  soup = BeautifulSoup(description['about_the_game'], 'html.parser')
  description['about_the_game'] = soup.get_text(separator=" ").strip()
  description["img_count"] = len(soup.find_all('img'))
  description["link_count"] = len(soup.find_all('a'))
  return description

An example of a singular parse result is shown below:

In [192]:
"""
description = r.json()
getdesc(description, id_list[2])
"""

{'about_the_game': 'Cyberpunk 2077 is an open-world, action-adventure RPG set in the megalopolis of Night City, where you play as a cyberpunk mercenary wrapped up in a do-or-die fight for survival. Improved and featuring all-new free additional content, customize your character and playstyle as you take on jobs, build a reputation, and unlock upgrades. The relationships you forge and the choices you make will shape the story and the world around you. Legends are made here. What will yours be? IMMERSE YOURSELF WITH UPDATE 2.1 Night City feels more alive than ever with the free Update 2.1! Take a ride on the fully functional NCART metro system, listen to music as you explore the city with the Radioport, hang out with your partner in V’s apartment, compete in replayable races, ride new vehicles, enjoy improved bike combat and handling, discover hiddens secrets and much, much more! CREATE YOUR OWN CYBERPUNK Become an urban outlaw equipped with cybernetic enhancements and build your legend 

We also want to attempt to take into account the fact that some games will be selling more just from the fact that they've been out longer. Finding an alternate response variable of historical unit sales from around 30 days after release will allow us to mitigate the advantage older games get just from existing longer. To do this we call another gamalytic API which gives us historical data over multiple different unix timestamps, we find the first unix time stamp that is greater than a month old (the range here is usually 1-1.5 months after release) and use that as a historical price indicator.

In [303]:
def findmonth(timestamps):
  if abs(timestamps['history'][0]['timeStamp'] - timestamps['releaseDate']) <= 2629743000:
    print("what")
    hist_price = timestamps['history'][-1]
    hist_price["steamId"] = int(timestamps["steamId"])
    return hist_price
  for i in range(1,len(timestamps['history'])):
    if abs(timestamps['history'][0]['timeStamp'] - timestamps['history'][i]['timeStamp']) >= 2629743000:
      hist_price = timestamps['history'][i]
      hist_price["steamId"] = int(timestamps["steamId"])
      return hist_price

An example of the results from a singular call is shown below.

In [317]:
"""
timestamps = r.json()
findmonth(timestamps)
"""

{'timeStamp': 1679889883569,
 'reviews': 103815,
 'price': 29.99,
 'score': 81,
 'rank': 8,
 'players': 22892.14285714286,
 'sales': 4855906,
 'revenue': 125325077,
 'steamId': 1326470}

Now we put together our parsing and processing functions inside another function which calls both the Steam and gamalytic apis every half a second after processing has completed. It retains features from the previous function such as being able to save on error and saving upon completion.

In [203]:
def dual_requests(id_list = id_list, timeout = 10):
  url2 = "https://store.steampowered.com/api/appdetails"
  payload2 = {"filters": "basic"}

  request_list1 = []
  request_list2 = []

  for game in id_list:
    url1 = f'https://api.gamalytic.com/game/{game}'
    payload2["appids"] = game
    try:
      r1 = requests.get(url1, timeout = timeout)
      r2 = requests.get(url2, params=payload2, timeout = timeout)
    except requests.exceptions.Timeout:
      index = id_list.index(game)
      print(f"Timed out on id index {index}")
      with open(f'{drive}/erroratindex{index}1.json', 'w') as f:
        f.write(json.dumps(request_list1))
      with open(f'{drive}/erroratindex{index}2.json', 'w') as f:
        f.write(json.dumps(request_list1))
      return request_list1, request_list2

    request_list1.append(findmonth(r1.json()))
    request_list2.append(getdesc(r2.json(), game))
    time.sleep(0.5)

  with open(f'{drive}/completedpagedquery1.json', 'w') as f:
    f.write(json.dumps(request_list1))
  with open(f'{drive}/completedpagedquery2.json', 'w') as f:
    f.write(json.dumps(request_list2))
  return request_list1, request_list2

Like before we have uploaded the query results onto GitHub for easier access and this project can be reproduced by simply downloading these files below:

In [232]:
"""
timestamps,descriptions = dual_requests(id_list = id_list)
"""

!wget https://github.com/abnormalPotassium/DATA620/raw/main/Project%20Final/completedpagedquery1.json && wget https://github.com/abnormalPotassium/DATA620/raw/main/Project%20Final/completedpagedquery2.json

We load both query result files into pandas dataframes and preview what they look like. We can see that df1 has a decent amount of excess information like rank, followers, etc which we want to trim later on.

In [338]:
df1 = pd.read_json("completedpagedquery1.json")
df2 = pd.read_json("completedpagedquery2.json")

In [313]:
df1

Unnamed: 0,timeStamp,reviews,price,score,players,sales,revenue,steamId,rank,followers
0,2020-12-01 05:00:00.000,28169,59.99,88.999966,4831.348504,1504620,79542168.0,1086940,,
1,2024-05-07 06:26:56.409,1632508,0.00,92.000000,52077.000000,39394460,,271590,41.0,3106521.0
2,2024-05-07 05:31:27.344,649930,59.99,94.000000,15309.000000,14080902,580349985.0,1091500,53.0,1405847.0
3,2015-02-02 00:00:00.000,1024,59.99,81.000485,907.869089,39870,1471040.0,306130,,
4,2024-05-06 10:32:59.826,1085195,19.99,77.000000,38523.000000,21336380,203473302.0,359550,31.0,1233525.0
...,...,...,...,...,...,...,...,...,...,...
195,2016-02-02 00:00:00.000,172,5.39,98.000000,69.363436,20942,105998.0,431240,,
196,2024-05-07 05:09:05.067,14505,59.99,86.000000,246.000000,601660,24392556.0,292730,1435.0,69292.0
197,2024-05-06 09:30:24.148,26924,19.99,76.000000,192.000000,1278519,19755011.0,310950,1875.0,120733.0
198,2024-05-07 05:00:23.050,269741,13.99,96.000000,7629.000000,7213041,35062034.0,960090,192.0,79064.0


In [327]:
df2

Unnamed: 0,about_the_game,steamId,img_count,link_count
0,Gather your party and return to the Forgotten ...,1086940,12,0
1,"When a young street hustler, a retired bank ro...",271590,0,1
2,"Cyberpunk 2077 is an open-world, action-advent...",1091500,8,1
3,Experience an ever-expanding story across all ...,306130,0,0
4,“One of the best first-person shooters ever ma...,359550,4,0
...,...,...,...,...
195,Why have friends if not to play Golf... With Y...,431240,6,0
196,Includes the Terminal Bonus Map and Zombies in...,292730,1,0
197,Experience the intensity of head-to-head battl...,310950,6,0
198,Craft your perfect defense from a combination ...,960090,5,6


The last part of data collection is getting rid of fields that we will not use and merging together our three different API collections to get a singular dataframe which we will process further for model building. The most important features are about_the_game which has much meaning left to extract from the description. As well as copiesSold being the total copies sold tracked as of the current point in time (5/7/2024) with sales being the total copies sold about a month after release for two potential response variables to compare. It's nice having the timeStamp and releaseDate fields too since we'll be able to calculate time after release that the sales feature's value is for.

In [344]:
df = pd.merge(df, df1, on='steamId')
df = pd.merge(df, df2, on='steamId')
df = df[["steamId","name","copiesSold","timeStamp","releaseDate","sales","about_the_game","img_count","link_count"]]

In [345]:
df

Unnamed: 0,steamId,name,copiesSold,timeStamp,releaseDate,sales,about_the_game,img_count,link_count
0,1086940,Baldur's Gate 3,16113300,2020-12-01 05:00:00.000,1691035200000,1504620,Gather your party and return to the Forgotten ...,12,0
1,271590,Grand Theft Auto V,35118817,2024-05-07 06:26:56.409,1428897600000,39394460,"When a young street hustler, a retired bank ro...",0,1
2,1091500,Cyberpunk 2077,14072558,2024-05-07 05:31:27.344,1607490000000,14080902,"Cyberpunk 2077 is an open-world, action-advent...",8,1
3,306130,The Elder Scrolls® Online,5226369,2015-02-02 00:00:00.000,1495425600000,39870,Experience an ever-expanding story across all ...,0,0
4,359550,Tom Clancy's Rainbow Six® Siege,21327664,2024-05-06 10:32:59.826,1448946000000,21336380,“One of the best first-person shooters ever ma...,4,0
...,...,...,...,...,...,...,...,...,...
195,431240,Golf With Your Friends,3734612,2016-02-02 00:00:00.000,1589860800000,20942,Why have friends if not to play Golf... With Y...,6,0
196,292730,Call of Duty®: Infinite Warfare,600697,2024-05-07 05:09:05.067,1478145600000,601660,Includes the Terminal Bonus Map and Zombies in...,1,0
197,310950,Street Fighter V,1278316,2024-05-06 09:30:24.148,1455512400000,1278519,Experience the intensity of head-to-head battl...,6,0
198,960090,Bloons TD 6,7209688,2024-05-07 05:00:23.050,1545022800000,7213041,Craft your perfect defense from a combination ...,5,6
