# Setup
## Checking the IP address

we aimed to trace the reviews that would have been made visible to each gamer when he/she was writing his/her review. We do not know exactly how STEAMs algorithm decides what reviews at displayed to each gamer when they write their reviews, but we can reverse-engineer and hypothesize them. So we test and validate our reversed-engineered hypothesized algorithm. 


In [None]:
# checking IP address again because STEAM sets a threshold on the number of calls. 
!curl ipecho.net/plain

35.236.167.36

## Reading the neccessary files and libraries

In [None]:
# importing packages and reading the list of games of reviews previously generated

from google.colab import drive
from tqdm import tqdm
drive.mount("/content/drive")
%cd '/content/drive/My Drive/Dissertation'
import requests
from bs4 import BeautifulSoup
import pandas as pd
import pickle
import random
with open("games_list", "rb") as fp1: 
  games_id_list = pickle.load(fp1) 
games_id_list=list(games_id_list)

Mounted at /content/drive
/content/drive/My Drive/Dissertation


  exec(code_obj, self.user_global_ns, self.user_ns)


## Selecting 5 random games to perform a simulation of our hypothesized review of STEAMs algorithm and compare it real time

In [None]:
# selecting 5 random games.


def random_no_generator():
  return(random.randint(0,len(games_id_list)))

def generate_five_gameIDs(games_id_list):
  random_id_list=[]
  while len(random_id_list)<5:
    no=random_no_generator()
    game_id=games_id_list[no]
    random_id_list.append(game_id)
  return(random_id_list)
random_id_list=generate_five_gameIDs(games_id_list)

In [None]:
random_id_list
# [252870, '1547380', '298110', '39530', 240]

# Resimulation stage

## Gathering reviews in real-time from the 5 randomly selected video-games


In [None]:
# extracting the reviews of the 5 games in real time. 

def get_one_review(appid, params={'json':1}):
        url = 'https://store.steampowered.com/appreviews/'
        response = requests.get(url=url+appid, params=params, headers={'User-Agent': 'Mozilla/5.0'})
        return response.json()


def get_all_reviews(appid):
    review_count=int(((get_one_review(str(appid)))["query_summary"])["total_reviews"])
    reviews = []
    cursor = '*'
    params = {
            'json' : 1,
            'filter' : 'all',
            'language' : 'english',
            'day_range' : 9223372036854775807,
            'review_type' : 'all',
            'purchase_type' : 'all'
            }
    while review_count > 0:
        params['cursor'] = cursor.encode()
        params['num_per_page'] = min(100, review_count)
        review_count -= 100
        response = get_one_review(appid, params)
        cursor = response['cursor']
        reviews += response['reviews']
        if len(response['reviews']) < 100: 
          break
    return (reviews)
    
def collect_reviews(FPS_appids):
  review_list=[]
  for i in tqdm(range(len(FPS_appids))):
    appid=str(FPS_appids[i])
    reviews=get_all_reviews(appid)
    review_list.append(reviews)
  return(review_list)

In [None]:
validating_list=collect_reviews(random_id_list)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [03:23<00:00, 40.63s/it]


In [None]:
# converting the lists nicely into a dataframe
import pandas as pd 
def turn_data_into_df(games_id_list,single_FPS_reviews):
  game_id,steam_id_list,player_review_list,time_stamp_created_list,time_stamp_updated_list,scores,votes_up_list,recommendation_id_list=[],[],[],[],[],[],[],[]
  for i in tqdm(range(len(games_id_list))):
    game=games_id_list[i]
    game_reviews=single_FPS_reviews[i]
    for j in (range(len(game_reviews))):
      steam_id=game_reviews[j]["author"]["steamid"]
      player_review=game_reviews[j]["review"]
      timestamp_created=game_reviews[j]["timestamp_created"]
      timestamp_updated=game_reviews[j]["timestamp_updated"]
      score=game_reviews[j]["weighted_vote_score"]
      votes_up=game_reviews[j]["votes_up"]
      recommendation_id=game_reviews[j]["recommendationid"]
      game_id.append(game)
      steam_id_list.append(steam_id)
      player_review_list.append(player_review)
      time_stamp_created_list.append(timestamp_created)
      time_stamp_updated_list.append(timestamp_updated)
      scores.append(score)
      votes_up_list.append(votes_up)
      recommendation_id_list.append(recommendation_id)

  df = pd.DataFrame(list(zip(game_id, scores,steam_id_list,player_review_list,time_stamp_created_list,time_stamp_updated_list,votes_up_list,recommendation_id_list)), columns =['game_id','score', 'steam_id','review','timestamp_created','timestamp_updated','votes_up','recommendation_id'])  
  return(df)

In [None]:
df=turn_data_into_df(random_id_list,validating_list)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:00<00:00, 112.85it/s]


## Testing our hypothesized sorting algorithm for the default-sorted (observable) and recently-sorted (unobservable) 'summary' reviews


These codes attempt to re-simulate and VALIDATE the default-sorted 'summary' reviews that the reviewer would have seen whilst writing his/her own review. We do so by tracing the time in which he/she wrote the review and implement our hypothesized default-sorted 'summary' of reviews. 

### How our hypothesized default-ordered (observable) algorithm works:
we attempted to reconstruct its contents using Steam-provided weighted_vote_score. Whenever there are enough reviews that have been composed within the past 30 days and have obtained at least one 'helpfulness' vote by other community members, the weighted vote score provides a perfect measure of the order of reviews at the time of display. If there were not enough reviews published in the last 30 days to fill the main bar, we expanded our research under the assumption that Steam uses an extended window of 30--90 days to select most helpful reviews. If at this stage there are still not enough reviews to be displayed, the same procedure is applied to reviews written in the 90--180 days period, after which all reviews that are older than 180 days are used.

### How our hypothesized recently-ordered (unobservable) algorithm works:
To reconstruct the side-bar content, where recency-sorted reviews are displayed, we simply obtained the most recent reviews (relative to the time when the target review was written), excluding reviews already shown in the main-bar section of the web page.

In [None]:
# this is our hypothesized default-sorted algorithm for the summary reviews. 

from datetime import datetime, timedelta
from tqdm import tqdm
import time

# this collects a dataframe and sorts them according to score
def sort_and_rearrange(temp_df):
  temp_df["score"] = pd.to_numeric(temp_df["score"])
  temp_df=temp_df.sort_values("score",ascending=False)
  temp_df["index"]=[i for i in range(len(temp_df))]
  temp_df=temp_df.set_index("index")
  return(temp_df)

# this takes a timestamp and a date to substract the number of days from the timestamp
def get_x_days_ago(end_int,days):
  end= datetime.fromtimestamp(end_int)
  start=(end - timedelta(days=days))
  start_int=int((time.mktime(start.timetuple())))
  return(start_int)

# this gives us the dataframe of comments written from (end_int-days) date to end_int date
def get_x_days_ago_comments(days, df,end_int):
  start=get_x_days_ago(int(end_int),days)
  df['timestamp_created'] = pd.to_numeric(df['timestamp_created'])
  temp_dates = df[df['timestamp_created'].between(start,(end_int-1))]
  return(temp_dates)


def condition_against(temp1,temp2,difference):
  if len(temp1)==10:
    temp=temp1
  elif difference==10:
    temp=temp2[0:len(temp2)]
  elif len(temp2)==0:
    temp=temp1
  else:
    temp=temp1
    #print(len(list(temp2["score"])),len(list(temp1["score"])))
    for i in range(difference):
      if i>=len(list(temp2["score"])):
        temp=temp
      elif float((list(temp2["score"]))[i])>float((list(temp1["score"]))[0]):
        to_be_added=temp2[i:i+1]
        temp=pd.concat([temp,to_be_added])
      else:
        temp=temp
  return(temp)

# in temp_conditional_return_1 - 3, it is the 3 scenarios in which reviews are compared across the different time-dates (ie 30 days, 90 days, 180 days)
def temp_conditional_return_1(time_created,day_range1,day_range_ultimate,df):
  day_range1_reviews=get_x_days_ago_comments(day_range1,df,time_created)
  day_range1_date=get_x_days_ago(time_created,day_range1)
  days_diff=day_range_ultimate-day_range1
  in_between_days_reviews=get_x_days_ago_comments(days_diff, df,day_range1_date)
  difference=int(10-(len(day_range1_reviews)))
  temp1=sort_and_rearrange(day_range1_reviews)
  temp2=sort_and_rearrange(in_between_days_reviews)
  #print(difference)
  temp=condition_against(temp1,temp2,difference)
  return(temp)



def temp_conditional_return_2(time_created,day_range1,day_range2,day_range_ultimate,df):
  day_range1_reviews=get_x_days_ago_comments(day_range1,df,time_created)
  day_range1_date=get_x_days_ago(time_created,day_range1)
  days_diff_1=day_range2-day_range1
  day_range2_reviews=get_x_days_ago_comments(days_diff_1, df,day_range1_date)
  day_range2_date=get_x_days_ago(time_created,day_range2)
  difference=int(10-(len(day_range1_reviews)))
  temp1=sort_and_rearrange(day_range1_reviews)
  temp2=sort_and_rearrange(day_range2_reviews)
  temp=condition_against(temp1,temp2,difference)
  difference_2=int(10-len(temp))
  days_diff_2=day_range_ultimate-day_range2
  day_range3_reviews=get_x_days_ago_comments(days_diff_2, df,day_range2_date)
  temp3=sort_and_rearrange(day_range3_reviews)
  temp=condition_against(temp,temp3,difference_2)
  return(temp)



def temp_conditional_return_3(time_created,day_range1,day_range2,day_range_3,start_date,df):

  # 0-30 days
  day_range1_reviews=get_x_days_ago_comments(day_range1,df,time_created)
  day_range1_date=get_x_days_ago(time_created,day_range1)

  # 30-90 days
  days_diff_1=day_range2-day_range1
  day_range2_reviews=get_x_days_ago_comments(days_diff_1, df,day_range1_date)
  day_range2_date=get_x_days_ago(time_created,day_range2)
  difference=int(10-(len(day_range1_reviews)))
  temp1=sort_and_rearrange(day_range1_reviews)
  temp2=sort_and_rearrange(day_range2_reviews)
  temp=condition_against(temp1,temp2,difference)

  # 90-180 days
  difference_2=int(10-len(temp))
  days_diff_2=day_range_3-day_range2
  day_range3_reviews=get_x_days_ago_comments(days_diff_2, df,day_range2_date)
  temp3=sort_and_rearrange(day_range3_reviews)
  temp=condition_against(temp,temp3,difference_2)

  # 180 days - forever days
  day_range3_date=get_x_days_ago(time_created,day_range_3)
  day_range_4=int((time_created - start_date) / 86400)
  difference_3=int(10-len(temp))
  days_diff_3=day_range_4-day_range_3
  day_range4_reviews=get_x_days_ago_comments(days_diff_3, df,day_range3_date)
  temp4=sort_and_rearrange(day_range4_reviews)
  temp=condition_against(temp,temp4,difference_3)
  return(temp)

import math
def get_list_of_visible_comments(df):
  order_of_visibility,order_of_scores,order_of_times,list_of_visible_id=[],[],[],[]
  x=0
  for i in tqdm(range(len(df))):
    #print(df["steam_id"][i])
    player=str(df["steam_id"][i])
    
    if player!="validating":
      continue
    else:
      game_id=str(df["game_id"][i])
      time_created=int(df["timestamp_created"][i])
      temp=df[(df['game_id']==game_id)]
      temp=temp[(temp['votes_up']!=0)]
      temp_dates = get_x_days_ago_comments(30, temp,time_created)
      if len(temp)==0:
        temp=temp_dates
      elif len(temp_dates)>=10:
        temp=temp_dates
      else: 
        temp_90_dates = temp_conditional_return_1(time_created,30,90,temp)
        if len(temp_90_dates)>=10:
          temp=temp_90_dates
        else:
          temp_180_dates = temp_conditional_return_2(time_created,30,90,180,temp)
          if len(temp_180_dates)>=10:
            temp=temp_180_dates
          else:
            start_date=int(temp['timestamp_created'].min())
            temp=temp_conditional_return_3(time_created,30,90,180,start_date,temp)

    
      temp["score"] = pd.to_numeric(temp["score"])
      temp=temp.sort_values("score",ascending=False)
      temp["index"]=[i for i in range(len(temp))]
      temp=temp.set_index("index")
      if len(temp)>=10:
        n=10
      else:
        n=len(temp)
      list_of_visible=list(temp["review"][0:n]) # check this line by running below
      list_of_scores=list(temp["score"][0:n])
      list_of_times=list(temp["timestamp_created"][0:n])
      list_of_visible_2=list(temp["recommendation_id"][0:n])
      list_of_visible_id.append(list_of_visible_2)
      order_of_visibility.append(list_of_visible)
      order_of_scores.append(list_of_scores)
      order_of_times.append(list_of_times)
      #x+=1
  return(order_of_visibility,order_of_scores,order_of_times,list_of_visible_id)

In [None]:
# creating hypothetical data such that we are the reviewers writing the review in real time. 

df=df[(df['steam_id']!="validating")]
stimulated_time=1656606800
import random
game_id=random_id_list
scores=[(random.random()) for i in range(len(random_id_list))]
steam_id_list=["validating","validating","validating","validating","validating"]
player_review_list=["validating","validating","validating","validating","validating"]
time_stamp_created_list=[stimulated_time,stimulated_time,stimulated_time,stimulated_time,stimulated_time]
recommendation_id_list_2=["1",'2','3','4','5']
time_stamp_updated_list=time_stamp_created_list
votes_up_list=[0,0,0,0,0]
stimulated = pd.DataFrame(list(zip(game_id, scores,steam_id_list,player_review_list,time_stamp_created_list,time_stamp_updated_list,votes_up_list,recommendation_id_list_2)), columns =['game_id','score', 'steam_id','review','timestamp_created','timestamp_updated','votes_up','recommendation_id']) 
df=pd.concat([df,stimulated])
df["index"]=[i for i in range(len(df))]
df=df.set_index("index")

In [None]:
# perform the re-simulation with the hypothesized default-sorted algorithm 
order_of_visible_comments,order_of_scores,order_of_times,id=get_list_of_visible_comments(df)
validating = pd.DataFrame(list(zip(game_id,order_of_visible_comments,order_of_scores,order_of_times,id)), columns =['game_id','reviews','scores', 'times order','id']) 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 31351/31351 [00:00<00:00, 101706.97it/s]


Unnamed: 0,game_id,reviews,scores,times order
0,252870,[After seeing Neebs gaming announce that they ...,"[0.8003709912300109, 0.719564139842987, 0.5564...","[1652381956, 1650579308, 1655533461, 165460486..."
1,1547380,[[table][th]\nReview by [url=https://store.ste...,"[0.7956037521362304, 0.789889633655548, 0.6298...","[1644564603, 1644558942, 1649456280, 165500634..."
2,298110,[Far Cry 4 is currently [h1]FREE[/h1] on Amazo...,"[0.8769996762275695, 0.8227600455284118, 0.667...","[1654100832, 1654153357, 1654373965, 165420953..."
3,39530,[Faster than a bullet\nTerrifying scream\nEnra...,"[0.865518569946289, 0.8042218089103698, 0.7882...","[1533718571, 1493389162, 1496480907, 157765442..."
4,240,[Before I played Counter-Strike: Source i had ...,"[0.9212783575057985, 0.9050492644309996, 0.838...","[1654604660, 1654361577, 1654020027, 165412612..."


In [None]:
# print the reviews nicely

from typing import List
def print_nicely(df):
  game_list,reviews_list,order=[],[],[]
  for i in range(0,len(df)):
    game=(df["game_id"])[i]
    reviews=list((df["reviews"])[i])
    for j in range(10):
      game_list.append(game)
      reviews_list.append(str(reviews[j]))
      order.append(j+1)
  new_df = pd.DataFrame(list(zip(game_list,reviews_list,order)), columns =['game_id','review',"order"])
  return(new_df)

In [None]:
validation=print_nicely(validating)
validation

Unnamed: 0,game_id,review,order
0,252870,After seeing Neebs gaming announce that they w...,1
1,252870,The game is definitely fun IF you play with pe...,2
2,252870,Pulsar is a game that can't really be describe...,3
3,252870,My new favourite! Great fun for playing with f...,4
4,252870,This game is janky as all hell with poor feedb...,5
5,252870,"A load of fun, but better with a real crew tha...",6
6,252870,Amazing game to play with friends. Cool mechan...,7
7,252870,funny space game where you get mad at friends,8
8,252870,Game be gud,9
9,252870,i think this game is really fun with friends a...,10


In [None]:
# Comparing our results to the ground-truth results using Kendall's Tau coefficient

## ID=252870	
ground_truth=[1,2,10,3,4,5,6,7,8,9]
predicted=[i for i in range(1,11)]
from scipy import stats
tau, p_value = stats.kendalltau(ground_truth, predicted)
print("tau is", tau, " with p-value of", p_value)

tau is 0.6888888888888888  with p-value of 0.00468694885361552


In [None]:
## ID=1547380	
ground_truth=[1,3,11,4,5,6,7,8,9,10]
tau, p_value = stats.kendalltau(ground_truth, predicted)
print("tau is", tau, " with p-value of", p_value)

tau is 0.6888888888888888  with p-value of 0.00468694885361552


In [None]:
## ID=298110
ground_truth=[1,2,3,4,5,6,7,8,11,12]
tau, p_value = stats.kendalltau(ground_truth, predicted)
print("tau is", tau, " with p-value of", p_value)

tau is 0.9999999999999999  with p-value of 5.511463844797178e-07


In [None]:
## ID = 39350
ground_truth=[1,2,11,6,7,8,12,9,10,13]
tau, p_value = stats.kendalltau(ground_truth, predicted)
print("tau is", tau, " with p-value of", p_value)

tau is 0.6888888888888888  with p-value of 0.00468694885361552


In [None]:
## ID = 240
ground_truth=[1,2,3,4,5,6,7,8,9,10]
tau, p_value = stats.kendalltau(ground_truth, predicted)
print("tau is", tau, " with p-value of", p_value)

tau is 0.9999999999999999  with p-value of 5.511463844797178e-07


## Testing our algorithm for the recently-sorted reviews displayed on the miniature side-bar of the webpage. 

In [None]:
# this is the recently-sorted algorith, 

## it is pretty straight forward, it picks out the most recent reviews, except for those already listed in the default-sorted display of reviews. 

def get_temp_df(df,reviews_list):
  invisible_comments_list_all=[]
  for i in tqdm(range(len(df))):
    end_int=1656606800
    start=int(end_int-(30*86400))
    game_id=str(df.game_id[i])
    temp_dates=reviews_list[(reviews_list['game_id']==game_id)] # get reviews only from the same game
    temp_dates = temp_dates[temp_dates['timestamp_created'].between(start,(end_int-1))] # get reviews from the past 30 days since the date of the written comment
    temp_dates=temp_dates.sort_values("timestamp_created",ascending=False) # sort by most recent to earliest
    observable_list=df.id[i] 
    temp_dates_list=list(temp_dates.recommendation_id)
    unobservable_temp_dates_list = [x for x in temp_dates_list if x not in observable_list] # filtering games observed vs unobserved
    unobservable_temp_dates_list=unobservable_temp_dates_list[0:10]
    invisible_comments_list_all.append(unobservable_temp_dates_list)   
  return(invisible_comments_list_all)
unobservable_comments_list=get_temp_df(validating,df)

# We manage to retrieve the id this function displays the actual review for us. 
def convert_id_to_review(df,list_of_review):
  review_list=[]
  for i in (list_of_review):
    for j in range(len(df)):
      if str(df.recommendation_id[j])==str(i):
        review_list.append(df.review[j])
      else:
        continue

  return(review_list)

# print out the reviews. 
for i in unobservable_comments_list:
  print(convert_id_to_review(df,i))

# in the order of: [252870, '1547380', '298110', '39530', 240] 
 
# all of the outputs are perfectly aligned with the screenshots, so Tau=1, p-value=0

["Pretty relaxing space sim, flying isn't too bad and once you get the hang of what you should upgrade first and work toward, you're set. I've been getting by just fine playing alone with just the bots to help, but playing online is where it shines for extra enjoyment haha :D", 'Nice game to playing with your friends', "Really fun game putting you and several other players in control of a spaceship, this game completely depends on the people you're playing with, but In my experience it's not that difficult to find a good crew of randoms. The challenges this game gives can be somewhat difficult with some puzzles, but the spaceship combat is like no other! \nI especially love the engineer role as balancing power is a really fun challenge that requires constant attnetion during fights as you need to maintain power to shields and weapons without overheating your reactor! With good communication you can achieve this by having cool down periods where you throttle the reactor. As a captain yo