<a href="https://colab.research.google.com/github/adnan-mujagic/steam-recommender-system-using-implicitly-inferred-ratings/blob/main/steam_dataset_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Importing dependencies
numpy, pandas, and time are libraries commonly used for Machine Learning and data processing

In [None]:
import numpy as np
import pandas as pd
import time

## Importing dataset
Importing the dataset of interest, in this case it is the `steam-200k.csv` file located inside my Google Drive

In [None]:
dataset = pd.read_csv("/content/drive/MyDrive/ML/Steam/steam-200k.csv", header = 0, names = ["user_id", "game_name", "action", "hours_played", "_"])

# take a few samples from the dataset
dataset.sample(5)

Unnamed: 0,user_id,game_name,action,hours_played,_
184759,110248234,Call of Duty Modern Warfare 3,play,17.0,0
136479,183838531,Portal 2,purchase,1.0,0
26482,105877396,Dwarfs F2P,play,1.0,0
1541,11373749,Trials 2 Second Edition,play,7.7,0
32724,86912006,Post Apocalyptic Mayhem,purchase,1.0,0


I will drop the last column since it is redundant

In [None]:
dataset = dataset.drop(columns = ["_"])

dataset.sample(5)

Unnamed: 0,user_id,game_name,action,hours_played
148011,11970504,Styx Master of Shadows,purchase,1.0
127916,197190654,Euro Truck Simulator 2,purchase,1.0
47055,20464587,Robocraft,play,0.6
39033,156297441,Company of Heroes Opposing Fronts,purchase,1.0
126185,186126772,Call of Duty Black Ops II - Zombies,purchase,1.0


## Getting more understanding of the data


In [None]:
dataset.shape

(199999, 4)

In [None]:
print("Number of users in dataset is:", dataset.user_id.unique().shape[0])
print("Number of games in dataset is:", dataset.game_name.unique().shape[0])

Number of users in dataset is: 12393
Number of games in dataset is: 5155


In [None]:
dataset.action.value_counts()

purchase    129510
play         70489
Name: action, dtype: int64

For the generation of implicit reviews, I will be comparing a particular user's playing time to the average playing time of other players for the game in question. The score will be high if it exceedes the average, and low if it is below average.

Before I get into that, I will create some helper functions.

I noticed the dataset is not necessarily very clean. There are two actions, namely `play` and `purchase`, which I want to avoid. I only want the total playing time of a particular user to be recorded, even if they just purchased and never played.

In [None]:
def clean_hours_played(input_dataframe):  
  clean_dataframe = pd.DataFrame(columns = ["user_id", "game_name", "hours_played"])
  
  unique_users = input_dataframe.user_id.unique()
  unique_users_count = unique_users.shape[0]
  processed_users = 0
  
  for user_id in unique_users:
    user_actions = input_dataframe.loc[input_dataframe.user_id == user_id]
    for game_name in user_actions.game_name.unique():
      user_actions_per_game = user_actions.loc[user_actions.game_name == game_name]
      if user_actions_per_game.shape[0] == 2:
        # if played, append actual playtime
        try:
          hours_played = user_actions_per_game.loc[user_actions_per_game.action == "play"].iloc[0].hours_played
        except:
          hours_played = 0
      else:
        # otherwise append 0 hours played as actual playtime
        hours_played = 0
      clean_dataframe = clean_dataframe.append({'user_id': user_id, 'game_name': game_name, 'hours_played': hours_played}, ignore_index = True)
    processed_users = processed_users + 1
    print("Processed ", processed_users, "/", unique_users_count, ".")

  return clean_dataframe

In [None]:
# clean_dataset = clean_hours_played(dataset) -> run this the first time to generate clean dataset
clean_dataset_export_path = "/content/drive/MyDrive/ML/Steam/steam-200k-clean.csv"
clean_dataset = pd.read_csv(clean_dataset_export_path)

clean_dataset.head()

Unnamed: 0,user_id,game_name,hours_played
0,151603712,The Elder Scrolls V Skyrim,0.0
1,151603712,Fallout 4,87.0
2,151603712,Spore,14.9
3,151603712,Fallout New Vegas,12.1
4,151603712,Left 4 Dead 2,8.9


In [None]:
clean_dataset.sample(10)

Unnamed: 0,user_id,game_name,hours_played
2241,202395275,Dota 2,4.3
17043,98649241,GTR Evolution,0.0
35456,17531316,Half-Life 2 Deathmatch,0.0
29248,229556803,Team Fortress 2,23.0
70663,211454678,One Finger Death Punch,9.9
124696,20772968,Deus Ex Human Revolution,0.0
38138,293600797,RaceRoom Racing Experience,0.0
69766,172434236,Goat Simulator,1.7
20286,9128105,Portal 2,7.2
73343,82686146,Clive Barker's Jericho,3.8


In [None]:
clean_dataset.shape

(128804, 3)

In [None]:
clean_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128804 entries, 0 to 128803
Data columns (total 3 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   user_id       128804 non-null  int64  
 1   game_name     128804 non-null  object 
 2   hours_played  128804 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 2.9+ MB


In [None]:
clean_dataset.to_csv(clean_dataset_export_path, index = False)

## Implicit rating deduction formula
Now that I have a cleaner dataset, I can start developing the algorithm for implicit rating deduction.

### Version 1
Rating for user `u` and game `g` is:

```python
rating = 5 * Sigmoid((hours_played_u_g - average_hours_played_g) / average_hours_played_g)
```

This should generate ratings in the range `(0, 5)`.

This function is further explained in the following cells.

In [None]:
# Wrapper function to measure execution time of other functions
def time_benchmark(func):
  
    def wrap(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        elapsed = end - start
        print(func.__name__, "returned a result after:", elapsed, "seconds")
        return result
    return wrap

In [None]:
@time_benchmark
def get_average_hours_played_by_game(game_name):
  return clean_dataset.loc[clean_dataset.game_name == game_name].hours_played.mean()

In [None]:
get_average_hours_played_by_game("Far Cry 3")

get_average_hours_played_by_game returned a result after: 0.01874375343322754 seconds


24.52173913043478

In [None]:
get_average_hours_played_by_game("Counter-Strike Global Offensive")

228.59178470254955

In [None]:
def get_hours_played_by_user_and_game(user_id, game_name):
  try:
    return clean_dataset.loc[(clean_dataset.user_id == user_id) & (clean_dataset.game_name == game_name)].iloc[0].hours_played
  except:
    return 0

In [None]:
get_hours_played_by_user_and_game(126340495, "Saints Row IV")

48.0

In [None]:
get_hours_played_by_user_and_game(254906420, "Team Fortress 2")

1.2

In [None]:
def sigmoid(x):
  return 1 / (1 + np.exp(-x))

In [None]:
sigmoid(0)

0.5

In [None]:
sigmoid(3)

0.9525741268224334

In [None]:
sigmoid(-5)

0.0066928509242848554

In [None]:
# turn this on if you want -> @time_benchmark
def get_user_rating_for_game(user_id, game_name):
  hours_played_by_user = get_hours_played_by_user_and_game(user_id, game_name)
  average_hours_played_for_game = get_average_hours_played_by_game(game_name)
  if average_hours_played_for_game == 0.0:
    joy_coefficient = 0
  else:
    joy_coefficient = (hours_played_by_user - average_hours_played_for_game) / average_hours_played_for_game
  return sigmoid(joy_coefficient) * 5

In [None]:
clean_dataset.sample(10)

Unnamed: 0,user_id,game_name,hours_played
65632,180901075,Silent Storm,0.0
117153,52072237,Counter-Strike,4.8
97889,110546460,Portal 2,5.8
91642,158576100,Euro Truck Simulator 2,0.0
676,97298878,Europa Universalis III,8.3
69339,23608098,Age of Empires II HD The Forgotten,0.0
12849,26430231,Nosgoth,0.8
118924,154230933,State of Decay - Breakdown,0.0
103445,87694960,Bionic Dues,0.0
88264,155919035,Crash Time II,0.6


Let's analyze this formula

In [None]:
get_average_hours_played_by_game("Warframe")

31.965289256198346

In [None]:
get_user_rating_for_game(98649241, "Warframe")

4.157689811271309

In the previous three cells, I gave an example of how the ratings are generated. 

From the sample dataset, you can see that user with id `98649241`, spent `83.0` hours playing a game called `Warframe`.

In the cell below, the average playtime for `Warframe` is calculated to be approximately `31.96`, which means this user played the game for approximately `1.6` whole average play times for that game. I named this number a `joy_coefficient`. 

Joy coefficient is high if the user plays the game more than average, it is low when the user plays the game less than average, and it is 0 if the user plays the game an average amount of time.


This leads to the conclusion that the user likes the game quite a lot, in fact, the implicily inferred rating by feeding the `joy_coefficient`to the `5 * sigmoid` function  is `4.16 / 5`, as you can see in the plot below:

![Sigmoid](https://drive.google.com/uc?id=12krBswoap1c35m9Pp8pZfqxb1gWg5xIV)

Finally, I will iterate through the current dataset, and create one in which each row will have the following columns: `user_id`, `game_name` and `rating`, where the rating will be generated using the above explained funciton.

In [None]:
def get_dataset_with_implicit_ratings(input_dataframe, debug = False, progress_update_percentage = 1):
  dataframe_with_ratings = pd.DataFrame(columns = ["user_id", "game_name", "rating"])
  progress = 0.
  number_of_rows = input_dataframe.shape[0]
  for idx, row in input_dataframe.iterrows():
    if debug and ((idx + 1) / number_of_rows) * 100 > progress + progress_update_percentage:
      progress = progress + progress_update_percentage
      print("Processing progress:", progress, "%")
    user_id, game_name = row.user_id, row.game_name
    dataframe_with_ratings = dataframe_with_ratings.append({"user_id": user_id, "game_name": game_name, "rating": get_user_rating_for_game(user_id, game_name)}, ignore_index = True)
  return dataframe_with_ratings

In [None]:
# clean_dataset = pd.read_csv("/content/drive/MyDrive/ML/Steam/steam-200k-clean.csv")

# clean_dataset = clean_dataset.iloc[:100]

dataset_with_ratings = get_dataset_with_implicit_ratings(clean_dataset, debug = True)

In [None]:
dataset_with_ratings.sample(5)

Unnamed: 0,user_id,game_name,rating
21346,244290943,Audition Online,3.263658
86166,62486194,Empire Total War,3.593875
31634,24872208,Counter-Strike Condition Zero,1.344707
46333,302652704,Unturned,1.402747
60977,100741663,Rome Total War,1.714285


In [None]:
dataset_with_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128804 entries, 0 to 128803
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   user_id    128804 non-null  object 
 1   game_name  128804 non-null  object 
 2   rating     128804 non-null  float64
dtypes: float64(1), object(2)
memory usage: 2.9+ MB


In [None]:
# Export clean dataset and dataset with ratings

dataset_with_ratings_export_path = "/content/drive/MyDrive/ML/Steam/steam-200k-with-ratings.csv"

dataset_with_ratings.to_csv(dataset_with_ratings_export_path, index = False)

Now that I have my inferred ratings, it's time to get started with the algorithm. I will do that in a separate notebook. You can follow see the implementation details of the recommender system at this [link](https://github.com/adnan-mujagic/steam-recommender-system-using-implicitly-inferred-ratings/blob/main/recommender_system.ipynb).