## Simple Recommenders on Steam Games Data

In this notebook we try out two different recommenders on [Steam dataset on Kaggle](https://www.kaggle.com/tamber/steam-video-games). We will use:
- ```numpy, pandas, sklearn``` for data preprocessing
- ```turicreate``` for building an item-item similarity recommender

The dataset has no header but comes in 5 columns:
- User ID: integer
- Game title: string
- Activity: string (purchase/play)
- Status: float (1.0 if activity is purchase, total number of hours in game if activity is play)
- A column full of 0's, will discard

In [38]:
import pandas as pd
import numpy as np 
import matplotlib
# import matplotlib.pyplot as plt 
import seaborn as sns
import turicreate
import sklearn as sk
from sklearn.preprocessing import Imputer
from sklearn.model_selection import train_test_split

In [61]:
steam_data = pd.read_pickle("clean_steam_data.pkl")
steam_data.head()
data_copy = steam_data

# we can easily get basic information about the data in each column 

In [56]:
data_copy.describe()

Unnamed: 0,userId,purchase,play,Actions
count,129535.0,129535.0,70785.0,70785.0
mean,102441000.0,1.0,1.0,48.770761
std,72362060.0,0.0,0.0,228.927258
min,5250.0,1.0,1.0,0.1
25%,45483460.0,1.0,1.0,1.0
50%,86055700.0,1.0,1.0,4.5
75%,154230700.0,1.0,1.0,19.1
max,309903100.0,1.0,1.0,11754.0


#### Some obervations

Above summary shows that ```purchase``` has only a single value 1.0. We will **not** include purchase status in building our recommender, for two reasons:
1. number of hours played is more interesting to look at - it is an implicit feedback of the user's preference for games he/she has purchased
2. if using [Jaccard similarity](https://apple.github.io/turicreate/docs/api/generated/turicreate.recommender.item_similarity_recommender.ItemSimilarityRecommender.html) when measuring the similarity between two sets of elements, the number of hours played are treated as binary purchase status (1/0)

For ```play``` column, the data is heavily skewed and has a large range. There are different ways to do normalization. Since we are going to use it to represent the user's ranking of preference within his / her own purchases, let's convert it to the percentage of hours each user spends on each game he / she owns. 

But before doing that, notice there are 128804 purchase records and only 70489 play records, meaning some users bought certain games but never played them. We need to find a score for those less-favoured games too. To differentiate them from the games that a user has never bought, we can impute the corresponding play fields with a small value less than the min 0.1, say 0.05. Then we can move on to calculate the percentage as planned.

In [62]:
data_copy.drop("purchase", axis = 1, inplace = True)
data_copy["Actions"].fillna(0.05, inplace=True)
data_copy["play"].fillna(0, inplace=True)

just in case there is duplicated row in the dataframe, we implement the following function to ensure that every row in our dataframe will be distinct

In [67]:
#remove duplicates 
data_copy = data_copy.drop_duplicates(subset=None,keep="first")