## Advanced Cleaning on Steam Games Data

In this notebook we try out two different recommenders on [Steam dataset on Kaggle](https://www.kaggle.com/tamber/steam-video-games). We will use:
- ```numpy, pandas, sklearn``` for data preprocessing

The dataset has no header but comes in 5 columns:
- User ID: integer
- Game title: string
- Activity: string (purchase/play)
- Status: float (1.0 if activity is purchase, total number of hours in game if activity is play)
- A column full of 0's, will discard

In [1]:
import pandas as pd
import numpy as np 
import matplotlib
# import matplotlib.pyplot as plt 
import seaborn as sns
import turicreate
import sklearn as sk
from sklearn.preprocessing import Imputer
from sklearn.model_selection import train_test_split

In [2]:
steam_data = pd.read_pickle("clean_steam_data.pkl")
steam_data.head()
data_copy = steam_data

### we can easily get basic information about the data in each column 

In [3]:
data_copy.describe()

Unnamed: 0,userId,purchase,play,Actions
count,129535.0,129535.0,70785.0,70785.0
mean,102441000.0,1.0,1.0,48.770761
std,72362060.0,0.0,0.0,228.927258
min,5250.0,1.0,1.0,0.1
25%,45483460.0,1.0,1.0,1.0
50%,86055700.0,1.0,1.0,4.5
75%,154230700.0,1.0,1.0,19.1
max,309903100.0,1.0,1.0,11754.0


### Some obervations

Above summary shows that ```purchase``` has only a single value 1.0. We will **not** include purchase status in building our recommender, for two reasons:
1. number of hours played is more interesting to look at - it is an implicit feedback of the user's preference for games he/she has purchased
2. if using [Jaccard similarity](https://apple.github.io/turicreate/docs/api/generated/turicreate.recommender.item_similarity_recommender.ItemSimilarityRecommender.html) when measuring the similarity between two sets of elements, the number of hours played are treated as binary purchase status (1/0)

For ```play``` column, the data is heavily skewed and has a large range. There are different ways to do normalization. Since we are going to use it to represent the user's ranking of preference within his / her own purchases, let's convert it to the percentage of hours each user spends on each game he / she owns. 

But before doing that, notice there are 128804 purchase records and only 70489 play records, meaning some users bought certain games but never played them. We need to find a score for those less-favoured games too. To differentiate them from the games that a user has never bought, we can impute the corresponding play fields with a small value less than the min 0.1, say 0.05. Then we can move on to calculate the percentage as planned.

In [4]:
data_copy.drop("purchase", axis = 1, inplace = True)
data_copy["Actions"].fillna(0.05, inplace=True)
data_copy["play"].fillna(0, inplace=True)


just in case there is duplicated row in the dataframe, we implement the following function to ensure that every row in our dataframe will be distinct

In [5]:
#remove duplicates 
data_copy = data_copy.drop_duplicates(subset=None,keep="first")

besides, we could also have duplicated rows with the same userId and gameName appearing together. This might be some mistakes that happen during the data collection stage so that the hours has not been fully merged and accumulated. we wanna sum up the number of hours for the corrections

In [6]:
dp = data_copy[data_copy.duplicated(subset=['userId','gameName'], keep=False)]
dp = dp.groupby(['userId','gameName'], as_index=False)['Actions'].sum()
print(dp)
dp = dp.groupby(['userId','gameName'], as_index=False)['Actions'].sum()
data_copy.update(dp)
data_copy = data_copy.drop_duplicates(subset=['userId','gameName'],keep="first")

       userId                                      gameName  Actions
0    28472068                          Grand Theft Auto III      0.5
1    28472068                  Grand Theft Auto San Andreas      0.9
2    28472068                    Grand Theft Auto Vice City      5.7
3    33865373                   Sid Meier's Civilization IV    137.0
4    50769696                  Grand Theft Auto San Andreas     14.0
5    59925638                       Tom Clancy's H.A.W.X. 2      7.4
6    71411882                          Grand Theft Auto III      1.3
7    71510748                  Grand Theft Auto San Andreas      0.8
8   118664413                  Grand Theft Auto San Andreas      2.1
9   148362155                  Grand Theft Auto San Andreas     26.3
10  176261926                   Sid Meier's Civilization IV     14.8
11  176261926  Sid Meier's Civilization IV Beyond the Sword    564.4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[col] = expressions.where(mask, this, that)


### Replace actual hours played with percentage of hours on a game for each user

In [7]:
d = data_copy.groupby('userId')['Actions'].apply(lambda x: x/ x.sum())
data_copy.update(d)
# convert userId to int
data_copy['userId'] = data_copy['userId'].astype(int)

### Output and save the dataframe

In [8]:
data_copy.to_pickle('clean_steam.pkl')

### Before building the recommender, here is an optional step - indexing the game names with integers
Based on doc of turicreate.recommender.create, the user ID and item ID columns can be either int or str, meaning that keeping the game title as is should be fine. Here let's do an extra step that indexes those strings with integers. (Maybe when we feel less lazy we can try spark.ml's recommender on it too.)

However indexing is essential when you build models with Spark's machine learning libraries.

In [9]:
data_copy['gameId']= data_copy.gameName.astype('category').cat.codes

### we want to create a look up table for the gameName and gameId columns

In [10]:
data_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 128792 entries, 41879 to 83172
Data columns (total 5 columns):
userId      128792 non-null int64
gameName    128792 non-null object
play        128792 non-null float64
Actions     128792 non-null float64
gameId      128792 non-null int16
dtypes: float64(2), int16(1), int64(1), object(1)
memory usage: 5.2+ MB


In [11]:
game_map = data_copy[['gameName','gameId']].copy().drop_duplicates().sort_values(by=['gameId'])
game_map.head()

Unnamed: 0,gameName,gameId
40639,007 Legends,0
72197,0RBITALIS,1
58190,1... 2... 3... KICK IT! (Drop That Beat Like a...,2
72198,10 Second Ninja,3
47824,10000000,4


### Item - Item similarity model in Turicreate

According to the documentation [here](https://apple.github.io/turicreate/docs/userguide/recommender/using-trained-models.html), turicreate excludes the items that are observed for each user. In order to test the accuracy of the model, it is important to split a proportion of user's data into train and test for validation. 
but for now, let us just feed all the data in to our model.

There are 3 similarity measurements for similarity type - jaccard, cosine and pearson. In our case, We choose Pearson to measure the loss.

In [12]:
# create SFrame from DataFrame
from turicreate import SFrame

game_sf = SFrame(data=data_copy)

item_item_reco = turicreate.recommender.item_similarity_recommender.create( \
    game_sf, user_id='userId', item_id='gameId', \
    target="Actions", similarity_type='pearson')

In [13]:
# Get the k most similar items for each item in items. Default number is 10.
similar_games = item_item_reco.get_similar_items()

In [14]:
# for example to check some results
game_map_sframe = turicreate.SFrame(game_map)

similar_games_named = similar_games.join(game_map_sframe, on="gameId", how="left") \
   .join(game_map_sframe, on={"similar":"gameId"}, how="left") \
  .rename({"gameName.1":"similar_game"}, True) \
   .select_columns(["gameId", "gameName", "similar", "similar_game", "score", "rank"])

# print the result of one game for testing
a =similar_games_named[similar_games_named['gameName']== "Half-Life 2 Episode One"]
a.print_rows(num_rows=10, num_columns=6) 


+--------+-------------------------+---------+-------------------------------+
| gameId |         gameName        | similar |          similar_game         |
+--------+-------------------------+---------+-------------------------------+
|  2072  | Half-Life 2 Episode One |   2070  |          Half-Life 2          |
|  2072  | Half-Life 2 Episode One |   2077  |  Half-Life Deathmatch Source  |
|  2072  | Half-Life 2 Episode One |   2073  |    Half-Life 2 Episode Two    |
|  2072  | Half-Life 2 Episode One |   2074  |     Half-Life 2 Lost Coast    |
|  2072  | Half-Life 2 Episode One |   3627  |       SEGA Bass Fishing       |
|  2072  | Half-Life 2 Episode One |   1003  |           Crazy Taxi          |
|  2072  | Half-Life 2 Episode One |   480   | Battlefield Bad Company 2 ... |
|  2072  | Half-Life 2 Episode One |   3222  |             Portal            |
|  2072  | Half-Life 2 Episode One |   2076  |      Half-Life Blue Shift     |
|  2072  | Half-Life 2 Episode One |   1073  |      

### Let's pick a user and see what recommendations he / she gets:

In [45]:
rec_result = item_item_reco.recommend(diversity=3,random_seed=42).join(game_map_sframe, on="gameId", how="left")
print (rec_result)

+--------+--------+--------------------+------+-------------------------------+
| userId | gameId |       score        | rank |            gameName           |
+--------+--------+--------------------+------+-------------------------------+
|  5250  |  1705  |        1.0         |  1   |    Fast & Furious Showdown    |
|  5250  |  831   |        1.0         |  2   |       Chip's Challenge 2      |
|  5250  |  4934  |        1.0         |  3   |     Warrior Kings Battles     |
|  5250  |  1271  |        1.0         |  4   |          Diaper Dash          |
|  5250  |  1607  |        1.0         |  5   |        FIFA Manager 10        |
|  5250  |  1837  |        1.0         |  6   |         Funk of Titans        |
|  5250  |  4013  | 0.9385474860335195 |  7   |    Star Traders 4X Empires    |
|  5250  |  3609  | 0.9222188130095555 |  8   |        Runestone Keeper       |
|  5250  |  714   |        0.7         |  9   |  Cabela's Hunting Expeditions |
|  5250  |  2535  | 0.696555909573661  |

### In order to better estimate the accuracy of our model, we decide to split the training and testing set, and use "precision_recall" as our metric. 
check [here](https://apple.github.io/turicreate/docs/api/generated/turicreate.recommender.item_similarity_recommender.ItemSimilarityRecommender.evaluate.html#turicreate.recommender.item_similarity_recommender.ItemSimilarityRecommender.evaluate) for more info.

In [46]:
game_sf2 = SFrame(data=data_copy)

#The test dataset is generated by first choosing max_num_users out of the total number of users in dataset. Then, for each of the chosen test users, 
#a portion of the user’s items (determined by item_test_proportion) is randomly chosen to be included in the test set. 
#This split allows the training data to retain enough information about the users in the testset, so that adequate recommendations can be made. 
#The total number of users in the test set may be fewer than max_num_users if a user was chosen for the test set but none of their items are selected.

train_sframe, test_sframe = turicreate.recommender.util.random_split_by_user(game_sf2, user_id="userId", item_id="gameId", max_num_users=500)

In [69]:
item_item_reco = turicreate.recommender.item_similarity_recommender.create( \
    game_sf2, user_id='userId', item_id='gameId', \
    target="Actions", similarity_type='pearson')
rec = item_item_reco.recommend(diversity=3,random_seed=42)
rec = rec.join(game_map_sframe, on="gameId", how="left") 
from turicreate.toolkits.recommender.util import precision_recall_by_user

result = precision_recall_by_user(test_sframe, rec)

In [74]:
#look at the data of user 858433 in recommendation

# Filtering
filter_sf =rec[(rec['userId']== 8585433)] 
# Displaying
(filter_sf[['userId','gameName','score']]). print_rows(num_rows=21, num_columns=3)

+---------+-------------------------------+--------------------+
|  userId |            gameName           |       score        |
+---------+-------------------------------+--------------------+
| 8585433 |       Chip's Challenge 2      |        1.0         |
| 8585433 |          Diaper Dash          |        1.0         |
| 8585433 |            Insane 2           |        1.0         |
| 8585433 |           Jack Keane          |        1.0         |
| 8585433 | TyranoBuilder Visual Novel... |        1.0         |
| 8585433 |         Funk of Titans        |        1.0         |
| 8585433 | NOBUNAGA'S AMBITION Kakush... | 0.975877192982456  |
| 8585433 |    Eastside Hockey Manager    | 0.9458077709611452 |
| 8585433 |           Monochroma          |        0.75        |
| 8585433 |     Football Manager 2010     | 0.5283422527644849 |
+---------+-------------------------------+--------------------+
[10 rows x 3 columns]



In [75]:
# Filtering
test_sf = test_sframe[(test_sframe['userId']==  8585433 )] 
# Displaying
test_sf.materialize()
(test_sf[['userId', 'gameName']]). print_rows(num_rows=21, num_columns=2)

+---------+-------------------------------+
|  userId |            gameName           |
+---------+-------------------------------+
| 8585433 |             VVVVVV            |
| 8585433 |         Super Meat Boy        |
| 8585433 | Sang-Froid - Tales of Were... |
| 8585433 |    Ori and the Blind Forest   |
| 8585433 | Rising Storm/Red Orchestra... |
| 8585433 |             Saira             |
| 8585433 |  The Walking Dead Season Two  |
| 8585433 |       Shadowrun Returns       |
| 8585433 |            Distance           |
| 8585433 |         Mirror's Edge         |
| 8585433 |    Crysis 2 Maximum Edition   |
| 8585433 |         SteamWorld Dig        |
| 8585433 | Never Alone (Kisima Ingitc... |
| 8585433 | The Misadventures of P.B. ... |
| 8585433 |         Secret Ponchos        |
| 8585433 |            Spelunky           |
| 8585433 |     Euro Truck Simulator 2    |
| 8585433 |       Tabletop Simulator      |
| 8585433 | Sword of the Stars The Pit... |
| 8585433 |       Surgeon Simula

In [20]:
# obviously for those users the precision will be 0  because there is no data about those users' in testsframe

In [76]:
result = test_sframe.join(result, on="userId", how="left") \
   .select_columns(["userId","gameId", "gameName", "precision", "recall", "count"])

In [77]:
result.print_rows(num_rows=100, num_columns=6)

+----------+--------+-------------------------------+-----------+--------+-------+
|  userId  | gameId |            gameName           | precision | recall | count |
+----------+--------+-------------------------------+-----------+--------+-------+
| 1024319  |  2078  |    Half-Life Opposing Force   |    0.0    |  0.0   |   1   |
| 1364546  |  1179  |       Deathmatch Classic      |    0.0    |  0.0   |   1   |
| 4325465  |  1179  |       Deathmatch Classic      |    0.0    |  0.0   |   1   |
| 8542204  |  2712  |           Metro 2033          |    0.0    |  0.0   |   9   |
| 8542204  |  4527  |  The Witcher Enhanced Edition |    0.0    |  0.0   |   9   |
| 8542204  |  1978  |       Grand Theft Auto V      |    0.0    |  0.0   |   9   |
| 8542204  |  4343  |            The Crew           |    0.0    |  0.0   |   9   |
| 8542204  |  726   |     Call of Duty Black Ops    |    0.0    |  0.0   |   9   |
| 8542204  |  3223  |            Portal 2           |    0.0    |  0.0   |   9   |
| 85