## Advanced Cleaning on Steam Games Data

In this notebook we try out two different recommenders on [Steam dataset on Kaggle](https://www.kaggle.com/tamber/steam-video-games). We will use:
- ```numpy, pandas, sklearn``` for data preprocessing

The dataset has no header but comes in 5 columns:
- User ID: integer
- Game title: string
- Activity: string (purchase/play)
- Status: float (1.0 if activity is purchase, total number of hours in game if activity is play)
- A column full of 0's, will discard

In [1]:
import pandas as pd
import numpy as np 
import matplotlib
# import matplotlib.pyplot as plt 
import seaborn as sns
import turicreate
import sklearn as sk
from sklearn.preprocessing import Imputer
from sklearn.model_selection import train_test_split

In [2]:
steam_data = pd.read_pickle("clean_steam_data.pkl")
steam_data.head()
data_copy = steam_data

### we can easily get basic information about the data in each column 

In [3]:
data_copy.describe()

Unnamed: 0,userId,purchase,play,Actions
count,129535.0,129535.0,70785.0,70785.0
mean,102441000.0,1.0,1.0,48.770761
std,72362060.0,0.0,0.0,228.927258
min,5250.0,1.0,1.0,0.1
25%,45483460.0,1.0,1.0,1.0
50%,86055700.0,1.0,1.0,4.5
75%,154230700.0,1.0,1.0,19.1
max,309903100.0,1.0,1.0,11754.0


### Some obervations

Above summary shows that ```purchase``` has only a single value 1.0. We will **not** include purchase status in building our recommender, for two reasons:
1. number of hours played is more interesting to look at - it is an implicit feedback of the user's preference for games he/she has purchased
2. if using [Jaccard similarity](https://apple.github.io/turicreate/docs/api/generated/turicreate.recommender.item_similarity_recommender.ItemSimilarityRecommender.html) when measuring the similarity between two sets of elements, the number of hours played are treated as binary purchase status (1/0)

For ```play``` column, the data is heavily skewed and has a large range. There are different ways to do normalization. Since we are going to use it to represent the user's ranking of preference within his / her own purchases, let's convert it to the percentage of hours each user spends on each game he / she owns. 

But before doing that, notice there are 128804 purchase records and only 70489 play records, meaning some users bought certain games but never played them. We need to find a score for those less-favoured games too. To differentiate them from the games that a user has never bought, we can impute the corresponding play fields with a small value less than the min 0.1, say 0.05. Then we can move on to calculate the percentage as planned.

In [4]:
data_copy.drop("purchase", axis = 1, inplace = True)
data_copy["Actions"].fillna(0.05, inplace=True)
data_copy["play"].fillna(0, inplace=True)

just in case there is duplicated row in the dataframe, we implement the following function to ensure that every row in our dataframe will be distinct

In [5]:
#remove duplicates 
data_copy = data_copy.drop_duplicates(subset=None,keep="first")

besides, we could also have duplicated rows with the same userId and gameName appearing together. This might be some mistakes that happen during the data collection stage so that the hours has not been fully merged and accumulated. we wanna sum up the number of hours for the corrections

In [6]:
dp = data_copy[data_copy.duplicated(subset=['userId','gameName'], keep=False)]
dp = dp.groupby(['userId','gameName'], as_index=False)['Actions'].sum()
print(dp)
dp = dp.groupby(['userId','gameName'], as_index=False)['Actions'].sum()
data_copy.update(dp)
data_copy = data_copy.drop_duplicates(subset=['userId','gameName'],keep="first")

       userId                                      gameName  Actions
0    28472068                          Grand Theft Auto III      0.5
1    28472068                  Grand Theft Auto San Andreas      0.9
2    28472068                    Grand Theft Auto Vice City      5.7
3    33865373                   Sid Meier's Civilization IV    137.0
4    50769696                  Grand Theft Auto San Andreas     14.0
5    59925638                       Tom Clancy's H.A.W.X. 2      7.4
6    71411882                          Grand Theft Auto III      1.3
7    71510748                  Grand Theft Auto San Andreas      0.8
8   118664413                  Grand Theft Auto San Andreas      2.1
9   148362155                  Grand Theft Auto San Andreas     26.3
10  176261926                   Sid Meier's Civilization IV     14.8
11  176261926  Sid Meier's Civilization IV Beyond the Sword    564.4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[col] = expressions.where(mask, this, that)


### Replace actual hours played with percentage of hours on a game for each user

In [7]:
d = data_copy.groupby('userId')['Actions'].apply(lambda x: x/ x.sum())
data_copy.update(d)


### Output and save the dataframe

In [8]:
data_copy.to_pickle('clean_steam.pkl')

### Before building the recommender, here is an optional step - indexing the game names with integers
Based on doc of turicreate.recommender.create, the user ID and item ID columns can be either int or str, meaning that keeping the game title as is should be fine. Here let's do an extra step that indexes those strings with integers. (Maybe when we feel less lazy we can try spark.ml's recommender on it too.)

However indexing is essential when you build models with Spark's machine learning libraries.

In [20]:
data_copy['gameId']= data_copy.gameName.astype('category').cat.codes

### we want to create a look up table for the gameName and gameId columns

In [21]:
data_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 128792 entries, 41879 to 83172
Data columns (total 6 columns):
userId      128792 non-null float64
gameName    128792 non-null object
play        128792 non-null float64
Actions     128792 non-null float64
gameID      128792 non-null int16
gameId      128792 non-null int16
dtypes: float64(3), int16(2), object(1)
memory usage: 5.4+ MB


In [31]:
game_map = data_copy[['gameName','gameId']].copy().drop_duplicates().sort_values(by=['gameId'])
game_map.head()

Unnamed: 0,gameName,gameId
40639,007 Legends,0
72197,0RBITALIS,1
58190,1... 2... 3... KICK IT! (Drop That Beat Like a...,2
72198,10 Second Ninja,3
47824,10000000,4


Unnamed: 0,gameName,gameId
47824,10000000,4


Unnamed: 0,gameName,gameId
40639,007 Legends,0
72197,0RBITALIS,1
58190,1... 2... 3... KICK IT! (Drop That Beat Like a...,2
72198,10 Second Ninja,3
47824,10000000,4
