## Initial Considerations

1. Is the data best suited for content-based or collaborative based recommendations?
2. How can we represent the data appropiately? Possibly aggregated matrices


Potential pitfalls:
- Mistyped Game Titles example 'Dota2' or 'Dota 2'
- Data Normalization, used hours played as direct representation of the enjoyment. Some games many hours could skew output

In [3]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.figure_factory as ff

## Part 1: Data Discovery

Objectives:
- Ordered ranking of all game titles based on how many purchases made for each title.
- Ordered ranking of all game titles based on total hours played between all users.
- Ordered ranking of all game titles based on the average number of hours played by users who played the game after purchasing it. 

**Note**: When a user purchases a title they will have a record where the activity field shows "purchase" and the hours field shows "1.0".  However, a user will only have a record where the activity field is "play" if they have actually played the game after purchasing it, in which case the hours field will indicate the number of hours played.
The output of your code should be saved in clearly named files that can be submitted with your source code.

In [4]:
data = pd.read_csv('steam-200k_trimmed.csv')

In [5]:
data.head()

Unnamed: 0,user_id,game_title,activity,hours
0,151603712,The Elder Scrolls V Skyrim,purchase,1.0
1,151603712,The Elder Scrolls V Skyrim,play,273.0
2,151603712,Fallout 4,purchase,1.0
3,151603712,Fallout 4,play,87.0
4,151603712,Spore,purchase,1.0


What's the data look like overall?

In [49]:
overview = pd.DataFrame()
dict = {
  'Unique User IDs': data['user_id'].nunique()
, 'Unique Game Titles': data['game_title'].nunique()
, 'Unique Activity Types': data['activity'].unique()
}
overview = overview.append(dict, ignore_index = True)
print(overview)

   Unique User IDs  Unique Game Titles Unique Activity Types
0            12393                5155      [purchase, play]



The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



1. Ordered ranking of all game titles based on how many purchases were made for each title.

In [8]:
Purchases = data[data['activity']=='purchase'].groupby('game_title').activity.agg('count').sort_values(ascending = False).reset_index()

In [9]:
Purchases_df = pd.DataFrame(Purchases)

In [10]:
Purchases_df.head()

Unnamed: 0,game_title,activity
0,Dota 2,4841
1,Team Fortress 2,2323
2,Unturned,1563
3,Counter-Strike Global Offensive,1412
4,Half-Life 2 Lost Coast,981


In [51]:
Purchase_count_per_game = Purchases_df.to_csv('Purchase_count_per_game.csv')

In [11]:
def plot_distribution():
        activity_data = [Purchases_df['activity']] # needs to be in a list format
        group_labels = ['distplot'] # name of the dataset

        fig = ff.create_distplot(activity_data, group_labels)
        fig.show()



plot_distribution()

2. Ordered ranking of all game titles based on total hours played between all users. Filtered to remove the 1 hour count for purchase.

In [12]:
Hours = data[data['activity']=='play'].groupby('game_title').hours.agg('sum').sort_values(ascending = False).reset_index()

In [13]:
Hours_df = pd.DataFrame(Hours)
Hours_df.head()

Unnamed: 0,game_title,hours
0,Dota 2,981684.6
1,Counter-Strike Global Offensive,322771.6
2,Team Fortress 2,173673.3
3,Counter-Strike,134261.1
4,Sid Meier's Civilization V,99821.3


In [54]:
Hours_played_per_game = Hours_df.to_csv('Hours_played_per_game.csv')

3. Ordered ranking of all game titles based on the average number of hours played by users who played the game after purchasing it.

Note, the wording on this requirement made it seem as though a user could play a game _without_ purchasing it. I did not QC this, but that could be a differentiator in the output.

In [55]:
Avg_hours = data[data['activity']=='play'].groupby('game_title').hours.agg('mean').sort_values(ascending = False).reset_index()

In [58]:
Avg_Hours_df = pd.DataFrame(Avg_hours)
Avg_Hours_df.head()

Unnamed: 0,game_title,hours
0,Eastside Hockey Manager,1295.0
1,Baldur's Gate II Enhanced Edition,475.255556
2,FIFA Manager 09,411.0
3,Perpetuum,400.975
4,Football Manager 2014,391.984615


In [60]:
Avg_hours_played_per_game = Avg_Hours_df.to_csv('Avg_hours_played_per_game.csv')

## Part 2: Recommendations generator

Use the data however you see fit in order to create code to generate a list of the "top 5 game recommendations"

We want to be able to generate "Top 5 game recommendations for you" lists for users based on the games that they have already purchased/played.

**Assume**:
- A user expects to enjoy a game when they purchase it
- More hours spent playing a game indicate higher levels of user enjoyment
- Less hours of play indicate dissatisfaction

**Output**:
- List of 5 recommendations per user saved in an output file

**Output Requirements**:
- Should not include any titles the user already owns
- If there is not enough data for a user to provide 5 recommendations, the list can be padded with generally popular titles (from section 1 analysis)

  Assume that a user expects to enjoy a game when they purchase it, and that users indicate a higher level of enjoyment the more hours they spend playing a game.  Note, however, that a low number of hours played likely indicates dissatisfaction, and we want to avoid recommending games to a user that will leave them similarly disatisfied.  

 for each distinct user_id.  These generated lists should exclude games that the user already owns.  If not enough data is available for a user to fully populate their list of 5 recommendations, the list can be padded to 5 titles using games that were found to be generally popular according to the analysis portion above.  Please explain your general approach and any significant design/implementation decisions, either through code comments or in a separate document accompanying your source code for the recommendation generator.  The generated lists of recommendations should be saved in an output file that can be submitted with your source code.

In [16]:
data.head()

Unnamed: 0,user_id,game_title,activity,hours
0,151603712,The Elder Scrolls V Skyrim,purchase,1.0
1,151603712,The Elder Scrolls V Skyrim,play,273.0
2,151603712,Fallout 4,purchase,1.0
3,151603712,Fallout 4,play,87.0
4,151603712,Spore,purchase,1.0


In [17]:
data_play = data[data['activity']=='play']

In [18]:
data_play.info

<bound method DataFrame.info of           user_id                  game_title activity  hours
1       151603712  The Elder Scrolls V Skyrim     play  273.0
3       151603712                   Fallout 4     play   87.0
5       151603712                       Spore     play   14.9
7       151603712           Fallout New Vegas     play   12.1
9       151603712               Left 4 Dead 2     play    8.9
...           ...                         ...      ...    ...
199991  128470551                Fallen Earth     play    2.4
199993  128470551                 Magic Duels     play    2.2
199995  128470551                 Titan Souls     play    1.5
199997  128470551  Grand Theft Auto Vice City     play    1.5
199999  128470551                        RUSH     play    1.4

[70489 rows x 4 columns]>

In [19]:
data_purchase = data[data['activity']=='purchase']

In [20]:
data_purchase.head()

Unnamed: 0,user_id,game_title,activity,hours
0,151603712,The Elder Scrolls V Skyrim,purchase,1.0
2,151603712,Fallout 4,purchase,1.0
4,151603712,Spore,purchase,1.0
6,151603712,Fallout New Vegas,purchase,1.0
8,151603712,Left 4 Dead 2,purchase,1.0


In [21]:
DataGrouped = data_play.groupby(['user_id', 'game_title']).sum().reset_index()

In [22]:
DataGrouped.head()

Unnamed: 0,user_id,game_title,hours
0,5250,Alien Swarm,4.9
1,5250,Cities Skylines,144.0
2,5250,Deus Ex Human Revolution,62.0
3,5250,Dota 2,0.2
4,5250,Portal 2,13.6


In [79]:
DataGrouped['hours'].max() # I will use this to set the max value of my rating scale

11754.0

### Surprise

After researching options in the Python universe for recommendation algorithms, I landed on the Surprise library. It seemed one of the most straight forward to implement and offers many methods for recommendations for further development and refining of the recommendation algorithm.

For input data I will use 'DataGrouped'. This is the data grouped by user and game title, displaying the total hours played per user by game title.

In [30]:
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate
from surprise import SVD
from collections import defaultdict

In [80]:
# https://surprise.readthedocs.io/en/stable/getting_started.html#use-a-custom-dataset
reader = Reader(rating_scale = (1, 12000))
data_pred = Dataset.load_from_df(DataGrouped[['user_id','game_title', 'hours']], reader)

In [35]:
cross_validate(NormalPredictor(), data_pred, cv=2)

{'test_rmse': array([283.6375919 , 283.40344586]),
 'test_mae': array([141.81055473, 140.09291715]),
 'fit_time': (0.04998469352722168, 0.0704641342163086),
 'test_time': (0.30907225608825684, 0.4136683940887451)}

In [40]:
def get_top_n(predictions, n=10):
    """Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    """

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

I will be implementing the SVD algorithm. Research showed it has been widely and effectively applied for recommendations. This is just oneof many options within Surprise. 

In place of the usual user ratings that are used as input to SVD, I will use the raw game hours played. This is with the assumption that the more hours represent a higher user satisfaction. 

In [37]:
# https://surprise.readthedocs.io/en/stable/FAQ.html?highlight=recommendations#how-to-get-the-top-n-recommendations-for-each-user
trainset = data_pred.build_full_trainset()
algo = SVD()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1f9f4889580>

In [41]:
# Predict ratings for all pairs (u,i) that are NOT in the training set 
# Successful run of 13 mins
testset = trainset.build_anti_testset()
predictions = algo.test(testset)

top_n = get_top_n(predictions, n=5)

In [42]:
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

5250 ['Age of Empires II HD Edition', 'Banished', 'Call of Duty Black Ops', 'Call of Duty Black Ops - Multiplayer', 'Call of Duty Modern Warfare 2']
76767 ['Cities Skylines', 'Deus Ex Human Revolution', 'Dota 2', 'Team Fortress 2', 'Alan Wake']
86540 ['Alien Swarm', 'Cities Skylines', 'Deus Ex Human Revolution', 'Dota 2', 'Portal 2']
144736 ['Alien Swarm', 'Cities Skylines', 'Deus Ex Human Revolution', 'Dota 2', 'Portal 2']
181212 ['Alien Swarm', 'Cities Skylines', 'Deus Ex Human Revolution', 'Dota 2', 'Portal 2']
229911 ['Alien Swarm', 'Cities Skylines', 'Deus Ex Human Revolution', 'Dota 2', 'Portal 2']
298950 ['Call of Duty Black Ops', 'Call of Duty Black Ops - Multiplayer', 'Call of Duty Modern Warfare 2', 'Call of Duty Modern Warfare 2 - Multiplayer', 'Call of Duty Modern Warfare 3']
381543 ['Alien Swarm', 'Cities Skylines', 'Deus Ex Human Revolution', 'Dota 2', 'Portal 2']
547685 ['Alien Swarm', 'Cities Skylines', 'Deus Ex Human Revolution', 'Dota 2', 'Portal 2']
554278 ['Alien Sw

In [61]:
top_n_df = pd.DataFrame(top_n)

In [63]:
top_n_df.head()

Unnamed: 0,5250,76767,86540,144736,181212,229911,298950,381543,547685,554278,...,309228590,309255941,309262440,309265377,309404240,309434439,309554670,309626088,309824202,309903146
0,"(Age of Empires II HD Edition, 12000)","(Cities Skylines, 12000)","(Alien Swarm, 12000)","(Alien Swarm, 12000)","(Alien Swarm, 12000)","(Alien Swarm, 12000)","(Call of Duty Black Ops, 12000)","(Alien Swarm, 12000)","(Alien Swarm, 12000)","(Alien Swarm, 12000)",...,"(Alien Swarm, 12000)","(Alien Swarm, 12000)","(Alien Swarm, 12000)","(Alien Swarm, 12000)","(Alien Swarm, 12000)","(Alien Swarm, 12000)","(Alien Swarm, 12000)","(Alien Swarm, 12000)","(Alien Swarm, 12000)","(Alien Swarm, 12000)"
1,"(Banished, 12000)","(Deus Ex Human Revolution, 12000)","(Cities Skylines, 12000)","(Cities Skylines, 12000)","(Cities Skylines, 12000)","(Cities Skylines, 12000)","(Call of Duty Black Ops - Multiplayer, 12000)","(Cities Skylines, 12000)","(Cities Skylines, 12000)","(Cities Skylines, 12000)",...,"(Cities Skylines, 12000)","(Cities Skylines, 12000)","(Cities Skylines, 12000)","(Cities Skylines, 12000)","(Cities Skylines, 12000)","(Cities Skylines, 12000)","(Cities Skylines, 12000)","(Cities Skylines, 12000)","(Cities Skylines, 12000)","(Cities Skylines, 12000)"
2,"(Call of Duty Black Ops, 12000)","(Dota 2, 12000)","(Deus Ex Human Revolution, 12000)","(Deus Ex Human Revolution, 12000)","(Deus Ex Human Revolution, 12000)","(Deus Ex Human Revolution, 12000)","(Call of Duty Modern Warfare 2, 12000)","(Deus Ex Human Revolution, 12000)","(Deus Ex Human Revolution, 12000)","(Deus Ex Human Revolution, 12000)",...,"(Deus Ex Human Revolution, 12000)","(Deus Ex Human Revolution, 12000)","(Deus Ex Human Revolution, 12000)","(Deus Ex Human Revolution, 12000)","(Deus Ex Human Revolution, 12000)","(Deus Ex Human Revolution, 12000)","(Deus Ex Human Revolution, 12000)","(Deus Ex Human Revolution, 12000)","(Deus Ex Human Revolution, 12000)","(Deus Ex Human Revolution, 12000)"
3,"(Call of Duty Black Ops - Multiplayer, 12000)","(Team Fortress 2, 12000)","(Dota 2, 12000)","(Dota 2, 12000)","(Dota 2, 12000)","(Dota 2, 12000)","(Call of Duty Modern Warfare 2 - Multiplayer, ...","(Dota 2, 12000)","(Dota 2, 12000)","(Dota 2, 12000)",...,"(Portal 2, 12000)","(Dota 2, 12000)","(Dota 2, 12000)","(Dota 2, 12000)","(Dota 2, 12000)","(Portal 2, 12000)","(Dota 2, 12000)","(Dota 2, 12000)","(Portal 2, 12000)","(Portal 2, 12000)"
4,"(Call of Duty Modern Warfare 2, 12000)","(Alan Wake, 12000)","(Portal 2, 12000)","(Portal 2, 12000)","(Portal 2, 12000)","(Portal 2, 12000)","(Call of Duty Modern Warfare 3, 12000)","(Portal 2, 12000)","(Portal 2, 12000)","(Portal 2, 12000)",...,"(Team Fortress 2, 12000)","(Portal 2, 12000)","(Portal 2, 12000)","(Portal 2, 12000)","(Portal 2, 12000)","(Team Fortress 2, 12000)","(Portal 2, 12000)","(Portal 2, 12000)","(Team Fortress 2, 12000)","(Team Fortress 2, 12000)"


In [75]:
print(data[data['user_id']==181212])

       user_id                game_title  activity  hours
42336   181212            Counter-Strike  purchase    1.0
42337   181212            Counter-Strike      play    1.8
42338   181212    Half-Life 2 Lost Coast  purchase    1.0
42339   181212    Half-Life 2 Lost Coast      play    0.4
42340   181212     Counter-Strike Source  purchase    1.0
42341   181212             Day of Defeat  purchase    1.0
42342   181212        Deathmatch Classic  purchase    1.0
42343   181212                 Half-Life  purchase    1.0
42344   181212               Half-Life 2  purchase    1.0
42345   181212    Half-Life 2 Deathmatch  purchase    1.0
42346   181212      Half-Life Blue Shift  purchase    1.0
42347   181212  Half-Life Opposing Force  purchase    1.0
42348   181212                  Ricochet  purchase    1.0
42349   181212     Team Fortress Classic  purchase    1.0


In [81]:
Output_recs = top_n_df.to_csv('Output_recs.csv')

### Final Thoughts

A quick QC of the results shows that games the user already owns are not being suggested. However, the count of columns, representing the total number of users does not match the total number of users in the original dataset. This may have to do with the division of data between training and test data, or could be limited by users with not enough information. 

Further improvement of this methodology can be made by testing the different algorithms within the Surpise Library. Further data munging/cleanup could also improve outputs, cleaning up strings to validate for typos for example. Normalizing the hours played could also reduce the effect of outliers in recommendations. 