# introduction

In this project, we have developed a recommendation system for the Steam gaming platform. The goal of this system is to suggest games to users based on their preferences and game informations. To achieve this, we have utilized various data analysis techniques and feature engineering to create a model that can predict potential interests of users in different games. This notebook documents the steps taken after data preprocessing and all the way to model training and evaluation.


# Data preparation

## Library imports

In [83]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity
from datetime import datetime

## Dataset imports

The dataset we are using for this part is the preprocessed datasets 'games_filtered.csv' and 'users_filtered.csv', which are cleaned and filtered datasets from previous work.

In [84]:
# Import dataset
games_df = pd.read_csv("../data/games_filtered.csv")
users_df = pd.read_csv("../data/users_filtered.csv")

## Data preprocessing

Select needed columns and further cleaning

In [85]:
# Set the 'Hours' for 'purchase' action to 0
users_df['Hours'] = np.where(users_df['Action'] == 'purchase', 0, users_df['Hours'])

# Select needed columns
games_df.drop(['name', 'platforms','required_age'], axis=1, inplace=True)
users_df.drop('Game_Name', axis=1, inplace=True)

In [86]:
games_df.head(3)

Unnamed: 0,appid,release_date,english,developer,publisher,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price,cleaned_title
0,10,2000-11-01,1,Valve,Valve,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,124534,3339,17612,317,10000000-20000000,7.19,counter strike
1,20,1999-04-01,1,Valve,Valve,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,3318,633,277,62,5000000-10000000,3.99,team fortress classic
2,30,2003-05-01,1,Valve,Valve,Multi-player;Valve Anti-Cheat enabled,Action,FPS;World War II;Multiplayer,0,3416,398,187,34,5000000-10000000,3.99,day of defeat


In [87]:
users_df.head(3)

Unnamed: 0,User_ID,Action,Hours,cleaned_game_title
0,151603712,purchase,0.0,fallout 4
1,151603712,play,87.0,fallout 4
2,151603712,purchase,0.0,spore


Now **'users_df'** contains each users' information for purchase and play hours for every game they had.  
For **'games_df'**, details are as follows:  
**appid:** The unique identifier for each game on the Steam platform.  
**release_date:** The date when the game was released.  
**english:** A binary indicator (e.g., 1 for English, 0 for other languages) indicating whether the game is available in English.  
**categories:** Categories associated with the game.  
**genres:** The genre of the game.  
**steamspy_tags:** Tags or keywords associated with the game.  
**achievements:** The number of in-game achievements available in the game.  
**positive_ratings:** The number of positive ratings or reviews given by players.  
**negative_ratings:** The number of negative ratings or reviews given by players.  
**average_playtime:** The average playtime in minutes for players of the game.  
**median_playtime:** The median playtime in minutes for players of the game.  
**owners:** The estimated number of game owners (users who have purchased or obtained the game).  
**price:** The price of the game.  
**cleaned_title:** The title of the game after cleaning.  

# Feature engineering

Feature engineering is a fundamental step in the machine learning pipeline, where we transform raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data. In this recommendation system, we're specifically focusing on features that are indicative of user preferences and game qualities.


## Feature Transformation Pipeline

### Text Features: Categories, Genres, and Tags
For each game, descriptive text data such as categories, genres, and tags provide a rich set of information about the game's content and style. However, this textual information is not immediately usable by machine learning algorithms, which require numerical input. To convert this text into a numerical format, we employ a `TfidfVectorizer`. This method transforms the text into a matrix of TF-IDF features, which capture the importance of words in relation to the documents in which they appear.

### User Ratings
User ratings are a direct indication of a game's reception by its players. We create a feature that reflects the ratio of positive ratings to total ratings, offering a normalized metric that can be used to gauge overall user satisfaction.

### Playtime
The amount of time users spend playing a game is a good indicator of engagement and enjoyment. We normalize the average and median playtime to ensure that these features are scaled appropriately for our algorithms.

### Ownership
The number of users who own a game can provide insights into its popularity and accessibility. We calculate the average number of owners from

### Days Since Release
How long after the game released. Which can be a useful feature for the games popularity.

In [88]:
# Transformer for text features
class CategoriesGenresTagsTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.vectorizer = TfidfVectorizer(stop_words='english')
        
    def fit(self, X, y=None):
        combined_text = X['categories'] + " " + X['genres'] + " " + X['steamspy_tags']
        self.vectorizer.fit(combined_text)
        return self
    
    def transform(self, X):
        combined_text = X['categories'] + " " + X['genres'] + " " + X['steamspy_tags']
        return self.vectorizer.transform(combined_text).toarray()

# Transformer for ratings
class RatingsTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        total_ratings = X['positive_ratings'] + X['negative_ratings']
        rating_ratio = X['positive_ratings'] / total_ratings
        return rating_ratio.values.reshape(-1, 1)

# Transformer for playtime
class PlaytimeTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.average_playtime_max = X['average_playtime'].max()
        self.median_playtime_max = X['median_playtime'].max()
        return self
    
    def transform(self, X):
        normalized_average_playtime = X['average_playtime'] / self.average_playtime_max
        normalized_median_playtime = X['median_playtime'] / self.median_playtime_max
        return pd.concat([normalized_average_playtime, normalized_median_playtime], axis=1).values

# Transformer for ownership
class OwnersTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        owners_processed = X['owners'].str.split('-').apply(lambda x: (int(x[0]) + int(x[1])) / 2)
        return owners_processed.values.reshape(-1, 1)

# Transformer for how long the game released
class DaysSinceReleaseTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        now = datetime.now()
        days_since_release = X['release_date'].apply(lambda x: (now - datetime.strptime(x, '%Y-%m-%d')).days)
        return days_since_release.values.reshape(-1, 1)

### Combine feature transformer

The feature transformation pipeline is a critical component of our system. By using `FeatureUnion`, we can combine multiple feature-processing steps into a single transformer object that can be applied to the dataset. This approach not only simplifies the code but also ensures that all features are processed in a consistent manner. The pipeline includes our custom transformers, each handling a different aspect of the game data to create a comprehensive feature set for the recommendation model.


In [89]:
# Combined Feature Transformer
feature_transformer = FeatureUnion([
    ('categories_genres_tags', CategoriesGenresTagsTransformer()),
    ('ratings', RatingsTransformer()),
    ('playtime', PlaytimeTransformer()),
    ('owners', OwnersTransformer()),
    ('days_since_release', DaysSinceReleaseTransformer())
])

# Model Training

With our features engineered and ready, the model training process is divided into two parts: clustering users based on their gaming preferences and building a recommendation model for suggesting new games.

## Clustering User Data

The user data is analyzed to determine patterns in gaming preferences. This is achieved through clustering, where we aim to group users with similar behaviors:

- We use the K-Means algorithm, which is effective for grouping data into distinct clusters based on feature similarity.
- Standardization of user data is crucial here as it ensures that each feature contributes equally to the distance computations during clustering.
- The number of clusters is selected based on exploratory data analysis or domain knowledge. It dictates how many distinct user groups we expect to find.

The process involves the following steps:
1. Normalizing the user data to ensure fair contribution from all features.
2. Training the K-Means model with the standardized user data to find clusters.
3. Analyzing the clusters to understand the common characteristics of user groups.

In [90]:
# Create a user-game interaction matrix
interaction_matrix = users_df.pivot_table(index='User_ID', columns='cleaned_game_title', values='Hours', aggfunc='sum').fillna(0)

# Standardize the interaction matrix
scaler = StandardScaler()
scaled_matrix = scaler.fit_transform(interaction_matrix)

# Perform clustering using KMeans
n_clusters = 10
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(scaled_matrix)

# Add cluster labels to user data
interaction_matrix['cluster_label'] = cluster_labels

# Create a user-game hours matrix
user_game_hours = users_df.groupby(['User_ID', 'cleaned_game_title'])['Hours'].sum().unstack(fill_value=0)

  super()._check_params_vs_input(X, default_n_init=10)


In [91]:
def games_liked_by_cluster_function(cluster_users):
    # Filter out the user_game_hours dataframe for users in the given cluster
    cluster_game_hours = user_game_hours.loc[cluster_users]
    
    # Sum the hours for each game across all users in the cluster
    cluster_game_sum = cluster_game_hours.sum().sort_values(ascending=False)
    
    return cluster_game_sum

## User Representation through Game Features
Creating a representation of each user's preferences is a key component in making personalized game recommendations. The system must understand which games a user likes and how they interact with different game genres and styles. This is achieved by transforming the raw game data into feature vectors and then aggregating them into a user profile.

The function `compute_user_representation` constructs a user's profile based on their gameplay history. It follows these steps:

1. For a given user, the function retrieves all the games they have interacted with from `users_df`.
2. It initializes a vector to represent the user, starting with a zero vector with the same shape as the game feature vectors.
3. For each game the user has played, the function looks up the game's feature vector in `game_features_dict` and weighs it by the number of hours the user has spent on that game.
4. The weighted feature vectors are summed to create a composite representation of the user's gaming preferences.
5. This sum is then normalized by the number of games the user has interacted with to produce the final user representation vector.


In [92]:
# Use feature_transformer to get game feature vectors
feature_transformer.fit(games_df)
game_features = feature_transformer.transform(games_df)

# dict for game game and game features
game_features_dict = {game: features for game, features in zip(games_df['cleaned_title'], game_features)}

# compute user representation
def compute_user_representation(user_id, users_df, game_features_dict):
    user_games = users_df[users_df['User_ID'] == user_id]
    representation = np.zeros(next(iter(game_features_dict.values())).shape)
    for _, row in user_games.iterrows():
        game_title = row['cleaned_game_title']
        hours = row['Hours']
        representation += game_features_dict[game_title] * hours
    return representation / len(user_games)

## Training Recommendation Model

Our recommendation system employs a hybrid approach that combines collaborative filtering with content-based filtering. This approach leverages both the similarities between users' preferences (collaborative) and the features of the games themselves (content-based). Here's how the `hybrid_recommendation` function operates to generate game recommendations for a given user:

### Combining Collaborative and Content-Based Methods

1. **Cosine Similarity for Content-Based Filtering**:
   - The function first invokes `compute_user_representation` to get the feature vector representation of the user's preferences based on their gameplay history.
   - It then calculates the cosine similarity between this user representation and the feature vectors of all games in the dataset.
   - This similarity score serves as a basis for content-based recommendations, suggesting games with features similar to those the user has shown a preference for.

2. **Cluster-Based Recommendations for Collaborative Filtering**:
   - The user is assigned to a cluster based on their past interactions. This cluster represents a group of users with similar gaming tastes.
   - We find other users in the same cluster and retrieve games that are popular within this user group, representing the collaborative aspect.
   - A function `games_liked_by_cluster_function` is used to aggregate the preferred games of users in the same cluster.

### Generating Hybrid Recommendations

- The function sorts the games based on their similarity scores and selects the top `N` games as content-based recommendations.
- It also retrieves the top `N` games liked by the user's cluster for collaborative recommendations.
- These two lists are then combined to form a single list of recommended games, ensuring any duplicate entries are removed.


In [93]:
def hybrid_recommendation(user_id, game_features, top_n=5):
    # 1. Calculate cosine similarity between the user's representation and game feature vectors
    user_representation = compute_user_representation(user_id, users_df, game_features_dict)
    similarities = cosine_similarity([user_representation], game_features)
    
    # 2. Recommend games based on what other users in the same cluster as the target user like
    user_cluster = interaction_matrix.loc[user_id, 'cluster_label']
    users_in_same_cluster = interaction_matrix[interaction_matrix['cluster_label'] == user_cluster].index
    games_liked_by_cluster = games_liked_by_cluster_function(users_in_same_cluster)
    
    # Combine the results from both strategies
    recommended_indices_similarity = similarities[0].argsort()[-top_n:][::-1]
    recommended_games_similarity = games_df.iloc[recommended_indices_similarity]['cleaned_title'].tolist()
    
    recommended_games_cluster = games_liked_by_cluster.head(top_n).index.tolist()
    
    # Merge the two recommendation lists and remove duplicate games
    hybrid_recommendations = list(dict.fromkeys(recommended_games_similarity + recommended_games_cluster))
    
    return hybrid_recommendations[:top_n]

In [94]:
# Example usage for testing
user_id_example = users_df['User_ID'].iloc[0]
hybrid_recommendation(user_id_example, game_features)

['clicker heroes',
 'garry s mod',
 'dirty bomb',
 'a story about my uncle',
 'half life 2']

# Model evaluation

Evaluating the performance of the recommendation system is crucial to understand its effectiveness and to identify areas for improvement. We focus on a metric known as the 'hit rate', which measures the accuracy of our recommendations in terms of user satisfaction.

## Hit Rate Calculation

The hit rate is calculated as follows:

- For each user, we generate a set of recommended games based on the user's profile and interaction with games.
- We then compare these recommended games against the games that the user has spent a significant amount of time on (above a certain threshold of hours), which we consider as games they "truly like".
- If there is an overlap between the recommended games and the liked games, it is considered a 'hit'.
## Defining a Successful Recommendation

In our system, a successful recommendation for a user is defined by whether any of the games suggested by our model is among the games that the user has played for more than a set threshold of hours (e.g., 5 hours). This threshold is used as a proxy for the user's preference and satisfaction with the game.


In [95]:
def recommendation(user_id, game_features, users_df, threshold_hours=5, top_n=5):

    recommended_games = hybrid_recommendation(user_id, game_features, top_n=top_n)
    
    # Get games that user truly likes based on threshold_hours
    liked_games = users_df[(users_df['User_ID'] == user_id) & (users_df['Hours'] > threshold_hours)]['cleaned_game_title'].tolist()
    
    # Check for hits
    hits = set(recommended_games).intersection(set(liked_games))
    
    return len(hits) > 0

*As our computational power is limited, I only test the hit_rate for a sample size of 1000, you are free to use a bigger sample size to see the result*

In [97]:
# Set the random seed to 42 for reproducibility
np.random.seed(42)

# Calculate hit rate for testing users
test_users_size = 1000
random_user_ids = np.random.choice(users_df['User_ID'].unique(), size=test_users_size, replace=False)

hits = sum(recommendation(user_id, game_features, users_df) for user_id in random_user_ids)

hit_rate = hits / test_users_size
print('The hit rate for a sample size of {} is: {}%'.format(test_users_size, hit_rate*100))

The hit rate for a sample size of 1000 is: 41.8%


The hit rate for our recommendation system, calculated over a random sample of users are shown above. This metric indicates that for roughly 41.8% of the users in our sample, at least one game recommended by our system was a game that they had spent a significant amount of time playing (beyond the threshold of 5 hours, which we've used as a proxy for user preference).

### Interpretation of the Hit Rate

- **Positive Aspects**: A hit rate of 41.8% suggests that the hybrid recommendation system is capable of identifying at least one game that aligns with users' preferences nearly half of the time. This is a strong starting point, particularly if we consider the vast number of games available and the complexity of individual user preferences.

- **User behavior**: For Steam users often have a wide range of game collections, it is hard to predict the exact game they had. To further evaluate, we can try predict game that fit in the genre that the users are in favor of.  

- **Room for Improvement**: While the hit rate is relatively good, it also indicates that there is room for improvement. In more than half the cases, our recommendations did not align with the games that users spent the most time on. This could be due to various factors such as the diversity of users' tastes, the granularity of the user clusters, the balance between collaborative and content-based recommendations, or limitations in the features used to represent games and users.

### Analyzing the Results

- **User Cluster Granularity**: We may need to re-examine the granularity of user clusters. Too many clusters can lead to overfitting, where the system fails to generalize from the users' past behavior. Conversely, too few clusters might not capture the nuances of different users' preferences.

- **Feature Representation**: The features used to represent games and users play a crucial role in the recommendation process. It may be beneficial to revisit the feature engineering step to ensure that we are capturing the most relevant aspects of games that influence user preferences.

- **Threshold for Liked Games**: The threshold of 5 hours is an arbitrary cutoff for user preference. This threshold could be fine-tuned, or a different method of gauging user satisfaction could be explored, such as incorporating user ratings or the frequency of interactions with a game.

- **Balancing Recommendations**: The balance between collaborative and content-based recommendations is also critical. If one type is dominating the recommendations, it may not provide a well-rounded set of suggestions to the user. We may need to adjust how we weigh these two components in the final recommendation list.

## Moving Forward

To improve the hit rate, we can undertake several actions:

- **Hyperparameter Tuning**: Experiment with different numbers of clusters and other hyperparameters in the K-Means algorithm to find a more optimal user grouping.
- **Enhanced Feature Engineering**: Consider including additional game features or user interaction data that could improve the personalization of recommendations.
- **Cross-Validation**: Implement a cross-validation framework to ensure that the hit rate is consistent across different subsets of users.
- **Alternative Metrics**: Besides the hit rate, we can explore other metrics like precision, recall, F1-score, or even user surveys to get a more comprehensive understanding of the recommendation system's performance.

The current hit rate provides a benchmark for the effectiveness of the recommendation system. It is a quantifiable outcome that we can aim to improve upon through iterative development and testing.
