In [1]:
import sys
import os

current_dir = os.getcwd()
project_root = os.path.abspath(os.path.join(current_dir, '..'))
sys.path.append(os.path.join(project_root, 'src'))

In [61]:
import pandas as pd
import numpy as np
import random


from metrics import map_score, mrr_score, ndcg_score, rmse_score, average_precision
from utils import to_user_movie_matrix, make_binary_matrix, RatingMatrix

In [3]:
#for the first stage 
from surprise import Dataset, Reader, SVD
from surprise.model_selection import cross_validate, GridSearchCV, train_test_split

#for the second stage 
from sklearn.model_selection import train_test_split as train_test_split_sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

In [4]:
ratings = pd.read_csv('../data/ratings.dat', sep='::', engine='python', names=['UserID', 'MovieID', 'Rating', 'Timestamp'])
users = pd.read_csv('../data/users.dat', sep='::', engine='python', names=['UserID', 'Gender', 'Age', 'Occupation', 'Zip-code'])
movies = pd.read_csv('../data/movies.dat', sep='::', engine='python', names=['MovieID', 'Title', 'Genres'], encoding='latin1')

data = ratings.merge(users, on='UserID').merge(movies, on='MovieID')

### Data preprocessing

We need to perform some data preprocessing to prepare context for models. The steps are: 

1. Extract the main genre for all films because each film can have sub-genre
2. Create dummy variables for all categorical columns 

In [5]:
data['FirstGenre'] = data['Genres'].apply(lambda x: x.split('|')[0])
data = pd.get_dummies(data, columns=['Occupation', 'Gender', 'FirstGenre'])

In [6]:
data['genres_number'] = data['Genres'].apply(lambda x: len(x.split('|')))

In [7]:
data.head(1)

Unnamed: 0,UserID,MovieID,Rating,Timestamp,Age,Zip-code,Title,Genres,Occupation_0,Occupation_1,...,FirstGenre_Film-Noir,FirstGenre_Horror,FirstGenre_Musical,FirstGenre_Mystery,FirstGenre_Romance,FirstGenre_Sci-Fi,FirstGenre_Thriller,FirstGenre_War,FirstGenre_Western,genres_number
0,1,1193,5,978300760,1,48067,One Flew Over the Cuckoo's Nest (1975),Drama,False,False,...,False,False,False,False,False,False,False,False,False,1


### Training the first stage - candidates generation

We need to narrow down the vast number of possible movies to a smaller set of potentially relevant items. To do this, we will use SVD and pick from all possible items highly-rated movies.  

In [9]:
#convert our data into the format Surprise library expects
reader = Reader(rating_scale=(1, 5))
surprise_data = Dataset.load_from_df(data[['UserID', 'MovieID', 'Rating']], reader)

In [10]:
trainset, testset = train_test_split(surprise_data, test_size=0.2)

Let's use GridSearch to search for the best input params for SVD

In [13]:
param_grid = {
  'n_factors': [20, 30, 40, 50, 60, 70, 80, 90, 100],
  'n_epochs': [5, 10, 20, 30, 40, 50]
}
 

In [16]:
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=10)
gs.fit(surprise_data)
 
print(gs.best_score['rmse'])
print(gs.best_params['rmse'])

0.8656169437820651
{'n_factors': 50, 'n_epochs': 30}


In [17]:
#initializing of SVD with the best params

svd = SVD(n_factors=50, n_epochs=30)

In [18]:
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x21112a52d60>

Now, let's check the performance of SVD on our dataset. RMSE threshold for model choosing is 1.5 (this was our FunkSVD result from previous homework).

In [46]:
results = cross_validate(svd, surprise_data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8736  0.8744  0.8754  0.8744  0.8744  0.8744  0.0006  
MAE (testset)     0.6824  0.6845  0.6842  0.6836  0.6838  0.6837  0.0007  
Fit time          12.40   11.67   12.50   13.18   13.60   12.67   0.67    
Test time         0.98    0.83    0.89    1.97    0.87    1.11    0.43    


Mean RMSE equals less than the threshold, so we can proceed with this model.  

### Training the second stage - ranking of the candidates 

Now, we need to train a more powerful model to rank the candidate movies in order to generate the final set of recommendations. We will try GradientBoosting() from sklearn because of good train time / quality ratio. 

We will use the predictions from the first stage as a feature in the second stage model. 

In [19]:
ranking_data = []
indices = []

In [20]:
#let's prepare X and y inputs for the GradientBoosting() model 

for idx, row in data.iterrows():
    user_id = row['UserID']
    movie_id = row['MovieID']
    
    #the trained SVD model predicts ratings for all user-item pairs
    pred = svd.predict(user_id, movie_id)
    
    features = [
        pred.est,
        row['Age'],
        row['Gender_M'],
        row['Occupation_0'],
        row['Occupation_1'],
        row['Occupation_2'],
        row['Occupation_3'],
        row['Occupation_4'],
        row['Occupation_5'],
        row['Occupation_6'],
        row['Occupation_7'],
        row['Occupation_8'],
        row['Occupation_9'],
        row['Occupation_10'],
        row['Occupation_11'],
        row['Occupation_12'],
        row['Occupation_13'],
        row['Occupation_14'],
        row['Occupation_15'],
        row['Occupation_16'],
        row['Occupation_17'],
        row['Occupation_18'],
        row['Occupation_19'],
        row['Occupation_20'],
        row['FirstGenre_Action'],
        row['FirstGenre_Adventure'], 
        row['FirstGenre_Animation'],
        row["FirstGenre_Children's"], 
        row['FirstGenre_Comedy'], 
        row['FirstGenre_Crime'],
        row['FirstGenre_Documentary'], 
        row['FirstGenre_Drama'], 
        row['FirstGenre_Fantasy'],
        row['FirstGenre_Film-Noir'], 
        row['FirstGenre_Horror'], 
        row['FirstGenre_Musical'],
        row['FirstGenre_Mystery'], 
        row['FirstGenre_Romance'], 
        row['FirstGenre_Sci-Fi'],
        row['FirstGenre_Thriller'], 
        row['FirstGenre_War'], 
        row['FirstGenre_Western'],
        row['genres_number']
    ]
    
    #we will think of 4, 5 as good ratings
    label = 1 if row['Rating'] >= 4 else 0
    

    ranking_data.append((features, label))
    indices.append(idx) 

In [21]:
X, y = zip(*ranking_data)
X, y = np.array(X), np.array(y)
indices = np.array(indices)

In [22]:
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split_sklearn(
    X, y, indices, test_size=0.2, random_state=42
)

Let's train the second stage model:

In [24]:
ranking_model = GradientBoostingClassifier()
ranking_model.fit(X_train, y_train)

### Evaluation 

We are ready to test the model's results using our evaluation framework

In [38]:
y_pred = ranking_model.predict_proba(X_test)[:, 1]   #we extract probabilities for the positive class (likelihood of liking the movie)


In [46]:
ranked_pred_df = pd.DataFrame({
    'UserID': data.loc[indices_test, 'UserID'].values,
    'MovieID': data.loc[indices_test, 'MovieID'].values,
    'Rating': y_pred
})

In [48]:
pred_matrix = to_user_movie_matrix(ranked_pred_df)

In [60]:
map_score_value = map_score(pred_matrix, to_user_movie_matrix(data.loc[indices_test, ['UserID', 'MovieID', 'Rating']]), top=20)
mrr_score_value = mrr_score(pred_matrix, to_user_movie_matrix(data.loc[indices_test, ['UserID', 'MovieID', 'Rating']]), top=20)
ndcg_score_value = ndcg_score(pred_matrix, to_user_movie_matrix(data.loc[indices_test, ['UserID', 'MovieID', 'Rating']]), top=20)

print(f'Two-stage MAP: {map_score_value}')
print(f'Two-stage MRR: {mrr_score_value}')
print(f'Two-stage NDCG: {ndcg_score_value}')

Two-stage MAP: 0.6889973280868418
Two-stage MRR: 0.9665626125664224
Two-stage NDCG: 0.8749891741359047


A MAP of 0.689 indicates a high mean precision of the algorithm when averaged over all queries. This suggests that, on average, the precision of the recommended items is nearly 69%. It is our one of the best results across all approaches tried. 

An MRR of 0.967 reflects very high precision, indicating that relevant items are, on average, ranked very close to the top of the recommendation list. This value suggests that the first relevant item is often found in the first or second position in the ranking, showing strong performance in placing relevant items early in the recommendation list.

NDCG evaluates the quality of the ranking by considering the position of relevant items. An NDCG of 0.875 suggests that the overall ranking of relevant items is highly effective, with most relevant items placed near the top of the ranked list.

### The pipeline  

When the first and second srage models are trained, we can illustrate the recommendation pipeline for a random user from the test dataset.  

The steps are as follows: 

**Step 1: select a user**

**Step 2: generate initial candidate recommendations using the SVD model**
We use the trained SVD model to predict ratings for all movies that the selected user has not yet rated. Then, we select the top-N movies with the highest predicted ratings as initial candidates.

**Step 3: re-rank the candidates using the GradientBoosting model**
For each of the top-N candidate movies, we create feature vectors that include the predicted rating from the SVD model and other relevant features (like user demographics and movie attributes). Then, we use the trained GradientBoosting model to predict a ranking score for each candidate.

**Step 4: display the final recommendations**
We sort the candidate movies by the ranking score generated by the GradientBoosting model and display the top-ranked movies as the final recommendations for the user.

In [77]:
#Step 1

random_user_id = random.choice(data.loc[indices_test, 'UserID'].values)

In [78]:
#Step 2
top_n = 20


#select unseen movies
movie_ids = data.loc[indices_test, 'MovieID'].unique()
user_rated_movies = data[data['UserID'] == random_user_id]['MovieID']
unseen_movies = np.setdiff1d(movie_ids, user_rated_movies)

#SVD predictions 
predictions = [svd.predict(random_user_id, movie_id) for movie_id in unseen_movies]
top_candidates = sorted(predictions, key=lambda x: x.est, reverse=True)[:top_n]  # Top-N candidates

In [85]:
#Step 3

ranking_data = []

for pred in top_candidates:
    movie_id = pred.iid
    predicted_rating = pred.est
    
    
    user_info = data[data['UserID'] == random_user_id].iloc[0]
    movie_info = data[data['MovieID'] == movie_id].iloc[0]
    
    features = [
        predicted_rating,
        user_info['Age'],
        user_info['Gender_M'],
        user_info['Occupation_0'],
        user_info['Occupation_1'],
        user_info['Occupation_2'],
        user_info['Occupation_3'],
        user_info['Occupation_4'],
        user_info['Occupation_5'],
        user_info['Occupation_6'],
        user_info['Occupation_7'],
        user_info['Occupation_8'],
        user_info['Occupation_9'],
        user_info['Occupation_10'],
        user_info['Occupation_11'],
        user_info['Occupation_12'],
        user_info['Occupation_13'],
        user_info['Occupation_14'],
        user_info['Occupation_15'],
        user_info['Occupation_16'],
        user_info['Occupation_17'],
        user_info['Occupation_18'],
        user_info['Occupation_19'],
        user_info['Occupation_20'],
        movie_info['FirstGenre_Action'],
        movie_info['FirstGenre_Adventure'], 
        movie_info['FirstGenre_Animation'],
        movie_info["FirstGenre_Children's"], 
        movie_info['FirstGenre_Comedy'], 
        movie_info['FirstGenre_Crime'],
        movie_info['FirstGenre_Documentary'], 
        movie_info['FirstGenre_Drama'], 
        movie_info['FirstGenre_Fantasy'],
        movie_info['FirstGenre_Film-Noir'], 
        movie_info['FirstGenre_Horror'], 
        movie_info['FirstGenre_Musical'],
        movie_info['FirstGenre_Mystery'], 
        movie_info['FirstGenre_Romance'], 
        movie_info['FirstGenre_Sci-Fi'],
        movie_info['FirstGenre_Thriller'], 
        movie_info['FirstGenre_War'], 
        movie_info['FirstGenre_Western'],
        movie_info['genres_number']
    ]
    
    ranking_data.append(features)

    ranking_scores = ranking_model.predict_proba(ranking_data)[:, 1]

    #Let's combine the ranking scores with the movie IDs
ranked_candidates = [(top_candidates[i].iid, ranking_scores[i]) for i in range(len(top_candidates))]

In [88]:
#Step 4

ranked_candidates = sorted(ranked_candidates, key=lambda x: x[1], reverse=True)
final_recommendations = [movie_id for movie_id, score in ranked_candidates]

In [None]:
#Let's show the movie titles and genres for display

recommended_movies = movies[movies['MovieID'].isin(final_recommendations)]

In [106]:
print(f"Top recommendations for user {random_user_id}:")
recommended_movies[['Title', 'Genres']].head(10)

Top recommendations for user 1468:


Unnamed: 0,Title,Genres
476,Jurassic Park (1993),Action|Adventure|Sci-Fi
589,"Silence of the Lambs, The (1991)",Drama|Thriller
735,"Close Shave, A (1995)",Animation|Comedy|Thriller
740,Dr. Strangelove or: How I Learned to Stop Worr...,Sci-Fi|War
898,Some Like It Hot (1959),Comedy|Crime
907,"Wizard of Oz, The (1939)",Adventure|Children's|Drama|Musical
910,Sunset Blvd. (a.k.a. Sunset Boulevard) (1950),Film-Noir
911,Citizen Kane (1941),Drama
1022,"Sound of Music, The (1965)",Musical
1081,E.T. the Extra-Terrestrial (1982),Children's|Drama|Fantasy|Sci-Fi


In [107]:
#watched and rated movies

data[data['UserID'] == random_user_id].sort_values(['Rating'], ascending = False).head(10).loc[:, ['Title', 'Genres', 'Rating']]

Unnamed: 0,Title,Genres,Rating
47470,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Fantasy|Sci-Fi,5
160388,"Fugitive, The (1993)",Action|Thriller,5
134011,Raiders of the Lost Ark (1981),Action|Adventure,5
141046,"Matrix, The (1999)",Action|Sci-Fi|Thriller,5
489506,"Godfather, The (1972)",Action|Crime|Drama,5
149712,Rocky (1976),Action|Drama,5
155088,Die Hard: With a Vengeance (1995),Action|Thriller,5
416827,"Boat, The (Das Boot) (1981)",Action|Drama|War,5
350083,Face/Off (1997),Action|Sci-Fi|Thriller,5
129073,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Drama|Sci-Fi|War,5


The user's watched and rated movies heavily feature action, adventure, and sci-fi genres. The recommended movies like Jurassic Park, E.T. the Extra-Terrestrial, and Wizard of Oz align well with these genres, indicating a good match in terms of genre preferences. 

Both the recommended and rated movies include classic and highly rated films, suggesting that the user appreciates iconic movies. Recommendations like Citizen Kane and Sound of Music fall into this category.

We can conclude that the recommendation seem well-aligned with the user's historical preferences.

### Summary 

In this task, we have developed a two-stage recommendation system, which included user ratings, demographic information, and movie metadata. 

The two-stage approach involved:

1. Candidate generation: we used SVD algorithm from the scikit-surprise library to generate an initial list of candidate movies for each user. This stage focused on reducing the large set of possible recommendations to a manageable number by predicting ratings for all user-movie pairs and selecting the top-N candidates (20 candidates in our case).
2. Ranking: the candidates generated in the first stage were then re-ranked using a Gradient Boosting Classifier. This model took into account additional features such as user demographics, movie genres, and the initial predicted ratings from the SVD model. The goal of this stage was to fine-tune the recommendations by considering more granular factors that influence user preferences.

The performance of the developed system was evaluated using three key metrics: Mean Average Precision (MAP), Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (NDCG). 

The results showed high scores across all three metrics, indicating that the system was effective in providing relevant and well-ranked recommendations.
* Two-stage MAP: 0.6889973280868418
* Two-stage MRR: 0.9665626125664224
* Two-stage NDCG: 0.8749891741359047

Benefits of the system: 
* It is scalable. By first narrowing down the potential candidates from a large set of items, the two-stage approach ensures that the more computationally intensive ranking stage is applied only to a smaller, relevant subset. This makes the system more scalable in production.
* The two-stage approach allows for flexibility in model selection and feature engineering. Different algorithms can be used for candidate generation and ranking, which can be fine-tuned independently to optimize the overall recommendation quality.
* Improved recommendation quality: by combining the strengths of collaborative filtering (in candidate generation) with additional context-aware features (in ranking), the two-stage approach often results in more personalized and accurate recommendations.

Drawbacks: 
* It doesn't fully solve the cold start problem. For new users or items with limited data, the system may struggle to make accurate initial selections in the first stage.
* Mistakes or biases in the first stage can propagate and amplify in the second stage, potentially leading to suboptimal final recommendations.
* Difficulty in real-time updates: incorporating new user behavior or item information in real-time across both stages can be challenging.
* Implementing and maintaining a two-stage system is more complex than using a single-stage model. It requires careful integration of different algorithms and may involve additional computational overhead.