In [51]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Import Descriptions

- **Pandas (`pd`)**: Data manipulation and analysis.
- **Tabulate**: Display data in tabular format.
- **Surprise**:
  - `Dataset`, `Reader`: Load and read data for recommendation algorithms.
  - `SVD`: Collaborative filtering using Singular Value Decomposition.
  - `train_test_split`, `accuracy`: Train-test splitting and accuracy evaluation.
- **Optuna**: Hyperparameter optimization.
- **Optuna Visualization**: Visualize optimization results.

In [52]:
import pandas as pd

import pickle

from tabulate import tabulate
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

import optuna
import optuna.visualization as vis



### Imports 
Importing data for the recommendation system

In [53]:
users = pd.read_csv(
    'data/ml-1m/users.dat',
    sep="::",
    names=["user_id", "sex", "age_group", "occupation", "zip_code"],
    engine="python",
)

ratings = pd.read_csv(
    'data/ml-1m/ratings.dat',
    sep="::",
    names=["user_id", "movie_id", "rating", "unix_timestamp"],
    engine="python",
)

movies = pd.read_csv('data/movies.csv')

Prefixe user IDs, age groups, and occupations in the `users` DataFrame, and movie IDs and user IDs in the `ratings` DataFrame for clarity and consistency, and converts ratings to floats.

In [54]:
users["user_id"] = users["user_id"].apply(lambda x: f"user_{x}")
users["age_group"] = users["age_group"].apply(lambda x: f"group_{x}")
users["occupation"] = users["occupation"].apply(lambda x: f"occupation_{x}")

ratings["movie_id"] = ratings["movie_id"].apply(lambda x: f"movie_{x}")
ratings["user_id"] = ratings["user_id"].apply(lambda x: f"user_{x}")
ratings["rating"] = ratings["rating"].apply(lambda x: float(x))

Prepare the data for the Surprise library by loading the ratings DataFrame and splitting it into training and testing sets:

1. **Create Reader:**
    - Defines the rating scale for the dataset.

2. **Load Data:**
    - Loads the ratings data into a format suitable for Surprise.

3. **Split Data:**
    - Splits the data into training and testing sets with 80% for training and 20% for testing.

In [55]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings[['user_id', 'movie_id', 'rating']], reader)

trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

1. **Objective Function:**
    - **Hyperparameter Suggestions:** The function suggests values for `n_factors`, `n_epochs`, `lr_all`, and `reg_all`.
    - **Algorithm Initialization:** Initializes the SVD algorithm with the suggested hyperparameters.
    - **Training and Testing:** Trains the algorithm on the training set and tests it on the test set.
    - **RMSE Calculation:** Computes the RMSE (Root Mean Square Error) of the predictions.
    - **Reporting and Pruning:** Reports the RMSE and prunes the trial if needed.

2. **Optimization Process:**
    - **Study Creation:** Creates an Optuna study to minimize the RMSE.
    - **Study Optimization:** Runs the optimization for 25 trials using the defined objective function.

In [56]:
def objective(trial):
    n_factors = trial.suggest_int('n_factors', 10, 200)
    n_epochs = trial.suggest_int('n_epochs', 10, 50)
    lr_all = trial.suggest_float('lr_all', 0.001, 0.2)
    reg_all = trial.suggest_float('reg_all', 0.01, 0.2)
    
    algo = SVD(n_factors=n_factors, n_epochs=n_epochs, lr_all=lr_all, reg_all=reg_all)
    algo.fit(trainset)

    predictions = algo.test(testset)
    rmse = accuracy.rmse(predictions, verbose=False)
    
    trial.report(rmse, step=n_epochs)

    if trial.should_prune():
        raise optuna.TrialPruned()
    
    return rmse

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=25)

[I 2024-07-06 12:52:21,406] A new study created in memory with name: no-name-a88f4bee-6d2c-4085-b901-392f5b138583
[I 2024-07-06 12:52:29,141] Trial 0 finished with value: 0.9093574440842732 and parameters: {'n_factors': 151, 'n_epochs': 18, 'lr_all': 0.029994959873405044, 'reg_all': 0.16424616599530517}. Best is trial 0 with value: 0.9093574440842732.
[I 2024-07-06 12:52:35,119] Trial 1 finished with value: 0.9055079701925798 and parameters: {'n_factors': 51, 'n_epochs': 26, 'lr_all': 0.0021449741536371915, 'reg_all': 0.07198255239590926}. Best is trial 1 with value: 0.9055079701925798.
[I 2024-07-06 12:52:38,246] Trial 2 finished with value: 0.9717008036322241 and parameters: {'n_factors': 27, 'n_epochs': 16, 'lr_all': 0.16168894845105186, 'reg_all': 0.12056752098864361}. Best is trial 1 with value: 0.9055079701925798.
[I 2024-07-06 12:52:52,692] Trial 3 finished with value: 0.8945899906799015 and parameters: {'n_factors': 165, 'n_epochs': 46, 'lr_all': 0.021144088939005826, 'reg_all'

In [62]:
vis.plot_optimization_history(study).show()
vis.plot_param_importances(study).show()
vis.plot_slice(study).show()
vis.plot_parallel_coordinate(study).show()

In [63]:
best_params = study.best_params
print("Best hyperparameters: ", best_params)

best_algo = SVD(n_factors=best_params['n_factors'], n_epochs=best_params['n_epochs'],
                lr_all=best_params['lr_all'], reg_all=best_params['reg_all'])

best_algo.fit(trainset)
predictions = best_algo.test(testset)
rmse = accuracy.rmse(predictions)
print("RMSE with best hyperparameters: ", rmse)

Best hyperparameters:  {'n_factors': 169, 'n_epochs': 20, 'lr_all': 0.022792466492533893, 'reg_all': 0.05452270422523691}
RMSE: 0.8586
RMSE with best hyperparameters:  0.8586390797040614


#### Save the model with the best hyperparameters

In [59]:
filename = 'models/svd_model.pkl'
pickle.dump(best_algo, open(filename, 'wb'))

### Detailed Explanation of the Algorithm

The function `recommend_movies_for_two_users` generates movie recommendations for two users based on their past ratings and the predictions made by a trained recommendation algorithm. Here is a step-by-step explanation of how it works:

1. **Get Movies Already Seen by Both Users:**
   - **User1's Seen Movies:** Retrieves the set of movie IDs that user1 has already rated.
   - **User2's Seen Movies:** Retrieves the set of movie IDs that user2 has already rated.
   - **All Seen Movies:** Combines the sets of movies seen by both users to get the total set of movies that either user has seen.

2. **Calculate Number of Movies Seen by Each User:**
   - **Number of Movies Seen by User1:** Counts the number of movies user1 has seen.
   - **Number of Movies Seen by User2:** Counts the number of movies user2 has seen.

3. **Determine Weights Based on Number of Movies Seen:**
   - **Total Movies Seen:** Calculates the total number of movies seen by both users combined.
   - **Weight for User1:** Calculates the weight for user1 based on the proportion of the total movies seen.
   - **Weight for User2:** Calculates the weight for user2 based on the proportion of the total movies seen.
   - **User Influence:** The user who has seen more movies has a greater influence on the final recommendations due to their higher weight.

4. **Predict Scores for All Unseen Movies:**
   - **All Movies:** Retrieves the set of all movie IDs in the dataset.
   - **Unseen Movies:** Identifies the movies that neither user has seen.
   - **Predictions for User1:** Predicts the ratings for all unseen movies for user1 using the trained algorithm.
   - **Predictions for User2:** Predicts the ratings for all unseen movies for user2 using the trained algorithm.

5. **Combine Predictions with Weights:**
   - **Combined Predictions:** For each unseen movie, combines the predictions from both users, weighted by the number of movies each user has seen.

6. **Convert Predictions to DataFrame:**
   - Creates a DataFrame with movie IDs and their combined predicted scores.

7. **Add Columns `averageRating` and `numVotes`:**
   - Merges the prediction DataFrame with the `movies` DataFrame to add additional movie information such as average rating and number of votes.

8. **Sort by Predicted Score:**
   - Sorts the movies by their combined predicted score in descending order and selects the top `n` recommendations.

9. **Get Recommended Movies Information:**
   - Retrieves detailed information about the top recommended movies.
   - Merges the recommended movies with their predicted scores.

10. **Sort Final Recommendations:**
    - Sorts the final recommendations by average rating and number of votes to prioritize highly rated and popular movies.


This algorithm uses collaborative filtering to predict movie ratings and combines the preferences of two users to generate personalized movie recommendations, giving more influence to the user who has seen more movies.

In [60]:
def recommend_movies_for_two_users(user1, user2, ratings, movies, algo, top_n=10):

    user1_seen_movies = set(ratings[ratings['user_id'] == user1]['movie_id'])
    user2_seen_movies = set(ratings[ratings['user_id'] == user2]['movie_id'])
    all_seen_movies = user1_seen_movies.union(user2_seen_movies)
    
    num_movies_seen_user1 = len(user1_seen_movies)
    num_movies_seen_user2 = len(user2_seen_movies)
    
    total_movies_seen = num_movies_seen_user1 + num_movies_seen_user2
    weight_user1 = num_movies_seen_user1 / total_movies_seen
    weight_user2 = num_movies_seen_user2 / total_movies_seen
    
    print(f"{user1} has seen {num_movies_seen_user1} movies, weight: {weight_user1}")
    print(f"{user2} has seen {num_movies_seen_user2} movies, weight: {weight_user2}")

    all_movies = set(movies['movie_id'])
    unseen_movies = list(all_movies - all_seen_movies)
    
    predictions_user1 = [algo.predict(user1, movie_id).est for movie_id in unseen_movies]
    predictions_user2 = [algo.predict(user2, movie_id).est for movie_id in unseen_movies]
    
    combined_predictions = [
        (movie_id, weight_user1 * pred1 + weight_user2 * pred2) 
        for movie_id, pred1, pred2 in zip(unseen_movies, predictions_user1, predictions_user2)
    ]
    
    pred_df = pd.DataFrame(combined_predictions, columns=['movie_id', 'predicted_score'])
    
    pred_df = pd.merge(pred_df, movies[['movie_id', 'averageRating', 'numVotes']], on='movie_id', how='left')
    
    top_recommendations = pred_df.sort_values(by='predicted_score', ascending=False).head(top_n)
    
    recommended_movies = movies[movies['movie_id'].isin(top_recommendations['movie_id'])]
    recommended_movies = recommended_movies.merge(top_recommendations[['movie_id', 'predicted_score']], on='movie_id')
    
    return recommended_movies.sort_values(by=['averageRating', 'numVotes'], ascending=False)

In [61]:
headers = ['title', 'year', 'genres', 'isAdult', 'runtimeMinutes', 'averageRating', 'numVotes', 'predicted_score']

users_pairs = [('user_5', 'user_2'), ('user_30', 'user_45')]
for user1, user2 in users_pairs:
    recommendations = recommend_movies_for_two_users(user1, user2, ratings, movies, best_algo, top_n=5)
    print(f'The top 5 movies recommended for {user1} and {user2} are:')
    print(tabulate(recommendations[headers], headers='keys', tablefmt='pretty'))
    print('\n')

user_5 has seen 198 movies, weight: 0.6055045871559633
user_2 has seen 129 movies, weight: 0.3944954128440367
The top 5 movies recommended for user_5 and user_2 are:
+---+--------------------------------------------------------------+------+--------------+---------+----------------+---------------+----------+-------------------+
|   |                            title                             | year |    genres    | isAdult | runtimeMinutes | averageRating | numVotes |  predicted_score  |
+---+--------------------------------------------------------------+------+--------------+---------+----------------+---------------+----------+-------------------+
| 0 | Seven Samurai (The Magnificent Seven) (Shichinin no samurai) | 1954 | Action|Drama |    0    |      207       |      4.3      |  368459  | 4.074780144285222 |
| 1 |                       Some Like It Hot                       | 1959 | Comedy|Crime |    0    |      121       |      4.1      |  284749  | 4.026338557939705 |
| 3 |    