<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/PreferredAI/tutorials/blob/master/recommender-systems/10_model_ensembling.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/PreferredAI/tutorials/blob/master/recommender-systems/10_model_ensembling.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

# Model Ensembling

This Jupyter Notebook shows how to combine multiple recommendation models using the Cornac library. Ensembling is a technique where we combine predictions from different models to get more accurate results. By using this method, we can improve the performance of a recommendation system.

### What You'll Learn
This tutorial is divided into five parts:

1. **Introduction**.
We’ll start with a simple experiment using the **BPR** and **WMF** models and explore the dataset.
2. **Simple Model Ensembling**.
Learn how to combine BPR and WMF predictions using a method called **Borda Count**.
3. **Further Ensembling**.
Create variations of the WMF model and ensemble their predictions.
4. **Ensembling with Regression Models**.
Use **linear regression** and **random forest regression** from `scikit-learn` to combine WMF models.
5. **Further Evaluation**.
Evaluate the ensemble models to see how they perform compared to individual models.

**Note:** Part of this notebook (in Section 4) uses the `scikit-learn` package.

## 1. Introduction
<a id='introduction'></a>

In this section, we’ll run a basic experiment with the **BPR** (Bayesian Personalized Ranking) and **WMF** (Weighted Matrix Factorization) models to see how they work. We’ll also look at the dataset to understand its structure and distribution.

### 1.1 Install required dependencies

In [1]:
!pip install --quiet cornac==2.3.2

In [2]:
import os
import sys
import logging

# Disable all CUDA logging
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
logging.getLogger('tensorflow').setLevel(logging.ERROR)

# Import necessary libraries and functions
from IPython.display import display
import numpy as np
import pandas as pd
from tqdm import tqdm

import cornac
from cornac.datasets import movielens
from cornac.models import BPR, WMF
from cornac.eval_methods import RatioSplit
from cornac.metrics import Precision, Recall
from cornac.utils import cache
from cornac import Experiment

from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor

import tensorflow as tf

print(f"System version: {sys.version}")
print(f"Cornac version: {cornac.__version__}")
print(f"Tensorflow version: {tf.__version__}")

System version: 3.11.12 (main, Apr  9 2025, 08:55:54) [GCC 11.4.0]
Cornac version: 2.3.2
Tensorflow version: 2.18.0


### 1.2 Loading Dataset

First, we load the **MovieLens 100K** dataset.

In [3]:
data = movielens.load_feedback(variant="100K") # Load MovieLens Dataset

rs = RatioSplit(data, test_size=0.2, rating_threshold=4.0, seed=42, verbose=True) # Split to train-test set to 80-20
train_set, test_set = rs.train_set, rs.test_set

rating_threshold = 4.0
exclude_unknowns = True
---
Training data:
Number of users = 943
Number of items = 1651
Number of ratings = 80000
Max rating = 5.0
Min rating = 1.0
Global mean = 3.5
---
Test data:
Number of users = 943
Number of items = 1651
Number of ratings = 19964
Number of unknown users = 0
Number of unknown items = 0
---
Total users = 943
Total items = 1651


### 1.3 Training BPR and WMF models

We will train two models:

1. **BPR (Bayesian Personalized Ranking)**
2. **WMF (Weighted Matrix Factorization)**

In [4]:
bpr_model = BPR(k=10, max_iter=100, learning_rate=0.01, lambda_reg=0.001, seed=123) # Initialize BPR model
wmf_model = WMF(k=10, max_iter=300, a=1.0, b=0.1, learning_rate=0.001, lambda_u=0.01, lambda_v=0.01, seed=123) # Initialize WMF model

models = [bpr_model, wmf_model]
metrics = [Precision(k=100), Recall(k=100)] # Set metrics for experiment

experiment = Experiment(rs, models, metrics, user_based=True).run() # Run Experiment to compare BPR model to WMF model individually


[BPR] Training started!

[BPR] Evaluation started!


Ranking:   0%|          | 0/940 [00:00<?, ?it/s]


[WMF] Training started!


  0%|          | 0/300 [00:00<?, ?it/s]

Learning completed!

[WMF] Evaluation started!


Ranking:   0%|          | 0/940 [00:00<?, ?it/s]


TEST:
...
    | Precision@100 | Recall@100 | Train (s) | Test (s)
--- + ------------- + ---------- + --------- + --------
BPR |        0.0706 |     0.6607 |    2.5258 |   7.8199
WMF |        0.0772 |     0.7208 |   30.2204 |   0.3964



Comparing **Precision** and **Recall**, both **BPR** and **WMF** are providing comparable results.

Let's move on to try to interpret these results by using the genres of movies that were recommended to us.

### 1.4 Interpreting Results

##### 1.4.1 Creating a Movie Genre Dataframe

In [5]:
# Creating a dataframe of movies with its corresponding genres

# Download some information of MovieLens 100K dataset
item_df = pd.read_csv(
  cache("http://files.grouplens.org/datasets/movielens/ml-100k/u.item"),
  sep="|", encoding="ISO-8859-1",
  names=["ItemID", "Title", "Release Date", "Video Release Date", "IMDb URL",
         "unknown", "Action", "Adventure", "Animation", "Children's", "Comedy",
         "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror",
         "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"]
).set_index("ItemID").drop(columns=["Video Release Date", "IMDb URL", "unknown"])

item_idx2id = train_set.item_ids # mapping between item index and origial film ID
user_idx2id = train_set.user_ids # mapping between user index and origial user ID

# Let's take a look at an example of this dataframe
display(item_df.head(3))

Data from http://files.grouplens.org/datasets/movielens/ml-100k/u.item
will be cached into /root/.cornac/u.item


0.00B [00:00, ?B/s]

File cached!


Unnamed: 0_level_0,Title,Release Date,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,Toy Story (1995),01-Jan-1995,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
2,GoldenEye (1995),01-Jan-1995,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,Four Rooms (1995),01-Jan-1995,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


The `item_df` dataframe consists of all movie items with its corresponding genre attributes.

Further down below, we are going to filter this table with the recommendations that we get from the recommender system models we created to get a better sense.

##### 1.4.2 Creating Training Data Dataframe

To get a sense of what data has been inserted into our model for training, let's count the genres of the training data used to train the model.

But first, let's create a `training_data_df` dataframe with all training data.

The training data consists of 80000 triplets of **User Index**, **Item Index** and **Rating** rows as seen in the dataset summary in Section 1.2.

In [6]:
# Let's view a sample of the training data dataframe
print("Sample row of record:")
print("(user_index, item_index, rating):", list(zip(*train_set.uir_tuple))[0])

# Create a training data dataframe
training_data_df = pd.DataFrame(zip(*train_set.uir_tuple)) # adding all training data into dataframe
training_data_df.columns = ['user_idx', 'item_idx', 'rating'] # adding column names to the data

# Add new column, 'item_id', for further filtering in later sections
training_data_df['item_id'] = training_data_df.apply(lambda row: item_idx2id[int(row['item_idx'])], axis=1) # converted from the item index field

Sample row of record:
(user_index, item_index, rating): (np.int64(0), np.int64(0), np.float64(4.0))


##### 1.4.3 Filtering Training Data

Let's filter based on a particular user to learn more about the user.

We set ``UIDX`` to user index **3**, and ``TOPK`` to **100**, to get the top 100 recommendations in each model for comparison.

In [7]:
# Let's define the user index and top-k movies to be recommended
UIDX = 3
TOPK = 100

# Positively rated items by a user (rating >= 4.0 as rating_threshold used earlier, and user index = UIDX)
positively_rated_items = training_data_df[
    (training_data_df['rating'] >= 4.0) & (training_data_df['user_idx'] == UIDX)
]['item_id'].unique()
filter_df = item_df.loc[[int(item_id) for item_id in positively_rated_items]] # get genres of movie items

print("Number of movies:", len(filter_df)) # Number of movies positvely rated by user index 3 in training data

# Group by Movie Genre and Sum by genres
filter_df = filter_df.select_dtypes(np.number).sum()
filter_df = filter_df.to_frame("Sum") # Let's call that column 'Sum'

# Add a new column '%' for the percentage of individual genre sum compared to total sum
filter_df["%"] = filter_df["Sum"] / filter_df["Sum"].sum() * 100
filter_df["%"] = filter_df["%"].round(1)

# Let's see the training data genres, sums and percentages
print("Positively rated movies by user index 3 in training data")
display(filter_df.sort_values("Sum", ascending=False)[:10])

Number of movies: 250
Positively rated movies by user index 3 in training data


Unnamed: 0,Sum,%
Drama,117,22.6
Comedy,72,13.9
Romance,56,10.8
Action,55,10.6
Thriller,50,9.7
Adventure,36,6.9
Children's,23,4.4
War,20,3.9
Crime,20,3.9
Sci-Fi,18,3.5


As shown above in the training data, the top genres for user index 3 with positively rated movies include 'Drama', 'Comedy', 'Romance', 'Action' and 'Thriller'.

Let's now compare them to the recommendations of the BPR and WMF models respectively.

##### 1.4.4 Interpreting Recommendations of BPR, WMF Models

In [8]:
# Get the Top 5 Genres in filtered training data for user index 3
top_genres = filter_df.sort_values("Sum", ascending=False).head(5).index.tolist()
print("\nTop 5 Genres in training data:", top_genres)

# Get top K recommendations for BPR and put them into the genre dataframe
bpr_recommendations, bpr_scores = bpr_model.rank(UIDX) # rank recommendations by score, limit to top K
bpr_recommendations = bpr_recommendations[:TOPK] # limit to top K
bpr_topk = [item_idx2id[iidx] for iidx in bpr_recommendations] # convert item indexes into item ids
bpr_df = item_df.loc[[int(iid) for iid in bpr_topk]] # filter the movie genre dataframe by item ids

# Let's view the top recommendations for BPR by top genres
display("BPR: Top recommendations", bpr_df[["Title"] + top_genres].head(10))

# Now, let's do likewise for WMF - get top K recommendations and put them into the genre dataframe
wmf_recommendations, wmf_scores = wmf_model.rank(UIDX) # rank recommendations by score
wmf_recommendations = wmf_recommendations[:TOPK] # limit to top K
wmf_topk = [item_idx2id[iidx] for iidx in wmf_recommendations] # convert item indexes into item ids
wmf_df = item_df.loc[[int(iid) for iid in wmf_topk]] # filter the movie genre dataframe by item ids

# View the top recommendations for WMF
display("WMF: Top recommendations", wmf_df[["Title"] + top_genres].head(10))


Top 5 Genres in training data: ['Drama', 'Comedy', 'Romance', 'Action', 'Thriller']


'BPR: Top recommendations'

Unnamed: 0_level_0,Title,Drama,Comedy,Romance,Action,Thriller
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
781,French Kiss (1995),0,1,1,0,0
294,Liar Liar (1997),0,1,0,0,0
1,Toy Story (1995),0,1,0,0,0
181,Return of the Jedi (1983),0,0,1,1,0
121,Independence Day (ID4) (1996),0,0,0,1,0
100,Fargo (1996),1,0,0,0,1
739,Pretty Woman (1990),0,1,1,0,0
313,Titanic (1997),1,0,1,1,0
402,Ghost (1990),0,1,1,0,1
471,Courage Under Fire (1996),1,0,0,0,0


'WMF: Top recommendations'

Unnamed: 0_level_0,Title,Drama,Comedy,Romance,Action,Thriller
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
313,Titanic (1997),1,0,1,1,0
204,Back to the Future (1985),0,1,0,0,0
8,Babe (1995),1,1,0,0,0
125,Phenomenon (1996),1,0,1,0,0
318,Schindler's List (1993),1,0,0,0,0
15,Mr. Holland's Opus (1995),1,0,0,0,0
655,Stand by Me (1986),1,1,0,0,0
64,"Shawshank Redemption, The (1994)",1,0,0,0,0
692,"American President, The (1995)",1,1,1,0,0
732,Dave (1993),0,1,1,0,0


Now that we have seen the top recommendations of the BPR and WMF models, let's do a comparison by taking a look at the genre distribution.

##### 1.4.5 Comparing Models by Genre Distribution

In [9]:
# Let's introduce `combined_df` for comparison.
# This dataframe will be used to compare models by summing up genres from recommendations of different models
combined_df = pd.DataFrame({
    "Train Data %": filter_df["%"],
    "BPR Sum": bpr_df.select_dtypes(np.number).sum(), # group by genres, then get sum of each genre
    "WMF Sum": wmf_df.select_dtypes(np.number).sum() # likewise for WMF
})

# Get percentages of movie genre sums
combined_df['BPR %'] = combined_df['BPR Sum'] / TOPK * 100
combined_df["WMF %"] = combined_df["WMF Sum"] / TOPK * 100

combined_df = combined_df.round(1) # round all
combined_df = combined_df.sort_values("Train Data %", ascending=False)

# Let's take a look at the genre distribution by percentages
display("Train Data to Recommended % Distribution", combined_df[['BPR %', 'WMF %']][:10])

'Train Data to Recommended % Distribution'

Unnamed: 0,BPR %,WMF %
Drama,43.0,49.0
Comedy,31.0,40.0
Romance,33.0,32.0
Action,31.0,16.0
Thriller,29.0,12.0
Adventure,17.0,11.0
Children's,4.0,6.0
War,11.0,10.0
Crime,9.0,4.0
Sci-Fi,11.0,9.0


Now that we have seen the distribution of individual models, we are curious about what kind of distribution we will get from ensembling these models.

Let's see what happens when we ensemble these two models.

## 2. Simple Model Ensembling

In this section, we’ll combine the predictions from the **BPR** and **WMF** models using a method called **Borda Count**.

### What is Borda Count?

Borda Count is a simple ranking method that assigns points based on an item’s rank in each model. Higher-ranked items get more points. We then add up the points from all models to get a combined ranking.

**Example**:
1. Each model ranks items from 1 to 5.
2. Items earn points based on their rank (e.g., 1st place gets 4 points, 2nd gets 3 points, etc.).
3. We sum the points for each item across all models.
4. The item with the highest total points becomes the top recommendation.

Here’s a sample ranking for a user:

| Rank | Model 1 | Model 2 | Model 3 | Points (5 - rank) |
|------|---------|---------|---------|-------------------|
| 1    | A       | D       | E       | 4                 |
| 2    | B       | C       | A       | 3                 |
| 3    | C       | A       | B       | 2                 |
| 4    | D       | B       | D       | 1                 |
| 5    | E       | E       | C       | 0                 |

**Borda Count Result**:

| Item | Total Points |
|------|--------------|
| A    | 9            |
| B    | 6            |
| C    | 5            |
| D    | 6            |
| E    | 4            |

**Final Ranking: A > B, D > C > E**

Now, let’s implement this method!

In [10]:
# Let's create a new dataframe to calculate ranking and borda count
rank_df = pd.DataFrame({
    "ItemID": item_idx2id,
})

total_items = len(rank_df) # 1651 items

# Obtain points (inverse of rank) of the items based on the BPR score
rank_df["BPR Score"] = bpr_scores
rank_df["BPR Rank"] = rank_df["BPR Score"].rank(ascending=False).astype(int) # Get Rank where 1 = Top recommendation
rank_df["BPR Points"] = total_items - rank_df["BPR Rank"] # Get points by calculating ('Total Item count' - 'Rank')

# Do likewise for WMF
rank_df["WMF Score"] = wmf_scores
rank_df["WMF Rank"] = rank_df["WMF Score"].rank(ascending=False).astype(int) # Get Rank where 1 = Top recommendation
rank_df["WMF Points"] = total_items - rank_df["WMF Rank"] # Get points by calculating ('Total Item count' - 'Rank')

# Get Borda Count by summing up points of BPR and WMF
rank_df["Borda Count"] = rank_df["BPR Points"] + rank_df["WMF Points"]
rank_df["Borda Rank"] = rank_df["Borda Count"].rank(ascending=False).astype(int) # Get Rank where 1 = Top recommendation

# Round decimal places for readability purposes
rank_df = rank_df.round(3)
rank_df.sort_values("Borda Rank", inplace=True)

# Now let's take a look at the table with Borda Count
display(rank_df[["ItemID", "BPR Rank", "WMF Rank", "Borda Rank"]].head(5))

Unnamed: 0,ItemID,BPR Rank,WMF Rank,Borda Rank
152,313,8,1,1
194,739,7,11,2
425,237,15,18,3
382,655,27,7,4
310,692,26,9,5


The top recommendation, **ItemID 313**, was ranked **8th** by BPR and **1st** by WMF. Similarly, the second recommendation, **ItemID 739**, was ranked **7th** by BPR and **11th** by WMF.

This demonstrates how ensembling allows us to leverage the strengths of multiple models to produce a more balanced recommendation.

---

Next, we’ll incorporate the recommendations into the genre distribution dataframe to compare their performance against the individual base models.

In [11]:
UIDX = 3
TOPK = 100

borda_count_topk = rank_df["ItemID"].values[:TOPK] # Get top K (100) Item IDs

borda_df = item_df.loc[[int(i) for i in borda_count_topk]] # Filter genre data frame by the top item IDs

# Add Borda Count results into 'combined_df' dataframe for comparison
combined_df["Borda Count Sum"] = borda_df.select_dtypes(np.number).sum() # group by genre, and calculate sum of each genre
combined_df["BPR + WMF Borda Count %"] = combined_df["Borda Count Sum"] / TOPK * 100 # Calculate percentage of sum to total
combined_df["BPR + WMF Borda Count %"] = combined_df["BPR + WMF Borda Count %"].round(1) # rounding for readability purposes

# Let's take a look at the genre distribution of train data, BPR, WMF and the newly added Borda Count
display("BPR + WMF Borda Count Recommendations Distribution", combined_df[["BPR %", "WMF %", "BPR + WMF Borda Count %"]][:10])

'BPR + WMF Borda Count Recommendations Distribution'

Unnamed: 0,BPR %,WMF %,BPR + WMF Borda Count %
Drama,43.0,49.0,51.0
Comedy,31.0,40.0,32.0
Romance,33.0,32.0,35.0
Action,31.0,16.0,25.0
Thriller,29.0,12.0,22.0
Adventure,17.0,11.0,15.0
Children's,4.0,6.0,4.0
War,11.0,10.0,13.0
Crime,9.0,4.0,6.0
Sci-Fi,11.0,9.0,8.0


As Borda Count is a combination of both BPR and WMF models, the distributions are expected to be influenced by both models.

In the next section, we will further add more models to the ensemble.

## 3. Further Ensembling

In this step, we enhance our ensemble by creating variations of the **WMF** model and combining their predictions using Borda Count. Each variation introduces slight adjustments, such as changes in parameters, to capture different perspectives of the dataset.

Think of choosing a movie with friends, where each person has a slightly different taste. By considering everyone’s preferences, you make a decision that satisfies the group. As with any statistical learning models, there could be some variance in the model trained with different seeds or hyperparameters. By ensembling these models, we can reduce the variance and improve the overall performance. By introducing variations of the WMF model, we obtain a more balanced and robust recommendation.

### Approach:
1. **Different Random Seeds**:  
   Train multiple models with different random seeds (e.g., `seed=123`). This variation captures different nuances, as some models may perform better for certain users than others.
   
2. **Varying Number of Latent Factors**:  
   Adjust the number of latent factors (`k`). By changing `k`, the models can capture diverse aspects of the data, providing a broader view of the underlying patterns.

Let’s implement this by training several WMF models with different seeds and latent factor values, then ensemble them using Borda Count to improve the overall recommendation performance.

In [12]:
# WMF models with different seeds
wmf_model_123 = WMF(name="WMF_123", k=10, max_iter=300, a=1.0, b=0.1, learning_rate=0.001, lambda_u=0.01, lambda_v=0.01, seed=123)
wmf_model_456 = WMF(name="WMF_456", k=10, max_iter=300, a=1.0, b=0.1, learning_rate=0.001, lambda_u=0.01, lambda_v=0.01, seed=456)
wmf_model_789 = WMF(name="WMF_789", k=10, max_iter=300, a=1.0, b=0.1, learning_rate=0.001, lambda_u=0.01, lambda_v=0.01, seed=789)
wmf_model_888 = WMF(name="WMF_888", k=10, max_iter=300, a=1.0, b=0.1, learning_rate=0.001, lambda_u=0.01, lambda_v=0.01, seed=888)
wmf_model_999 = WMF(name="WMF_999", k=10, max_iter=300, a=1.0, b=0.1, learning_rate=0.001, lambda_u=0.01, lambda_v=0.01, seed=999)
# WMF models with different number of latent factors
wmf_model_k20 = WMF(name="WMF_k20", k=20, max_iter=300, a=1.0, b=0.1, learning_rate=0.001, lambda_u=0.01, lambda_v=0.01, seed=123)
wmf_model_k30 = WMF(name="WMF_k30", k=30, max_iter=300, a=1.0, b=0.1, learning_rate=0.001, lambda_u=0.01, lambda_v=0.01, seed=123)
wmf_model_k40 = WMF(name="WMF_k40", k=40, max_iter=300, a=1.0, b=0.1, learning_rate=0.001, lambda_u=0.01, lambda_v=0.01, seed=123)
wmf_model_k50 = WMF(name="WMF_k50", k=50, max_iter=300, a=1.0, b=0.1, learning_rate=0.001, lambda_u=0.01, lambda_v=0.01, seed=123)

models = [wmf_model_123, wmf_model_456, wmf_model_789, wmf_model_888, wmf_model_999, wmf_model_k20, wmf_model_k30, wmf_model_k40, wmf_model_k50]

metrics = [Precision(k=100), Recall(k=100)] # The same metrics as before

# Let's run an experiment to take a look at how different these models are, with just different random seeds!
experiment = Experiment(rs, models, metrics, user_based=True).run()


[WMF_123] Training started!


  0%|          | 0/300 [00:00<?, ?it/s]

Learning completed!

[WMF_123] Evaluation started!


Ranking:   0%|          | 0/940 [00:00<?, ?it/s]


[WMF_456] Training started!


  0%|          | 0/300 [00:00<?, ?it/s]

Learning completed!

[WMF_456] Evaluation started!


Ranking:   0%|          | 0/940 [00:00<?, ?it/s]


[WMF_789] Training started!


  0%|          | 0/300 [00:00<?, ?it/s]

Learning completed!

[WMF_789] Evaluation started!


Ranking:   0%|          | 0/940 [00:00<?, ?it/s]


[WMF_888] Training started!


  0%|          | 0/300 [00:00<?, ?it/s]

Learning completed!

[WMF_888] Evaluation started!


Ranking:   0%|          | 0/940 [00:00<?, ?it/s]


[WMF_999] Training started!


  0%|          | 0/300 [00:00<?, ?it/s]

Learning completed!

[WMF_999] Evaluation started!


Ranking:   0%|          | 0/940 [00:00<?, ?it/s]


[WMF_k20] Training started!


  0%|          | 0/300 [00:00<?, ?it/s]

Learning completed!

[WMF_k20] Evaluation started!


Ranking:   0%|          | 0/940 [00:00<?, ?it/s]


[WMF_k30] Training started!


  0%|          | 0/300 [00:00<?, ?it/s]

Learning completed!

[WMF_k30] Evaluation started!


Ranking:   0%|          | 0/940 [00:00<?, ?it/s]


[WMF_k40] Training started!


  0%|          | 0/300 [00:00<?, ?it/s]

Learning completed!

[WMF_k40] Evaluation started!


Ranking:   0%|          | 0/940 [00:00<?, ?it/s]


[WMF_k50] Training started!


  0%|          | 0/300 [00:00<?, ?it/s]

Learning completed!

[WMF_k50] Evaluation started!


Ranking:   0%|          | 0/940 [00:00<?, ?it/s]


TEST:
...
        | Precision@100 | Recall@100 | Train (s) | Test (s)
------- + ------------- + ---------- + --------- + --------
WMF_123 |        0.0772 |     0.7208 |   19.1655 |   0.3843
WMF_456 |        0.0756 |     0.7119 |   19.9272 |   0.3958
WMF_789 |        0.0772 |     0.7229 |   18.3668 |   0.3777
WMF_888 |        0.0769 |     0.7172 |   20.4630 |   0.3775
WMF_999 |        0.0772 |     0.7190 |   20.0061 |   0.5143
WMF_k20 |        0.0777 |     0.7294 |   19.7682 |   0.4365
WMF_k30 |        0.0749 |     0.7069 |   21.7206 |   0.4034
WMF_k40 |        0.0723 |     0.6863 |   22.5502 |   0.3932
WMF_k50 |        0.0707 |     0.6767 |   24.6839 |   0.4126



Based on the results, we can see that even within the same model, the results can vary.

Let's try ensembling all these models together into 1 single model by Borda Count, and look at its recommendations.

In [13]:
# Let's create a different dataframe to calculate ranking and borda count
rank_2_df = pd.DataFrame({
    "ItemID": item_idx2id,
})

# Add a column named 'Ensembled WMF Model'
rank_2_df["WMF Family Borda Count"] = 0

# Calculate the points (inverse of rank) for each of the models and accumulate them into the 'WMF Borda Count' column
# We use the same formula as the 'Borda Count' calculation
for model in models:
    name = model.name
    recommendations, scores = model.rank(UIDX)
    rank_2_df[name + "_score"] = scores
    rank_2_df[name + "_rank"] = rank_2_df[name + "_score"].rank(ascending=False).astype(int)
    rank_2_df[name + "_points"] = total_items - rank_2_df[name + "_rank"]
    rank_2_df["WMF Family Borda Count"] = rank_2_df["WMF Family Borda Count"] + rank_2_df[name + "_points"]

# Let's sort and view the top recommendations!
display("Top 10 Recommendations for WMF Borda Count", rank_2_df[["ItemID", "WMF Family Borda Count"]].sort_values("WMF Family Borda Count", ascending=False).head(10))

'Top 10 Recommendations for WMF Borda Count'

Unnamed: 0,ItemID,WMF Family Borda Count
37,318,14757
152,313,14708
197,191,14660
132,272,14637
156,64,14622
61,204,14605
279,402,14595
305,181,14584
405,22,14578
604,215,14544


In [14]:
# Now, let's add them to the combined dataframe for comparison with earlier models
wmf_borda_count_topk = rank_2_df.sort_values("WMF Family Borda Count", ascending=False)["ItemID"].values[:TOPK]
wmf_borda_df = item_df.loc[[int(i) for i in wmf_borda_count_topk]]

combined_df["WMF Family Borda Count Sum"] = wmf_borda_df.select_dtypes(np.number).sum()
combined_df["WMF Family Borda Count %"] = combined_df["WMF Family Borda Count Sum"] / TOPK * 100
combined_df["WMF Family Borda Count %"] = combined_df["WMF Family Borda Count %"].round(1)

# Let's compare the recommendation distribution
display("Combined Recommendations Distribution", combined_df[["WMF %", "BPR + WMF Borda Count %", "WMF Family Borda Count %"]][:10])

'Combined Recommendations Distribution'

Unnamed: 0,WMF %,BPR + WMF Borda Count %,WMF Family Borda Count %
Drama,49.0,51.0,53.0
Comedy,40.0,32.0,29.0
Romance,32.0,35.0,33.0
Action,16.0,25.0,21.0
Thriller,12.0,22.0,18.0
Adventure,11.0,15.0,14.0
Children's,6.0,4.0,8.0
War,10.0,13.0,11.0
Crime,4.0,6.0,8.0
Sci-Fi,9.0,8.0,10.0


Comparing the results of the WMF Borda Count model, we can see that the different random seed initializations, along with the different number of latent factors, have influenced the recommendations.

-------

Now that we have touched on borda count methods, let's see how we could use other methods and popular packages such as **scikit-learn** to do advanced model ensembling.

## 4. Ensembling with Regression Models

In this step, we’ll explore ensembling using **linear regression** and **random forest regression**. These methods allow us to model the relationship between the predictions of multiple models and the actual outcomes, resulting in a more adaptive and accurate ensemble.

### Why Use Regression Models?

- **Linear Regression**:  
  A simple and interpretable approach, best suited when the relationship between model predictions and true values is linear.
  
- **Random Forest Regression**:  
  A more flexible method that captures non-linear relationships and complex interactions, making it well-suited for diverse and intricate datasets.

These regression-based methods go beyond basic averaging by adapting to patterns in the data, potentially improving prediction accuracy.

### Approach

This process can be seen as a meta-learning problem. Here’s how it works:

We use the predictions of the base models (WMF Variations) as features, and a meta-learner (such as Linear Regression or Random Forest) is trained to make the final prediction. This framework allows flexibility to experiment with various machine learning models, including Linear Regression, Random Forest, Gradient Boosting, or even Neural Networks.

Let’s begin by training a **Linear Regression** model to combine the predictions from the WMF variations.

##### 4.1 Prepare Data

In [15]:
# First, lets create training and test data dataframes
training_df = pd.DataFrame(zip(*train_set.uir_tuple)) # Add 'User Index', 'Item Index', 'Rating' triples as records in dataframe
training_df.columns = ['user_idx', 'item_idx', 'rating'] # Set column names

# Get all possible user_index, item_index combinations, add them into dataframe for inference
all_df = pd.DataFrame({
    "user_idx": [user_idx for user_idx in range(train_set.num_users) for _ in range(train_set.num_items)],
    "item_idx": [item_idx for _ in range(train_set.num_users) for item_idx in range(train_set.num_items)],
})
all_df['item_id'] = all_df.apply(lambda row: item_idx2id[int(row['item_idx'])], axis=1) # Add 'Item ID' column into dataframe by converting 'Item Index' to 'Item ID'

# Lets get all the scores for the models trained in Part 3.
models = [wmf_model_123, wmf_model_456, wmf_model_789, wmf_model_888, wmf_model_999, wmf_model_k20, wmf_model_k30, wmf_model_k40, wmf_model_k50]

# For each model, we add individual predicted ratings by individual models to training and test dataframes
for model in tqdm(models):
    name = model.name

    # Group by user_idx and apply score function to each group
    def score_items(group):
        return pd.Series(model.score(int(group.name))[group['item_idx'].values], index=group.index)

    training_df[name + "_score"] = training_df.groupby("user_idx").apply(score_items, include_groups=False).reset_index(level=0, drop=True) # for training
    all_df[name + "_score"] = all_df.groupby("user_idx").apply(score_items, include_groups=False).reset_index(level=0, drop=True) # for inference

# Let's pick out the 5 features - predicted ratings from the 5 models trained
X_train = training_df[['WMF_123_score', 'WMF_456_score', 'WMF_789_score', 'WMF_888_score', 'WMF_999_score', 'WMF_k20_score', 'WMF_k30_score', 'WMF_k40_score', 'WMF_k50_score']] # use these predicted ratings as features
y_train = training_df['rating'] # use ground truth to train this linear regression model
X_inference = all_df[['WMF_123_score', 'WMF_456_score', 'WMF_789_score', 'WMF_888_score', 'WMF_999_score', 'WMF_k20_score', 'WMF_k30_score', 'WMF_k40_score', 'WMF_k50_score']] # all data, used to predict values for ranking

display("Training features", X_train.head(3)) # predicting ratings as features
display("Target values", y_train.head(3)) # ground truth ratings
display("Inference Data", X_inference.head(3)) # all inference data

100%|██████████| 9/9 [00:06<00:00,  1.33it/s]


'Training features'

Unnamed: 0,WMF_123_score,WMF_456_score,WMF_789_score,WMF_888_score,WMF_999_score,WMF_k20_score,WMF_k30_score,WMF_k40_score,WMF_k50_score
0,2.110019,2.071391,1.903634,2.302152,3.117326,2.806854,3.366545,4.248308,3.842377
1,2.791607,2.692413,2.42141,2.479053,2.736743,2.779775,2.640032,2.263458,2.301732
2,3.750998,3.385033,3.542022,3.761468,3.728116,4.09543,3.427522,3.495448,3.09535


'Target values'

Unnamed: 0,rating
0,4.0
1,3.0
2,4.0


'Inference Data'

Unnamed: 0,WMF_123_score,WMF_456_score,WMF_789_score,WMF_888_score,WMF_999_score,WMF_k20_score,WMF_k30_score,WMF_k40_score,WMF_k50_score
0,2.110019,2.071391,1.903634,2.302152,3.117326,2.806854,3.366545,4.248308,3.842377
1,0.807391,1.295641,0.918384,0.553748,0.58848,-0.075686,-0.365777,-0.918214,0.783866
2,1.648536,1.456618,1.591769,1.272854,1.677722,2.479369,2.03364,2.326872,0.942713


Now that we have already prepared the data for fitting into a **scikit-learn** model, let's first try to train a Linear Regression model

##### 4.2 Fitting Linear Regression Model

In [16]:
UIDX = 3
TOPK = 100

# Let's now fit into a Linear Regression model
regr = linear_model.LinearRegression(fit_intercept=False) # force model to only use predictions from WMF models
regr.fit(X_train, y_train) # train the model

# Input: 9 base model predicted ratings. Output: final predicted rating based on linear regression
y_pred = regr.predict(X_inference) # Get predictions based on trained model

all_df["WMF Linear Regression"] = y_pred # create a column in `test_df` for the predictions

# Get Top K ratings from predictions
sorted_df = all_df.sort_values("WMF Linear Regression", ascending=False) # sort by predicted ratings
top_item_ids = sorted_df[sorted_df['user_idx'] == UIDX]['item_id'].values[:TOPK] # filter top K (50 as set in Section 2.3)

# Place them into the comparison distribution dataframe
linear_regression_df = item_df.loc[[int(i) for i in top_item_ids]] # Get genres of ratings
combined_df["WMF Linear Regression Sum"] = linear_regression_df.select_dtypes(np.number).sum() # group by genre and sum them up
combined_df["WMF Linear Regression %"] = combined_df["WMF Linear Regression Sum"] / TOPK * 100 # get percentages of (genre sum / whole sum)

combined_df["WMF Linear Regression %"] = combined_df["WMF Linear Regression %"].round(1) # round values for readability

print("Coefficients of the linear regression model")
print(regr.coef_) # coefficients of the linear regression model
print(regr.intercept_) # intercept of the linear regression model

Coefficients of the linear regression model
[-0.03614638  0.0501909  -0.09003779 -0.05209999 -0.1257568  -0.10536644
  0.13963693  0.48313987  0.84896576]
0.0


Coefficients of the Linear Regression model indicate the contributions of each base model in the ensemble.

We have successfully trained a **Linear Regression** model using the predictions from the 9 WMF base models, which included variations with different seeds and latent factors.

Next, let's proceed to train a **Random Forest Regressor** model.

##### 4.3 Fitting the Random Forest Model

We will use the same training data to fit a **Random Forest Regressor** model.

While we are using a Random Forest in this example, we also have the option to experiment with other models, such as Gradient Boosting and others, to see how they perform.

In [17]:
UIDX = 3
TOPK = 100

# Let's now train a Random Forest model
randomforest_model = RandomForestRegressor(n_estimators=50, max_depth=2, random_state=42)
randomforest_model.fit(X_train, y_train) # Train the model

# Input: 5 base model predicted ratings. Output: final predicted rating based on random forest
y_pred = randomforest_model.predict(X_inference)

all_df["WMF Random Forest"] = y_pred # create a column in `all_df` for the predictions

# Get Top K ratings from predictions
sorted_df = all_df.sort_values("WMF Random Forest", ascending=False) # sort by predicted ratings
top_item_ids = sorted_df[sorted_df['user_idx'] == UIDX]['item_id'].values[:TOPK] # filter top K (50 as set in Section 2.3)

# Place them into the comparison distribution dataframe
random_forest_df = item_df.loc[[int(i) for i in top_item_ids]] # Get genres of ratings
combined_df["WMF Random Forest Sum"] = random_forest_df.select_dtypes(np.number).sum() # group by genre and sum them up
combined_df["WMF Random Forest %"] = combined_df["WMF Random Forest Sum"] / TOPK * 100 # get percentages of (genre sum / whole sum)

combined_df["WMF Random Forest %"] = combined_df["WMF Random Forest %"].round(1) # round values for readability

# Now let's take a look at how the genre distribution is
display("Combined Recommendations Distribution", combined_df[["WMF %", "WMF Family Borda Count %", "WMF Linear Regression %", "WMF Random Forest %"]][:10])

'Combined Recommendations Distribution'

Unnamed: 0,WMF %,WMF Family Borda Count %,WMF Linear Regression %,WMF Random Forest %
Drama,49.0,53.0,59.0,38.0
Comedy,40.0,29.0,24.0,34.0
Romance,32.0,33.0,24.0,30.0
Action,16.0,21.0,20.0,32.0
Thriller,12.0,18.0,23.0,24.0
Adventure,11.0,14.0,11.0,17.0
Children's,6.0,8.0,4.0,3.0
War,10.0,11.0,8.0,12.0
Crime,4.0,8.0,9.0,8.0
Sci-Fi,9.0,10.0,6.0,15.0




We have also successfully trained a **Random Forest Regressor** model using the predictions from the 9 WMF base models, which included variations with different seeds and latent factors.

The distribution of the results indicates that these ensemble models leveraged the base model predictions in different ways to generate the final predictions.

---

In the next section, we will compare the results of the various models to evaluate their performance.

## 5. Further Evaluation

In the beginning, we have split the dataset into training and testing sets. Now, we will evaluate the performance of the ensemble models using **Precision@100** and **Recall@100** metrics.

We will use the test set to evaluate the models.

### 5.1 Preparing the Data

In [18]:
rank_df = pd.DataFrame({
    "user_idx": all_df["user_idx"],
    "item_idx": all_df["item_idx"],
})

total_items = train_set.num_items # 1651 items

models_to_calculate = [bpr_model, wmf_model, wmf_model_123, wmf_model_456, wmf_model_789, wmf_model_888, wmf_model_999, wmf_model_k20, wmf_model_k30, wmf_model_k40, wmf_model_k50]

# Calculate points for each model using the Borda count process.
# Take note that points should be calculated on a per user basis.
for model in tqdm(models_to_calculate):
    name = model.name

    # Group by user_idx and apply score function to each group
    def score_items(group):
        return pd.Series(model.score(int(group.name))[group['item_idx'].values], index=group.index)

    rank_df[name + "_score"] = rank_df.groupby("user_idx").apply(score_items, include_groups=False).reset_index(level=0, drop=True)

    # Calculate ranks and points for all users at once
    rank_df[name + "_rank"] = rank_df.groupby("user_idx")[name + "_score"].rank(ascending=False, method='min').astype(int)
    rank_df[name + "_points"] = total_items - rank_df[name + "_rank"] + 1

100%|██████████| 11/11 [00:14<00:00,  1.36s/it]


This is how you calculate Borda Count scores for all users.

Once we have the scores calculated, we will sum them up according to the Borda Count formula outlined in Sections 2 and 3.

**BPR + WMF Borda Count**:  
To clarify, our basic Borda Count model includes the **BPR Model** and the **WMF Model**.

**WMF Family Borda Count**:  
The `WMF Family Borda Count` model, on the other hand, consists of multiple variations:
- Models initialized with different random seeds: **wmf_model_123**, **wmf_model_456**, **wmf_model_789**, **wmf_model_888**, and **wmf_model_999**.
- Models with different latent factors: **wmf_model_k20**, **wmf_model_k30**, **wmf_model_k40**, and **wmf_model_k50**.

In [19]:
borda_count_models = [bpr_model, wmf_model]
rank_df["BPR + WMF Borda Count"] = rank_df[[model.name + "_points" for model in borda_count_models]].sum(axis=1) # Sum up points of BPR and WMF

wmf_borda_count_models = [wmf_model_123, wmf_model_456, wmf_model_789, wmf_model_888, wmf_model_999, wmf_model_k20, wmf_model_k30, wmf_model_k40, wmf_model_k50]
rank_df["WMF Family Borda Count"] = rank_df[[model.name + "_points" for model in wmf_borda_count_models]].sum(axis=1) # Sum up points of all WMF models

# Now, lets add them into the `all_df` dataframe for comparison
all_df.sort_values(by=["user_idx", "item_idx"], inplace=True) # ensure that the dataframe is sorted by user index and item index

all_df["BPR_score"] = rank_df["BPR_score"].values
all_df["WMF_score"] = rank_df["WMF_score"].values

all_df["BPR + WMF Borda Count"] = rank_df["BPR + WMF Borda Count"].values
all_df["WMF Family Borda Count"] = rank_df["WMF Family Borda Count"].values

Now that we have all model scores in the same table. Let's calculate the same **Precision@K** and **Recall@K** values as run in the experiments.

We do this by manually calculating recall values with the respective formulas.

### 5.2 Results for Borda Count of BPR and WMF

We calculate the **Precision@100** and **Recall@100** values for the BPR + WMF Borda Count model, which combines the BPR and WMF models.

In [20]:
models = ["BPR_score", "WMF_score", "BPR + WMF Borda Count"]

result_data = {
    "Metrics": ["Precision@100", "Recall@100"],
}

test_users = set(test_set.uir_tuple[0])
for model in tqdm(models):
    sorted_df = all_df.sort_values(model, ascending=False) # sort by predicted ratings
    precisions, recalls = [], []

    for uidx in test_users:
        true_top_k = test_set.user_data[uidx][0] # ground truth data
        predicted_top_k = sorted_df[sorted_df['user_idx'] == uidx]['item_idx'].values[:TOPK].astype(int)
        # Precision@K
        precision = len(set(true_top_k) & set(predicted_top_k)) / len(predicted_top_k)
        precisions.append(precision)
        # Recall@K
        recall = len(set(true_top_k) & set(predicted_top_k)) / len(true_top_k)
        recalls.append(recall)

    result_data[model] = [np.mean(precisions), np.mean(recalls)]
    # result_df[f"Recall@{TOPK}"].append(np.mean(recalls))

# Now let's take a look at the results
result_df = pd.DataFrame(result_data)

display("Base BPR and Base WMF in comparison with BPR + WMF Borda Count", result_df[["Metrics", "BPR_score", "WMF_score", "BPR + WMF Borda Count"]])

100%|██████████| 3/3 [00:13<00:00,  4.59s/it]


'Base BPR and Base WMF in comparison with BPR + WMF Borda Count'

Unnamed: 0,Metrics,BPR_score,WMF_score,BPR + WMF Borda Count
0,Precision@100,0.083255,0.084989,0.085968
1,Recall@100,0.545496,0.558469,0.564845


We observe better recall performance in Borda Count compared to the individual models.

### 5.3 Results for WMF Related Models

We calculate the **Precision@100** and **Recall@100** values for the WMF related models.

In [21]:
models = ["WMF Family Borda Count", "WMF Linear Regression", "WMF Random Forest"]

result_data = {
    "Metrics": ["Precision@100", "Recall@100"],
}

test_users = set(test_set.uir_tuple[0])
for model in tqdm(models):
    sorted_df = all_df.sort_values(model, ascending=False) # sort by predicted ratings
    precisions, recalls = [], []

    for uidx in test_users:
        true_top_k = test_set.user_data[uidx][0] # ground truth data
        predicted_top_k = sorted_df[sorted_df['user_idx'] == uidx]['item_idx'].values[:TOPK].astype(int)
        # Precision@K
        precision = len(set(true_top_k) & set(predicted_top_k)) / len(predicted_top_k)
        precisions.append(precision)
        # Recall@K
        recall = len(set(true_top_k) & set(predicted_top_k)) / len(true_top_k)
        recalls.append(recall)

    result_data[model] = [np.mean(precisions), np.mean(recalls)]
    # result_df[f"Recall@{TOPK}"].append(np.mean(recalls))

# Now let's take a look at the results
result_df = pd.DataFrame(result_data)

display("WMF Models Comparison", result_df[["Metrics", "WMF Family Borda Count", "WMF Linear Regression", "WMF Random Forest"]])

100%|██████████| 3/3 [00:14<00:00,  4.77s/it]


'WMF Models Comparison'

Unnamed: 0,Metrics,WMF Family Borda Count,WMF Linear Regression,WMF Random Forest
0,Precision@100,0.083543,0.067638,0.065926
1,Recall@100,0.567534,0.488337,0.439542


However, we also observe that performance varies, and may not always provide an improvement over the individual models.

One of the other ways that could be explored will be to create an new ensemble, utilizing the many different base models that Cornac supports.

During the development of these models, we find that there are many ways to experiment about to improve the models. However, there is also a risk of overfitting the model to the training data.

It is important to evaluate the models on the test set to ensure that they generalize well to unseen data.

## 6. Conclusion

Our results show that there’s no one-size-fits-all solution.

### Which models and configurations perform best?

Testing multiple models and ensemble techniques helps find the best approach for each dataset. While ensembling can improve accuracy, results will depend on how well models complement each other.

- **Try Different Base Models**: Cornac offers a variety of models; experimenting with each helps reveal what works best.
- **Adjust Model Parameters**: Tuning settings can optimize individual models and enhance ensemble performance.

### Is Ensembling Always Better?

- **Performance vs. Resources**: Ensembles often require more computation, so it’s important to balance resource use with performance gains.
- **Know When Not to Ensemble**: In some cases, a single well-tuned model may work as well as, or even better than, an ensemble.

These questions guide future experiments as we continue experimenting towards better recommender systems.