## Projet 10: Application de recommandation de contenu
### Partie 2: Modèlisation 
Dans ce notebook on vas procéder à effectuer la modélisation des nos données pour mettre en place une systeme de recommandation du contenu.
Nous allons utiliser deux approches principales:
- Content-based filtering
- Collaborative-based filtering

Nous allons nous appuyer sur la libraries [Surpise](https://surpriselib.com/) pour mettre en place nos modèles 

### 1. Import 

#### 1.1 Import des libraries

In [1]:
import os
import pickle
import pandas as pd


from sklearn.metrics.pairwise import cosine_similarity
from math import floor
import numpy as np

import surprise
from surprise import SVD, Dataset, Reader, KNNBasic, KNNWithMeans
from surprise.model_selection import GridSearchCV
from surprise.model_selection import train_test_split
from heapq import nlargest

from surprise import accuracy
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import SVDpp
from collections import defaultdict
from heapq import nlargest


#### 1.2 Import des données

Maintenant nous allons procéder aves l'importations des libraries que nous allons utiliser pour la modèlisation et puis on vas importer les fichiers des données que nous allons utiliser comme base pour les modèles.

#### 1.2.1 Définition des chemins

In [2]:
data_path = "../data/raw/globocom/"
clicks_path= "../data/raw/globocom/clicks/"

#### 1.2.2 Import fichier avec metadonnées des articles

In [3]:
articles_df = pd.read_csv(data_path + 'articles_metadata.csv')
articles_df.drop(columns=['created_at_ts'], inplace=True)
articles_df.head()

Unnamed: 0,article_id,category_id,publisher_id,words_count
0,0,0,0,168
1,1,1,0,189
2,2,1,0,250
3,3,1,0,230
4,4,1,0,162


In [4]:
articles_df.columns

Index(['article_id', 'category_id', 'publisher_id', 'words_count'], dtype='object')

#### 1.2.3 Import fichiers avec embeddings des articles

In [5]:
# Ouvrir le fichiers pickle et afficher les 5 premiers lignes
with open(data_path + 'articles_embeddings.pickle', 'rb') as f:
    data = pickle.load(f)

embeddings_df = pd.DataFrame(data)
embeddings_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,240,241,242,243,244,245,246,247,248,249
0,-0.161183,-0.957233,-0.137944,0.050855,0.830055,0.901365,-0.335148,-0.559561,-0.500603,0.165183,...,0.321248,0.313999,0.636412,0.169179,0.540524,-0.813182,0.28687,-0.231686,0.597416,0.409623
1,-0.523216,-0.974058,0.738608,0.155234,0.626294,0.485297,-0.715657,-0.897996,-0.359747,0.398246,...,-0.487843,0.823124,0.412688,-0.338654,0.320787,0.588643,-0.594137,0.182828,0.39709,-0.834364
2,-0.619619,-0.97296,-0.20736,-0.128861,0.044748,-0.387535,-0.730477,-0.066126,-0.754899,-0.242004,...,0.454756,0.473184,0.377866,-0.863887,-0.383365,0.137721,-0.810877,-0.44758,0.805932,-0.285284
3,-0.740843,-0.975749,0.391698,0.641738,-0.268645,0.191745,-0.825593,-0.710591,-0.040099,-0.110514,...,0.271535,0.03604,0.480029,-0.763173,0.022627,0.565165,-0.910286,-0.537838,0.243541,-0.885329
4,-0.279052,-0.972315,0.685374,0.113056,0.238315,0.271913,-0.568816,0.341194,-0.600554,-0.125644,...,0.238286,0.809268,0.427521,-0.615932,-0.503697,0.61445,-0.91776,-0.424061,0.185484,-0.580292


#### 1.2.4 Import fichier avec les interactions des utilisateurs

In [6]:
def get_all_files_clicks(path):
    clicks_df = pd.DataFrame()
    for file in os.listdir(path):
        df = pd.read_csv(path + file)
        clicks_df = pd.concat([clicks_df, df], axis=0)

    return clicks_df

In [7]:
clicks_df = get_all_files_clicks(clicks_path)

In [8]:
clicks_df['click_timestamp'] = pd.to_datetime(clicks_df['click_timestamp'], unit='ms')
clicks_df['session_start'] = pd.to_datetime(clicks_df['session_start'], unit='ms')
clicks_df.head()

Unnamed: 0,user_id,session_id,session_start,session_size,click_article_id,click_timestamp,click_environment,click_deviceGroup,click_os,click_country,click_region,click_referrer_type
0,93863,1507865792177843,2017-10-13 03:36:32,2,96210,2017-10-13 03:37:12.925,4,3,2,1,21,2
1,93863,1507865792177843,2017-10-13 03:36:32,2,158094,2017-10-13 03:37:42.925,4,3,2,1,21,2
2,294036,1507865795185844,2017-10-13 03:36:35,2,20691,2017-10-13 03:36:59.095,4,3,20,1,9,2
3,294036,1507865795185844,2017-10-13 03:36:35,2,96210,2017-10-13 03:37:29.095,4,3,20,1,9,2
4,77136,1507865796257845,2017-10-13 03:36:36,2,336245,2017-10-13 03:42:13.178,4,3,2,1,25,2


### 2. Content-based Filtering

Le **Content-based Filtering** est une méthode de recommandation qui utilise des informations détaillées sur les éléments pour recommander d'autres éléments similaires. Par exemple, dans un système de recommandation de films, le filtrage basé sur le contenu pourrait utiliser des informations telles que le genre du film, le réalisateur, les acteurs, etc.

**Principe**
L'idée est que si un utilisateur a aimé un certain élément dans le passé, il est probable qu'il aimera à nouveau des éléments similaires à l'avenir. Par conséquent, le système recommande des éléments qui sont similaires aux éléments que l'utilisateur a aimés précédemment.

**Calcul de la similarité**
La similarité entre les éléments est généralement calculée en utilisant des techniques telles que la similarité cosinus ou la distance euclidienne. Les éléments qui sont les plus similaires à ceux que l'utilisateur a aimés sont recommandés.

**Note importante**
Il est important de noter que le filtrage basé sur le contenu ne tient pas compte des opinions d'autres utilisateurs. Il se concentre uniquement sur les préférences de l'utilisateur actuel.

In [9]:
def recommend_articles(articles, clicks, user_id, n=5):
    # Convert user_id and click_article_id to integer type
    clicks['user_id'] = clicks['user_id'].astype(int)
    clicks['click_article_id'] = clicks['click_article_id'].astype(int)
    articles.index = articles.index.astype(int)
    
    # Get the articles read by the user
    articles_read = clicks[clicks['user_id'] == int(user_id)]['click_article_id'].tolist()
    print(f"Articles read by user {user_id}: {articles_read}")

    # If the user hasn't read any articles, recommend the most popular ones
    if len(articles_read) == 0:
        most_popular_articles = clicks['click_article_id'].value_counts().index.tolist()
        print(f"User {user_id} has not read any articles. Recommending most popular articles: {most_popular_articles[:n]}")
        return most_popular_articles[:n]

    # Get the embeddings of the articles read by the user
    articles_read_embedding = articles.loc[articles_read]
    print(f"Number of articles read by user {user_id}: {len(articles_read)}")

    # Remove the articles read by the user from the list of articles
    articles = articles.drop(articles_read)
    print(f"Remaining articles after removing articles read by user {user_id}: {len(articles)}")

    # Calculate the cosine similarity between the articles read by the user and the other articles
    matrix = cosine_similarity(articles_read_embedding, articles)

    recommendations = []

    # Recommend the articles most similar to the articles read by the user
    for i in range(n):
        coord_x = floor(np.argmax(matrix)/matrix.shape[1])
        coord_y = np.argmax(matrix)%matrix.shape[1]

        recommendations.append(int(articles.index[coord_y]))

        # Set the similarity of the recommended article to 0
        matrix[coord_x][coord_y] = 0

    # Print the number of recommended articles that have already been read by the user
    already_read = len(set(recommendations) & set(articles_read))
    print(f"Number of recommended articles that have already been read by user {user_id}: {already_read}")

    return recommendations

In [10]:
user_id = 7723

In [11]:
# Assuming `articles_df` is your articles data and `user_id` is the id of the user you want to evaluate
recommend = recommend_articles(articles_df, clicks_df, user_id, 10)
print(f"recommended articles: {recommend}")

Articles read by user 7723: [214455, 9308, 9649, 129799, 141548, 336220, 353673, 337192, 84763, 107179, 199197, 271400, 84835, 338339, 60252, 303565, 31520, 36685, 36609, 163505, 123434, 141050, 313504, 272660, 72618, 72646, 140445, 277491, 226648, 57740, 128551, 140324, 198659, 166581, 156560, 282964, 225124, 277491, 128707]
Number of articles read by user 7723: 39
Remaining articles after removing articles read by user 7723: 364009
Number of recommended articles that have already been read by user 7723: 0
recommended articles: [336221, 226650, 303553, 226639, 129806, 287088, 236086, 336171, 156544, 271364]


### 3. Collaborative-based Filtering

Le **Collaborative-based Filtering** est une méthode de recommandation qui se base sur les comportements passés des utilisateurs pour faire des prédictions sur ce qu'un utilisateur pourrait aimer.

**Principe**
L'idée principale est que si deux utilisateurs ont eu des comportements similaires par le passé (par exemple, ils ont aimé les mêmes films ou acheté les mêmes produits), alors ils sont susceptibles d'avoir des intérêts similaires à l'avenir.

**Types de Filtrage Collaboratif**
Il existe deux types principaux de filtrage collaboratif :

1. **Filtrage Collaboratif Basé sur les Utilisateurs** : Cette méthode trouve des utilisateurs similaires à l'utilisateur cible et recommande des éléments que ces utilisateurs similaires ont aimés.

2. **Filtrage Collaboratif Basé sur les Éléments** : Cette méthode trouve des éléments similaires à ceux que l'utilisateur cible a aimés et recommande ces éléments similaires.

3. **Filtrage Collaboratif Basé sur un Modèle** : Cette méthode utilise des techniques de modélisation, comme la factorisation de matrices ou le clustering, pour prédire l'intérêt d'un utilisateur pour un élément. Elle se base sur les comportements passés de tous les utilisateurs, ainsi que sur les évaluations que l'utilisateur cible a données à d'autres éléments.

**Calcul de la similarité**
La similarité entre les utilisateurs ou les éléments est généralement calculée en utilisant des techniques telles que la corrélation de Pearson ou la similarité cosinus.

**Note importante**
Contrairement au filtrage basé sur le contenu, le filtrage collaboratif ne nécessite pas d'informations détaillées sur les éléments. Il se base uniquement sur les interactions passées entre les utilisateurs et les éléments.

Tout d'abord, nous allons rechercher les meilleurs paramètres pour le modèle en utilisant `GridSearchCV`. `GridSearchCV` est une méthode de recherche exhaustive qui parcourt toutes les combinaisons possibles de paramètres pour trouver celle qui produit le meilleur score de validation croisée.

Dans le contexte de l'apprentissage automatique, les paramètres sont les configurations du modèle que nous ajustons pour améliorer la performance. Par exemple, dans un modèle de forêt aléatoire, les paramètres pourraient inclure le nombre d'arbres dans la forêt (`n_estimators`) et la profondeur maximale des arbres (`max_depth`).

`GridSearchCV` fonctionne en entraînant et en évaluant un modèle pour chaque combinaison de paramètres. Il utilise la validation croisée pour évaluer la performance du modèle, ce qui signifie qu'il divise les données en un ensemble d'entraînement et un ensemble de test, entraîne le modèle sur l'ensemble d'entraînement, puis évalue la performance sur l'ensemble de test.

Une fois que `GridSearchCV` a terminé la recherche, nous pouvons obtenir les meilleurs paramètres en utilisant l'attribut `best_params_`. Nous pouvons ensuite utiliser ces paramètres pour entraîner notre modèle final.

In [12]:
from surprise import SVD, SVDpp, KNNWithMeans, CoClustering, SlopeOne, NormalPredictor
from surprise.model_selection import GridSearchCV

# Create a 'click_count' column
clicks_df['click_count'] = clicks_df.groupby(['user_id', 'click_article_id'])['click_timestamp'].transform('count')

# Load a fraction of the data into a Surprise dataset
reader = Reader(rating_scale=(0, clicks_df.click_count.max()))
data = Dataset.load_from_df(clicks_df[['user_id', 'click_article_id', 'click_count']].sample(frac=0.1, random_state=42), reader)


models = {
    SVD: {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005], 'reg_all': [0.4, 0.6]},
    SVDpp: {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005], 'reg_all': [0.4, 0.6]},
    KNNWithMeans: {'k': [20], 'sim_options': {'name': ['msd', 'cosine'], 'user_based': [False]}},
    CoClustering: {'n_cltr_u': [3, 5], 'n_cltr_i': [3, 5]},
    SlopeOne: {},
    NormalPredictor: {}
}

best_score = float('inf')
best_model = None
best_params = None

for model, param_grid in models.items():
    gs = GridSearchCV(model, param_grid, measures=['rmse', 'mae'], cv=3)
    gs.fit(data)
    params = gs.best_params['rmse']
    score = gs.best_score['rmse']
    print(f"Best parameters for {model.__name__}: {params}")
    print(f"Best RMSE for {model.__name__}: {score}")
    print(f"Best MAE for {model.__name__}: {gs.best_score['mae']}")
    if score < best_score:
        best_score = score
        best_model = model.__name__
        best_params = params

print(f"\nBest model: {best_model}")
print(f"Best parameters: {best_params}")
print(f"Best RMSE: {best_score}")

Best parameters for SVD: {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}
Best RMSE for SVD: 0.28798814019027646
Best MAE for SVD: 0.05711371978412593
Best parameters for SVDpp: {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}
Best RMSE for SVDpp: 0.28899137485788406
Best MAE for SVDpp: 0.054097179295432146
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Best parameters for KNNWithMeans: {'k': 20, 'sim_options': {'name': 'msd', 'user_based': False}}
Best RMSE for KNNWithMeans: 0.3053635006599857
Best MAE for KNNWithMeans: 0.06530060814579225
Best parameters for CoClustering: {'n_cltr_u': 3, '

### SVD

In [15]:
# Create a 'click_count' column
clicks_df['click_count'] = clicks_df.groupby(['user_id', 'click_article_id'])['click_timestamp'].transform('count')

# Load a fraction of the data into a Surprise dataset
reader = Reader(rating_scale=(0, clicks_df.click_count.max()))
data = Dataset.load_from_df(clicks_df[['user_id', 'click_article_id', 'click_count']], reader)

# Define the parameter grid
param_grid = {
    'n_factors': [10, 20, 30],
    'n_epochs': [5, 10, 15],
    'lr_all': [0.001, 0.002, 0.003],
    'reg_all': [0.01, 0.02, 0.03]
}

# Run a grid search with cross-validation
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(data)

# Get the best parameters
best_params = gs.best_params['rmse']

print(f"Best parameters: {best_params}")

print(f"Best RMSE: {gs.best_score['rmse']}")
print(f"Best MAE: {gs.best_score['mae']}")

Best parameters: {'n_factors': 20, 'n_epochs': 10, 'lr_all': 0.002, 'reg_all': 0.01}
Best RMSE: 0.19410449113534123
Best MAE: 0.04801835921096572


### Recommendations

In [18]:
def precision_recall_at_k(predictions, k_list=[5, 10]):
    '''Return precision and recall at k over all users for multiple values of k'''

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = {k: dict() for k in k_list}
    recalls = {k: dict() for k in k_list}
    for uid, user_ratings in user_est_true.items():
        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        for k in k_list:
            # Number of recommended items in top k
            n_rec_k = len(user_ratings[:k])

            # Number of relevant and recommended items in top k
            n_rel_and_rec_k = sum((true_r == est) for (est, true_r) in user_ratings[:k])

            # Precision@K: Proportion of recommended items that are relevant
            precisions[k][uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 1

            # Number of relevant items
            n_rel = sum((true_r == est) for (est, true_r) in user_ratings)

            # Recall@K: Proportion of relevant items that are recommended
            recalls[k][uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 1

    return precisions, recalls

def collaborativeFilteringRecommendArticle(articles, clicks, user_id, n=5):
    # Convert user_id and click_article_id to integer type
    clicks['user_id'] = clicks['user_id'].astype(int)
    clicks['click_article_id'] = clicks['click_article_id'].astype(int)
    articles.index = articles.index.astype(int)
    
    # Check if user_id is in clicks
    if user_id not in clicks['user_id'].values:
        return f"Error: User ID {user_id} not found in clicks data."

    # Create a new DataFrame that counts the number of times a user clicked on an article
    click_counts = clicks.groupby(['user_id', 'click_article_id']).size().reset_index(name='click_count')

    # Use a smaller subset of data for the collaborative filtering to avoid memory issues
    data_subset = click_counts

    # Create a reader and a data object
    reader = Reader(rating_scale=(1, data_subset.click_count.max()))  # assuming a click count of at least 1
    data = Dataset.load_from_df(data_subset, reader)

    # Split the data into train and test sets
    trainset, testset = train_test_split(data, test_size=0.2)

    # Train a SVD model with the best parameters
    algo = SVD(n_factors=best_params['n_factors'],n_epochs=best_params['n_epochs'], lr_all=best_params['lr_all'], reg_all=best_params['reg_all'])
    algo.fit(trainset)

    # Predict ratings for the testset
    predictions_test = algo.test(testset)

    # Calculate precision and recall at k
    precisions, recalls = precision_recall_at_k(predictions_test, k_list=[5, 10])
    for k in [5, 10]:
        avg_precision = sum(prec for prec in precisions[k].values()) / len(precisions[k])
        avg_recall = sum(rec for rec in recalls[k].values()) / len(recalls[k])
        print(f"Average Precision at {k}: {avg_precision}")
        print(f"Average Recall at {k}: {avg_recall}")

    # Get the list of articles read by the user
    articles_read = clicks[clicks['user_id'] == user_id]['click_article_id'].tolist()

    # Get the list of all articles
    all_articles = list(articles.index)

    # Remove the articles already read by the user
    articles_to_predict = [article for article in all_articles if article not in articles_read]

    # Get the predicted ratings for the articles not yet read by the user
    predictions = {article: algo.predict(user_id, article).est for article in articles_to_predict}

    # Get the top n articles
    top_n_articles = nlargest(n, predictions, key=predictions.get)

    return top_n_articles

In [19]:
recommended_articles = collaborativeFilteringRecommendArticle(articles_df, clicks_df, 7723, n=10)

Average Precision at 5: 0.3394941377528835
Average Recall at 5: 0.9252399546895873
Average Precision at 10: 0.35810079658429156
Average Recall at 10: 0.9811812621699382


In [23]:
print("Recommended articles for user 7723:")
for article_id in recommended_articles:
    print(articles_df.loc[article_id])

Recommended articles for user 7723:
article_id      237071
category_id        375
publisher_id         0
words_count        161
Name: 237071, dtype: int64
article_id      363925
category_id        458
publisher_id         0
words_count        326
Name: 363925, dtype: int64
article_id      38823
category_id        60
publisher_id        0
words_count       262
Name: 38823, dtype: int64
article_id      43032
category_id        68
publisher_id        0
words_count       148
Name: 43032, dtype: int64
article_id      73431
category_id       138
publisher_id        0
words_count       183
Name: 73431, dtype: int64
article_id      69463
category_id       136
publisher_id        0
words_count       172
Name: 69463, dtype: int64
article_id      105941
category_id        228
publisher_id         0
words_count        145
Name: 105941, dtype: int64
article_id      146230
category_id        271
publisher_id         0
words_count        221
Name: 146230, dtype: int64
article_id      225378
category_

In [35]:
recommended_articles = collaborativeFilteringRecommendArticle(articles_df, clicks_df, 20 , n=10)

Average Precision at 5: 0.15603441275239588
Average Recall at 5: 0.955144725884014
Average Precision at 10: 0.1678363519898356
Average Recall at 10: 0.9918343063678295


In [25]:
recommended_articles = collaborativeFilteringRecommendArticle(articles_df, clicks_df, 160974 , n=10)

Average Precision at 5: 0.1556990868167984
Average Recall at 5: 0.9554786949001022
Average Precision at 10: 0.16748426520754528
Average Recall at 10: 0.9918598440366335


In [26]:
print("Recommended articles for user 20:")
for article_id in recommended_articles:
    print(articles_df.loc[article_id])

Recommended articles for user 20:
article_id      68851
category_id       136
publisher_id        0
words_count       278
Name: 68851, dtype: int64
article_id      237071
category_id        375
publisher_id         0
words_count        161
Name: 237071, dtype: int64
article_id      363925
category_id        458
publisher_id         0
words_count        326
Name: 363925, dtype: int64
article_id      105941
category_id        228
publisher_id         0
words_count        145
Name: 105941, dtype: int64
article_id      73431
category_id       138
publisher_id        0
words_count       183
Name: 73431, dtype: int64
article_id      69463
category_id       136
publisher_id        0
words_count       172
Name: 69463, dtype: int64
article_id      62197
category_id       127
publisher_id        0
words_count       251
Name: 62197, dtype: int64
article_id      74254
category_id       141
publisher_id        0
words_count       141
Name: 74254, dtype: int64
article_id      146230
category_id     

### Hybrid based filtering

In [32]:
def hybridRecommendArticle(articles, clicks, user_id, n=5):
    # Convert user_id and click_article_id to integer type
    clicks['user_id'] = clicks['user_id'].astype(int)
    clicks['click_article_id'] = clicks['click_article_id'].astype(int)
    articles.index = articles.index.astype(int)
    
    # Check if user_id is in clicks
    if user_id not in clicks['user_id'].values:
        return f"Error: User ID {user_id} not found in clicks data."

    # Create a new DataFrame that counts the number of times a user clicked on an article
    click_counts = clicks.groupby(['user_id', 'click_article_id']).size().reset_index(name='click_count')

    # Use a smaller subset of data for the collaborative filtering to avoid memory issues
    data_subset = click_counts

    # Create a reader and a data object
    reader = Reader(rating_scale=(1, data_subset.click_count.max()))  # assuming a click count of at least 1
    data = Dataset.load_from_df(data_subset, reader)

    # Split the data into train and test sets
    trainset, testset = train_test_split(data, test_size=0.2)

    # Train a SVD model with the best parameters
    algo = SVDpp(n_epochs=best_params['n_epochs'], lr_all=best_params['lr_all'], reg_all=best_params['reg_all'])
    algo.fit(trainset)

    # Predict ratings for the testset
    predictions_test = algo.test(testset)

    # Calculate precision and recall at k
    precisions, recalls = precision_recall_at_k(predictions_test, k_list=[5, 10])
    for k in [5, 10]:
        avg_precision = sum(prec for prec in precisions[k].values()) / len(precisions[k])
        avg_recall = sum(rec for rec in recalls[k].values()) / len(recalls[k])
        print(f"Average Precision at {k}: {avg_precision}")
        print(f"Average Recall at {k}: {avg_recall}")

    # Get the list of articles read by the user
    articles_read = clicks[clicks['user_id'] == user_id]['click_article_id'].tolist()

    # Get the list of all articles
    all_articles = list(articles.index)

    # Remove the articles already read by the user
    articles_to_predict = [article for article in all_articles if article not in articles_read]

    # Get the predicted ratings for the articles not yet read by the user
    predictions = {article: algo.predict(user_id, article).est for article in articles_to_predict}

    # Get the top n articles
    top_n_articles_collab = nlargest(n, predictions, key=predictions.get)

    # Content-based filtering
    top_n_articles_content = recommend_articles(articles, clicks, user_id, n)

    # Combine the results of collaborative and content-based filtering
    top_n_articles = top_n_articles_collab + top_n_articles_content

    return top_n_articles

In [33]:
recommended_articles = hybridRecommendArticle(articles_df, clicks_df, 7723, n=10)

Average Precision at 5: 0.15311180453172374
Average Recall at 5: 0.9551493733941426
Average Precision at 10: 0.16524886347860998
Average Recall at 10: 0.9920455463510742
Articles read by user 7723: [214455, 9308, 9649, 129799, 141548, 336220, 353673, 337192, 84763, 107179, 199197, 271400, 84835, 338339, 60252, 303565, 31520, 36685, 36609, 163505, 123434, 141050, 313504, 272660, 72618, 72646, 140445, 277491, 226648, 57740, 128551, 140324, 198659, 166581, 156560, 282964, 225124, 277491, 128707]
Number of articles read by user 7723: 39
Remaining articles after removing articles read by user 7723: 364009
Number of recommended articles that have already been read by user 7723: 0
