## Projet 10: Application de recommandation de contenu
### Partie 2: Modèlisation 
Dans ce notebook on vas procéder à effectuer la modélisation des nos données pour mettre en place une systeme de recommandation du contenu.
Nous allons utiliser deux approches principales:
- Content-based filtering
- Collaborative-based filtering

Nous allons nous appuyer sur la libraries [Surpise](https://surpriselib.com/) pour mettre en place nos modèles 

### 1. Import

#### 1.1 Import des libraries

In [1]:
import os
import pickle
import pandas as pd


from sklearn.metrics.pairwise import cosine_similarity
from math import floor
import numpy as np

import surprise
from surprise import SVD, Dataset, Reader, KNNBasic, KNNWithMeans
from surprise.model_selection import GridSearchCV
from surprise.model_selection import train_test_split
from heapq import nlargest

#### 1.2 Import des données

Maintenant nous allons procéder aves l'importations des libraries que nous allons utiliser pour la modèlisation et puis on vas importer les fichiers des données que nous allons utiliser comme base pour les modèles.

In [3]:
data_path = "../data/raw/globocom/"
clicks_path= "../data/raw/globocom/clicks/"

#### 1.2.2 Import fichier avec metadonnées des articles

In [4]:
articles_df = pd.read_csv(data_path + 'articles_metadata.csv')
articles_df.head()

Unnamed: 0,article_id,category_id,created_at_ts,publisher_id,words_count
0,0,0,1513144419000,0,168
1,1,1,1405341936000,0,189
2,2,1,1408667706000,0,250
3,3,1,1408468313000,0,230
4,4,1,1407071171000,0,162


#### 1.2.3 Import fichiers avec embeddings des articles

In [23]:
embeddings_df = pd.read_pickle(data_path + 'articles_embeddings.pickle')
embeddings_df = pd.DataFrame(embeddings_df, columns=["emb_" + str(i) for i in range(embeddings_df.shape[1])])
embeddings_df.head()

Unnamed: 0,emb_0,emb_1,emb_2,emb_3,emb_4,emb_5,emb_6,emb_7,emb_8,emb_9,...,emb_240,emb_241,emb_242,emb_243,emb_244,emb_245,emb_246,emb_247,emb_248,emb_249
0,-0.161183,-0.957233,-0.137944,0.050855,0.830055,0.901365,-0.335148,-0.559561,-0.500603,0.165183,...,0.321248,0.313999,0.636412,0.169179,0.540524,-0.813182,0.28687,-0.231686,0.597416,0.409623
1,-0.523216,-0.974058,0.738608,0.155234,0.626294,0.485297,-0.715657,-0.897996,-0.359747,0.398246,...,-0.487843,0.823124,0.412688,-0.338654,0.320787,0.588643,-0.594137,0.182828,0.39709,-0.834364
2,-0.619619,-0.97296,-0.20736,-0.128861,0.044748,-0.387535,-0.730477,-0.066126,-0.754899,-0.242004,...,0.454756,0.473184,0.377866,-0.863887,-0.383365,0.137721,-0.810877,-0.44758,0.805932,-0.285284
3,-0.740843,-0.975749,0.391698,0.641738,-0.268645,0.191745,-0.825593,-0.710591,-0.040099,-0.110514,...,0.271535,0.03604,0.480029,-0.763173,0.022627,0.565165,-0.910286,-0.537838,0.243541,-0.885329
4,-0.279052,-0.972315,0.685374,0.113056,0.238315,0.271913,-0.568816,0.341194,-0.600554,-0.125644,...,0.238286,0.809268,0.427521,-0.615932,-0.503697,0.61445,-0.91776,-0.424061,0.185484,-0.580292


#### 1.2.4 Import fichier avec les interactions des utilisateurs

In [6]:
def get_all_files_clicks(path):
    clicks_df = pd.DataFrame()
    for file in os.listdir(path):
        df = pd.read_csv(path + file)
        clicks_df = pd.concat([clicks_df, df], axis=0)

    return clicks_df

In [7]:
clicks_df = get_all_files_clicks(clicks_path)

In [8]:
clicks_df = clicks_df[['user_id', 'session_id', 'session_size', 'click_article_id']]
clicks_df.head()

Unnamed: 0,user_id,session_id,session_size,click_article_id
0,93863,1507865792177843,2,96210
1,93863,1507865792177843,2,158094
2,294036,1507865795185844,2,20691
3,294036,1507865795185844,2,96210
4,77136,1507865796257845,2,336245


In [9]:
users_df = clicks_df.groupby('user_id').agg({'click_article_id':lambda x: list(x)})
users_df.head()

Unnamed: 0_level_0,click_article_id
user_id,Unnamed: 1_level_1
0,"[157541, 68866, 96755, 313996, 160158, 233470,..."
1,"[327984, 183176, 235840, 96663, 59758, 160474,..."
2,"[119592, 30970, 30760, 209122]"
3,"[236444, 234318, 233688, 237452, 235745, 12096..."
4,"[336499, 271261, 48915, 44488, 195887, 195084,..."


### 2. Content-based Filtering

Le **Content-based Filtering** est une méthode de recommandation qui utilise des informations détaillées sur les éléments pour recommander d'autres éléments similaires. Par exemple, dans un système de recommandation de films, le filtrage basé sur le contenu pourrait utiliser des informations telles que le genre du film, le réalisateur, les acteurs, etc.

**Principe**
L'idée est que si un utilisateur a aimé un certain élément dans le passé, il est probable qu'il aimera à nouveau des éléments similaires à l'avenir. Par conséquent, le système recommande des éléments qui sont similaires aux éléments que l'utilisateur a aimés précédemment.

**Calcul de la similarité**
La similarité entre les éléments est généralement calculée en utilisant des techniques telles que la similarité cosinus ou la distance euclidienne. Les éléments qui sont les plus similaires à ceux que l'utilisateur a aimés sont recommandés.

**Note importante**
Il est important de noter que le filtrage basé sur le contenu ne tient pas compte des opinions d'autres utilisateurs. Il se concentre uniquement sur les préférences de l'utilisateur actuel.

In [28]:
def contentBasedFiltering(articles, clicks, user_id, n=5):
    # Get the articles read by the user
    articles_read = clicks[clicks['user_id'] == int(user_id)]['click_article_id'].tolist()
    print(f"Articles read by user {user_id}: {articles_read}")

    # If the user hasn't read any articles, recommend the most popular ones
    if len(articles_read) == 0:
        most_popular_articles = clicks['click_article_id'].value_counts().index.tolist()
        print(f"User {user_id} has not read any articles. Recommending most popular articles: {most_popular_articles[:n]}")
        return most_popular_articles[:n]

    # Get the embeddings of the articles read by the user
    articles_read_embedding = articles.loc[articles_read]
    print(f"Number of articles read by user {user_id}: {len(articles_read)}")

    # Remove the articles read by the user from the list of articles
    articles = articles.drop(articles_read)
    print(f"Remaining articles after removing articles read by user {user_id}: {len(articles)}")

    # Calculate the cosine similarity between the articles read by the user and the other articles
    matrix = cosine_similarity(articles_read_embedding, articles)

    recommendations = []

    # Recommend the articles most similar to the articles read by the user
    for i in range(n):
        coord_x = floor(np.argmax(matrix)/matrix.shape[1])
        coord_y = np.argmax(matrix)%matrix.shape[1]

        recommendations.append(int(articles.index[coord_y]))

        # Set the similarity of the recommended article to 0
        matrix[coord_x][coord_y] = 0


    return recommendations

In [30]:
# Avec le fichier d'embedding classique
test = contentBasedFiltering(embeddings_df, clicks_df, 21)
print(test)

Articles read by user 21: [156560, 162655, 303565, 124751, 313504, 140720, 124177, 123757]
Number of articles read by user 21: 8
Remaining articles after removing articles read by user 21: 364039
[123852, 304790, 301782, 304136, 304708]


### 3. Collaborative-based Filtering

Le **Collaborative-based Filtering** est une méthode de recommandation qui se base sur les comportements passés des utilisateurs pour faire des prédictions sur ce qu'un utilisateur pourrait aimer.

**Principe**
L'idée principale est que si deux utilisateurs ont eu des comportements similaires par le passé (par exemple, ils ont aimé les mêmes films ou acheté les mêmes produits), alors ils sont susceptibles d'avoir des intérêts similaires à l'avenir.

**Types de Filtrage Collaboratif**
Il existe deux types principaux de filtrage collaboratif :

1. **Filtrage Collaboratif Basé sur les Utilisateurs** : Cette méthode trouve des utilisateurs similaires à l'utilisateur cible et recommande des éléments que ces utilisateurs similaires ont aimés.

2. **Filtrage Collaboratif Basé sur les Éléments** : Cette méthode trouve des éléments similaires à ceux que l'utilisateur cible a aimés et recommande ces éléments similaires.

3. **Filtrage Collaboratif Basé sur un Modèle** : Cette méthode utilise des techniques de modélisation, comme la factorisation de matrices ou le clustering, pour prédire l'intérêt d'un utilisateur pour un élément. Elle se base sur les comportements passés de tous les utilisateurs, ainsi que sur les évaluations que l'utilisateur cible a données à d'autres éléments.

**Calcul de la similarité**
La similarité entre les utilisateurs ou les éléments est généralement calculée en utilisant des techniques telles que la corrélation de Pearson ou la similarité cosinus.

**Note importante**
Contrairement au filtrage basé sur le contenu, le filtrage collaboratif ne nécessite pas d'informations détaillées sur les éléments. Il se base uniquement sur les interactions passées entre les utilisateurs et les éléments.

In [31]:
def calculRatingByClick(clicks):

    count_user_article_size = (clicks.groupby(['user_id', "click_article_id"]).agg(user_article_size=("session_size", "sum")))
    count_user_total_size = (clicks.groupby(['user_id']).agg(user_total_size=("session_size", "sum")))

    ratings = count_user_article_size.join(count_user_total_size, on="user_id")

    ratings['rating'] = ratings['user_article_size'] / ratings['user_total_size']

    ratings = ratings.reset_index().drop(['user_article_size', 'user_total_size'], axis = 1).rename({'click_article_id': 'article_id'}, axis = 1)

    return ratings

In [32]:
ratings = calculRatingByClick(clicks_df)

ratings.head()

Unnamed: 0,user_id,article_id,rating
0,0,68866,0.125
1,0,87205,0.125
2,0,87224,0.125
3,0,96755,0.125
4,0,157541,0.125


In [36]:
reader = Reader(rating_scale=(0, 1))

data = Dataset.load_from_df(ratings.sample(frac=0.1, random_state=42), reader=reader)

param_grid = {'n_factors': [20, 50, 100], 'n_epochs': [10, 20, 50],
              'lr_all': [0.002, 0.005, 0.01], 'reg_all': [0.02, 0.04, 0.1]}

gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# Get the best parameters
best_params = gs.best_params['rmse']

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

0.11419393435585924
{'n_factors': 20, 'n_epochs': 50, 'lr_all': 0.01, 'reg_all': 0.1}


In [40]:
# Get the best parameters
best_params = gs.best_params['rmse']

In [41]:
from surprise import accuracy

def collaborativeFilteringRecommendArticle(articles, ratings, user_id, n=5):
    # Convert user_id in ratings DataFrame to string
    # ratings['user_id'] = ratings['user_id'].astype(str)
    # user_id = str(user_id)

    # Create a reader and a data object
    reader = Reader(rating_scale=(0, 1))  # ratings range from 0 to 1
    data = Dataset.load_from_df(ratings[['user_id', 'article_id', 'rating']], reader)

    # Split the data into train and test sets
    trainset, testset = train_test_split(data, test_size=0.2)

    # Train a SVDpp model
    algo = SVD(n_factors=best_params['n_factors'], n_epochs=best_params['n_epochs'], lr_all=best_params['lr_all'], reg_all=best_params['reg_all'])
    algo.fit(trainset)

    # Predict ratings for the test set
    predictions = algo.test(testset)

    # Calculate and print the RMSE
    rmse = accuracy.rmse(predictions)
    print(f"RMSE: {rmse}")

    # Calculate and print the MAE
    mae = accuracy.mae(predictions)
    print(f"MAE: {mae}")

    # Get the list of articles read by the user
    articles_read = ratings[ratings['user_id'] == user_id]['article_id'].tolist()

    # Print the number of articles read by the user
    print(f"The user has read {len(articles_read)} articles.")

    # Get the list of all articles
    all_articles = list(articles.index)

    # Remove the articles already read by the user
    articles_to_predict = [article for article in all_articles if article not in articles_read]

    # Get the predicted ratings for the articles not yet read by the user
    predictions = {article: algo.predict(user_id, str(article)).est for article in articles_to_predict}

    # Get the top n articles
    top_n_articles = nlargest(n, predictions, key=predictions.get)

    return top_n_articles

In [44]:
from surprise import Reader, Dataset, SVD
from surprise.model_selection import train_test_split
from heapq import nlargest

def collaborativeFilteringRecommendArticle(articles, ratings, user_id, n=5):
    # Create a reader and a data object
    reader = Reader(rating_scale=(0, 1))  # ratings range from 0 to 1
    data = Dataset.load_from_df(ratings[['user_id', 'article_id', 'rating']], reader)

    # Split the data into train and test sets
    trainset, testset = train_test_split(data, test_size=0.2)

    # Train a SVD model
    algo = SVD(n_factors=best_params['n_factors'], n_epochs=best_params['n_epochs'], lr_all=best_params['lr_all'], reg_all=best_params['reg_all'])
    algo.fit(trainset)

    # Get the list of articles read by the user
    articles_read = ratings[ratings['user_id'] == user_id]['article_id'].tolist()

    # Get the list of all articles
    all_articles = list(articles.index)

    # Remove the articles already read by the user
    articles_to_predict = [article for article in all_articles if article not in articles_read]

    # Get the predicted ratings for the articles not yet read by the user
    predictions = {article: algo.predict(user_id, article).est for article in articles_to_predict}

    # Get the top n articles
    top_n_articles = nlargest(n, predictions, key=predictions.get)

    return top_n_articles

In [43]:
def collaborativeFilteringRecommendArticle(articles, clicks, user_id, n=5):

    index = list(articles.index)

    articles_read = clicks[clicks['user_id'] == user_id]['click_article_id'].tolist()

    for ele in articles_read:
        if ele in index:
            index.remove(ele)

    results = dict()

    for i in index:
        pred = model_SVD.predict(user_id, i)
        results[pred.iid] = pred.est
    
    return nlargest(n, results, key = results.get)

In [45]:
user_id = '21'
n_recommendations = 5

recommended_articles = collaborativeFilteringRecommendArticle(embeddings_df, clicks_df, user_id, n_recommendations)

print(f"Top {n_recommendations} recommended articles for user {user_id}: {recommended_articles}")

KeyError: "['article_id', 'rating'] not in index"

In [46]:
data = Dataset.load_from_df(ratings, reader=reader)

trainset = data.build_full_trainset()

model_SVD = gs.best_estimator['rmse']

model_SVD.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x35a4d1e10>

In [64]:
from sklearn.metrics import precision_score, recall_score

def collaborativeFilteringRecommendArticle(articles, clicks, user_id, n=5, evaluate=False):

    index = list(articles.index)

    articles_read = clicks[clicks['user_id'] == user_id]['click_article_id'].tolist()

    for ele in articles_read:
        if ele in index:
            index.remove(ele)

    results = dict()

    for i in index:
        pred = model_SVD.predict(user_id, i)
        results[pred.iid] = pred.est

    recommendations = nlargest(n, results, key = results.get)

    if evaluate:
        # Create a binary list for whether each article was recommended
        recommended = [1 if article in recommendations else 0 for article in articles.index]
        # Create a binary list for whether each article was read
        read = [1 if article in articles_read else 0 for article in articles.index]
        # Calculate precision and recall
        precision = precision_score(read, recommended)
        recall = recall_score(read, recommended)
        return recommendations, precision, recall

    return recommendations

In [65]:
def precision_at_k(recommended, read, k):
    # Only consider top k recommendations
    recommended = recommended[:k]
    # Calculate precision at k
    return sum([1 if article in read else 0 for article in recommended]) / len(recommended)

def collaborativeFilteringRecommendArticle(articles, clicks, user_id, n=5, evaluate=False):

    index = list(articles.index)

    articles_read = clicks[clicks['user_id'] == user_id]['click_article_id'].tolist()

    for ele in articles_read:
        if ele in index:
            index.remove(ele)

    results = dict()

    for i in index:
        pred = model_SVD.predict(user_id, i)
        results[pred.iid] = pred.est

    recommendations = nlargest(n, results, key = results.get)

    if evaluate:
        # Calculate precision at k
        precision = precision_at_k(recommendations, articles_read, n)
        return recommendations, precision

    return recommendations

In [67]:
test_cf = collaborativeFilteringRecommendArticle(embeddings_df,  clicks_df, 5, evaluate=True)
print(test_cf)

([289003, 74455, 50644, 39894, 36162], 0.0)


In [69]:
# Replace these with actual user_ids
user_ids = [1, 2, 3, 4, 5]

for user_id in user_ids:
    recommendations, precision = collaborativeFilteringRecommendArticle(embeddings_df, clicks_df, user_id, n=5, evaluate=True)
    print(f"User ID: {user_id}")
    print(f"Recommended articles: {recommendations}")
    print(f"Precision at 5: {precision}\n")

User ID: 1
Recommended articles: [74455, 289003, 67185, 192263, 283009]
Precision at 5: 0.0

User ID: 2
Recommended articles: [36193, 289003, 42270, 163686, 74457]
Precision at 5: 0.0

User ID: 3
Recommended articles: [289003, 74455, 283009, 277107, 50644]
Precision at 5: 0.0

User ID: 4
Recommended articles: [316504, 74455, 289003, 213668, 36193]
Precision at 5: 0.0

User ID: 5
Recommended articles: [289003, 74455, 50644, 39894, 36162]
Precision at 5: 0.0

