# Système de Recommandation de Films

Ce notebook a pour objectif de construire un système de recommandation de films hybride. Il combine l'analyse de sentiment et l'extraction de thèmes pour proposer des films alignés avec l'humeur actuelle de l'utilisateur.

Le système prendra en compte la similarité de sentiment, la similarité thématique et la note IMDb pour générer les recommandations.

## Pipeline

Le pipeline de ce projet est le suivant :
1. Nettoyage de base des données et tokenisation.
2. Apprentissage d'un encodeur Sentence BERT fine-tuné pour la classification binaire du sentiment.
3. Extraction de thèmes (topics) à l'aide de BERTopic pour obtenir des vecteurs thématiques.
4. Construction d'un recommandeer hybride combinant la similarité de sentiment, la similarité thématique et la note IMDb.


In [None]:
import pandas as pd
import os
import re

from datasets import Dataset
from transformers import AutoTokenizer
from sklearn import preprocessing
import spacy


In [None]:
from google.colab import drive
drive.mount('/content/drive')
os.chdir('/content/drive/MyDrive/AA IPSSI 2025/AA_TIME SERIES/Projet')
os.getcwd()

Mounted at /content/drive


'/content/drive/MyDrive/AA IPSSI 2025/AA_TIME SERIES/Projet'

## 1. Chargement et Exploration des Données

les colonnes pertinentes pour notre analyse:
- Label (positif si rating > 5 et Négatif sinon)
- movie: le nom du film
- review (concattenation de review_summary et review_details)

In [None]:
#Importation des données
df = pd.read_json('sample.json')  #'part-06.json'
df.head()

Unnamed: 0,review_id,reviewer,movie,rating,review_summary,review_date,spoiler_tag,review_detail,helpful
0,rw1133942,OriginalMovieBuff21,Kill Bill: Vol. 2 (2004),8.0,Good follow up that answers all the questions,24 July 2005,0,"After seeing Tarantino's Kill Bill Vol: 1, I g...","[0, 1]"
1,rw1133943,sentra14,Journey to the Unknown (1968– ),,Excellent series,24 July 2005,0,"I have the entire series on video, taped mostl...","[11, 11]"
2,rw1133946,GreenwheelFan2002,The Island (2005),9.0,"Not just about action, but about survival...",24 July 2005,0,Once again the critics prove themselves as mor...,"[2, 5]"
3,rw1133948,itsascreambaby,Win a Date with Tad Hamilton! (2004),3.0,Falls under the category: seen it a million ti...,24 July 2005,0,This IS a film that has been done too many tim...,"[2, 3]"
4,rw1133949,OriginalMovieBuff21,Saturday Night Live: The Best of Chris Farley ...,10.0,"Before Tommy Boy and Black Sheep, there was Sa...",24 July 2005,0,Chris Farley is one of my favorite comedians a...,"[4, 4]"


In [None]:
df.columns

Index(['review_id', 'reviewer', 'movie', 'rating', 'review_summary',
       'review_date', 'spoiler_tag', 'review_detail', 'helpful'],
      dtype='object')

In [None]:
df.shape

(100000, 9)

**Creation du champ review et label et équilibrage des données**

In [None]:
def null_values_summary(df, percent=0):
    total = df.isnull().sum()
    pct = (total / len(df)) * 100
    missing_data = pd.concat([total, pct], axis=1, keys=['Total', 'Percent'])
    filtered_data = missing_data[(missing_data['Total'] != 0) & (missing_data['Percent'] >= percent)]
    return filtered_data.sort_values('Total', ascending=False)

In [None]:
def preprocess_dataframe(df):
    """
    Preprocesses the DataFrame by handling missing ratings, creating a 'review' column,
    converting rating to integer, and creating a 'label' column.

    Args:
        df (pd.DataFrame): The input DataFrame.

    Returns:
        pd.DataFrame: The preprocessed DataFrame.
    """
    # Supprimer les lignes avec des ratings manquants
    df = df.dropna(subset=["rating"])

    # Créer la colonne 'review'
    df["review"] = df["review_summary"].fillna("") + " " + df["review_detail"].fillna("")

    # Convertir le rating en entier
    df["rating"] = df["rating"].astype(int)

    # Créer la colonne 'label'
    df["label"] = df["rating"].apply(lambda x: "positif" if x > 5 else "négatif")
    #df["label"] = df["rating"].apply(lambda x: "positif" if x > 5 else ("négatif" if x < 5 else "neutre"))

    return df

In [None]:
# Valeurs manquantes
df["rating"] = pd.to_numeric(df["rating"], errors="coerce")
null_values_summary(df)

Unnamed: 0,Total,Percent
rating,12092,12.092


In [None]:
df = preprocess_dataframe(df)
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["review"] = df["review_summary"].fillna("") + " " + df["review_detail"].fillna("")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["rating"] = df["rating"].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["label"] = df["rating"].apply(lambda x: "positif" if x > 5 else "négatif")


Unnamed: 0,review_id,reviewer,movie,rating,review_summary,review_date,spoiler_tag,review_detail,helpful,review,label
0,rw1133942,OriginalMovieBuff21,Kill Bill: Vol. 2 (2004),8,Good follow up that answers all the questions,24 July 2005,0,"After seeing Tarantino's Kill Bill Vol: 1, I g...","[0, 1]",Good follow up that answers all the questions ...,positif
2,rw1133946,GreenwheelFan2002,The Island (2005),9,"Not just about action, but about survival...",24 July 2005,0,Once again the critics prove themselves as mor...,"[2, 5]","Not just about action, but about survival... O...",positif
3,rw1133948,itsascreambaby,Win a Date with Tad Hamilton! (2004),3,Falls under the category: seen it a million ti...,24 July 2005,0,This IS a film that has been done too many tim...,"[2, 3]",Falls under the category: seen it a million ti...,négatif
4,rw1133949,OriginalMovieBuff21,Saturday Night Live: The Best of Chris Farley ...,10,"Before Tommy Boy and Black Sheep, there was Sa...",24 July 2005,0,Chris Farley is one of my favorite comedians a...,"[4, 4]","Before Tommy Boy and Black Sheep, there was Sa...",positif
5,rw1133950,Aaron1375,Outlaw Star (1998– ),10,Great anime series soars through the stars.,24 July 2005,0,"I love this anime series, my only complaint is...","[11, 12]",Great anime series soars through the stars. I ...,positif


In [None]:
# # Ratings null pour les predictions futures via le modèle créé
# invalid_ratings_df = df[df["rating"].isna()] #| (df["rating"] == "")
# invalid_ratings_10= invalid_ratings_df.head(10)
# invalid_ratings_10.to_csv("invalid_ratings.csv", index=False)
# invalid_ratings_10.shape

(10, 9)

In [None]:
df['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
positif,64611
négatif,23297


In [None]:
def balance_positive_negative_samples(df, n_samples=25000, random_state=42):
    """
    Balances the number of 'positif' and 'négatif' samples in the DataFrame.

    Args:
        df (pd.DataFrame): The input DataFrame with a 'label' column.
        n_samples (int): The number of 'positif' samples to sample.
        random_state (int): The random state for reproducibility.

    Returns:
        pd.DataFrame: The balanced DataFrame.
    """
    # Échantillonner les critiques positives
    df_positif = df[df['label'] == 'positif'].sample(n=n_samples, random_state=random_state)

    # Sélectionner toutes les critiques négatives
    df_negatif = df[df['label'] == 'négatif']

    # Concaténer les critiques positives échantillonnées et toutes les critiques négatives
    df_balanced = pd.concat([df_positif, df_negatif])

    # Mélanger les données pour éviter tout biais d'ordre
    df_balanced = df_balanced.sample(frac=1, random_state=random_state).reset_index(drop=True)

    return df_balanced

In [None]:
# Utiliser la fonction pour obtenir le DataFrame équilibré
data = balance_positive_negative_samples(df)

# Afficher la distribution des labels dans le DataFrame équilibré
data['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
positif,25000
négatif,23297


In [None]:
# #Dataset for the database
data.to_json("reviews_labeled.json", orient="records", force_ascii=False, indent=2)

# Garder uniquement les colonnes review et label
data = data[["review", "label"]]
data.to_csv("reviews.csv", index=False)

## 2. Traitement et Nettoyage des Données

Cette section se concentre sur le prétraitement des données textuelles. Nous effectuerons des étapes de nettoyage de base comme:
- la suppression de la ponctuation
- la mise en minuscules
- la tokenisation.

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df[label_column_name].tolist())
df['label'] = le.transform(df[label_column_name].tolist())
#df["label"] = df["label"].map({"négatif": 0, "positif": 1})

In [None]:
df.head()

Unnamed: 0,review,label
0,Cooler Than Cool Surely one of the suavest mov...,1
1,"Along with 'The Wizard of Oz', the supreme fil...",1
2,"Jeff Costello, a nearly perfect gangster A rar...",1
4,Stylish and cluelessly silly. Inoffensive. The...,0
5,One of the Most Perfect Alibis in a Great Fren...,1


In [None]:
import re
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = ' '.join(word for word in text.split() if word not in stop_words)
    return text


df["review_clean"] = df["review"].apply(clean_text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## 3. Apprentissage de l'Encodeur Sentence BERT Fine-tuné pour le Sentiment

Nous allons fine-tuner un modèle Sentence BERT pour la tâche de classification binaire du sentiment (positif/négatif).

## 4. Extraction de Thèmes (BERTopic)

Nous allons utiliser BERTopic pour extraire les thèmes/sentiments  principaux des reviews de films.

## 5. Construction du moteur de recommandation


## 6. Évaluation et Amélioration (Étapes Futures)

Cette section décrirait comment évaluer la performance du système de recommandation.
