üß¨ 08 - Syst√®me de Recommandation Hybride (Content + Collaborative)
Objectifs
Fusionner les deux cerveaux : la Similarit√© Th√©matique (Content-Based) et l'Affinit√© Sociale (SVD).

Exploiter la richesse du dataset Master Clean en utilisant plus de 10 colonnes (Genres, Th√®mes, Synopsis, Ratings, etc.).

Impl√©menter un score hybride pond√©r√© pour √©quilibrer pr√©cision et d√©couverte

(Imports & Configuration)

In [5]:
import pandas as pd
import numpy as np
import os
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer

# Configuration des chemins
DATA_PATH = "../data/processed/"
MASTER_CLEAN = os.path.join(DATA_PATH, "anime_master_clean.csv")
SVD_MATRIX = os.path.join(DATA_PATH, "svd_similarity_matrix.pkl")

(Chargement des donn√©es)

In [6]:
# Chargement de la source de v√©rit√©
df_master = pd.read_csv(MASTER_CLEAN)

# R√©cup√©ration de l'intelligence sociale du Notebook 07
if os.path.exists(SVD_MATRIX):
    df_similarity_svd = pd.read_pickle(SVD_MATRIX)
    print(f"‚úÖ Matrice SVD charg√©e : {df_similarity_svd.shape}")
else:
    raise FileNotFoundError("La matrice SVD est introuvable. Exportez-la depuis le Notebook 07.")

‚úÖ Matrice SVD charg√©e : (3064, 3064)


Code (Feature Engineering - 10+ Colonnes)

In [7]:
# On pr√©pare une matrice de contenu riche (Plus de 10 colonnes utilis√©es)
# Colonnes : synopsis, genres, themes, type, rating, score, year, members, completion_rate, fav_count...

# 1. Traitement du texte (Synopsis)
tfidf = TfidfVectorizer(stop_words='english', max_features=500)
synopsis_features = tfidf.fit_transform(df_master['synopsis'].fillna('')).toarray()

# 2. Normalisation des statistiques num√©riques (6 colonnes)
scaler = MinMaxScaler()
stat_cols = ['score', 'members', 'year', 'completion_rate', 'drop_rate', 'fav_count']
stats_scaled = scaler.fit_transform(df_master[stat_cols].fillna(0))

# 3. On combine pour cr√©er le profil "ADN" de chaque anim√©
# On concat√®ne les features (Texte + Statistiques)
content_features = np.hstack([synopsis_features, stats_scaled])
print(f"‚úÖ Profil ADN g√©n√©r√© avec {content_features.shape[1]} caract√©ristiques.")

‚úÖ Profil ADN g√©n√©r√© avec 506 caract√©ristiques.


In [8]:
def prepare_content_matrix(df):
    # 1. Analyse textuelle du Synopsis (NLP)
    tfidf = TfidfVectorizer(stop_words='english', max_features=500)
    synopsis_features = tfidf.fit_transform(df['synopsis'].fillna('')).toarray()
    
    # 2. Normalisation des statistiques (Score, membres, ann√©e, completion, drop, favoris)
    # On utilise ici 6 colonnes de ton Master Clean
    scaler = MinMaxScaler()
    stat_cols = ['score', 'members', 'year', 'completion_rate', 'drop_rate', 'fav_count']
    stats_scaled = scaler.fit_transform(df[stat_cols].fillna(0))
    
    # Fusion : On combine texte + stats pour cr√©er un profil ADN ultra-pr√©cis
    return np.hstack([synopsis_features, stats_scaled])

content_matrix = prepare_content_matrix(df_master)
print(f"‚úÖ ADN g√©n√©r√© : {content_matrix.shape[1]} caract√©ristiques utilis√©es.")

‚úÖ ADN g√©n√©r√© : 506 caract√©ristiques utilis√©es.


(Moteur Hybride & Inf√©rence)

In [9]:
# --- √Ä REMPLACER : Moteur Hybride Recalibr√© ---
def get_hybrid_recs(anime_title, alpha=0.8, n_recs=10):
    # 1. Identification de la cible
    idx = df_master[df_master['title'].str.contains(anime_title, case=False)].index[0]
    target_id = df_master.iloc[idx]['mal_id']
    
    # --- CERVEAU 1 : ADN (80%) ---
    # Ressemblance th√©matique bas√©e sur 10+ colonnes
    content_sim = cosine_similarity(content_matrix[idx].reshape(1, -1), content_matrix).flatten()
    
    # --- CERVEAU 2 : Social (20%) ---
    # R√©cup√©r√© de ta matrice SVD du Notebook 07
    if target_id in df_similarity_svd.index:
        svd_sim = df_master['mal_id'].map(df_similarity_svd.loc[target_id]).fillna(0).values
    else:
        svd_sim = np.zeros(len(df_master))
        
    # FUSION : On m√©lange avec priorit√© √† l'ADN
    final_scores = (alpha * content_sim) + ((1 - alpha) * svd_sim)
    
    # RECALIBRAGE : Booster de popularit√© par log-scaling
    df_master['relevance_score'] = final_scores * np.log10(df_master['members'].fillna(1) + 1)
    
    # Nettoyage (S√©curit√© sur le genre 'Ecchi')
    res = df_master[df_master['mal_id'] != target_id]
    if 'ecchi' not in str(df_master.iloc[idx]['genres_list']).lower():
        res = res[~res['genres_str'].str.contains('Ecchi', case=False, na=False)]
    
    return res.sort_values('relevance_score', ascending=False).head(n_recs)[
        ['title', 'genres_list', 'rating', 'score', 'members', 'relevance_score']
    ]

# TEST FINAL
get_hybrid_recs("Naruto", alpha=0.8)

Unnamed: 0,title,genres_list,rating,score,members,relevance_score
16814,Naruto,"['Action', 'Adventure', 'Fantasy']",PG-13 - Teens 13 or older,8.01,3035328,3.830703
16826,Naruto: Shippuuden,"['Action', 'Adventure', 'Fantasy']",PG-13 - Teens 13 or older,8.28,2668197,3.620484
7575,Golden Time,"['Drama', 'Romance']",PG-13 - Teens 13 or older,7.74,1109634,3.358148
9000,Higurashi no Naku Koro ni,"['Horror', 'Mystery', 'Suspense']",R - 17+ (violence & profanity),7.87,842969,3.34122
2493,Bleach,"['Action', 'Adventure', 'Supernatural']",PG-13 - Teens 13 or older,7.98,2134926,3.293337
2445,Black Clover,"['Action', 'Fantasy']",PG-13 - Teens 13 or older,8.14,1831877,3.255047
26163,Violet Evergarden,['Drama'],PG-13 - Teens 13 or older,8.69,1933489,3.24265
17983,One Punch Man,"['Action', 'Comedy']",R - 17+ (violence & profanity),8.48,3430848,3.219387
26154,Vinland Saga Season 2,"['Action', 'Adventure', 'Drama']",R - 17+ (violence & profanity),8.82,784755,3.213663
3334,Chainsaw Man,"['Action', 'Fantasy']",R - 17+ (violence & profanity),8.44,1813584,3.188493
