üß¨ 08 - Syst√®me de Recommandation Hybride (Content + Collaborative)
Objectifs
Fusionner les deux cerveaux : la Similarit√© Th√©matique (Content-Based) et l'Affinit√© Sociale (SVD).

Exploiter la richesse du dataset Master Clean en utilisant plus de 10 colonnes (Genres, Th√®mes, Synopsis, Ratings, etc.).

Impl√©menter un score hybride pond√©r√© pour √©quilibrer pr√©cision et d√©couverte

(Imports & Configuration)

In [17]:
# --- Cellule d'imports modifi√©e ---
import pandas as pd
import numpy as np
import os
import yaml  # Ajout√©
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer

# --- Nouvelle cellule : Chargement de la Configuration ---
with open("../config/config_hybrid.yaml", 'r') as stream:
    config = yaml.safe_load(stream)

MASTER_PATH = config['data']['dataset_path']
SVD_PATH = config['data']['svd_matrix_path']
TFIDF_MAX = config['features']['tfidf_max_features']
DNA_COLS = config['features']['dna_stats_cols']
ALPHA = config['model']['hybrid_alpha']

# --- Cellule de chargement (Utilisation du YAML) ---
df_master = pd.read_csv(MASTER_PATH)
df_similarity_svd = pd.read_pickle(SVD_PATH)

# --- Dans la fonction de pr√©paration de la matrice ADN ---
# Utilisez TFIDF_MAX et DNA_COLS
tfidf = TfidfVectorizer(stop_words='english', max_features=TFIDF_MAX)
# ...
scaler = MinMaxScaler()
stats_scaled = scaler.fit_transform(df_master[DNA_COLS].fillna(0))

(Chargement des donn√©es)

In [18]:

# Utilisez MASTER_PATH (d√©fini via YAML) au lieu de MASTER_CLEAN
df_master = pd.read_csv(MASTER_PATH) 

# Utilisez SVD_PATH (d√©fini via YAML) au lieu de SVD_MATRIX
if os.path.exists(SVD_PATH):
    df_similarity_svd = pd.read_pickle(SVD_PATH)
    print(f"‚úÖ Matrice SVD charg√©e : {df_similarity_svd.shape}")
else:
    raise FileNotFoundError(f"La matrice SVD est introuvable √† : {SVD_PATH}")

‚úÖ Matrice SVD charg√©e : (3064, 3064)


Code (Feature Engineering - 10+ Colonnes)

In [19]:
# On pr√©pare une matrice de contenu riche (Plus de 10 colonnes utilis√©es)
# Colonnes : synopsis, genres, themes, type, rating, score, year, members, completion_rate, fav_count...

# 1. Traitement du texte (Synopsis)
tfidf = TfidfVectorizer(stop_words='english', max_features=500)
synopsis_features = tfidf.fit_transform(df_master['synopsis'].fillna('')).toarray()

# 2. Normalisation des statistiques num√©riques (6 colonnes)
scaler = MinMaxScaler()
stat_cols = ['score', 'members', 'year', 'completion_rate', 'drop_rate', 'fav_count']
stats_scaled = scaler.fit_transform(df_master[stat_cols].fillna(0))

# 3. On combine pour cr√©er le profil "ADN" de chaque anim√©
# On concat√®ne les features (Texte + Statistiques)
content_features = np.hstack([synopsis_features, stats_scaled])
print(f"‚úÖ Profil ADN g√©n√©r√© avec {content_features.shape[1]} caract√©ristiques.")

‚úÖ Profil ADN g√©n√©r√© avec 506 caract√©ristiques.


In [20]:
def prepare_content_matrix(df):
    # NLP avec max_features du YAML
    tfidf = TfidfVectorizer(stop_words='english', max_features=TFIDF_MAX)
    synopsis_features = tfidf.fit_transform(df['synopsis'].fillna('')).toarray()
    
    # Stats avec colonnes du YAML
    scaler = MinMaxScaler()
    stats_scaled = scaler.fit_transform(df[DNA_COLS].fillna(0))
    
    return np.hstack([synopsis_features, stats_scaled])

content_matrix = prepare_content_matrix(df_master)
print(f"‚úÖ ADN g√©n√©r√© : {content_matrix.shape[1]} caract√©ristiques utilis√©es.")

‚úÖ ADN g√©n√©r√© : 506 caract√©ristiques utilis√©es.


Optimisation des parametre par Algo d'opti 

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# --- PHASE D'OPTIMISATION (ALPHA TUNING) ---
print("üß™ Recherche de l'√©quilibre Hybride optimal...")

# 1. On pr√©pare les donn√©es de test (on prend le premier anim√© du dataset pour tester)
idx_test = 0 
target_id = df_master.iloc[idx_test]['mal_id']
target_title = df_master.iloc[idx_test]['title']

print(f"Analyse bas√©e sur l'anim√© : {target_title}")

# On calcule les deux types de similarit√©s pour cet anim√©
# Similarit√© de contenu (ADN)
c_sim = cosine_similarity(content_matrix[idx_test].reshape(1, -1), content_matrix).flatten()

# Similarit√© sociale (SVD)
if target_id in df_similarity_svd.index:
    s_sim = df_similarity_svd.loc[target_id].values
else:
    s_sim = np.zeros(len(df_master))

# 2. Fonction de score de qualit√©
def calculate_quality_score(alpha, content_scores, svd_scores):
    # Fusion pond√©r√©e
    hybrid_score = (alpha * content_scores) + ((1 - alpha) * svd_scores)
    # On mesure la force des 10 meilleures recommandations
    return np.mean(np.sort(hybrid_score)[-10:])

# 3. Boucle de test
alphas_to_test = [0.1, 0.3, 0.5, 0.7, 0.9]
best_alpha = 0.5
max_quality = 0

for a in alphas_to_test:
    q_score = calculate_quality_score(a, c_sim, s_sim)
    print(f"Alpha {a:.1f} -> Score de qualit√© observ√©: {q_score:.4f}")
    
    if q_score > max_quality:
        max_quality = q_score
        best_alpha = a

print(f"\nüèÜ L'optimisation a d√©termin√© que Alpha = {best_alpha} est le meilleur compromis.")

# --- üìù MISE √Ä JOUR ---
ALPHA = best_alpha
print(f"‚úÖ La variable ALPHA est maintenant fix√©e √† {ALPHA}.")

üöÄ Optimisation de l'√©quilibre Hybride (Formule R√©elle)...
Test Alpha 0.1 : Force du signal = 0.2622
Test Alpha 0.2 : Force du signal = 0.5243
Test Alpha 0.3 : Force du signal = 0.7865
Test Alpha 0.4 : Force du signal = 1.0486
Test Alpha 0.5 : Force du signal = 1.3108
Test Alpha 0.6 : Force du signal = 1.5730
Test Alpha 0.7 : Force du signal = 1.8351
Test Alpha 0.8 : Force du signal = 2.0973
Test Alpha 0.9 : Force du signal = 2.3594

üèÜ MEILLEUR ALPHA TROUV√â : 0.9
C'est cette valeur qui maximise la pertinence avec ta formule de 'Popularity Boost'.


(Moteur Hybride & Inf√©rence)

In [27]:
# --- √Ä REMPLACER : Moteur Hybride Recalibr√© ---
def get_hybrid_recs(anime_title, alpha=0.8, n_recs=10):
    # 1. Identification de la cible
    idx = df_master[df_master['title'].str.contains(anime_title, case=False)].index[0]
    target_id = df_master.iloc[idx]['mal_id']
    
    # --- CERVEAU 1 : ADN (80%) ---
    # Ressemblance th√©matique bas√©e sur 10+ colonnes
    content_sim = cosine_similarity(content_matrix[idx].reshape(1, -1), content_matrix).flatten()
    
    # --- CERVEAU 2 : Social (20%) ---
    # R√©cup√©r√© de ta matrice SVD du Notebook 07
    if target_id in df_similarity_svd.index:
        svd_sim = df_master['mal_id'].map(df_similarity_svd.loc[target_id]).fillna(0).values
    else:
        svd_sim = np.zeros(len(df_master))
        
    # FUSION : On m√©lange avec priorit√© √† l'ADN
    final_scores = (ALPHA * content_sim) + ((1 - ALPHA) * svd_sim)
    
    # RECALIBRAGE : Booster de popularit√© par log-scaling
    df_master['relevance_score'] = final_scores * np.log10(df_master['members'].fillna(1) + 1)
    
    # Nettoyage (S√©curit√© sur le genre 'Ecchi')
    res = df_master[df_master['mal_id'] != target_id]
    if 'ecchi' not in str(df_master.iloc[idx]['genres_list']).lower():
        res = res[~res['genres_str'].str.contains('Ecchi', case=False, na=False)]
    
    return res.sort_values('relevance_score', ascending=False).head(n_recs)[
        ['title', 'genres_list', 'rating', 'score', 'members', 'relevance_score']
    ]

# TEST FINAL
get_hybrid_recs("Kawaii dake ja Nai Shikimori-san", alpha=ALPHA)

Unnamed: 0,title,genres_list,rating,score,members,relevance_score
25097,Toradora!,"['Drama', 'Romance']",PG-13 - Teens 13 or older,8.04,2318129,4.095022
4012,Clannad,"['Drama', 'Romance']",PG-13 - Teens 13 or older,7.99,1480574,4.060356
21129,Seishun Buta Yarou wa Bunny Girl Senpai no Yum...,"['Drama', 'Romance', 'Supernatural']",PG-13 - Teens 13 or older,8.23,1894065,4.057907
17340,Nisekoi,"['Comedy', 'Romance']",PG-13 - Teens 13 or older,7.55,1224489,4.042125
9395,Horimiya,['Romance'],PG-13 - Teens 13 or older,8.19,1559915,4.040972
12888,"Komi-san wa, Comyushou desu.",['Comedy'],PG-13 - Teens 13 or older,7.81,977740,4.023463
20016,ReLIFE,"['Drama', 'Romance']",PG-13 - Teens 13 or older,7.96,1111693,4.022584
4094,Code Geass: Hangyaku no Lelouch R2,"['Action', 'Award Winning', 'Drama', 'Sci-Fi']",R - 17+ (violence & profanity),8.91,1900357,3.98622
1133,Ao Haru Ride,['Romance'],PG-13 - Teens 13 or older,7.63,990250,3.965876
17480,Noragami,"['Action', 'Supernatural']",PG-13 - Teens 13 or older,7.94,2273787,3.964966
