# Notebook 05 ‚Äì Baseline Models (TF-IDF + Similarit√© Cosine)

## Objectifs
- Construire un syst√®me de recommandation basique "content-based"
- √âvaluer les performances avec des m√©triques adapt√©es (Precision@K, Recall@K)
- Cr√©er une base de comparaison pour les mod√®les ML/DL √† venir


### 1. Chargement des donn√©es et pr√©traitements

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import joblib

# Chargement des donn√©es
df = pd.read_csv("../data/processed/anime_clean.csv")
features = joblib.load("../data/processed/final_features_sparse.pkl")

print(f"Shape des features: {features.shape}")
print(f"Shape du dataframe: {df.shape}")


### 2. Calcul de la similarit√© de contenu

On part d‚Äôun simple mod√®le de similarit√© cosinus entre les animes.

In [None]:
# Calcul de la matrice de similarit√© (approximation possible sur 5000 premiers si lourd)
sample_size = min(5000, features.shape[0])
similarity_matrix = cosine_similarity(features[:sample_size])

print(f"‚úÖ Matrice de similarit√© calcul√©e: {similarity_matrix.shape}")


### 3.Fonction de recommandation

In [None]:
def get_recommendations(title, top_n=10):
    """Retourne les animes les plus similaires √† un titre donn√©"""
    if title not in df['title'].values:
        return f"‚ö†Ô∏è '{title}' non trouv√© dans la base."

    idx = df.index[df['title'] == title][0]
    sim_scores = list(enumerate(similarity_matrix[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:top_n+1]

    anime_indices = [i[0] for i in sim_scores]
    recommendations = df.iloc[anime_indices][['title', 'score', 'genres', 'type', 'year']]
    recommendations['similarity'] = [i[1] for i in sim_scores]
    return recommendations.reset_index(drop=True)


### 4. Test du mod√®le

In [None]:
# Exemple de test
example_title = df.iloc[100]['title']
print(f"üé¨ Recommandations pour : {example_title}")
recs = get_recommendations(example_title, top_n=5)
display(recs)


### 5.√âvaluation Baseline ‚Äì M√©triques de recommandation

In [None]:
from sklearn.metrics import precision_score, recall_score

def precision_at_k(recommended, relevant, k):
    return len(set(recommended[:k]) & set(relevant)) / k

def recall_at_k(recommended, relevant, k):
    return len(set(recommended[:k]) & set(relevant)) / len(relevant)

# Exemple simul√© :
relevant_items = set(df[df['genres'].str.contains('Action', na=False)].sample(10)['title'])
recommended_items = set(get_recommendations(example_title, top_n=10)['title'])

print(f"Precision@10 = {precision_at_k(list(recommended_items), list(relevant_items), 10):.2f}")
print(f"Recall@10 = {recall_at_k(list(recommended_items), list(relevant_items), 10):.2f}")


### 6. Visualisation des similarit√©s

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

top_sim = similarity_matrix[idx][:20]
titles = df['title'][:20]

plt.figure(figsize=(10, 6))
sns.barplot(x=top_sim, y=titles)
plt.title(f"Top similarit√©s pour {example_title}")
plt.xlabel("Score de similarit√©")
plt.show()


### 7. Sauvegarde et pr√©paration du mod√®le pour Streamlit

In [None]:
import pickle

# Sauvegarde de la matrice de similarit√©
pickle.dump(similarity_matrix, open("../models/baseline_similarity.pkl", "wb"))
print("‚úÖ Mod√®le baseline sauvegard√© avec succ√®s !")


üìò R√©sum√©

In [None]:
## üìò R√©sum√© - Baseline Models

| √âl√©ment | M√©thode | D√©tails |
|----------|----------|----------|
| Type de mod√®le | Content-Based | TF-IDF + Cosine Similarity |
| Donn√©es utilis√©es | features_engineered.csv | Texte, num√©riques, cat√©gorielles |
| M√©triques simul√©es | Precision@K, Recall@K | (User profiling √† venir) |
| Points forts | Simple, interpr√©table, rapide |
| Limites | Pas de personnalisation utilisateur |
| Suivant | Notebook 06 ‚Äì Machine Learning (RandomForest, XGBoost) |
