# Analyse Exploratoire des Donnees (EDA)
## Dataset MovieLens 25M

Ce notebook analyse le dataset MovieLens utilise dans notre systeme de recommandation hybride.

**Contenu :**
1. Chargement et apercu des donnees
2. Distribution des notes
3. Analyse des genres
4. Activite des utilisateurs
5. Analyse temporelle
6. Films les plus populaires et les mieux notes
7. Sparsite et statistiques avancees

In [3]:
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import os, warnings
warnings.filterwarnings('ignore')

sns.set_theme(style='darkgrid', palette='viridis')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 12
plt.rcParams['figure.dpi'] = 100

os.makedirs('figures', exist_ok=True)
print('Bibliotheques chargees')

Bibliotheques chargees


---
## 1. Chargement et apercu des donnees

In [4]:
movies = pd.read_csv('../movies.csv')

# On charge les 2 premiers millions de notes pour economiser la memoire
# Le dataset complet fait 25M lignes
ratings = pd.read_csv('../ratings.csv', nrows=2_000_000)

print('Films:    {:,} lignes, {} colonnes'.format(movies.shape[0], movies.shape[1]))
print('Ratings:  {:,} lignes (echantillon), {} colonnes'.format(ratings.shape[0], ratings.shape[1]))
print('(Dataset complet: ~25,000,000 notes)')
print()
print('--- movies.csv (5 premieres lignes) ---')
print(movies.head().to_string())
print()
print('--- ratings.csv (5 premieres lignes) ---')
print(ratings.head().to_string())

Films:    62,423 lignes, 3 colonnes
Ratings:  2,000,000 lignes (echantillon), 4 colonnes
(Dataset complet: ~25,000,000 notes)

--- movies.csv (5 premieres lignes) ---
   movieId                               title                                       genres
0        1                    Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy
1        2                      Jumanji (1995)                   Adventure|Children|Fantasy
2        3             Grumpier Old Men (1995)                               Comedy|Romance
3        4            Waiting to Exhale (1995)                         Comedy|Drama|Romance
4        5  Father of the Bride Part II (1995)                                       Comedy

--- ratings.csv (5 premieres lignes) ---
   userId  movieId  rating   timestamp
0       1      296     5.0  1147880044
1       1      306     3.5  1147868817
2       1      307     5.0  1147868828
3       1      665     5.0  1147878820
4       1      899     3.5  1147868510


In [5]:
print('=== movies.csv ===')
print(movies.dtypes)
print('\nValeurs manquantes:')
print(movies.isnull().sum())

print('\n=== ratings.csv ===')
print(ratings.dtypes)
print('\nValeurs manquantes:')
print(ratings.isnull().sum())

=== movies.csv ===
movieId     int64
title      object
genres     object
dtype: object

Valeurs manquantes:
movieId    0
title      0
genres     0
dtype: int64

=== ratings.csv ===
userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

Valeurs manquantes:
userId       0
movieId      0
rating       0
timestamp    0
dtype: int64


In [6]:
print('=== Statistiques des notes ===')
print(ratings['rating'].describe())

print('\nNombre de films uniques notes:    {:,}'.format(ratings['movieId'].nunique()))
print('Nombre d utilisateurs uniques:   {:,}'.format(ratings['userId'].nunique()))
print('Note moyenne globale:            {:.2f}'.format(ratings['rating'].mean()))
print('Note mediane:                    {:.1f}'.format(ratings['rating'].median()))

=== Statistiques des notes ===
count    2.000000e+06
mean     3.540147e+00
std      1.059694e+00
min      5.000000e-01
25%      3.000000e+00
50%      4.000000e+00
75%      4.000000e+00
max      5.000000e+00
Name: rating, dtype: float64

Nombre de films uniques notes:    27,321
Nombre d utilisateurs uniques:   13,322
Note moyenne globale:            3.54
Note mediane:                    4.0


---
## 2. Distribution des notes

In [7]:
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

rating_counts = ratings['rating'].value_counts().sort_index()
colors = sns.color_palette('viridis', len(rating_counts))

axes[0].bar(rating_counts.index, rating_counts.values, width=0.4, color=colors, edgecolor='white')
axes[0].set_xlabel('Note')
axes[0].set_ylabel('Nombre de notes')
axes[0].set_title('Distribution des Notes', fontsize=14, fontweight='bold')
axes[0].set_xticks(rating_counts.index)
for x, y in zip(rating_counts.index, rating_counts.values):
    axes[0].text(x, y + rating_counts.values.max()*0.01, '{:.0f}K'.format(y/1e3), ha='center', fontsize=9)

pct = (rating_counts / len(ratings) * 100)
axes[1].pie(
    pct.values, labels=[str(x) + ' etoiles' for x in pct.index],
    autopct='%1.1f%%', colors=colors, startangle=90,
    textprops={'fontsize': 9}
)
axes[1].set_title('Repartition des Notes', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('figures/01_rating_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

print('La note la plus frequente est {} etoiles'.format(rating_counts.idxmax()))

La note la plus frequente est 4.0 etoiles


---
## 3. Analyse des Genres

In [8]:
all_genres = []
for genres in movies['genres'].dropna():
    all_genres.extend(genres.split('|'))

genre_counts = Counter(all_genres)
genre_df = pd.DataFrame(genre_counts.items(), columns=['Genre', 'Nombre']).sort_values('Nombre', ascending=True)
genre_df = genre_df[genre_df['Genre'] != '(no genres listed)']

fig, ax = plt.subplots(figsize=(12, 8))
bars = ax.barh(genre_df['Genre'], genre_df['Nombre'], color=sns.color_palette('viridis', len(genre_df)))
ax.set_xlabel('Nombre de Films')
ax.set_title('Nombre de Films par Genre', fontsize=14, fontweight='bold')

for bar, val in zip(bars, genre_df['Nombre']):
    ax.text(bar.get_width() + 50, bar.get_y() + bar.get_height()/2, '{:,}'.format(val), va='center', fontsize=9)

plt.tight_layout()
plt.savefig('figures/02_genres_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

print('{} genres differents'.format(len(genre_counts)))
print('Genre le plus courant: {} ({:,} films)'.format(genre_df.iloc[-1]['Genre'], genre_df.iloc[-1]['Nombre']))

20 genres differents
Genre le plus courant: Drama (25,606 films)


In [9]:
# Note moyenne par genre
movies_exploded = movies.assign(genre=movies['genres'].str.split('|')).explode('genre')
movies_exploded = movies_exploded[movies_exploded['genre'] != '(no genres listed)']
merged = movies_exploded.merge(ratings[['movieId', 'rating']], on='movieId')

genre_ratings = merged.groupby('genre')['rating'].agg(['mean', 'count', 'std']).reset_index()
genre_ratings.columns = ['Genre', 'Note_Moyenne', 'Nombre_Notes', 'Ecart_Type']
genre_ratings = genre_ratings.sort_values('Note_Moyenne', ascending=True)

fig, ax = plt.subplots(figsize=(12, 8))
colors = sns.color_palette('RdYlGn', len(genre_ratings))
bars = ax.barh(genre_ratings['Genre'], genre_ratings['Note_Moyenne'], color=colors, edgecolor='white')
ax.axvline(x=ratings['rating'].mean(), color='red', linestyle='--', linewidth=1.5,
           label='Moyenne globale ({:.2f})'.format(ratings['rating'].mean()))
ax.set_xlabel('Note Moyenne')
ax.set_title('Note Moyenne par Genre', fontsize=14, fontweight='bold')
ax.legend(fontsize=10)
ax.set_xlim(2.5, 4.2)

for bar, val in zip(bars, genre_ratings['Note_Moyenne']):
    ax.text(val + 0.02, bar.get_y() + bar.get_height()/2, '{:.2f}'.format(val), va='center', fontsize=9)

plt.tight_layout()
plt.savefig('figures/03_average_rating_by_genre.png', dpi=150, bbox_inches='tight')
plt.show()

In [10]:
# Heatmap de co-occurrence des genres
top_genres = genre_df.tail(12)['Genre'].tolist()

cooccurrence = pd.DataFrame(0, index=top_genres, columns=top_genres)
for genres in movies['genres'].dropna():
    genre_list = [g for g in genres.split('|') if g in top_genres]
    for i, g1 in enumerate(genre_list):
        for g2 in genre_list[i:]:
            cooccurrence.loc[g1, g2] += 1
            if g1 != g2:
                cooccurrence.loc[g2, g1] += 1

fig, ax = plt.subplots(figsize=(12, 10))
mask = np.triu(np.ones_like(cooccurrence, dtype=bool), k=1)
sns.heatmap(cooccurrence, mask=mask, annot=True, fmt='d', cmap='YlOrRd', ax=ax,
            linewidths=0.5, cbar_kws={'label': 'Nombre de films'})
ax.set_title('Co-occurrence des Genres (Top 12)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('figures/04_genre_cooccurrence.png', dpi=150, bbox_inches='tight')
plt.show()

---
## 4. Activite des Utilisateurs

In [11]:
user_counts = ratings.groupby('userId').size()

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

axes[0].hist(user_counts, bins=100, color='#3498db', edgecolor='white', alpha=0.8)
axes[0].set_xlabel('Nombre de notes par utilisateur')
axes[0].set_ylabel('Nombre d utilisateurs')
axes[0].set_title('Distribution de l activite des utilisateurs', fontsize=14, fontweight='bold')
axes[0].set_yscale('log')

bp = axes[1].boxplot(user_counts, vert=True, patch_artist=True,
                     boxprops=dict(facecolor='#3498db', alpha=0.7),
                     medianprops=dict(color='red', linewidth=2))
axes[1].set_ylabel('Nombre de notes')
axes[1].set_title('Boxplot - Notes par utilisateur', fontsize=14, fontweight='bold')
axes[1].set_yscale('log')

plt.tight_layout()
plt.savefig('figures/05_user_activity.png', dpi=150, bbox_inches='tight')
plt.show()

print('Mediane:    {:.0f} notes/utilisateur'.format(user_counts.median()))
print('Moyenne:    {:.0f} notes/utilisateur'.format(user_counts.mean()))
print('Min:        {} notes'.format(user_counts.min()))
print('Max:        {:,} notes'.format(user_counts.max()))

Mediane:    70 notes/utilisateur
Moyenne:    150 notes/utilisateur
Min:        20 notes
Max:        4,689 notes


In [12]:
movie_counts = ratings.groupby('movieId').size()

fig, ax = plt.subplots(figsize=(14, 6))
ax.hist(movie_counts, bins=100, color='#e74c3c', edgecolor='white', alpha=0.8)
ax.set_xlabel('Nombre de notes par film')
ax.set_ylabel('Nombre de films')
ax.set_title('Distribution de la Popularite des Films', fontsize=14, fontweight='bold')
ax.set_yscale('log')
ax.axvline(x=movie_counts.median(), color='blue', linestyle='--',
           label='Mediane ({:.0f})'.format(movie_counts.median()))
ax.axvline(x=100, color='green', linestyle='--', label='Seuil popularite (100)')
ax.legend()

plt.tight_layout()
plt.savefig('figures/06_movie_popularity.png', dpi=150, bbox_inches='tight')
plt.show()

print('Films avec < 10 notes:    {:,} ({:.1f}%)'.format((movie_counts < 10).sum(), (movie_counts < 10).mean()*100))
print('Films avec > 1000 notes: {:,} ({:.1f}%)'.format((movie_counts > 1000).sum(), (movie_counts > 1000).mean()*100))

Films avec < 10 notes:    17,667 (64.7%)
Films avec > 1000 notes: 425 (1.6%)


---
## 5. Analyse Temporelle

In [13]:
ratings['date'] = pd.to_datetime(ratings['timestamp'], unit='s')
ratings['year'] = ratings['date'].dt.year

yearly = ratings.groupby('year').agg(
    count=('rating', 'count'),
    mean_rating=('rating', 'mean')
).reset_index()

fig, ax1 = plt.subplots(figsize=(14, 6))
ax2 = ax1.twinx()

ax1.bar(yearly['year'], yearly['count']/1e3, color='#3498db', alpha=0.7, label='Nombre (K)')
ax2.plot(yearly['year'], yearly['mean_rating'], color='#e74c3c', linewidth=2.5,
         marker='o', markersize=5, label='Note moyenne')

ax1.set_xlabel('Annee')
ax1.set_ylabel('Nombre de notes (milliers)', color='#3498db')
ax2.set_ylabel('Note moyenne', color='#e74c3c')
ax2.set_ylim(3.0, 4.2)
ax1.set_title('Evolution des Notes dans le Temps', fontsize=14, fontweight='bold')

lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines1 + lines2, labels1 + labels2, loc='upper left')

plt.tight_layout()
plt.savefig('figures/07_temporal_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

---
## 6. Top Films

In [14]:
movie_stats = ratings.groupby('movieId').agg(
    count=('rating', 'count'),
    mean=('rating', 'mean'),
    std=('rating', 'std')
).reset_index()
movie_stats = movie_stats.merge(movies, on='movieId')

top_rated_count = movie_stats.nlargest(20, 'count')

fig, ax = plt.subplots(figsize=(14, 8))
y_pos = range(len(top_rated_count))
colors = sns.color_palette('viridis', len(top_rated_count))
bars = ax.barh(y_pos, top_rated_count['count'], color=colors, edgecolor='white')
ax.set_yticks(list(y_pos))
ax.set_yticklabels([t[:45] for t in top_rated_count['title']], fontsize=10)
ax.invert_yaxis()
ax.set_xlabel('Nombre de notes')
ax.set_title('Top 20 Films les Plus Notes', fontsize=14, fontweight='bold')

for bar, mean in zip(bars, top_rated_count['mean']):
    ax.text(bar.get_width() + 200, bar.get_y() + bar.get_height()/2,
            'Moy: {:.2f}'.format(mean), va='center', fontsize=9, color='orange')

plt.tight_layout()
plt.savefig('figures/08_top_movies_count.png', dpi=150, bbox_inches='tight')
plt.show()

In [15]:
popular = movie_stats[movie_stats['count'] >= 500]
top_quality = popular.nlargest(20, 'mean')

fig, ax = plt.subplots(figsize=(14, 8))
y_pos = range(len(top_quality))
colors = sns.color_palette('YlOrRd_r', len(top_quality))
bars = ax.barh(y_pos, top_quality['mean'], color=colors, edgecolor='white')
ax.set_yticks(list(y_pos))
ax.set_yticklabels([t[:45] for t in top_quality['title']], fontsize=10)
ax.invert_yaxis()
ax.set_xlabel('Note Moyenne')
ax.set_title('Top 20 Films les Mieux Notes (min. 500 notes)', fontsize=14, fontweight='bold')
ax.set_xlim(3.8, 4.6)

for bar, (mean, count) in zip(bars, zip(top_quality['mean'], top_quality['count'])):
    ax.text(mean + 0.01, bar.get_y() + bar.get_height()/2,
            '{:.2f} ({:,} notes)'.format(mean, count), va='center', fontsize=9)

plt.tight_layout()
plt.savefig('figures/09_top_movies_quality.png', dpi=150, bbox_inches='tight')
plt.show()

---
## 7. Sparsite de la Matrice User-Item

In [16]:
n_users = ratings['userId'].nunique()
n_movies_r = ratings['movieId'].nunique()
n_ratings = len(ratings)
# Donnes du dataset complet
total_ratings_real = 25_000_095
n_users_real = 162_541
n_items_real = 62_423
sparsity = 1 - total_ratings_real / (n_users_real * n_items_real)

print('=' * 50)
print('  MATRICE USER-ITEM (Dataset complet)')
print('=' * 50)
print('  Utilisateurs:  {:>10,}'.format(n_users_real))
print('  Films:         {:>10,}'.format(n_items_real))
print('  Notes:         {:>10,}'.format(total_ratings_real))
print('  Taille totale: {:>14,}'.format(n_users_real * n_items_real))
print('  Sparsite:      {:>9.4f}%'.format(sparsity*100))
print('=' * 50)
print('\nSeulement {:.4f}% de la matrice est remplie!'.format((1-sparsity)*100))
print('Le filtrage collaboratif (SVD) comble les cases vides')
print('en apprenant des patterns latents.')

  MATRICE USER-ITEM (Dataset complet)
  Utilisateurs:     162,541
  Films:             62,423
  Notes:         25,000,095
  Taille totale: 10,146,296,843
  Sparsite:        99.7536%

Seulement 0.2464% de la matrice est remplie!
Le filtrage collaboratif (SVD) comble les cases vides
en apprenant des patterns latents.


In [17]:
# Visualisation de la sparsite (petit echantillon)
np.random.seed(42)
sample_users = np.random.choice(ratings['userId'].unique(), min(150, ratings['userId'].nunique()), replace=False)
sample_movies = np.random.choice(ratings['movieId'].unique(), min(150, ratings['movieId'].nunique()), replace=False)
sample = ratings[(ratings['userId'].isin(sample_users)) & (ratings['movieId'].isin(sample_movies))]

user_map = {u: i for i, u in enumerate(sorted(sample_users))}
movie_map = {m: i for i, m in enumerate(sorted(sample_movies))}

fig, ax = plt.subplots(figsize=(12, 10))
xs, ys, cs = [], [], []
for _, row in sample.iterrows():
    if row['userId'] in user_map and row['movieId'] in movie_map:
        xs.append(movie_map[row['movieId']])
        ys.append(user_map[row['userId']])
        cs.append(row['rating'])

if len(xs) > 0:
    scatter = ax.scatter(xs, ys, c=cs, cmap='YlOrRd', s=5, vmin=0.5, vmax=5)
    plt.colorbar(scatter, ax=ax, label='Note')

ax.set_xlabel('Films (echantillon)')
ax.set_ylabel('Utilisateurs (echantillon)')
ax.set_title('Visualisation de la Sparsite ({:.2f}%)'.format(sparsity*100), fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('figures/10_sparsity_matrix.png', dpi=150, bbox_inches='tight')
plt.show()

In [18]:
user_avg = ratings.groupby('userId')['rating'].mean()

fig, ax = plt.subplots(figsize=(12, 5))
ax.hist(user_avg, bins=50, color='#9b59b6', edgecolor='white', alpha=0.8)
ax.axvline(x=user_avg.mean(), color='red', linestyle='--', linewidth=2,
           label='Moyenne ({:.2f})'.format(user_avg.mean()))
ax.set_xlabel('Note moyenne par utilisateur')
ax.set_ylabel('Nombre d utilisateurs')
ax.set_title('Distribution des Notes Moyennes par Utilisateur', fontsize=14, fontweight='bold')
ax.legend()

plt.tight_layout()
plt.savefig('figures/11_user_avg_rating.png', dpi=150, bbox_inches='tight')
plt.show()

print('Les utilisateurs les plus severes notent en moyenne {:.1f} etoiles'.format(user_avg.min()))
print('Les utilisateurs les plus genereux notent en moyenne {:.1f} etoiles'.format(user_avg.max()))
print('La majorite des utilisateurs notent autour de {:.1f} etoiles'.format(user_avg.median()))

Les utilisateurs les plus severes notent en moyenne 0.5 etoiles
Les utilisateurs les plus genereux notent en moyenne 5.0 etoiles
La majorite des utilisateurs notent autour de 3.7 etoiles


---
## Resume de l'EDA

### Observations cles :
1. **Les notes sont biaisees vers le haut** : les notes 3, 3.5, 4 dominent
2. **La matrice est tres sparse** (~99.5%) : la factorisation SVD est bien adaptee
3. **Drama et Comedy** sont les genres les plus frequents
4. **Film-Noir et War** ont les meilleures notes moyennes
5. Les utilisateurs ont des **comportements de notation tres varies**