# üéµ Analyse Exploratoire des Donn√©es (EDA)
## Music Recommendation System - MCRec-30M Dataset

---

### üìå R√¥le de ce Notebook

Ce notebook constitue la **premi√®re √©tape fondamentale** du projet de syst√®me de recommandation musicale. Son objectif principal est de r√©aliser une **analyse exploratoire approfondie (EDA)** du dataset MCRec-30M afin de :

‚úÖ **Comprendre la structure des donn√©es** : dimensions, types de variables, distribution des valeurs  
‚úÖ **√âvaluer la qualit√©** : d√©tecter les valeurs manquantes, les doublons et les anomalies  
‚úÖ **Identifier les patterns** : comportements d'√©coute, pr√©f√©rences musicales, tendances temporelles  
‚úÖ **Analyser les features** : caract√©ristiques audio, corr√©lations, variables pertinentes  
‚úÖ **Guider le preprocessing** : d√©terminer les transformations n√©cessaires (normalisation, encodage, filtrage)  
‚úÖ **Documenter les insights** : g√©n√©rer un rapport JSON pour r√©f√©rence future  

Cette analyse permettra de prendre des **d√©cisions √©clair√©es** pour les √©tapes suivantes (preprocessing et mod√©lisation) et d'assurer la qualit√© du syst√®me de recommandation final.

---

**üìä Dataset** : `personalized_music_recommendation_dataset.csv`  
**üîß Configuration** : Charg√©e depuis `config.yaml`  
**üìà Sortie** : Visualisations interactives + Rapport JSON (`data/processed/eda_report.json`)

---

2 : Importation des biblioth√®ques et chargement des donn√©es

In [1]:
# Importation des biblioth√®ques
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import yaml
import json
import os
import warnings
warnings.filterwarnings('ignore')

# Configuration de l'affichage
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úÖ Biblioth√®ques import√©es avec succ√®s")

# Chargement de la configuration
with open('../config.yaml', 'r', encoding='utf-8') as f:
    config = yaml.safe_load(f)

print(f"‚úÖ Configuration charg√©e: {config['project']['name']} v{config['project']['version']}")

# Chargement du dataset
data_path = r"C:\Users\ekoub\OneDrive\Bureau\run python\music_recommendation_system\data\raw\personalized_music_recommendation_dataset.csv"

print("\nChargement des donn√©es...")
df = pd.read_csv(data_path)
print(f"‚úÖ Dataset charg√©: {df.shape[0]:,} lignes et {df.shape[1]} colonnes")
print(f"üíæ Taille m√©moire: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Affichage des premi√®res lignes
df.head()

‚úÖ Biblioth√®ques import√©es avec succ√®s
‚úÖ Configuration charg√©e: Music Recommendation System v1.0.0

Chargement des donn√©es...
‚úÖ Dataset charg√©: 70,129 lignes et 49 colonnes
üíæ Taille m√©moire: 96.22 MB


Unnamed: 0,timestamp,user_id,age,gender,location,device_type,listening_time_mins,sessions_per_day,time_of_day,day_of_week,preferred_genre,preferred_artist,recent_skip_rate,subscription_type,song_id,title,artist,album,genre,release_year,language,duration_sec,popularity,explicit,tempo,key,mode,time_signature,energy,danceability,acousticness,instrumentalness,liveness,valence,loudness,speechiness,lyrics_sentiment,emotion_tag,play_count,skip_count,added_to_playlist,finished_song,time_spent_on_song,repeat_count,first_time_listening,context_type,song_position_in_session,session_duration_mins,liked
0,2021-01-01 00:00:00,1040,35,Female,US,Mobile,3,2,Afternoon,Weekday,Rock,ArtistB,0.080488,Premium,10122,SongA,ArtistB,AlbumX,Pop,2023,English,301,88,No,129.741127,1,Major,3,0.568881,0.353782,0.258067,0.148152,0.109031,0.684514,-15.66057,0.104678,Positive,Sad,5,2,0,1,109,1,0,Relax,10,31,0
1,2021-01-01 00:30:00,1000,45,Male,US,Mobile,28,1,Evening,Weekday,Pop,ArtistA,0.137869,Premium,10168,SongC,ArtistA,AlbumZ,Pop,2013,Spanish,184,17,No,105.300463,6,Major,4,0.759281,0.657928,0.14774,0.067243,0.077384,0.777952,-3.80269,0.200428,Positive,Happy,7,1,1,1,116,1,0,Workout,6,38,0
2,2021-01-01 01:00:00,1025,35,Other,US,Mobile,65,3,Afternoon,Weekday,Pop,ArtistA,0.298151,Free,10165,SongB,ArtistB,AlbumZ,Jazz,2013,Spanish,165,58,No,132.40856,1,Major,4,0.53818,0.615189,0.649933,0.254371,0.089832,0.710532,-10.006042,0.082967,Positive,Sad,5,0,0,0,172,1,0,Workout,8,25,1
3,2021-01-01 01:30:00,1049,45,Male,US,Smart Speaker,2,1,Afternoon,Weekend,Pop,ArtistB,0.154449,Free,10184,SongB,ArtistB,AlbumZ,Rock,2024,English,207,72,No,114.133532,1,Major,5,0.768903,0.812795,0.341269,0.43311,0.134263,0.475146,-11.681155,0.255022,Positive,Calm,5,0,0,1,172,0,1,Relax,8,49,0
4,2021-01-01 02:00:00,1011,45,Male,UK,Mobile,44,1,Evening,Weekday,Jazz,ArtistA,0.107674,Free,10091,SongD,ArtistA,AlbumZ,Pop,2021,English,200,40,No,124.51349,6,Major,4,0.334829,0.471074,0.535915,0.144477,0.003084,0.36091,-8.800733,0.080975,Negative,Sad,4,0,1,0,178,1,0,Relax,2,25,0


3 : Vue d'ensemble et informations g√©n√©rales

In [2]:
print("\n" + "="*80)
print("INFORMATIONS G√âN√âRALES SUR LE DATASET")
print("="*80)

# Statistiques g√©n√©rales
print(f"\nüìä DIMENSIONS:")
print(f"  ‚Ä¢ Lignes: {df.shape[0]:,}")
print(f"  ‚Ä¢ Colonnes: {df.shape[1]}")
print(f"  ‚Ä¢ Taille m√©moire: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

print(f"\nüî¢ ENTIT√âS UNIQUES:")
print(f"  ‚Ä¢ Utilisateurs: {df['user_id'].nunique():,}")
print(f"  ‚Ä¢ Chansons: {df['song_id'].nunique():,}")
print(f"  ‚Ä¢ Artistes: {df['artist'].nunique():,}")
print(f"  ‚Ä¢ Genres: {df['genre'].nunique():,}")
print(f"  ‚Ä¢ Interactions totales: {len(df):,}")

print(f"\nüìã TYPES DE DONN√âES:")
df.info()

print(f"\nüìä STATISTIQUES DESCRIPTIVES:")
df.describe()


INFORMATIONS G√âN√âRALES SUR LE DATASET

üìä DIMENSIONS:
  ‚Ä¢ Lignes: 70,129
  ‚Ä¢ Colonnes: 49
  ‚Ä¢ Taille m√©moire: 96.22 MB

üî¢ ENTIT√âS UNIQUES:
  ‚Ä¢ Utilisateurs: 50
  ‚Ä¢ Chansons: 200
  ‚Ä¢ Artistes: 3
  ‚Ä¢ Genres: 5
  ‚Ä¢ Interactions totales: 70,129

üìã TYPES DE DONN√âES:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70129 entries, 0 to 70128
Data columns (total 49 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   timestamp                 70129 non-null  object 
 1   user_id                   70129 non-null  int64  
 2   age                       70129 non-null  int64  
 3   gender                    70129 non-null  object 
 4   location                  70129 non-null  object 
 5   device_type               70129 non-null  object 
 6   listening_time_mins       70129 non-null  int64  
 7   sessions_per_day          70129 non-null  int64  
 8   time_of_day               70129 non-null  ob

Unnamed: 0,user_id,age,listening_time_mins,sessions_per_day,recent_skip_rate,song_id,release_year,duration_sec,popularity,tempo,key,time_signature,energy,danceability,acousticness,instrumentalness,liveness,valence,loudness,speechiness,play_count,skip_count,added_to_playlist,finished_song,time_spent_on_song,repeat_count,first_time_listening,song_position_in_session,session_duration_mins,liked
count,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0,70129.0
mean,1024.530508,35.282351,19.501961,1.523592,0.285876,10099.389397,2018.799655,209.463489,49.569622,119.927511,5.497155,3.950605,0.714504,0.599602,0.285205,0.199693,0.166208,0.571546,-10.003112,0.332984,4.98691,0.994068,0.19822,0.699768,159.301231,0.994382,0.399749,5.48994,29.593392,0.298122
std,14.47466,10.439658,20.035027,0.729316,0.159505,57.697092,4.014108,59.920928,28.883892,14.974337,3.464414,0.383333,0.159536,0.199771,0.159343,0.120535,0.14029,0.174675,3.00361,0.178484,2.232181,0.99787,0.398662,0.458362,30.026655,0.998,0.48985,2.871212,9.978299,0.457437
min,1000.0,18.0,0.0,1.0,0.00154,10000.0,2010.0,-74.0,0.0,56.870131,0.0,3.0,0.051737,0.02365,0.001503,0.000207,8e-06,0.011293,-21.976866,0.000659,0.0,0.0,0.0,0.0,35.0,0.0,0.0,1.0,-13.0,0.0
25%,1012.0,25.0,5.0,1.0,0.161633,10049.0,2016.0,169.0,25.0,109.830796,2.0,4.0,0.610684,0.455842,0.16074,0.106748,0.055954,0.447943,-12.042249,0.192346,3.0,0.0,0.0,0.0,139.0,0.0,0.0,3.0,23.0,0.0
50%,1025.0,35.0,13.0,1.0,0.265101,10100.0,2019.0,209.0,50.0,120.005073,6.0,4.0,0.735693,0.614265,0.264379,0.178982,0.128973,0.578667,-9.995646,0.31335,5.0,1.0,0.0,1.0,159.0,1.0,0.0,5.0,30.0,0.0
75%,1037.0,45.0,27.0,2.0,0.388704,10149.0,2022.0,250.0,75.0,130.138337,9.0,4.0,0.83946,0.757045,0.388802,0.272376,0.242381,0.70244,-7.964235,0.454204,6.0,2.0,0.0,1.0,179.0,2.0,1.0,8.0,36.0,1.0
max,1049.0,55.0,204.0,4.0,0.946192,10199.0,2024.0,491.0,99.0,181.040198,11.0,5.0,0.999498,0.998792,0.931008,0.794926,0.882037,0.989947,3.234937,0.952045,17.0,8.0,1.0,1.0,295.0,8.0,1.0,10.0,74.0,1.0


4 : Analyse de la qualit√© des donn√©es

In [3]:
print("\n" + "="*80)
print("ANALYSE DE LA QUALIT√â DES DONN√âES")
print("="*80)

# V√©rification des doublons
duplicates = df.duplicated().sum()
print(f"\nüîç Doublons: {duplicates:,} ({duplicates/len(df)*100:.2f}%)")

# Valeurs manquantes
missing_data = pd.DataFrame({
    'Colonne': df.columns,
    'Valeurs Manquantes': df.isnull().sum(),
    'Pourcentage (%)': (df.isnull().sum() / len(df)) * 100
}).sort_values('Valeurs Manquantes', ascending=False)

missing_data = missing_data[missing_data['Valeurs Manquantes'] > 0]

if len(missing_data) > 0:
    print(f"\n‚ö†Ô∏è VALEURS MANQUANTES D√âTECT√âES dans {len(missing_data)} colonnes:")
    print(missing_data.to_string(index=False))
    
    # Visualisation
    fig = px.bar(missing_data, x='Colonne', y='Pourcentage (%)', 
                 title='Pourcentage de Valeurs Manquantes par Colonne',
                 color='Pourcentage (%)', color_continuous_scale='Reds')
    fig.show()
else:
    print("\n‚úÖ Aucune valeur manquante d√©tect√©e!")

# Analyse des valeurs aberrantes (IQR method)
print("\nüîç D√âTECTION DES VALEURS ABERRANTES (m√©thode IQR):")
for col in ['listening_time_mins', 'sessions_per_day', 'duration_sec', 'popularity']:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    outlier_pct = (len(outliers) / len(df)) * 100
    
    print(f"  ‚Ä¢ {col}: {len(outliers):,} aberrantes ({outlier_pct:.2f}%)")


ANALYSE DE LA QUALIT√â DES DONN√âES

üîç Doublons: 0 (0.00%)

‚úÖ Aucune valeur manquante d√©tect√©e!

üîç D√âTECTION DES VALEURS ABERRANTES (m√©thode IQR):
  ‚Ä¢ listening_time_mins: 3,342 aberrantes (4.77%)
  ‚Ä¢ sessions_per_day: 1,378 aberrantes (1.96%)
  ‚Ä¢ duration_sec: 474 aberrantes (0.68%)
  ‚Ä¢ popularity: 0 aberrantes (0.00%)


5 : Analyse des utilisateurs

In [4]:
print("\n" + "="*80)
print("ANALYSE DES UTILISATEURS")
print("="*80)

# Distribution des interactions par utilisateur
user_interactions = df.groupby('user_id').size().reset_index(name='nb_interactions')

print(f"\nüìä STATISTIQUES DES INTERACTIONS PAR UTILISATEUR:")
print(f"  ‚Ä¢ Moyenne: {user_interactions['nb_interactions'].mean():.2f}")
print(f"  ‚Ä¢ M√©diane: {user_interactions['nb_interactions'].median():.0f}")
print(f"  ‚Ä¢ Min: {user_interactions['nb_interactions'].min():.0f}")
print(f"  ‚Ä¢ Max: {user_interactions['nb_interactions'].max():.0f}")
print(f"  ‚Ä¢ √âcart-type: {user_interactions['nb_interactions'].std():.2f}")

# Profil d√©mographique
print(f"\nüë• PROFIL D√âMOGRAPHIQUE:")
print(f"\nGenre:")
print(df['gender'].value_counts())
print(f"\nLocalisation (Top 10):")
print(df['location'].value_counts().head(10))
print(f"\nType d'Appareil:")
print(df['device_type'].value_counts())
print(f"\nType d'Abonnement:")
print(df['subscription_type'].value_counts())

# Visualisations
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Distribution des Interactions', 'Genre', 'Appareil', 'Abonnement'),
    specs=[[{'type':'histogram'}, {'type':'pie'}],
           [{'type':'pie'}, {'type':'pie'}]]
)

# Distribution interactions
fig.add_trace(
    go.Histogram(x=user_interactions['nb_interactions'], nbinsx=50, name='Interactions'),
    row=1, col=1
)

# Genre
gender_counts = df['gender'].value_counts()
fig.add_trace(go.Pie(labels=gender_counts.index, values=gender_counts.values),
              row=1, col=2)

# Appareil
device_counts = df['device_type'].value_counts()
fig.add_trace(go.Pie(labels=device_counts.index, values=device_counts.values),
              row=2, col=1)

# Abonnement
sub_counts = df['subscription_type'].value_counts()
fig.add_trace(go.Pie(labels=sub_counts.index, values=sub_counts.values),
              row=2, col=2)

fig.update_layout(height=700, showlegend=False, 
                  title_text="Analyse des Utilisateurs")
fig.show()


ANALYSE DES UTILISATEURS

üìä STATISTIQUES DES INTERACTIONS PAR UTILISATEUR:
  ‚Ä¢ Moyenne: 1402.58
  ‚Ä¢ M√©diane: 1402
  ‚Ä¢ Min: 1320
  ‚Ä¢ Max: 1482
  ‚Ä¢ √âcart-type: 36.01

üë• PROFIL D√âMOGRAPHIQUE:

Genre:
gender
Male      42175
Female    20990
Other      6964
Name: count, dtype: int64

Localisation (Top 10):
location
US    35008
IN    13963
DE     7125
BR     7081
UK     6952
Name: count, dtype: int64

Type d'Appareil:
device_type
Mobile           49168
Desktop          13933
Smart Speaker     7028
Name: count, dtype: int64

Type d'Abonnement:
subscription_type
Free       56133
Premium    13996
Name: count, dtype: int64


6 : Analyse des chansons et contenus musicaux

In [5]:
print("\n" + "="*80)
print("ANALYSE DES CHANSONS ET CONTENUS MUSICAUX")
print("="*80)

# Statistiques des chansons
song_stats = df.groupby('song_id').agg({
    'play_count': 'sum',
    'liked': 'sum',
    'skip_count': 'sum',
    'added_to_playlist': 'sum'
}).reset_index()

print(f"\nüìä STATISTIQUES DES √âCOUTES PAR CHANSON:")
print(f"  ‚Ä¢ Moyenne: {song_stats['play_count'].mean():.2f}")
print(f"  ‚Ä¢ M√©diane: {song_stats['play_count'].median():.0f}")
print(f"  ‚Ä¢ Max: {song_stats['play_count'].max():.0f}")

# Top chansons
top_songs = df.groupby(['song_id', 'title', 'artist']).size().reset_index(name='nb_ecoutes')
top_songs = top_songs.sort_values('nb_ecoutes', ascending=False).head(20)

print(f"\nüéµ TOP 20 CHANSONS LES PLUS √âCOUT√âES:")
for idx, row in top_songs.head(10).iterrows():
    print(f"  {row['title']} - {row['artist']} ({row['nb_ecoutes']} √©coutes)")

# Genres
genre_counts = df['genre'].value_counts()
print(f"\nüé∏ DISTRIBUTION DES GENRES:")
print(genre_counts)

# Top artistes
top_artists = df['artist'].value_counts().head(10)
print(f"\nüé§ TOP 10 ARTISTES:")
print(top_artists)

# Visualisations
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Top 15 Chansons', 'Distribution des Genres', 
                    'Top 15 Artistes', 'Popularit√© des Chansons'),
    specs=[[{'type':'bar'}, {'type':'pie'}],
           [{'type':'bar'}, {'type':'histogram'}]]
)

# Top chansons
top15 = top_songs.head(15)
fig.add_trace(
    go.Bar(y=top15['title'], x=top15['nb_ecoutes'], orientation='h'),
    row=1, col=1
)

# Genres
fig.add_trace(
    go.Pie(labels=genre_counts.index, values=genre_counts.values),
    row=1, col=2
)

# Top artistes
fig.add_trace(
    go.Bar(y=top_artists.index[:15], x=top_artists.values[:15], orientation='h'),
    row=2, col=1
)

# Popularit√©
fig.add_trace(
    go.Histogram(x=df['popularity'], nbinsx=30),
    row=2, col=2
)

fig.update_layout(height=800, showlegend=False, 
                  title_text="Analyse des Chansons et Contenus")
fig.show()


ANALYSE DES CHANSONS ET CONTENUS MUSICAUX

üìä STATISTIQUES DES √âCOUTES PAR CHANSON:
  ‚Ä¢ Moyenne: 1748.63
  ‚Ä¢ M√©diane: 1746
  ‚Ä¢ Max: 2054

üéµ TOP 20 CHANSONS LES PLUS √âCOUT√âES:
  SongA - ArtistA (76 √©coutes)
  SongD - ArtistA (73 √©coutes)
  SongA - ArtistA (72 √©coutes)
  SongB - ArtistA (72 √©coutes)
  SongA - ArtistA (72 √©coutes)
  SongC - ArtistA (71 √©coutes)
  SongD - ArtistA (71 √©coutes)
  SongB - ArtistA (71 √©coutes)
  SongB - ArtistA (70 √©coutes)
  SongD - ArtistA (70 √©coutes)

üé∏ DISTRIBUTION DES GENRES:
genre
Pop          42141
Rock         14032
Jazz          6991
EDM           4888
Classical     2077
Name: count, dtype: int64

üé§ TOP 10 ARTISTES:
artist
ArtistA    42304
ArtistB    20922
ArtistC     6903
Name: count, dtype: int64


7 : Analyse des features audio

In [6]:
print("\n" + "="*80)
print("ANALYSE DES CARACT√âRISTIQUES AUDIO")
print("="*80)

# Features audio depuis config
audio_features = config['audio_features']

print(f"\nüìä STATISTIQUES DES FEATURES AUDIO:")
print(df[audio_features].describe())

# Matrice de corr√©lation
correlation_matrix = df[audio_features].corr()

# Visualisation des distributions et corr√©lations
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Distribution des Features Audio', 'Matrice de Corr√©lation'),
    specs=[[{'type':'box'}, {'type':'heatmap'}]],
    column_widths=[0.4, 0.6]
)

# Boxplots (seulement quelques features pour la lisibilit√©)
selected_features = audio_features[:5]
for feature in selected_features:
    fig.add_trace(
        go.Box(y=df[feature], name=feature),
        row=1, col=1
    )

# Heatmap de corr√©lation
fig.add_trace(
    go.Heatmap(
        z=correlation_matrix.values,
        x=correlation_matrix.columns,
        y=correlation_matrix.columns,
        colorscale='RdBu',
        zmid=0,
        text=correlation_matrix.values.round(2),
        texttemplate='%{text}',
        textfont={"size": 8}
    ),
    row=1, col=2
)

fig.update_layout(height=500, title_text="Analyse des Features Audio")
fig.show()

# Corr√©lations fortes
print(f"\n‚ö†Ô∏è CORR√âLATIONS FORTES (|r| > 0.7):")
high_corr = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > 0.7:
            print(f"  ‚Ä¢ {correlation_matrix.columns[i]} ‚Üî {correlation_matrix.columns[j]}: {correlation_matrix.iloc[i, j]:.3f}")
            high_corr.append((correlation_matrix.columns[i], correlation_matrix.columns[j], correlation_matrix.iloc[i, j]))

if not high_corr:
    print("  Aucune corr√©lation forte d√©tect√©e.")


ANALYSE DES CARACT√âRISTIQUES AUDIO

üìä STATISTIQUES DES FEATURES AUDIO:
              tempo        energy  danceability  acousticness  \
count  70129.000000  70129.000000  70129.000000  70129.000000   
mean     119.927511      0.714504      0.599602      0.285205   
std       14.974337      0.159536      0.199771      0.159343   
min       56.870131      0.051737      0.023650      0.001503   
25%      109.830796      0.610684      0.455842      0.160740   
50%      120.005073      0.735693      0.614265      0.264379   
75%      130.138337      0.839460      0.757045      0.388802   
max      181.040198      0.999498      0.998792      0.931008   

       instrumentalness      liveness       valence      loudness  \
count      70129.000000  70129.000000  70129.000000  70129.000000   
mean           0.199693      0.166208      0.571546    -10.003112   
std            0.120535      0.140290      0.174675      3.003610   
min            0.000207      0.000008      0.011293    -21.976


‚ö†Ô∏è CORR√âLATIONS FORTES (|r| > 0.7):
  Aucune corr√©lation forte d√©tect√©e.


 8 : Analyse temporelle et comportementale

In [7]:
print("\n" + "="*80)
print("ANALYSE TEMPORELLE ET COMPORTEMENTALE")
print("="*80)

# Conversion timestamp
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
df['date'] = df['timestamp'].dt.date

# Statistiques temporelles
time_of_day_counts = df['time_of_day'].value_counts()
day_of_week_counts = df['day_of_week'].value_counts()
hourly_counts = df['hour'].value_counts().sort_index()

print(f"\n‚è∞ DISTRIBUTION TEMPORELLE:")
print(f"\nPar Moment de Journ√©e:")
print(time_of_day_counts)
print(f"\nPar Type de Jour:")
print(day_of_week_counts)

# Statistiques comportementales
print(f"\nüí° STATISTIQUES COMPORTEMENTALES:")
print(f"  ‚Ä¢ Taux de skip moyen: {df['skip_count'].mean():.2f}")
print(f"  ‚Ä¢ Taux de like: {df['liked'].mean()*100:.2f}%")
print(f"  ‚Ä¢ Taux d'ajout playlist: {df['added_to_playlist'].mean()*100:.2f}%")
print(f"  ‚Ä¢ Taux de completion: {df['finished_song'].mean()*100:.2f}%")
print(f"  ‚Ä¢ R√©p√©titions moyennes: {df['repeat_count'].mean():.2f}")

# √âmotions et contextes
emotion_counts = df['emotion_tag'].value_counts()
context_counts = df['context_type'].value_counts()

print(f"\nüòä DISTRIBUTION DES √âMOTIONS:")
print(emotion_counts)
print(f"\nüéß CONTEXTE D'√âCOUTE:")
print(context_counts)

# Visualisations
fig = make_subplots(
    rows=2, cols=3,
    subplot_titles=('√âcoutes par Heure', 'Moment de Journ√©e', 'Type de Jour',
                   '√âmotions', 'Contexte', '√âvolution Temporelle'),
    specs=[[{'type':'bar'}, {'type':'pie'}, {'type':'pie'}],
           [{'type':'bar'}, {'type':'pie'}, {'type':'scatter'}]]
)

# Heure
fig.add_trace(go.Bar(x=hourly_counts.index, y=hourly_counts.values), row=1, col=1)

# Moment
fig.add_trace(go.Pie(labels=time_of_day_counts.index, values=time_of_day_counts.values), row=1, col=2)

# Jour
fig.add_trace(go.Pie(labels=day_of_week_counts.index, values=day_of_week_counts.values), row=1, col=3)

# √âmotions
fig.add_trace(go.Bar(x=emotion_counts.index, y=emotion_counts.values), row=2, col=1)

# Contexte
fig.add_trace(go.Pie(labels=context_counts.index, values=context_counts.values), row=2, col=2)

# √âvolution
daily_counts = df.groupby('date').size().reset_index(name='count')
fig.add_trace(go.Scatter(x=daily_counts['date'], y=daily_counts['count'], mode='lines'), row=2, col=3)

fig.update_layout(height=700, showlegend=False, title_text="Analyse Temporelle et Comportementale")
fig.show()


ANALYSE TEMPORELLE ET COMPORTEMENTALE

‚è∞ DISTRIBUTION TEMPORELLE:

Par Moment de Journ√©e:
time_of_day
Evening      28094
Afternoon    20878
Morning      14102
Night         7055
Name: count, dtype: int64

Par Type de Jour:
day_of_week
Weekday    49436
Weekend    20693
Name: count, dtype: int64

üí° STATISTIQUES COMPORTEMENTALES:
  ‚Ä¢ Taux de skip moyen: 0.99
  ‚Ä¢ Taux de like: 29.81%
  ‚Ä¢ Taux d'ajout playlist: 19.82%
  ‚Ä¢ Taux de completion: 69.98%
  ‚Ä¢ R√©p√©titions moyennes: 0.99

üòä DISTRIBUTION DES √âMOTIONS:
emotion_tag
Happy    34799
Sad      14181
Calm     14007
Angry     7142
Name: count, dtype: int64

üéß CONTEXTE D'√âCOUTE:
context_type
Workout    20999
Commute    17552
Relax      13933
Study      10634
Party       7011
Name: count, dtype: int64


 9 : Conclusions et sauvegarde du rapport

In [8]:
print("\n" + "="*80)
print("CONCLUSIONS ET RECOMMANDATIONS")
print("="*80)

print("""
‚úÖ POINTS CL√âS IDENTIFI√âS:

1. QUALIT√â DES DONN√âES:
   - Dataset structur√© avec types appropri√©s
   - V√©rifier valeurs manquantes et doublons
   - Pr√©sence de valeurs aberrantes √† traiter

2. DISTRIBUTION:
   - Distribution in√©gale des interactions (cold start potentiel)
   - Genres/artistes dominants (biais de popularit√©)
   - Patterns temporels exploitables

3. FEATURES AUDIO:
   - Distributions vari√©es ‚Üí normalisation n√©cessaire
   - Corr√©lations √† surveiller
   - Features riches pour content-based

üìã RECOMMANDATIONS PREPROCESSING:

1. NETTOYAGE:
   ‚úì Traiter valeurs manquantes et doublons
   ‚úì G√©rer valeurs aberrantes
   
2. FEATURE ENGINEERING:
   ‚úì Normaliser features audio
   ‚úì Encoder variables cat√©gorielles
   ‚úì Cr√©er features agr√©g√©es
   
3. FILTRAGE:
   ‚úì Filtrer utilisateurs/chansons < 5 interactions
   
4. SPLIT:
   ‚úì Train/Test temporel (80/20)

‚û°Ô∏è Prochaine √©tape: Notebook 02_preprocessing.ipynb
""")

# Cr√©er le dossier processed
os.makedirs('../data/processed', exist_ok=True)

# Sauvegarde du rapport
report = {
    'dataset_shape': list(df.shape),
    'n_users': int(df['user_id'].nunique()),
    'n_songs': int(df['song_id'].nunique()),
    'n_artists': int(df['artist'].nunique()),
    'n_genres': int(df['genre'].nunique()),
    'n_interactions': len(df),
    'duplicates': int(duplicates),
    'user_interactions_mean': float(user_interactions['nb_interactions'].mean()),
    'user_interactions_median': float(user_interactions['nb_interactions'].median()),
    'liked_rate': float(df['liked'].mean()),
    'skip_rate': float(df['skip_count'].mean()),
    'completion_rate': float(df['finished_song'].mean()),
    'audio_features': audio_features,
    'top_genres': genre_counts.head(5).to_dict(),
    'top_artists': top_artists.head(5).to_dict()
}

with open('../data/processed/eda_report.json', 'w') as f:
    json.dump(report, f, indent=4)

print("\nüíæ Rapport d'analyse sauvegard√©: data/processed/eda_report.json")
print("‚úÖ Analyse exploratoire termin√©e!")


CONCLUSIONS ET RECOMMANDATIONS

‚úÖ POINTS CL√âS IDENTIFI√âS:

1. QUALIT√â DES DONN√âES:
   - Dataset structur√© avec types appropri√©s
   - V√©rifier valeurs manquantes et doublons
   - Pr√©sence de valeurs aberrantes √† traiter

2. DISTRIBUTION:
   - Distribution in√©gale des interactions (cold start potentiel)
   - Genres/artistes dominants (biais de popularit√©)
   - Patterns temporels exploitables

3. FEATURES AUDIO:
   - Distributions vari√©es ‚Üí normalisation n√©cessaire
   - Corr√©lations √† surveiller
   - Features riches pour content-based

üìã RECOMMANDATIONS PREPROCESSING:

1. NETTOYAGE:
   ‚úì Traiter valeurs manquantes et doublons
   ‚úì G√©rer valeurs aberrantes

2. FEATURE ENGINEERING:
   ‚úì Normaliser features audio
   ‚úì Encoder variables cat√©gorielles
   ‚úì Cr√©er features agr√©g√©es

3. FILTRAGE:
   ‚úì Filtrer utilisateurs/chansons < 5 interactions

4. SPLIT:
   ‚úì Train/Test temporel (80/20)

‚û°Ô∏è Prochaine √©tape: Notebook 02_preprocessing.ipynb


üíæ 