# SAE-66: Chargement des données Users

**Objectif**: Charger le fichier `yelp_academic_dataset_user4students.jsonl`, explorer les attributs utilisateurs, et tester la jointure avec les reviews.

**Ticket**: [SAE-66](https://linear.app/sae6c01/issue/SAE-66/chargement-donnees-users-json)

In [1]:
import pandas as pd
import os

users_path = '../../data/raw/yelp_academic_dataset_user4students.jsonl'
reviews_path = '../../data/raw/yelp_academic_reviews4students.jsonl'

print("Fichiers présents :")
print(f"Users: {os.path.exists(users_path)}")
print(f"Reviews: {os.path.exists(reviews_path)}")

Fichiers présents :
Users: True
Reviews: True


## 1. Chargement des Users

In [2]:
print("Chargement des Users...")
# Attention fichier volumineux (>1GB)
df_users = pd.read_json(users_path, lines=True)
print(f"Users chargés: {len(df_users)}")
df_users.head()

Chargement des Users...


Users chargés: 558095


Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,...,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
0,qVc8ODYU5SZjKXVBgXdI7w,Walker,585,2007-01-25 16:47:26,7217,1259,5994,2007,"NSCy54eWehBJyZdG2iE84w, pe42u7DcCH2QmI81NX-8qA...",267,...,65,55,56,18,232,844,467,467,239,180
1,j14WgRoU_-2ZE1aw1dXrJg,Daniel,4333,2009-01-25 04:35:42,43091,13066,27281,"2009,2010,2011,2012,2013,2014,2015,2016,2017,2...","ueRPE0CX75ePGMqOFVj6IQ, 52oH4DrRvzzl8wh5UXyU0A...",3138,...,264,184,157,251,1847,7054,3131,3131,1521,1946
2,SZDeASXq7o05mMNLshsdIA,Gwen,224,2005-11-29 04:38:33,512,330,299,200920102011,"enx1vVPnfdNUdPho6PH_wg, 4wOcvMLtU6a9Lslggq74Vg...",28,...,4,1,6,2,12,16,26,26,10,9
3,hA5lMy-EnncsH4JoR-hFGQ,Karen,79,2007-01-05 19:40:59,29,15,7,,"PBK4q9KEEBHhFvSXCUirIw, 3FWPpM7KU1gXeOM_ZbYMbA...",1,...,1,0,0,0,1,1,0,0,0,0
4,q_QQ5kBBwlCcbL1s4NVK3g,Jane,1221,2005-03-14 20:26:35,14953,9940,11211,200620072008200920102011201220132014,"xBDpTUbai0DXrvxCe3X16Q, 7GPNBO496aecrjJfW6UWtg...",1357,...,163,191,361,147,1212,5696,2543,2543,815,323


In [3]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 558095 entries, 0 to 558094
Data columns (total 22 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   user_id             558095 non-null  object 
 1   name                558095 non-null  object 
 2   review_count        558095 non-null  int64  
 3   yelping_since       558095 non-null  object 
 4   useful              558095 non-null  int64  
 5   funny               558095 non-null  int64  
 6   cool                558095 non-null  int64  
 7   elite               558095 non-null  object 
 8   friends             558095 non-null  object 
 9   fans                558095 non-null  int64  
 10  average_stars       558095 non-null  float64
 11  compliment_hot      558095 non-null  int64  
 12  compliment_more     558095 non-null  int64  
 13  compliment_profile  558095 non-null  int64  
 14  compliment_cute     558095 non-null  int64  
 15  compliment_list     558095 non-nul

## 2. Statistiques de base

In [4]:
mean_reviews = df_users['review_count'].mean()
mean_stars = df_users['average_stars'].mean()

print(f"Nombre moyen de reviews par user : {mean_reviews:.2f}")
print(f"Note moyenne globale des users : {mean_stars:.2f}")

Nombre moyen de reviews par user : 40.47
Note moyenne globale des users : 3.69


### Top 10 Reviewers

In [5]:
top_users = df_users.nlargest(10, 'review_count')[['name', 'review_count', 'average_stars', 'yelping_since']]
top_users

Unnamed: 0,name,review_count,average_stars,yelping_since
11600,Fox,17473,3.77,2009-05-26 11:33:58
3280,Bruce,16567,3.67,2009-03-08 21:47:44
100,Kim,9941,3.81,2006-05-31 21:27:42
2749,Nijole,8363,3.75,2011-11-29 15:50:53
11947,Vincent,8354,3.87,2012-03-18 10:04:51
5126,George,7738,3.49,2009-11-06 22:53:16
7469,Kenneth,6766,3.32,2011-06-10 03:52:07
850,Jennifer,6679,3.34,2009-11-09 20:44:45
1692,Sunil,6459,3.53,2009-01-28 23:35:24
394,Eric,5887,3.94,2007-03-28 19:08:35


## 3. Test de Jointure (Merge)
On va charger un échantillon des reviews pour tester la jointure.

In [6]:
print("Chargement d'un échantillon de Reviews (100k)...")
# On charge juste 100k lignes pour valider le mécanisme de jointure sans exploser la RAM si la machine est limitée
df_reviews_sample = pd.read_json(reviews_path, lines=True, nrows=100000)
print(f"Reviews sample: {len(df_reviews_sample)}")

# Jointure
print("Jointure Reviews -> Users...")
merged_df = df_reviews_sample.merge(df_users, on='user_id', how='left', suffixes=('_review', '_user'))

print(f"Dimensions après merge : {merged_df.shape}")
print("Colonnes resultantes:")
print(merged_df.columns.tolist())

# Vérification d'une ligne
merged_df[['review_id', 'user_id', 'name', 'average_stars']].head()

Chargement d'un échantillon de Reviews (100k)...


Reviews sample: 100000
Jointure Reviews -> Users...


Dimensions après merge : (100000, 30)
Colonnes resultantes:
['review_id', 'user_id', 'business_id', 'stars', 'useful_review', 'funny_review', 'cool_review', 'text', 'date', 'name', 'review_count', 'yelping_since', 'useful_user', 'funny_user', 'cool_user', 'elite', 'friends', 'fans', 'average_stars', 'compliment_hot', 'compliment_more', 'compliment_profile', 'compliment_cute', 'compliment_list', 'compliment_note', 'compliment_plain', 'compliment_cool', 'compliment_funny', 'compliment_writer', 'compliment_photos']


Unnamed: 0,review_id,user_id,name,average_stars
0,J5Q1gH4ACCj6CtQG7Yom7g,56gL9KEJNHiSDUoyjk2o3Q,Dev,3.72
1,HlXP79ecTquSVXmjM10QxQ,bAt9OUFX9ZRgGLCXG22UmA,Kyle,4.2
2,JBBULrjyGx6vHto2osk_CQ,NRHPcLq2vGWqgqwVugSgnQ,Courtney,5.0
3,U9-43s8YUl6GWBFCpxUGEw,PAxc0qpqt5c2kA0rjDFFAg,Dianne,4.0
4,8T8EGa_4Cj12M6w8vRgUsQ,BqPR1Dp5Rb_QYs9_fz9RiA,Billie June,5.0
