# MovieLens - Ingestion Batch et Vérification de la Qualité

## 1. Initialisation de Spark

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import count

spark = SparkSession.builder \
    .appName("Data Ingestion - MovieLens") \
    .config("spark.hadoop.fs.defaultFS", "hdfs://namenode:9000") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

print("✅ SparkSession initialisée.")


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/02 08:56:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


✅ SparkSession initialisée.


In [2]:

## 2. Vérification HDFS préalable

!hdfs dfs -ls /

## 3. Création du répertoire de destination dans HDFS

!hdfs dfs -mkdir -p /user/movielens/raw
!hdfs dfs -ls /user/movielens

## 4. Upload local → HDFS (depuis /notebooks/data ou autre)
# Remplacer les chemins ci-dessous par le chemin réel dans le container Docker si différent

!hdfs dfs -put -f ./data/movielens/rating.csv /user/movielens/raw/
!hdfs dfs -put -f ./data/movielens/movie.csv /user/movielens/raw/

## 5. Vérification HDFS 
!hdfs dfs -ls /user/movielens/raw

## 6. Lecture avec Spark (vérification d'intégrité)
ratings_path = "hdfs://namenode:9000/user/movielens/raw/rating.csv"
movies_path = "hdfs://namenode:9000/user/movielens/raw/movie.csv"

log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Found 2 items
drwxrwx---   - root supergroup          0 2025-04-29 14:38 /tmp
drwxr-xr-x   - root supergroup          0 2025-04-29 23:24 /user
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Found 2 items
drwxr-xr-x   - root supergroup          0 2025-04-30 11:11 /user/movielens/clean
drwxr-xr-x   - root supergroup          0 2025-04-30 09:30 /user/movielens/raw
log4j:WARN No app

## 2. Chargement des données locales

In [3]:

ratings_df = spark.read.option("header", True).option("inferSchema", True).csv(ratings_path)
movies_df = spark.read.option("header", True).option("inferSchema", True).csv(movies_path)

print("✅ Fichiers chargés avec Spark.")


                                                                                

✅ Fichiers chargés avec Spark.


## 3. Exploration des données


In [4]:
# Affichage des schémas
print("Schéma des notes:")
ratings_df.printSchema()

print("\nSchéma des films:")
movies_df.printSchema()

Schéma des notes:
root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: timestamp (nullable = true)


Schéma des films:
root
 |-- movieId: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: string (nullable = true)



In [5]:
print("🎬 Aperçu des données ratings.csv :")
ratings_df.show(5)
ratings_df.printSchema()

print("🎥 Aperçu des données movies.csv :")
movies_df.show(5)
movies_df.printSchema()


🎬 Aperçu des données ratings.csv :
+------+-------+------+-------------------+
|userId|movieId|rating|          timestamp|
+------+-------+------+-------------------+
|     1|      2|   3.5|2005-04-02 23:53:47|
|     1|     29|   3.5|2005-04-02 23:31:16|
|     1|     32|   3.5|2005-04-02 23:33:39|
|     1|     47|   3.5|2005-04-02 23:32:07|
|     1|     50|   3.5|2005-04-02 23:29:40|
+------+-------+------+-------------------+
only showing top 5 rows

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: timestamp (nullable = true)

🎥 Aperçu des données movies.csv :
+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|C

In [6]:
# Statistiques descriptives
print("Statistiques sur les notes:")
ratings_df.describe().show()

print("\nNombre total de notes:")
print(ratings_df.count())

print("\nNombre total de films:")
print(movies_df.count())

print("\nNombre d'utilisateurs uniques:")
print(ratings_df.select("userId").distinct().count())

Statistiques sur les notes:


                                                                                

+-------+-----------------+-----------------+------------------+
|summary|           userId|          movieId|            rating|
+-------+-----------------+-----------------+------------------+
|  count|         20000263|         20000263|          20000263|
|   mean|69045.87258292554|9041.567330339605|3.5255285642993797|
| stddev| 40038.6266531621|  19789.477445413| 1.051988919294244|
|    min|                1|                1|               0.5|
|    max|           138493|           131262|               5.0|
+-------+-----------------+-----------------+------------------+


Nombre total de notes:


                                                                                

20000263

Nombre total de films:
27278

Nombre d'utilisateurs uniques:




138493


                                                                                

In [7]:
# Fermeture de la session Spark
spark.stop()