<a href="https://colab.research.google.com/github/dantetarraga/Spark-movie_Recommendation/blob/main/Spark_movie-recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Paso 1: Importar las librerias necesarias y configurar Spark**

In [3]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.4.1.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.1-py2.py3-none-any.whl size=311285398 sha256=f7165f21d0b27735e5a62331e2208f0660394d9cbc77243ca829cc2d62a6adb1
  Stored in directory: /root/.cache/pip/wheels/0d/77/a3/ff2f74cc9ab41f8f594dabf0579c2a7c6de920d584206e0834
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.1


**Paso 2: Lee contenido del archivo CSV y transforma a un DataFrame**

In [16]:
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import StringIndexer

# Crear una sesión de Spark
spark = SparkSession.builder.appName("MovieRecommender").getOrCreate()

# Load the Spotify dataset
movies_path = "/content/movies.csv"
ratings_path = "/content/ratings.csv"

# Cargar los datos de películas y ratings
movies_df = spark.read.csv(movies_path, header=True, inferSchema=True)
ratings_df = spark.read.csv(ratings_path, header=True, inferSchema=True)


In [17]:
movies_df.show()
ratings_df.show()

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
|      6|         Heat (1995)|Action|Crime|Thri...|
|      7|      Sabrina (1995)|      Comedy|Romance|
|      8| Tom and Huck (1995)|  Adventure|Children|
|      9| Sudden Death (1995)|              Action|
|     10|    GoldenEye (1995)|Action|Adventure|...|
|     11|American Presiden...|Comedy|Drama|Romance|
|     12|Dracula: Dead and...|       Comedy|Horror|
|     13|        Balto (1995)|Adventure|Animati...|
|     14|        Nixon (1995)|               Drama|
|     15|Cutthroat Island ...|Action|Adventure|...|
|     16|       Casino (1995)|         Crime|Drama|
|     17|Sen

**Paso 3: Pre-procesamiento del conjunto de datos**

*   Combinamos la tabla movies_df con ratings_dv para obterner el titulo de las peliculas





In [18]:
# Realizar la unión de los DataFrames basada en la columna "movieId"
ratings_df = ratings_df.join(movies_df, "movieId")

# Seleccionar las columnas deseadas
ratings_df = ratings_df.select("userId", "movieId", "rating", "title").withColumnRenamed("userId", "user").withColumnRenamed("movieId", "movie")

# Mostrar el DataFrame combinado
ratings_df.show()

+----+-----+------+--------------------+
|user|movie|rating|               title|
+----+-----+------+--------------------+
|   1|  307|   3.5|Three Colors: Blu...|
|   1|  481|   3.5|   Kalifornia (1993)|
|   1| 1091|   1.5|Weekend at Bernie...|
|   1| 1257|   4.5|Better Off Dead.....|
|   1| 1449|   4.5|Waiting for Guffm...|
|   1| 1590|   2.5|Event Horizon (1997)|
|   1| 1591|   1.5|        Spawn (1997)|
|   1| 2134|   4.5|Weird Science (1985)|
|   1| 2478|   4.0|¡Three Amigos! (1...|
|   1| 2840|   3.0|     Stigmata (1999)|
|   1| 2986|   2.5|    RoboCop 2 (1990)|
|   1| 3020|   4.0| Falling Down (1993)|
|   1| 3424|   4.5|Do the Right Thin...|
|   1| 3698|   3.5|Running Man, The ...|
|   1| 3826|   2.0|   Hollow Man (2000)|
|   1| 3893|   3.5|  Nurse Betty (2000)|
|   2|  170|   3.5|      Hackers (1995)|
|   2|  849|   3.5|Escape from L.A. ...|
|   2| 1186|   3.5|Sex, Lies, and Vi...|
|   2| 1235|   3.0|Harold and Maude ...|
+----+-----+------+--------------------+
only showing top

In [20]:
# Convert rating column to float
ratings_df = ratings_df.withColumn("rating", ratings_df["rating"].cast("float"))

# Filter out any invalid or missing values
ratings_df = ratings_df.filter(ratings_df["user"].isNotNull() & ratings_df["movie"].isNotNull() & ratings_df["rating"].isNotNull() & ratings_df["title"].isNotNull())

**Paso 4: Dividir le conjunto de datos en dos subconjuntos - Train y Test**

In [21]:
# Split the data into training and testing sets (80% for training, 20% for testing)
train_data, test_data = ratings_df.randomSplit([0.8, 0.2], seed=42)

**Paso 5: Indexazacion de valores no numericos (user, movie) y entrenamiento del modelo**

In [22]:
# Crear StringIndexers para las columnas relevantes
user_indexer = StringIndexer(inputCol="user", outputCol="userIndex")
movie_indexer = StringIndexer(inputCol="movie", outputCol="movieIndex")

# Aplicar StringIndexers a los datos de ratings
indexed_data = user_indexer.fit(train_data).transform(train_data)
indexed_data = movie_indexer.fit(indexed_data).transform(indexed_data)

# Create an ALS recommender model
als = ALS(userCol="userIndex", itemCol="movieIndex", ratingCol="rating", nonnegative=True)

# Fit the model to the training data
model = als.fit(indexed_data)

In [23]:
indexed_data.show()

+----+-----+------+--------------------+---------+----------+
|user|movie|rating|               title|userIndex|movieIndex|
+----+-----+------+--------------------+---------+----------+
|   1|  307|   3.5|Three Colors: Blu...|  25111.0|     910.0|
|   1|  481|   3.5|   Kalifornia (1993)|  25111.0|    1091.0|
|   1| 1257|   4.5|Better Off Dead.....|  25111.0|    1167.0|
|   1| 1449|   4.5|Waiting for Guffm...|  25111.0|     993.0|
|   1| 1590|   2.5|Event Horizon (1997)|  25111.0|     798.0|
|   1| 2134|   4.5|Weird Science (1985)|  25111.0|     996.0|
|   1| 2840|   3.0|     Stigmata (1999)|  25111.0|    1155.0|
|   1| 2986|   2.5|    RoboCop 2 (1990)|  25111.0|    1154.0|
|   1| 3020|   4.0| Falling Down (1993)|  25111.0|     874.0|
|   1| 3424|   4.5|Do the Right Thin...|  25111.0|     940.0|
|   1| 3826|   2.0|   Hollow Man (2000)|  25111.0|     800.0|
|   1| 3893|   3.5|  Nurse Betty (2000)|  25111.0|    1263.0|
|   2|  170|   3.5|      Hackers (1995)|  25511.0|     692.0|
|   2|  

**Paso 6: Indexazacion de valores no numericos (user, song) y generacion del conjunto de prueba**

In [24]:
user_indexer = StringIndexer(inputCol="user", outputCol="userIndex")
movie_indexer = StringIndexer(inputCol="movie", outputCol="movieIndex")

# Fit StringIndexers and transform the data
indexed_test_data = user_indexer.fit(test_data).transform(test_data)
indexed_test_data = movie_indexer.fit(indexed_test_data).transform(indexed_test_data)

**Paso 7: Genera 5 recomendaciones para cada usuario en el conjunto de datos de prueba**

In [25]:
# Generate top 5 recommendations for each user in the test data
recommendations = model.recommendForUserSubset(indexed_test_data, 5)

**Paso 8: Mostrar recomendaciones**

In [26]:
# Show the recommendations
recommendations.show(truncate=False)

+---------+----------------------------------------------------------------------------------------------------+
|userIndex|recommendations                                                                                     |
+---------+----------------------------------------------------------------------------------------------------+
|1        |[{26170, 5.344473}, {26891, 5.2205157}, {26635, 5.2205157}, {26272, 5.2205157}, {26066, 5.2205157}] |
|3        |[{26891, 5.7220397}, {26635, 5.7220397}, {26272, 5.7220397}, {26066, 5.7220397}, {21873, 5.7220397}]|
|5        |[{20702, 6.064125}, {26891, 6.0557866}, {26635, 6.0557866}, {26272, 6.0557866}, {26066, 6.0557866}] |
|6        |[{26891, 4.952735}, {26635, 4.952735}, {26272, 4.952735}, {26066, 4.952735}, {21873, 4.952735}]     |
|9        |[{15432, 7.1770325}, {15159, 6.5486417}, {27433, 6.406582}, {16047, 6.376807}, {14503, 6.186342}]   |
|12       |[{24399, 5.2823105}, {26170, 5.051624}, {26891, 4.8729715}, {26635, 4.8729715}, {2627

**Paso 9: Solicitar id de usuario para recomendarle pelicula**

In [31]:
from pyspark.sql.functions import col

# Solicitar al usuario que ingrese un número
id_to_retrieve = input("Ingrese numero de usuario: ")

filtered_data = indexed_data.filter(col("userIndex") == id_to_retrieve)

movie_index = filtered_data.select("movie").collect()[0][0]

movie_title = (ratings_df.filter(col("movie") == movie_index)).select("title").collect()[0][0]

# Imprimir la recomendación de película para el usuario
print("Para el usuario %s, se recomienda la película: %s" % (id_to_retrieve, movie_title))

Ingrese numero de usuario: 10531
Para el usuario 10531, se recomienda la película: Toy Story (1995)
