# Spark on Tour
## Recomendación automática de películas en base a eventos de puntuación

En este notebook vamos a explorar una de las principales funcionalidades disponibles en la librería de ML de Spark, la generación automática de recomendaciones en base al histórico de eventos de usuarios.

Concretamente Spark incopora de serie una implementación parapelizada de un algoritmo de recomendación mediante filtrado colaborativo:

https://spark.apache.org/docs/2.4.3/ml-collaborative-filtering.html

Este tipo de algoritmos se basan en aprovechar el conocimiento generado a través de la captura de feedback de un gran número de usuarios para generar recomendaciones de forma automática.

En este caso vamos a utilizar el dataset de Movielens, que incorpora informacón de puntuaciones de películas para generar recomendaciones de nuevas películas que le podrían gustar a nuestros usuarios.

### Importamos librerías, definimos esquemas e inicializamos la sesión Spark.

In [1]:
import findspark
findspark.init()

import pyspark
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

import plotly.express as px


ratingSchema = StructType([
    StructField("user", IntegerType()),
    StructField("movie", IntegerType()),
    StructField("rating", FloatType())
])

movieSchema = StructType([
    StructField("movie", IntegerType()),
    StructField("title", StringType()),
    StructField("genres", StringType())
])


#setup spark session
sparkSession = (SparkSession.builder
                .appName("Introducción API estructurada")
                .master("local[*]")
                .config("spark.scheduler.mode", "FAIR")
                .getOrCreate())
sparkSession.sparkContext.setLogLevel("ERROR")

### Leemos el dataset de ratings usuario / película

In [2]:
ratings = sparkSession.read.csv("/tmp/movielens/ratings.csv", schema=ratingSchema, header=True)
ratings.show(10)

+----+-----+------+
|user|movie|rating|
+----+-----+------+
|   1|    1|   4.0|
|   1|    3|   4.0|
|   1|    6|   4.0|
|   1|   47|   5.0|
|   1|   50|   5.0|
|   1|   70|   3.0|
|   1|  101|   5.0|
|   1|  110|   4.0|
|   1|  151|   5.0|
|   1|  157|   5.0|
+----+-----+------+
only showing top 10 rows



## Recomendando películas con el algoritmo Alternating Least Squares
Vamos a utilizar directamente los ratings como información de entrenamiento para el algoritmo de filtado colaborativo.

Dividimos el dataset en un conjunto de entrenamiento y otro conjunto, más pequeño, de prueba con el que poder validar el rendimiento del modelo de predicción de recomendaciones.

In [3]:
(training, test) = ratings.randomSplit([0.8, 0.2])

Los algoritmos de filtrado colaborativo construyen una matriz de dos dimensiones cruzando a todos los 'user' con todas las 'movies' por su rating, y sobre esta matriz tratan de predecir aquellos valores que no están presentes, es decir, aquellas películas que no ha recibido todavía valoración por parte de un usuario.

In [4]:
als = ALS(maxIter=5, regParam=0.01, userCol="user", itemCol="movie", ratingCol="rating",
          coldStartStrategy="drop")
model = als.fit(training)

Generamos las predicciones para nuestro conjunto de test y calculamos el error con respecto a las puntuaciones reales en dicho conjunto.

In [5]:
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Root-mean-square error = 1.0618606673886553


Generamos recomendaciones (5 películas) para cada usuario

In [6]:
recommendations = model.recommendForAllUsers(5)

In [7]:
recommendations.sort(asc("user")).show(10, truncate=False)

+----+--------------------------------------------------------------------------------------------------+
|user|recommendations                                                                                   |
+----+--------------------------------------------------------------------------------------------------+
|1   |[[3836, 6.511707], [2013, 6.3986654], [125, 6.246298], [3505, 6.2374196], [142488, 6.2144933]]    |
|2   |[[1014, 8.222667], [307, 8.031318], [2492, 7.4105945], [3446, 7.376205], [4649, 7.236281]]        |
|3   |[[86320, 6.682943], [4821, 6.6205907], [70946, 5.449472], [5181, 5.274498], [37380, 5.268049]]    |
|4   |[[7700, 7.3021345], [8228, 6.6762133], [6932, 6.446845], [111743, 6.385763], [72, 6.3837175]]     |
|5   |[[4649, 10.152126], [2135, 10.130684], [222, 10.052282], [4041, 9.367788], [1327, 9.3218775]]     |
|6   |[[2524, 6.5490828], [179819, 5.963917], [47423, 5.8685794], [4857, 5.848678], [103335, 5.802064]] |
|7   |[[446, 9.419422], [1283, 9.395563], [306

También podemos generar las recomendaciones inversas, cuales son los usuarios más interesados en ver una película

In [8]:
userRecs = model.recommendForAllItems(5)
userRecs.sort(asc("movie")).show(10, truncate=False)

+-----+-----------------------------------------------------------------------------------------+
|movie|recommendations                                                                          |
+-----+-----------------------------------------------------------------------------------------+
|1    |[[157, 6.3491497], [576, 5.822107], [236, 5.6325927], [548, 5.6133566], [512, 5.562183]] |
|2    |[[399, 5.6755395], [273, 5.189376], [595, 4.9810853], [589, 4.960952], [13, 4.915225]]   |
|3    |[[243, 7.4542813], [569, 6.8461537], [130, 6.379391], [535, 6.3497453], [99, 6.221362]]  |
|4    |[[536, 6.12868], [423, 5.446683], [485, 4.879732], [443, 4.7059207], [413, 4.622047]]    |
|5    |[[423, 6.8599854], [569, 6.1487494], [485, 6.0215416], [130, 5.782857], [243, 5.7175436]]|
|6    |[[147, 7.5006423], [243, 6.340062], [203, 5.9637938], [569, 5.8089347], [457, 5.7893596]]|
|7    |[[243, 5.8030853], [557, 5.75836], [547, 5.51956], [203, 5.181615], [468, 5.177589]]     |
|8    |[[99, 7.68777

## Vamos a analizar las recomendaciones

Leemos el dataset de películas

In [9]:
movies = sparkSession.read.csv("/tmp/movielens/movies.csv", schema=movieSchema, header=True)
movies.show(10, truncate=False)

+-----+----------------------------------+-------------------------------------------+
|movie|title                             |genres                                     |
+-----+----------------------------------+-------------------------------------------+
|1    |Toy Story (1995)                  |Adventure|Animation|Children|Comedy|Fantasy|
|2    |Jumanji (1995)                    |Adventure|Children|Fantasy                 |
|3    |Grumpier Old Men (1995)           |Comedy|Romance                             |
|4    |Waiting to Exhale (1995)          |Comedy|Drama|Romance                       |
|5    |Father of the Bride Part II (1995)|Comedy                                     |
|6    |Heat (1995)                       |Action|Crime|Thriller                      |
|7    |Sabrina (1995)                    |Comedy|Romance                             |
|8    |Tom and Huck (1995)               |Adventure|Children                         |
|9    |Sudden Death (1995)               |A

In [10]:
recs = recommendations.select("user", explode("recommendations"))
recs = recs.select("user", recs["col"]["movie"].alias("movie"), recs["col"]["rating"].alias("rating"))
recs.sort(asc("user")).show(truncate=False)

+----+------+---------+
|user|movie |rating   |
+----+------+---------+
|1   |3836  |6.511707 |
|1   |2013  |6.3986654|
|1   |142488|6.2144933|
|1   |3505  |6.2374196|
|1   |125   |6.246298 |
|2   |2492  |7.4105945|
|2   |4649  |7.236281 |
|2   |3446  |7.376205 |
|2   |307   |8.031318 |
|2   |1014  |8.222667 |
|3   |37380 |5.268049 |
|3   |70946 |5.449472 |
|3   |86320 |6.682943 |
|3   |4821  |6.6205907|
|3   |5181  |5.274498 |
|4   |111743|6.385763 |
|4   |8228  |6.6762133|
|4   |72    |6.3837175|
|4   |7700  |7.3021345|
|4   |6932  |6.446845 |
+----+------+---------+
only showing top 20 rows



Mezclamos con el dataset de películas para ver las recomendaciones por título

In [11]:
recs = recs.join(movies, "movie", "left_outer")
recs.sort(asc("user")).show(20)

+------+----+---------+--------------------+--------------------+
| movie|user|   rating|               title|              genres|
+------+----+---------+--------------------+--------------------+
|  2013|   1|6.3986654|Poseidon Adventur...|Action|Adventure|...|
|  3836|   1| 6.511707|Kelly's Heroes (1...|   Action|Comedy|War|
|  3505|   1|6.2374196|   No Way Out (1987)|Drama|Mystery|Thr...|
|142488|   1|6.2144933|    Spotlight (2015)|            Thriller|
|   125|   1| 6.246298|Flirting With Dis...|              Comedy|
|  3446|   2| 7.376205|  Funny Bones (1995)|        Comedy|Drama|
|   307|   2| 8.031318|Three Colors: Blu...|               Drama|
|  2492|   2|7.4105945|     20 Dates (1998)|      Comedy|Romance|
|  4649|   2| 7.236281|Wet Hot American ...|              Comedy|
|  1014|   2| 8.222667|    Pollyanna (1960)|Children|Comedy|D...|
| 70946|   3| 5.449472|      Troll 2 (1990)|      Fantasy|Horror|
| 37380|   3| 5.268049|         Doom (2005)|Action|Horror|Sci-Fi|
| 86320|  

In [12]:
movieRatings = ratings.join(movies, "movie", "left_outer")

In [13]:
movieRatings.filter("user == 2").toPandas()

Unnamed: 0,movie,user,rating,title,genres
0,318,2,3.0,"Shawshank Redemption, The (1994)",Crime|Drama
1,333,2,4.0,Tommy Boy (1995),Comedy
2,1704,2,4.5,Good Will Hunting (1997),Drama|Romance
3,3578,2,4.0,Gladiator (2000),Action|Adventure|Drama
4,6874,2,4.0,Kill Bill: Vol. 1 (2003),Action|Crime|Thriller
5,8798,2,3.5,Collateral (2004),Action|Crime|Drama|Thriller
6,46970,2,4.0,Talladega Nights: The Ballad of Ricky Bobby (2...,Action|Comedy
7,48516,2,4.0,"Departed, The (2006)",Crime|Drama|Thriller
8,58559,2,4.5,"Dark Knight, The (2008)",Action|Crime|Drama|IMAX
9,60756,2,5.0,Step Brothers (2008),Comedy


In [14]:
recs.filter("user == 2").toPandas()

Unnamed: 0,movie,user,rating,title,genres
0,1014,2,8.222667,Pollyanna (1960),Children|Comedy|Drama
1,307,2,8.031318,Three Colors: Blue (Trois couleurs: Bleu) (1993),Drama
2,2492,2,7.410594,20 Dates (1998),Comedy|Romance
3,3446,2,7.376205,Funny Bones (1995),Comedy|Drama
4,4649,2,7.236281,Wet Hot American Summer (2001),Comedy


In [15]:
movieRatings.groupBy("user").count().filter("count < 30").show()

+----+-----+
|user|count|
+----+-----+
| 471|   28|
| 496|   29|
| 392|   25|
| 516|   26|
| 251|   23|
|  53|   20|
| 296|   27|
| 472|   29|
| 530|   27|
|  81|   26|
| 406|   20|
|  26|   21|
| 192|   22|
| 329|   23|
| 388|   29|
| 548|   26|
| 578|   27|
| 333|   25|
| 157|   21|
| 360|   25|
+----+-----+
only showing top 20 rows



In [16]:
recs.filter("user == 26").toPandas()

Unnamed: 0,movie,user,rating,title,genres
0,4649,26,9.671824,Wet Hot American Summer (2001),Comedy
1,86911,26,8.025575,"Hangover Part II, The (2011)",Comedy
2,1327,26,7.998844,"Amityville Horror, The (1979)",Drama|Horror|Mystery|Thriller
3,4041,26,7.685206,"Officer and a Gentleman, An (1982)",Drama|Romance
4,1014,26,7.528905,Pollyanna (1960),Children|Comedy|Drama


In [17]:
movieRatings.filter("user == 26").toPandas()

Unnamed: 0,movie,user,rating,title,genres
0,10,26,3.0,GoldenEye (1995),Action|Adventure|Thriller
1,34,26,3.0,Babe (1995),Children|Drama
2,47,26,4.0,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
3,150,26,3.0,Apollo 13 (1995),Adventure|Drama|IMAX
4,153,26,3.0,Batman Forever (1995),Action|Adventure|Comedy|Crime
5,165,26,4.0,Die Hard: With a Vengeance (1995),Action|Crime|Thriller
6,185,26,3.0,"Net, The (1995)",Action|Crime|Thriller
7,208,26,2.0,Waterworld (1995),Action|Adventure|Sci-Fi
8,225,26,3.0,Disclosure (1994),Drama|Thriller
9,288,26,3.0,Natural Born Killers (1994),Action|Crime|Thriller
