# Second Exercise: Cosine Similarity for movie comparison

In this exercise you have to implement in a python notebook using the spark framework:

1. The distributed (map/reduce) algorithm of slide "3.7" (in notebook "8-Item-to-Items-globalfiltering-recommenders-py3-sshow.ipynb")
for computing the cosine similarity of a set of products with negative and positive ratings, using as input information an RDD (or spark dataframe that is also distributed) with ratings with this format:

     (userID,movieID,rating)

2. The computation of the Cosine Similarity (with the previous algorithm) of all the pairs of movies from the different files you have with this exercise:
  filtered50movies.csv filtered100movies.csv  filtered150movies.csv   filtered200movies.csv

Each file contains ratings for a different set of movies, but the ones in a smaller file
are always a subset of a file with bigger size. We provide files with different size
in case you have some memory issues in your computer, so use the biggest file you are able to use, although during "testing" of your code you can of course use the smallest file, or even any smaller subset of the file filtered50movies.csv.

3. Show on the screen the information for the "top 10" most similar pairs, but using the
name of the movies you can find in the file movies.

All the steps should be implemented always with map/reduce operations with spark RDDs/dataframes. Except the last step, when you have to find the name of the movies in the top-ten recommendations.

Present your notebook with plenty of comments in all your functions.

NOTE: The ratings for movies come from a dataset obtained from the smallest dataset from:
https://grouplens.org/datasets/movielens/
But the ratings have been re-scaled from the range [0,5] to the range [-3,2.5]

In [1]:
# Libraries
import pyspark
import os
import math
import sys
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

In [2]:
# Spark Session
spark = SparkSession.builder.appName('MovieRecommender').getOrCreate()
spark

In [3]:
# Movies Information
moviesDF = spark.read.csv('inputs/movies.csv',header=True)

# Cast
moviesDF = moviesDF.withColumn("movieId",moviesDF.movieId.cast('int'))
moviesDF.show(5)

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
+-------+--------------------+--------------------+
only showing top 5 rows



In [4]:
# User Qualifications
userMoviesDF = spark.read.csv('inputs/filtered50movies.csv',header=True)

# Cast
userMoviesDF = userMoviesDF.withColumn("UserID",userMoviesDF.UserID.cast('int'))
userMoviesDF = userMoviesDF.withColumn("MovieID",userMoviesDF.MovieID.cast('int'))
userMoviesDF = userMoviesDF.withColumn("Rating",userMoviesDF.Rating.cast('int'))

# Sort By MoviesID
userMoviesDF = userMoviesDF.orderBy('MovieID')

userMoviesDF.show(5)

+------+-------+------+
|UserID|MovieID|Rating|
+------+-------+------+
|     1|      1|     1|
|     5|      1|     1|
|     7|      1|     2|
|    15|      1|     0|
|    17|      1|     2|
+------+-------+------+
only showing top 5 rows



In [5]:
# Cartesian product
userMoviesRDD = userMoviesDF.rdd.cartesian(userMoviesDF.rdd)

# Filter the same user and different movie
userMoviesRDD = userMoviesRDD.filter(lambda x: x[0][0] == x[1][0] and x[0][1] < x[1][1])

In [6]:
userMoviesRDD.take(5)

[(Row(UserID=1, MovieID=1, Rating=1), Row(UserID=1, MovieID=3, Rating=1)),
 (Row(UserID=19, MovieID=1, Rating=1), Row(UserID=19, MovieID=3, Rating=0)),
 (Row(UserID=32, MovieID=1, Rating=0), Row(UserID=32, MovieID=3, Rating=0)),
 (Row(UserID=43, MovieID=1, Rating=2), Row(UserID=43, MovieID=3, Rating=2)),
 (Row(UserID=44, MovieID=1, Rating=0), Row(UserID=44, MovieID=3, Rating=0))]

In [7]:
# Map
userMoviesRDD = userMoviesRDD.map(lambda i:((i[0][1],i[1][1]), (i[0][2]*i[1][2],i[0][2]**2,i[1][2]**2)))

In [8]:
userMoviesRDD.take(5)

[((1, 3), (1, 1, 1)),
 ((1, 3), (0, 1, 0)),
 ((1, 3), (0, 0, 0)),
 ((1, 3), (4, 4, 4)),
 ((1, 3), (0, 0, 0))]

In [9]:
# Reduce by Key
userMoviesRDD = userMoviesRDD.reduceByKey(lambda x,y:(x[0]+y[0],x[1]+y[1],x[2]+y[2]))

In [10]:
userMoviesRDD.sortByKey().take(5)

[((1, 3), (22, 52, 30)),
 ((1, 6), (62, 86, 102)),
 ((1, 47), (109, 144, 197)),
 ((1, 50), (141, 156, 242)),
 ((1, 70), (15, 39, 42))]

In [11]:
# Calculate Cosine Distance
def cosineDistance(val):
    return val[0]/(math.sqrt(val[1])*math.sqrt(val[2]))

In [12]:
# Get Cosine for each pair of movies
userMoviesRDD = userMoviesRDD.map(lambda x: (x[0][0],x[0][1],cosineDistance(x[1])))

In [13]:
# Get top 10 relations
top10 = userMoviesRDD.sortBy(lambda x: -x[2]).take(10)

In [14]:
top10

[(151, 441, 0.9999999999999998),
 (661, 923, 0.9203579866168444),
 (362, 543, 0.8864052604279183),
 (50, 296, 0.8838436225454884),
 (608, 923, 0.877239316059811),
 (151, 919, 0.8660254037844387),
 (151, 457, 0.8486684247915055),
 (441, 923, 0.8411101831919989),
 (50, 593, 0.8383073406122689),
 (50, 527, 0.825837875567179)]

In [15]:
# Get Movie TItle By Id
def getMovieTitleById(id):
    return moviesDF.filter(moviesDF.movieId == id).first().title

In [16]:
# Create Dataframe with top 10 relations
top10DF = spark.createDataFrame(top10,["Movie_1_ID","Movie_2_ID","Cosine Distance"])

In [17]:
top10DF.show()

+----------+----------+------------------+
|Movie_1_ID|Movie_2_ID|   Cosine Distance|
+----------+----------+------------------+
|       151|       441|0.9999999999999998|
|       661|       923|0.9203579866168444|
|       362|       543|0.8864052604279183|
|        50|       296|0.8838436225454884|
|       608|       923| 0.877239316059811|
|       151|       919|0.8660254037844387|
|       151|       457|0.8486684247915055|
|       441|       923|0.8411101831919989|
|        50|       593|0.8383073406122689|
|        50|       527| 0.825837875567179|
+----------+----------+------------------+



In [18]:
# Join Movies Id with Titles
subset = top10DF.join(moviesDF,top10DF.Movie_1_ID == moviesDF.movieId,"inner")
subset = subset.withColumnRenamed("title", "Movie_1")
subset = subset.select(col("Movie_1"), col("Movie_2_ID"), col("Cosine Distance"))

subset = subset.join(moviesDF,subset.Movie_2_ID == moviesDF.movieId,"inner")
subset = subset.withColumnRenamed("title", "Movie_2")
subset = subset.select(col("Movie_1"), col("Movie_2"), col("Cosine Distance"))
subset.show()

+--------------------+--------------------+------------------+
|             Movie_1|             Movie_2|   Cosine Distance|
+--------------------+--------------------+------------------+
|      Rob Roy (1995)|Dazed and Confuse...|0.9999999999999998|
|James and the Gia...| Citizen Kane (1941)|0.9203579866168444|
|Jungle Book, The ...|So I Married an A...|0.8864052604279183|
|Usual Suspects, T...| Pulp Fiction (1994)|0.8838436225454884|
|        Fargo (1996)| Citizen Kane (1941)| 0.877239316059811|
|      Rob Roy (1995)|Wizard of Oz, The...|0.8660254037844387|
|      Rob Roy (1995)|Fugitive, The (1993)|0.8486684247915055|
|Dazed and Confuse...| Citizen Kane (1941)|0.8411101831919989|
|Usual Suspects, T...|Silence of the La...|0.8383073406122689|
|Usual Suspects, T...|Schindler's List ...| 0.825837875567179|
+--------------------+--------------------+------------------+



In [19]:
def getRecommendations(csv_file,n=10):
    # Movies Information
    moviesDF = spark.read.csv('inputs/movies.csv',header=True)
    moviesDF = moviesDF.withColumn("movieId",moviesDF.movieId.cast('int'))
    
    # User Qualifications
    userMoviesDF = spark.read.csv('inputs/'+csv_file,header=True)

    # Cast
    userMoviesDF = userMoviesDF.withColumn("UserID",userMoviesDF.UserID.cast('int'))
    userMoviesDF = userMoviesDF.withColumn("MovieID",userMoviesDF.MovieID.cast('int'))
    userMoviesDF = userMoviesDF.withColumn("Rating",userMoviesDF.Rating.cast('int'))

    # Sort By MoviesID
    userMoviesDF = userMoviesDF.orderBy('MovieID')
    
    # Cartesian product
    userMoviesRDD = userMoviesDF.rdd.cartesian(userMoviesDF.rdd)

    # Filter the same user and different movie
    userMoviesRDD = userMoviesRDD.filter(lambda x: x[0][0] == x[1][0] and x[0][1] < x[1][1])
    
    # Map
    userMoviesRDD = userMoviesRDD.map(lambda i:((i[0][1],i[1][1]), (i[0][2]*i[1][2],i[0][2]**2,i[1][2]**2)))
    
    # Reduce by Key
    userMoviesRDD = userMoviesRDD.reduceByKey(lambda x,y:(x[0]+y[0],x[1]+y[1],x[2]+y[2]))
    
    # Get Cosine for each pair of movies
    userMoviesRDD = userMoviesRDD.map(lambda x: (x[0][0],x[0][1],cosineDistance(x[1])))
    
    # Top 10 similar pairs, using their name
    top10 = userMoviesRDD.sortBy(lambda x: -x[2]).take(n)
    top10DF = spark.createDataFrame(top10,["Movie_1_ID","Movie_2_ID","Cosine Distance"])
    subset = top10DF.join(moviesDF,top10DF.Movie_1_ID == moviesDF.movieId,"inner")
    
    subset = subset.withColumnRenamed("title", "Movie_1")
    subset = subset.select(col("Movie_1"), col("Movie_2_ID"), col("Cosine Distance"))

    subset = subset.join(moviesDF,subset.Movie_2_ID == moviesDF.movieId,"inner")
    subset = subset.withColumnRenamed("title", "Movie_2")
    subset = subset.select(col("Movie_1"), col("Movie_2"), col("Cosine Distance"))
    subset.show()

In [20]:
getRecommendations('filtered50movies.csv')

+--------------------+--------------------+------------------+
|             Movie_1|             Movie_2|   Cosine Distance|
+--------------------+--------------------+------------------+
|      Rob Roy (1995)|Dazed and Confuse...|0.9999999999999998|
|James and the Gia...| Citizen Kane (1941)|0.9203579866168444|
|Jungle Book, The ...|So I Married an A...|0.8864052604279183|
|Usual Suspects, T...| Pulp Fiction (1994)|0.8838436225454884|
|        Fargo (1996)| Citizen Kane (1941)| 0.877239316059811|
|      Rob Roy (1995)|Wizard of Oz, The...|0.8660254037844387|
|      Rob Roy (1995)|Fugitive, The (1993)|0.8486684247915055|
|Dazed and Confuse...| Citizen Kane (1941)|0.8411101831919989|
|Usual Suspects, T...|Silence of the La...|0.8383073406122689|
|Usual Suspects, T...|Schindler's List ...| 0.825837875567179|
+--------------------+--------------------+------------------+



In [21]:
getRecommendations('filtered100movies.csv')

+--------------------+--------------------+------------------+
|             Movie_1|             Movie_2|   Cosine Distance|
+--------------------+--------------------+------------------+
|      Rob Roy (1995)|Dazed and Confuse...|0.9999999999999998|
|Star Wars: Episod...|Star Wars: Episod...|0.9404393848984065|
|      Platoon (1986)|   Goodfellas (1990)|0.9258200997725515|
| Pulp Fiction (1994)|      Platoon (1986)|0.9254934184530236|
|James and the Gia...| Citizen Kane (1941)|0.9203579866168444|
|Star Wars: Episod...|Star Wars: Episod...|0.9128739147503566|
|Reservoir Dogs (1...|   Goodfellas (1990)|0.9108460178468284|
|      Rob Roy (1995)|     Fantasia (1940)|0.9045340337332909|
|Reservoir Dogs (1...|      Platoon (1986)|0.9029210851443182|
|Raiders of the Lo...|Indiana Jones and...|0.8995067832853526|
+--------------------+--------------------+------------------+



In [22]:
getRecommendations('filtered150movies.csv')

+--------------------+--------------------+------------------+
|             Movie_1|             Movie_2|   Cosine Distance|
+--------------------+--------------------+------------------+
|      Rob Roy (1995)|Dazed and Confuse...|0.9999999999999998|
|Star Wars: Episod...|Star Wars: Episod...|0.9404393848984065|
|      Platoon (1986)|   Goodfellas (1990)|0.9258200997725515|
| Pulp Fiction (1994)|      Platoon (1986)|0.9254934184530236|
|    Tommy Boy (1995)|Negotiator, The (...|0.9245003270420487|
|James and the Gia...| Citizen Kane (1941)|0.9203579866168444|
| Citizen Kane (1941)|L.A. Confidential...|0.9191161050815164|
|        Bambi (1942)|         Tron (1982)|0.9185586535436918|
|Star Wars: Episod...|Star Wars: Episod...|0.9128739147503566|
|Reservoir Dogs (1...|   Goodfellas (1990)|0.9108460178468284|
+--------------------+--------------------+------------------+



In [23]:
getRecommendations('filtered200movies.csv')

+--------------------+--------------------+------------------+
|             Movie_1|             Movie_2|   Cosine Distance|
+--------------------+--------------------+------------------+
|Jungle Book, The ...|           Go (1999)|               1.0|
|      Rob Roy (1995)|Dazed and Confuse...|0.9999999999999998|
|Jungle Book, The ...|   Dick Tracy (1990)|0.9999999999999998|
|      Rob Roy (1995)|           Go (1999)|0.9622504486493763|
|      Rob Roy (1995)|Crocodile Dundee ...|0.9561828874675149|
|Jungle Book, The ...|¡Three Amigos! (1...|0.9486832980505138|
|Dirty Dozen, The ...|   Robin Hood (1973)|0.9486832980505138|
|  Wild Things (1998)|   Robin Hood (1973)|0.9428090415820632|
|Star Wars: Episod...|Star Wars: Episod...|0.9404393848984065|
|    Tommy Boy (1995)|Iron Giant, The (...|0.9337990556476821|
+--------------------+--------------------+------------------+



- The value ranges in [-1,1]
- -1 means that the pair of movies are different. 
- 0 means that the pair of movies are neutral.
- 1 means that the pair of movies are similar.

## Conclusions

The datasets provided have a maximum of 200 movies, whose main factor is the user's opinion.
The users' opinion will be the variable that will join the films in pairs and will help us to make the recommendation according to their criteria.

Sometimes the use of the users' opinion can cause bad results, because it could happen that two movies are similar but the users do not like one of them, therefore that movie would not be recommended. In the current case study, we believe that it would be better to use the genres of the movies to find the similarity between them.

As we can see in the different results, it seems that the recommendation algorithm based on the user's opinion works more or less well. For example we see that movies that form a saga such as Star Wars or Indiana Jones are related, we also observe that some movies that share the same scope as Reservoir Dogs and Goodfellas are also related and movies where the same actors appear or there are references in them are related as for example The Usual Suspects with Schindler's List.

In conlusion the recommendation that has been developed uses user feedback to recommend movies that are similar.