# Second Exercise: Cosine Similarity for movie comparison

In this exercise you have to implement in a python notebook using the spark framework:

1. The distributed (map/reduce) algorithm of slide "3.7" (in notebook "8-Item-to-Items-globalfiltering-recommenders-py3-sshow.ipynb")
for computing the cosine similarity of a set of products with negative and positive ratings, using as input information an RDD (or spark dataframe that is also distributed) with ratings with this format:

     (userID,movieID,rating)

2. The computation of the Cosine Similarity (with the previous algorithm) of all the pairs of movies from the different files you have with this exercise:
  filtered50movies.csv filtered100movies.csv  filtered150movies.csv   filtered200movies.csv

Each file contains ratings for a different set of movies, but the ones in a smaller file
are always a subset of a file with bigger size. We provide files with different size
in case you have some memory issues in your computer, so use the biggest file you are able to use, although during "testing" of your code you can of course use the smallest file, or even any smaller subset of the file filtered50movies.csv.

3. Show on the screen the information for the "top 10" most similar pairs, but using the
name of the movies you can find in the file movies.

All the steps should be implemented always with map/reduce operations with spark RDDs/dataframes. Except the last step, when you have to find the name of the movies in the top-ten recommendations.

Present your notebook with plenty of comments in all your functions.

NOTE: The ratings for movies come from a dataset obtained from the smallest dataset from:
https://grouplens.org/datasets/movielens/
But the ratings have been re-scaled from the range [0,5] to the range [-3,2.5]

In [1]:
# Libraries
import pyspark
import os
import math
import sys
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

In [2]:
# Spark Session
spark = SparkSession.builder.appName('MovieRecommender').getOrCreate()
spark

In [3]:
# Movies Information
moviesDF = spark.read.csv('inputs/movies.csv',header=True)

# Cast
moviesDF = moviesDF.withColumn("movieId",moviesDF.movieId.cast('int'))
moviesDF.show(5)

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
+-------+--------------------+--------------------+
only showing top 5 rows



In [4]:
# User Qualifications
userMoviesDF = spark.read.csv('inputs/filtered50movies.csv',header=True)

# Cast
userMoviesDF = userMoviesDF.withColumn("UserID",userMoviesDF.UserID.cast('int'))
userMoviesDF = userMoviesDF.withColumn("MovieID",userMoviesDF.MovieID.cast('int'))
userMoviesDF = userMoviesDF.withColumn("Rating",userMoviesDF.Rating.cast('float'))

# Sort By MoviesID
userMoviesDF = userMoviesDF.orderBy('MovieID')

userMoviesDF.show(5)

+------+-------+------+
|UserID|MovieID|Rating|
+------+-------+------+
|     1|      1|   1.5|
|     5|      1|   1.5|
|     7|      1|   2.0|
|    15|      1|  -0.5|
|    17|      1|   2.0|
+------+-------+------+
only showing top 5 rows



In [5]:
# Cartesian product
userMoviesRDD = userMoviesDF.rdd.cartesian(userMoviesDF.rdd)

# Filter the same user and different movie
userMoviesRDD = userMoviesRDD.filter(lambda x: x[0][0] == x[1][0] and x[0][1] < x[1][1])

In [6]:
userMoviesRDD.take(5)

[(Row(UserID=1, MovieID=1, Rating=1.5), Row(UserID=1, MovieID=3, Rating=1.5)),
 (Row(UserID=19, MovieID=1, Rating=1.5),
  Row(UserID=19, MovieID=3, Rating=0.5)),
 (Row(UserID=32, MovieID=1, Rating=0.5),
  Row(UserID=32, MovieID=3, Rating=0.5)),
 (Row(UserID=43, MovieID=1, Rating=2.5),
  Row(UserID=43, MovieID=3, Rating=2.5)),
 (Row(UserID=44, MovieID=1, Rating=0.5),
  Row(UserID=44, MovieID=3, Rating=0.5))]

In [7]:
# Map
userMoviesRDD = userMoviesRDD.map(lambda i:((i[0][1],i[1][1]), (i[0][2]*i[1][2],i[0][2]**2,i[1][2]**2)))

In [8]:
userMoviesRDD.take(5)

[((1, 3), (2.25, 2.25, 2.25)),
 ((1, 3), (0.75, 2.25, 0.25)),
 ((1, 3), (0.25, 0.25, 0.25)),
 ((1, 3), (6.25, 6.25, 6.25)),
 ((1, 3), (0.25, 0.25, 0.25))]

In [9]:
# Reduce by Key
userMoviesRDD = userMoviesRDD.reduceByKey(lambda x,y:(x[0]+y[0],x[1]+y[1],x[2]+y[2]))

In [10]:
userMoviesRDD.sortByKey().take(5)

[((1, 3), (48.0, 88.25, 51.5)),
 ((1, 6), (119.0, 150.25, 160.5)),
 ((1, 47), (190.0, 243.5, 291.25)),
 ((1, 50), (240.25, 255.0, 348.0)),
 ((1, 70), (31.75, 68.75, 61.75))]

In [11]:
# Calculate Cosine Distance
def cosineDistance(val):
    return val[0]/(math.sqrt(val[1])*math.sqrt(val[2]))

In [12]:
# Get Cosine for each pair of movies
userMoviesRDD = userMoviesRDD.map(lambda x: (x[0][0],x[0][1],cosineDistance(x[1])))

In [13]:
# Get top 10 relations
top10 = userMoviesRDD.sortBy(lambda x: -x[2]).take(10)

In [14]:
top10

[(151, 441, 1.0),
 (661, 923, 0.9476070829586856),
 (151, 457, 0.9178523316578322),
 (50, 296, 0.9117100850753502),
 (151, 919, 0.9101665113610138),
 (3, 151, 0.9052317076000181),
 (163, 661, 0.9030616159415418),
 (441, 923, 0.901674573007367),
 (151, 608, 0.8979133729352984),
 (608, 923, 0.8908671638779693)]

In [15]:
# Get Movie TItle By Id
def getMovieTitleById(id):
    return moviesDF.filter(moviesDF.movieId == id).first().title

In [16]:
# Create Dataframe with top 10 relations
top10DF = spark.createDataFrame(top10,["Movie_1_ID","Movie_2_ID","Cosine Distance"])

In [17]:
top10DF.show()

+----------+----------+------------------+
|Movie_1_ID|Movie_2_ID|   Cosine Distance|
+----------+----------+------------------+
|       151|       441|               1.0|
|       661|       923|0.9476070829586856|
|       151|       457|0.9178523316578322|
|        50|       296|0.9117100850753502|
|       151|       919|0.9101665113610138|
|         3|       151|0.9052317076000181|
|       163|       661|0.9030616159415418|
|       441|       923| 0.901674573007367|
|       151|       608|0.8979133729352984|
|       608|       923|0.8908671638779693|
+----------+----------+------------------+



In [18]:
# Join Movies Id with Titles
subset = top10DF.join(moviesDF,top10DF.Movie_1_ID == moviesDF.movieId,"inner")
subset = subset.withColumnRenamed("title", "Movie_1")
subset = subset.select(col("Movie_1"), col("Movie_2_ID"), col("Cosine Distance"))

subset = subset.join(moviesDF,subset.Movie_2_ID == moviesDF.movieId,"inner")
subset = subset.withColumnRenamed("title", "Movie_2")
subset = subset.select(col("Movie_1"), col("Movie_2"), col("Cosine Distance"))
subset.show()

+--------------------+--------------------+------------------+
|             Movie_1|             Movie_2|   Cosine Distance|
+--------------------+--------------------+------------------+
|      Rob Roy (1995)|Dazed and Confuse...|               1.0|
|James and the Gia...| Citizen Kane (1941)|0.9476070829586856|
|      Rob Roy (1995)|Fugitive, The (1993)|0.9178523316578322|
|Usual Suspects, T...| Pulp Fiction (1994)|0.9117100850753502|
|      Rob Roy (1995)|Wizard of Oz, The...|0.9101665113610138|
|Grumpier Old Men ...|      Rob Roy (1995)|0.9052317076000181|
|    Desperado (1995)|James and the Gia...|0.9030616159415418|
|Dazed and Confuse...| Citizen Kane (1941)| 0.901674573007367|
|      Rob Roy (1995)|        Fargo (1996)|0.8979133729352984|
|        Fargo (1996)| Citizen Kane (1941)|0.8908671638779693|
+--------------------+--------------------+------------------+



In [19]:
def getCosineSimilarity(dataframe):
    
    """
    Function/Algorithm that calculate the cosine similarity using as input a dataframe 
    like (UserID,Product,Rating).
    
    Return RDD; format (Product1,Product2,CosineDistance)
    """
    
     # Cartesian product
    resultRDD = dataframe.rdd.cartesian(dataframe.rdd)

    # Filter the same user and different movie
    resultRDD = resultRDD.filter(lambda x: x[0][0] == x[1][0] and x[0][1] < x[1][1])
    
    # Map
    resultRDD = resultRDD.map(lambda i:((i[0][1],i[1][1]), (i[0][2]*i[1][2],i[0][2]**2,i[1][2]**2)))
    
    # Reduce by Key
    resultRDD = resultRDD.reduceByKey(lambda x,y:(x[0]+y[0],x[1]+y[1],x[2]+y[2]))
    
    # Get Cosine for each pair of movies
    resultRDD = resultRDD.map(lambda x: (x[0][0],x[0][1],cosineDistance(x[1])))
    
    return resultRDD

In [20]:
def getRecommendations(csv_file,n=10):
    
    """
    Function that calculates cosine similarity top pairs for a movies dataset using an input as CSV.
    
    Return Dataframe; format [Movie1,Movie2,Cosine Distance]
    
    """
    
    # Movies Information
    moviesDF = spark.read.csv('inputs/movies.csv',header=True)
    moviesDF = moviesDF.withColumn("movieId",moviesDF.movieId.cast('int'))
    
    # User Qualifications
    userMoviesDF = spark.read.csv('inputs/'+csv_file,header=True)

    # Cast
    userMoviesDF = userMoviesDF.withColumn("UserID",userMoviesDF.UserID.cast('int'))
    userMoviesDF = userMoviesDF.withColumn("MovieID",userMoviesDF.MovieID.cast('int'))
    userMoviesDF = userMoviesDF.withColumn("Rating",userMoviesDF.Rating.cast('float'))

    # Sort By MoviesID
    userMoviesDF = userMoviesDF.orderBy('MovieID')
    
    # Calc Cosine Similarity
    userMoviesRDD = getCosineSimilarity(userMoviesDF)
    
    # Top 10 similar pairs, using their name
    top10 = userMoviesRDD.sortBy(lambda x: -x[2]).take(n)
    top10DF = spark.createDataFrame(top10,["Movie_1_ID","Movie_2_ID","Cosine Distance"])
    subset = top10DF.join(moviesDF,top10DF.Movie_1_ID == moviesDF.movieId,"inner")
    
    subset = subset.withColumnRenamed("title", "Movie_1")
    subset = subset.select(col("Movie_1"), col("Movie_2_ID"), col("Cosine Distance"))

    subset = subset.join(moviesDF,subset.Movie_2_ID == moviesDF.movieId,"inner")
    subset = subset.withColumnRenamed("title", "Movie_2")
    subset = subset.select(col("Movie_1"), col("Movie_2"), col("Cosine Distance"))
    return subset

In [21]:
getRecommendations('filtered50movies.csv').show()

+--------------------+--------------------+------------------+
|             Movie_1|             Movie_2|   Cosine Distance|
+--------------------+--------------------+------------------+
|      Rob Roy (1995)|Dazed and Confuse...|               1.0|
|James and the Gia...| Citizen Kane (1941)|0.9476070829586856|
|      Rob Roy (1995)|Fugitive, The (1993)|0.9178523316578322|
|Usual Suspects, T...| Pulp Fiction (1994)|0.9117100850753502|
|      Rob Roy (1995)|Wizard of Oz, The...|0.9101665113610138|
|Grumpier Old Men ...|      Rob Roy (1995)|0.9052317076000181|
|    Desperado (1995)|James and the Gia...|0.9030616159415418|
|Dazed and Confuse...| Citizen Kane (1941)| 0.901674573007367|
|      Rob Roy (1995)|        Fargo (1996)|0.8979133729352984|
|        Fargo (1996)| Citizen Kane (1941)|0.8908671638779693|
+--------------------+--------------------+------------------+



In [22]:
getRecommendations('filtered100movies.csv').show()

+--------------------+--------------------+------------------+
|             Movie_1|             Movie_2|   Cosine Distance|
+--------------------+--------------------+------------------+
|      Rob Roy (1995)|Dazed and Confuse...|               1.0|
|Star Wars: Episod...|Star Wars: Episod...|0.9560939845339547|
|      Platoon (1986)|   Goodfellas (1990)|0.9504490201227582|
|James and the Gia...| Citizen Kane (1941)|0.9476070829586856|
|      Rob Roy (1995)|Monty Python's Li...|0.9472110029417574|
|Raiders of the Lo...|Indiana Jones and...| 0.942606433250396|
| Pulp Fiction (1994)|      Platoon (1986)|0.9396928763131717|
|        Fargo (1996)|      Platoon (1986)|0.9374347353342417|
|Reservoir Dogs (1...|      Platoon (1986)|0.9364700594629586|
|Star Wars: Episod...|Star Wars: Episod...|0.9342673969280977|
+--------------------+--------------------+------------------+



In [23]:
getRecommendations('filtered150movies.csv').show()

+--------------------+--------------------+------------------+
|             Movie_1|             Movie_2|   Cosine Distance|
+--------------------+--------------------+------------------+
|      Rob Roy (1995)|Dazed and Confuse...|               1.0|
|Star Wars: Episod...|Star Wars: Episod...|0.9560939845339547|
|      Platoon (1986)|   Goodfellas (1990)|0.9504490201227582|
|James and the Gia...| Citizen Kane (1941)|0.9476070829586856|
|      Rob Roy (1995)|Monty Python's Li...|0.9472110029417574|
| Citizen Kane (1941)|Conan the Barbari...|0.9435183924675248|
|      Rob Roy (1995)|American History ...|0.9434864536402258|
|Raiders of the Lo...|Indiana Jones and...| 0.942606433250396|
|Jungle Book, The ...|Conan the Barbari...|0.9417419115948374|
| Citizen Kane (1941)|L.A. Confidential...| 0.941241407642987|
+--------------------+--------------------+------------------+



In [24]:
getRecommendations('filtered200movies.csv').show()

+--------------------+--------------------+------------------+
|             Movie_1|             Movie_2|   Cosine Distance|
+--------------------+--------------------+------------------+
|      Rob Roy (1995)|Dazed and Confuse...|               1.0|
|Jungle Book, The ...|   Dick Tracy (1990)|0.9922778767136677|
|Dirty Dozen, The ...|   Robin Hood (1973)|0.9855258295520649|
|Jungle Book, The ...|       Dr. No (1962)|0.9832820049844603|
|Jungle Book, The ...|¡Three Amigos! (1...|0.9785497849867492|
|      Rob Roy (1995)|           Go (1999)|0.9782797401561579|
|  Wild Things (1998)|   Robin Hood (1973)| 0.977897823397447|
|      Rob Roy (1995)|Crocodile Dundee ...|0.9734503756241593|
|        Bambi (1942)|Dirty Dozen, The ...|0.9689627902499088|
|Jungle Book, The ...|           Go (1999)|0.9666666666666666|
+--------------------+--------------------+------------------+



The value range is [-1,1]:
- -1 means that the pair of movies are not related. 
- 0 means that the pair of movies are neutral.
- 1 means that the pair of movies are related.

## Conclusions

The datasets provided have a maximum of 200 movies, whose main factor (rating) is the user's opinion.
The users' opinion will be the variable that will join the films in pairs and will help us to make the recommendation according to their criteria.

Sometimes the use of the users' opinion can cause bad results, because it could happen that two movies are similar but this concrete dataset doesn't relate them. In the current case study, we believe that it would be better to use as well the genres of the movies to find the similarity between them, as a second criteria.

As we can see in the different results, it seems that the recommendation algorithm based on the user's opinion works fairly well. For instance, we observe that movies of sagas such as Star Wars or Indiana Jones follow a coherence of high relationship with their respective ones. In addition, movies that share the same scope,actors or references in them are related.

In conlusion the recommendation that has been developed uses user feedback to recommend movies that has similar opinions.

By Francesc Contreras & Albert Pérez