# Activity: Item-Based Collaborative Filtering

**Task**: Use item-based collaborative filtering for recommending movies to users.

## Approach
* Map input ratings to `(userID, (movieID, rating))`
* Find every movie pair rated by the same user.
   * This can be done with a "self-join" operation.
   * At this point we have `(userID, ((movieID1, rating1), (movieID2, rating2)...))`
* Filter out duplicate pairs.
* Make the movie pairs the key.
   * Map to ((movieID1, movieID2), (rating1, rating2))
* `groupByKey()` to get every rating pair found for each movie pair.
* Compute similarity between ratings for each movie in the pair.
* Sort, save, and display results.

In [1]:
spark

Intitializing Scala interpreter ...

Spark Web UI available at http://192.168.1.19:4040
SparkContext available as 'sc' (version = 2.4.5, master = local[*], app id = local-1589209136071)
SparkSession available as 'spark'


res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@6a402e0b


In [11]:
import scala.io.Source
import scala.io.Codec
import java.nio.charset.CodingErrorAction
import scala.math.sqrt

import scala.io.Source
import scala.io.Codec
import java.nio.charset.CodingErrorAction
import scala.math.sqrt


# User-defined Data types

To help ease the implementation, we define a few customer data types.

In [6]:
type MovieRating = (Int, Double)
type UserRatingPair = (Int, (MovieRating, MovieRating))
type RatingPair = (Double, Double)
type RatingPairs = Iterable[RatingPair]

defined type alias MovieRating
defined type alias UserRatingPair
defined type alias RatingPair
defined type alias RatingPairs


# Helper functions

## `loadMovieNames`

Simply outputs the movie names

In [12]:
def loadMovieNames(): Map[Int, String] ={
    //Handle character encoding issues
    implicit val codec = Codec("UTF-8")
    codec.onMalformedInput(CodingErrorAction.REPLACE)
    codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
    
    // Create Map of Ints to Strings, and populate it from u.item
    var movieNames:Map[Int, String] = Map()
    var lines = Source.fromFile("../../ml-100k/u.item").getLines()
    for (line <- lines){
        var fields = line.split('|')
        if (fields.length > 1){
            movieNames += (fields(0).toInt -> fields(1))
        }
    }
    return movieNames
}

loadMovieNames: ()Map[Int,String]


## `makePairs`

Input: `UserRatingPair`

Output: Tuples of the form `((movie1,movie2), (rating1,rating2))`

In [4]:
def makePairs(userRatings: UserRatingPair) = {
    
    // Extract movieRating pairs
    val movieRating1 = userRatings._2._1
    val movieRating2 = userRatings._2._2
    
    // Extract the movie and rating elements
    val movie1 = movieRating1._1
    val rating1 = movieRating1._2
    
    val movie2 = movieRating2._1
    val rating2 = movieRating2._2
    
    ((movie1, movie2), (rating1, rating2))
}

makePairs: (userRatings: UserRatingPair)((Int, Int), (Double, Double))


## `filterDuplicates`

There may be cases where one sees "duplicate" entries. This function removes those duplicates.

Takes a pair of user ratings and determines whether `rating_movie1 < rating_movie2`.

Input: `UserRatingPair`

Ouput: `Boolean`


In [5]:
def filterDuplicates(userRatings: UserRatingPair): Boolean = {
    val movieRating1 = userRatings._2._1
    val movieRating2 = userRatings._2._2
    
    val movie1 = movieRating1._1
    val movie2 = movieRating2._1
    
    return movie1 < movie2
}

filterDuplicates: (userRatings: UserRatingPair)Boolean


## `computeCosineSimilarity`

To determine whether a pair of ratings are similar, we perform a cosine similarity.

Input: Pair of ratings `RatingPairs`
Output: score `(score, numPairs)`

In [10]:
def computeCosineSimilarity(ratingPairs: RatingPairs): (Double, Int)={
    //Initialize scores
    var numPairs: Int = 0
    var sum_xx: Double = 0.0
    var sum_yy:Double = 0.0
    var sum_xy:Double = 0.0
    
    // Iterate through all rating pairs
    for (pair <- ratingPairs){
        val ratingX = pair._1
        val ratingY = pair._2
        
        sum_xx += ratingX * ratingX
        sum_yy += ratingY * ratingY
        sum_xy += ratingX * ratingY
        numPairs += 1
    }
    val numerator:Double = sum_xy
    val denominator = sqrt(sum_xx) * sqrt(sum_yy)
    
    var score:Double = 0.0
    if (denominator != 0){
        score = numerator/denominator
    }
    return (score, numPairs)
}

computeCosineSimilarity: (ratingPairs: RatingPairs)(Double, Int)


# Implementation

Load movie names

In [13]:
val nameDict = loadMovieNames()

nameDict: Map[Int,String] = Map(645 -> Paris Is Burning (1990), 892 -> Flubber (1997), 69 -> Forrest Gump (1994), 1322 -> Metisse (Caf? au Lait) (1993), 1665 -> Brother's Kiss, A (1997), 1036 -> Drop Dead Fred (1991), 1586 -> Lashou shentan (1992), 1501 -> Prisoner of the Mountains (Kavkazsky Plennik) (1996), 809 -> Rising Sun (1993), 1337 -> Larger Than Life (1996), 1411 -> Barbarella (1968), 629 -> Victor/Victoria (1982), 1024 -> Mrs. Dalloway (1997), 1469 -> Tom and Huck (1995), 365 -> Powder (1995), 1369 -> Forbidden Christ, The (Cristo proibito, Il) (1950), 138 -> D3: The Mighty Ducks (1996), 1190 -> That Old Feeling (1997), 1168 -> Little Buddha (1993), 760 -> Screamers (1995), 101 -> Heavy Metal (1981), 1454 -> Angel and the Badman (1947), 1633 -> ? k?ldum klaka (Cold Fever) (199...

Load ratings data

In [14]:
val data = sc.textFile("../../ml-100k/u.data")

data: org.apache.spark.rdd.RDD[String] = ../../ml-100k/u.data MapPartitionsRDD[1] at textFile at <console>:32


Map ratings to key/value pairs: `userID => movieID` rating

In [15]:
val ratings = {data.map(l => l.split("\t"))
                  .map(l => (l(0).toInt, (l(1).toInt, l(2).toDouble) ))
              }

ratings: org.apache.spark.rdd.RDD[(Int, (Int, Double))] = MapPartitionsRDD[3] at map at <console>:34


Emit every movie rated together by the same user. Self-join to find every combination.

In [16]:
val joinedRatings = ratings.join(ratings)

joinedRatings: org.apache.spark.rdd.RDD[(Int, ((Int, Double), (Int, Double)))] = MapPartitionsRDD[6] at join at <console>:33


At this point our RDD consists of `userID => ((movieID,rating))` . We can filter out any duplicate pairs.

In [17]:
val uniqueJoinedRatings = joinedRatings.filter(filterDuplicates)

uniqueJoinedRatings: org.apache.spark.rdd.RDD[(Int, ((Int, Double), (Int, Double)))] = MapPartitionsRDD[7] at filter at <console>:35


Now key by `(movie1, movie2)` pairs

In [18]:
val moviePairs = uniqueJoinedRatings.map(makePairs)

moviePairs: org.apache.spark.rdd.RDD[((Int, Int), (Double, Double))] = MapPartitionsRDD[8] at map at <console>:35


Now we have `(movie1,movie2) =>(rating1,rating2), (rating1,rating2)...`. We collect all ratings for each movie pairs.

In [19]:
val moviePairRatings = moviePairs.groupByKey()

moviePairRatings: org.apache.spark.rdd.RDD[((Int, Int), Iterable[(Double, Double)])] = ShuffledRDD[9] at groupByKey at <console>:33


We compute the similarities.

In [20]:
val moviePairSimilarities = moviePairRatings.mapValues(computeCosineSimilarity).cache()

moviePairSimilarities: org.apache.spark.rdd.RDD[((Int, Int), (Double, Int))] = MapPartitionsRDD[10] at mapValues at <console>:35


Save the results if desired.

In [21]:
//val sorted = moviePairSimilarities.sortByKey()
//sorted.saveAsTextFile("movie-sims")

Extract similarities for the movie we care about that are "good".

In [23]:

val scoreThreshold = 0.97
val coOccurenceThreshold = 50.0

// Find movies that are similar to movieID = 50
val movieID:Int = 50

//Filter for movies with this sim that are "good" as defined by
// our quality threshold above.
val filteredResults = moviePairSimilarities.filter(x=>
                      {
                          val pair=x._1
                          val sim=x._2
                          (pair._1 == movieID || pair._2 == movieID) && sim._1 > scoreThreshold && sim._2 > coOccurenceThreshold
                      })

// Sort by quality score
val results = filteredResults.map(x=>(x._2, x._1)).sortByKey(false).take(10)

println("\nTop 10 similar movies for " + nameDict(movieID))
for (result <- results) {
    val sim = result._1
    val pair = result._2
    // Display the similarity result that isn't the movie we're looking at
    var similarMovieID = pair._1
    if (similarMovieID == movieID) {
        similarMovieID = pair._2
    }
    println(nameDict(similarMovieID) + "\tscore: " + sim._1 + "\tstrength: " + sim._2)
}


Top 10 similar movies for Star Wars (1977)
Empire Strikes Back, The (1980)	score: 0.9895522078385338	strength: 345
Return of the Jedi (1983)	score: 0.9857230861253026	strength: 480
Raiders of the Lost Ark (1981)	score: 0.981760098872619	strength: 380
20,000 Leagues Under the Sea (1954)	score: 0.9789385605497993	strength: 68
12 Angry Men (1957)	score: 0.9776576120448436	strength: 109
Close Shave, A (1995)	score: 0.9775948291054827	strength: 92
African Queen, The (1951)	score: 0.9764692222674887	strength: 138
Sting, The (1973)	score: 0.9751512937740359	strength: 204
Wrong Trousers, The (1993)	score: 0.9748681355460885	strength: 103
Wallace & Gromit: The Best of Aardman Animation (1996)	score: 0.9741816128302572	strength: 58


scoreThreshold: Double = 0.97
coOccurenceThreshold: Double = 50.0
movieID: Int = 50
filteredResults: org.apache.spark.rdd.RDD[((Int, Int), (Double, Int))] = MapPartitionsRDD[11] at filter at <console>:43
results: Array[((Double, Int), (Int, Int))] = Array(((0.9895522078385338,345),(50,172)), ((0.9857230861253026,480),(50,181)), ((0.981760098872619,380),(50,174)), ((0.9789385605497993,68),(50,141)), ((0.9776576120448436,109),(50,178)), ((0.9775948291054827,92),(50,408)), ((0.9764692222674887,138),(50,498)), ((0.9751512937740359,204),(50,194)), ((0.9748681355460885,103),(50,169)), ((0.9741816128302572,58),(50,114)))
