# Collaborative Filtering

We are performing collaborative filtering to see how much movies are similar to each other to give suggestions.

New methods are used here:


- ``mapValues``Pass each value in the key-value pair RDD through a map function without changing the keys

- ``cache`` Persist this RDD with the default storage level
- ``take`` Takes the first number of elements of the RDD
- argument ``local[*]`` in setMaster method to use spark built in cluster manager and use more than one core of the pc. 
- ``saveAsTextFile`` saving the rdd as text file in the current folder. It will generate one file for each executer (core).

We also see using a chain of ``map`` methods.

We are going to improve upon the previous work, by using performing several steps:

- throwing out movies with low rating
- using different similarity measures
- adding genre to the solution

In [1]:
import findspark
findspark.init()

import sys
from pyspark import SparkConf, SparkContext
from math import sqrt

conf = SparkConf().setMaster("local[*]").setAppName("MovieSimilarities")
sc = SparkContext(conf = conf)

In [4]:
def loadMovieNames():
    movieNames = {}
    with open("c:/SparkCourse/ml-100k/u.ITEM", encoding='ascii', errors='ignore') as f:
        for line in f:
            fields = line.split('|')
            movieNames[int(fields[0])] = fields[1]
    return movieNames

#Python 3 doesn't let you pass around unpacked tuples,
#so we explicitly extract the ratings now.
def makePairs( userRatings ):
    ratings = userRatings[1]
    (movie1, rating1) = ratings[0]
    (movie2, rating2) = ratings[1]
    return ((movie1, movie2), (rating1, rating2))

def filterDuplicates( userRatings ): # very god for removing duplicates in spark
    ratings = userRatings[1]
    (movie1, rating1) = ratings[0]
    (movie2, rating2) = ratings[1]
    return movie1 < movie2 # from two direction of join, keep the one that is alphabetically ahead
#also filters movies with the same name

def filterLowRatings(movieinfo):
    rating = float(movieinfo[1][1])
    return rating>=2  

def computeCosineSimilarity(ratingPairs):
    numPairs = 0
    sum_xx = sum_yy = sum_xy = 0
    for ratingX, ratingY in ratingPairs:
        sum_xx += ratingX * ratingX
        sum_yy += ratingY * ratingY
        sum_xy += ratingX * ratingY
        numPairs += 1

    numerator = sum_xy
    denominator = sqrt(sum_xx) * sqrt(sum_yy)

    score = 0
    if (denominator):
        score = (numerator / (float(denominator)))

    return (score, numPairs)



def computePearsonSimilarity(ratingPairs):
    numPairs = 0
    sum_xx = sum_yy = sum_xy = sum_x = sum_y =  0
    for ratingX, ratingY in ratingPairs:
        sum_xx += ratingX * ratingX
        sum_yy += ratingY * ratingY
        sum_xy += ratingX * ratingY
        sum_x += ratingX
        sum_y += ratingY
        numPairs += 1

    numerator = sum_xy - sum_x * sum_y / numPairs
    denominator = sqrt((sum_xx - (sum_x)**2/numPairs) * ((sum_yy - (sum_y)**2/numPairs)))

    score = 0
    if (denominator):
        score = (numerator / (float(denominator)))

    return (score, numPairs)


print("\nLoading movie names...")
nameDict = loadMovieNames()

data = sc.textFile("file:///SparkCourse/ml-100k/u.data")

# Map ratings to key / value pairs: user ID => movie ID, rating
ratings = data.map(lambda l: l.split()).map(lambda l: (int(l[0]), (int(l[1]), float(l[2])))).filter(filterLowRatings)

# Emit every movie rated together by the same user.
# Self-join to find every combination.
joinedRatings = ratings.join(ratings)
print("\nDataset is self-joined...")
# At this point our RDD consists of userID => ((movieID, rating), (movieID, rating))

# Filter out duplicate pairs
uniqueJoinedRatings = joinedRatings.filter(filterDuplicates)
print("\nDuplicates are filtered...")

# Now key by (movie1, movie2) pairs.
moviePairs = uniqueJoinedRatings.map(makePairs)
print("\nMovie data is anonymized, no userid...")

# We now have (movie1, movie2) => (rating1, rating2)
# Now collect all ratings for each movie pair and compute similarity
moviePairRatings = moviePairs.groupByKey()

# We now have (movie1, movie2) = > (rating1, rating2), (rating1, rating2) ...
# Can now compute similarities.
moviePairSimilarities = moviePairRatings.mapValues(computePearsonSimilarity).cache()

# Save the results if desired
#moviePairSimilarities.sortByKey()
#moviePairSimilarities.saveAsTextFile("movie-sims")



Loading movie names...

Dataset is self-joined...

Duplicates are filtered...

Movie data is anonymized, no userid...


In [5]:

# Extract similarities for the movie we care about that are "good".
scoreThreshold = 0.38
coOccurenceThreshold = 50

movieID = 483 # Casablanca

# Filter for movies with this sim that are "good" as defined by
# our quality thresholds above (filtering all at the same time)
filteredResults = moviePairSimilarities.filter(lambda pairSim: \
    (pairSim[0][0] == movieID or pairSim[0][1] == movieID) \
    and pairSim[1][0] > scoreThreshold and pairSim[1][1] > coOccurenceThreshold)

# Sort by quality score.
results = filteredResults.map(lambda pairSim: (pairSim[1], pairSim[0])).sortByKey(ascending = False).take(10)

print("Top 10 similar movies for " + nameDict[movieID])
for result in results:
    (sim, pair) = result
    # Display the similarity result that isn't the movie we're looking at
    similarMovieID = pair[0]
    if (similarMovieID == movieID):
        similarMovieID = pair[1]
    print(nameDict[similarMovieID] + "\tscore: " + str(sim[0]) + "\tstrength: " + str(sim[1]))

Top 10 similar movies for Casablanca (1942)
Third Man, The (1949)	score: 0.6036773350335661	strength: 52
Maltese Falcon, The (1941)	score: 0.5029606006150984	strength: 111
Shine (1996)	score: 0.4770054774358142	strength: 68
Bob Roberts (1992)	score: 0.473589560952675	strength: 57
It Happened One Night (1934)	score: 0.4577801353022157	strength: 60
African Queen, The (1951)	score: 0.4469809047597789	strength: 113
My Left Foot (1989)	score: 0.42873131674438764	strength: 67
Chinatown (1974)	score: 0.4236842501725612	strength: 105
Roman Holiday (1953)	score: 0.4121908459625984	strength: 51
Manchurian Candidate, The (1962)	score: 0.41196637795707625	strength: 90


Based the cosine score, here is the results:

Top 10 similar movies for Casablanca (1942)
- Third Man, The (1949)	score: 0.9914873737018578	strength: 52
- Maltese Falcon, The (1941)	score: 0.9884013076509861	strength: 111
- African Queen, The (1951)	score: 0.9865259305014779	strength: 113
- Manchurian Candidate, The (1962)	score: 0.984613301863293	strength: 90
- It Happened One Night (1934)	score: 0.9845081807112164	strength: 60
- Vertigo (1958)	score: 0.9832931334707176	strength: 127
- Citizen Kane (1941)	score: 0.9830936340463006	strength: 143
- Silence of the Lambs, The (1991)	score: 0.9829257762211996	strength: 185
- Treasure of the Sierra Madre, The (1948)	score: 0.9828749443758958	strength: 61
- Dial M for Murder (1954)	score: 0.9828585785349522	strength: 59

Personally, I find the results from the pearson correlation is more related.