### How To Predict Movie Ratings With K-Nearest Neighbors

Let's try to predict the rating of a movie based on the 10 movies that are "nearest" to it in terms of their genres and ratings.

To begin, let's load up our pandas dataframe with the MovieLens 100K data set.

In [111]:
import pandas as pd

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols, usecols=range(3))
ratings.head()


Unnamed: 0,user_id,movie_id,rating
0,0,50,5
1,0,172,5
2,0,133,1
3,196,242,3
4,186,302,3


Next let's group by movie ID, then calculate the number of ratings ("size" in numpy) and average rating ("mean" in numpy) for each movie.

In [112]:
import numpy as np

movieProperties = ratings.groupby('movie_id').agg({'rating': [np.size, np.mean]})
movieProperties.head()

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
movie_id,Unnamed: 1_level_2,Unnamed: 2_level_2
1,452,3.878319
2,131,3.206107
3,90,3.033333
4,209,3.550239
5,86,3.302326


Let's now create a dataframe that normalizes each "number of ratings" value to a value between 0 (the lowest number of ratings) and 1 (the highest number of ratings).

In [113]:
movieNumRatings = pd.DataFrame(movieProperties['rating']['size'])
movieNormalizedNumRatings = movieNumRatings.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
movieNormalizedNumRatings.head()

Unnamed: 0_level_0,size
movie_id,Unnamed: 1_level_1
1,0.773585
2,0.222985
3,0.152659
4,0.356775
5,0.145798


Next let's create a Python dictionary with each item containing:

- Move Name
- List of genre values (out of each of the possible 19 genres, 0 = not in genre and 1 = in genre)
- Normalized number of ratings
- Average rating

In [114]:
movieDict = {}
with open(r'ml-100k/u.item') as f:
    temp = ''
    for line in f:
        fields = line.rstrip('\n').split('|')
        movieID = int(fields[0])
        name = fields[1]
        genres = fields[5:25]
        genres = map(int, genres)
        movieDict[movieID] = (name, np.array(list(genres)), movieNormalizedNumRatings.loc[movieID].get('size'), movieProperties.loc[movieID].rating.get('mean'))


Let's take a look at the dictionary item for "Toy Story" (movie ID = 1)

In [115]:
movieDict[1]

('Toy Story (1995)',
 array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 0.7735849056603774,
 3.8783185840707963)

Now we'll define "ComputeDistance" to see how close two movies are in terms of genre and number of ratings.

In [116]:
from scipy import spatial
import math

def ComputeDistance(a, b):
    genresA = a[1]
    genresB = b[1]
    genreDistance = spatial.distance.cosine(genresA, genresB)
    popularityA = a[2]
    popularityB = b[2]
    popularityDistance = abs(popularityA - popularityB)
    overallDistance = math.sqrt((genreDistance ** 2) + (popularityDistance ** 2))
    return overallDistance

Let's test our "ComputeDistance" function with two movies which aren't very similar: "GoldenEye" (movie ID 2) and "Get Shorty" (movie ID 4)

In [117]:
ComputeDistance(movieDict[2], movieDict[4])

0.6799591207583364

This appears to be an appropriate "Distance" score, so now let's define a function that gets the "K Nearest Neighbors" based on this "Distance" score and test it with "Toy Story" (movie ID = 1).

In [118]:
import operator

def getNeighbors(movieID, K):
    distances = []
    for movie in movieDict:
        if (movie != movieID):
            dist = ComputeDistance(movieDict[movieID], movieDict[movie])
            distances.append((movie, dist))
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(K):
        neighbors.append(distances[x][0])
    return neighbors

Now let's use this function to get the average rating for the "5 Nearest Neighbors".

In [119]:
K = 5

print (str(K) + " Nearest Neigbors: ")
print (" ")

sumRatings = 0
neighbors = getNeighbors(1, K)
for neighbor in neighbors:
    sumRatings += movieDict[neighbor][3]
    print (movieDict[neighbor][0] + " " + str(movieDict[neighbor][3]))
    
avgRating = sumRatings / K
print (" ")
print ("Average Rating of the " + str(K) + " Nearest Neigbors: " + str(avgRating))

5 Nearest Neigbors: 
 
Willy Wonka and the Chocolate Factory (1971) 3.6319018404907975
Aladdin (1992) 3.8127853881278537
Liar Liar (1997) 3.156701030927835
Monty Python and the Holy Grail (1974) 4.0664556962025316
Full Monty, The (1997) 3.926984126984127
 
Average Rating of the 5 Nearest Neigbors: 3.7189656165466287


How close did we get to "Toy Story"'s actual rating?

In [120]:
print (movieDict[1][0] + " " + str(movieDict[1][3]))

Toy Story (1995) 3.8783185840707963


Pretty close!