# Using K-Nearest Neighbors (kNN) to Predict the Average Rating of Similar Movies

## K-Nearest Neighbors (kNN) 

**K-Nearest Neighbors (kNN)** is an unsupervised machine learning algorithm that can be used for classification. The input consists of the *k closest* training examples in the given feature space. The most simple way to describe how the algorithm works is that it classifies observations with the most similar, or "nearest", neighbors based on some measure of distance (e.g. Euclidean, cosine, etc.). 

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import zscore
from scipy import spatial
import operator

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('/Users/czar.yobero/SparkScala/ml-100k/u.data', sep='\t', names=r_cols, usecols=range(3))
ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


Now, we'll group everything by movie ID and compute the total number of ratings (i.e. each movie's popularity) and the mean rating for every movie.

In [2]:
movie_properties = ratings.groupby('movie_id').agg({'rating': [np.size, np.mean]})
movie_properties.head()

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
movie_id,Unnamed: 1_level_2,Unnamed: 2_level_2
1,452,3.878319
2,131,3.206107
3,90,3.033333
4,209,3.550239
5,86,3.302326


The raw number of ratings is not very useful for computing distances between movies, so we will create a new data frame that contains the normalized number of ratings. A value of 0 means nobody rated it, and a value fo 1 will mean it's the most popular movie there is. 

In [3]:
num_ratings = pd.DataFrame(movie_properties['rating']['size'])
num_ratings_norm = num_ratings.apply(lambda x: (x - np.mean(x)) / np.std(x))
num_ratings_norm.head()

Unnamed: 0_level_0,size
movie_id,Unnamed: 1_level_1
1,4.884858
2,0.890331
3,0.380127
4,1.860964
5,0.330351


Now, let's retrieve the genre information from the "u.item" file. There are 19 fields that each correspond to a specific genre (e.g. horror, comedy, drama, etc.). A value of 0 means it is not in that genre, and a value of 1 means that it is. A movie may have more than one genre associated with it (think "dramadies"). 

We will also put together everything into one big Python dictionary. Each entry will contain the movie name, list of genre values, the normalized (z-score) popularity score, and the average rating of each movie. 

In [4]:
movie_dict = {}
with open(r'/Users/czar.yobero/SparkScala/ml-100k/u.item') as f:
    temp = ''
    for line in f:
        fields = line.rstrip('\n').split('|')
        movie_id = int(fields[0])
        name = fields[1]
        genres = fields[5:25]
        genres = map(int, genres)
        movie_dict[movie_id] = (name, genres, num_ratings_norm.loc[movie_id].get('size'), 
                                movie_properties.loc[movie_id].rating)

In [5]:
movie_dict[11]

('Seven (Se7en) (1995)',
 [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
 2.1969522760542399,
 size    236.000000
 mean      3.847458
 Name: 11, dtype: float64)

Now, let us define a function that computes the distance between two movies based on how similar their genres are and how similar their popularity is. To test, we will compute the distance between movie ID's 12 and 10. There are multiple measurements to compute the distances. We will compute the distances using Euclidean Distance, which is given by 

$$
\begin{aligned}
\text{dist}(p,q) = \sqrt{(q_{1} - p_{1})^{2} + (q_{2} - p_{2})^{2} \ldots (q_{n} - p_{n})^{2}} = \sqrt{\sum_{i=1}^{n}{(q_{i}-p_{i})^{2}}}
\end{aligned}
$$

In [6]:
def computeDistance(a, b):
    genre_a = a[1]
    genre_b = b[1]
    genre_distance = spatial.distance.euclidean(genre_a, genre_b)
    popularity_a = a[2]
    popularity_b = b[2]
    popularity_distance = abs(popularity_a - popularity_b)
    
    return genre_distance + popularity_distance

computeDistance(movie_dict[27], movie_dict[1])

6.9153840441811365

Remember that the higher the distance, the less similar the movies are. Let's check what movies 12 and 10 actually are to confirm how similar they are.

In [7]:
print movie_dict[1]
print movie_dict[27]

('Toy Story (1995)', [0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 4.8848584875558236, size    452.000000
mean      3.878319
Name: 1, dtype: float64)
('Bad Boys (1995)', [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], -0.030525556625312499, size    57.000000
mean     3.105263
Name: 27, dtype: float64)


This seems about right. While I haven't seen *Richard III*, I am willing to bet that it is not that similar to *The Usual Suspsects*, hence the high distance score.

Now we just need to write a few more lines of code to compute the distance between a test movie, *Bad Boys* in this case (great movie deserving of an Oscar), and all of the other movies in the data set. We will then sort them by distance and print out the K-nearest neighbors.   

In [8]:
def getNeighbors(movie_id, K):
    distances = []
    for movie in movie_dict:
        if (movie != movie_id):
            distance = computeDistance(movie_dict[movie_id], movie_dict[movie])
            distances.append((movie, distance))
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(K):
        neighbors.append(distances[x][0])    
    return neighbors

Let's now get the nearest neighbors (i.e. movies most similar to one of my fave movies of all time, Bad Boys).

In [9]:
K = 10
mean_rating = 0
neighbors = getNeighbors(27, K)

for neighbor in neighbors:
    mean_rating += movie_dict[neighbor][3][1]
    print movie_dict[neighbor][0] + " " + str(movie_dict[neighbor][3][1])
    
mean_rating /= float(K)

Bulletproof (1996) 3.20408163265
Substitute, The (1996) 2.69387755102
Under Siege 2: Dark Territory (1995) 2.45833333333
Sudden Death (1995) 2.72340425532
Shadow, The (1994) 2.88888888889
Money Train (1995) 2.51162790698
Metro (1997) 2.91666666667
Terminal Velocity (1994) 2.67647058824
Drop Zone (1994) 2.54838709677
Judgment Night (1993) 2.68


Let's compute the mean rating of the ten nearest neighbors to *Bad Boys*.

In [10]:
mean_rating

2.7301737919867741

Now, how does this compare to the mean rating of Bad Boys, which by the way should be a five out of five, but whatevs. #HatersGonHate

In [11]:
movie_dict[27][3][1]

3.1052631578947367

The ratings are fairly similar. *Bad Boys* and similar movies to it have an average rating of $\approx3$.