# K-Nearest-Neighbors to predict ratings of movie

## Data Source [MovieLens](https://grouplens.org/datasets/movielens/100k/)

In [1]:
import pandas as pd

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols, usecols=range(3))
ratings.head(2)


Unnamed: 0,user_id,movie_id,rating
0,0,50,5
1,0,172,5


In [2]:
# Groupby movie ID, and compute popularity and the average rating

import numpy as np
movieProperties = ratings.groupby('movie_id').agg({'rating': [np.size, np.mean]})
movieProperties.head(2)

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
movie_id,Unnamed: 1_level_2,Unnamed: 2_level_2
1,452,3.878319
2,131,3.206107


In [3]:
# normalize
movieNumRatings = pd.DataFrame(movieProperties['rating']['size'])
movieNormalizedNumRatings = movieNumRatings.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
movieNormalizedNumRatings.head(3)

Unnamed: 0_level_0,size
movie_id,Unnamed: 1_level_1
1,0.773585
2,0.222985
3,0.152659


Now, let's get the genre information from the u.item file. The way this works is there are 19 fields, each corresponding to a specific genre - a value of '0' means it is not in that genre, and '1' means it is in that genre. A movie may have more than one genre associated with it.

While we're at it, we'll put together everything into one big Python dictionary called movieDict. Each entry will contain the movie name, list of genre values, the normalized popularity score, and the average rating for each movie:

In [4]:
movieDict = {}
with open(r'ml-100k/u.item') as f:
    temp = ''
    for line in f:
        #line.decode("ISO-8859-1")
        fields = line.rstrip('\n').split('|')
        movieID = int(fields[0])
        name = fields[1]
        genres = fields[5:25]
        genres = map(int, genres)
        movieDict[movieID] = (name, np.array(list(genres)), movieNormalizedNumRatings.loc[movieID].get('size'), movieProperties.loc[movieID].rating.get('mean'))


In [5]:
# check: Copycat (1995)
print(movieDict[5])

('Copycat (1995)', array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]), 0.1457975986277873, 3.302325581395349)


In [6]:
# Build a function to compute the "distance" between two movies based on similarity, and popularity. 
# higher the distance, lower similarity

from scipy import spatial

def ComputeDistance(a, b):
    genresA = a[1]
    genresB = b[1]
    genreDistance = spatial.distance.cosine(genresA, genresB)
    popularityA = a[2]
    popularityB = b[2]
    popularityDistance = abs(popularityA - popularityB)
    return genreDistance + popularityDistance
    
ComputeDistance(movieDict[2], movieDict[4])

print(movieDict[2])
print(movieDict[4])

('GoldenEye (1995)', array([0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]), 0.22298456260720412, 3.2061068702290076)
('Get Shorty (1995)', array([0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 0.3567753001715266, 3.550239234449761)


In [7]:
# Compute the distance between 'Copycat (1995)'and all of the movies in our data set
import operator

def getNeighbors(movieID, K):
    distances = []
    for movie in movieDict:
        if (movie != movieID):
            dist = ComputeDistance(movieDict[movieID], movieDict[movie])
            distances.append((movie, dist))
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(K):
        neighbors.append(distances[x][0])
    return neighbors

K = 20
avgRating = 0
neighbors = getNeighbors(5, K)
for neighbor in neighbors:
    avgRating += movieDict[neighbor][3]
    print (movieDict[neighbor][0] + " " + str(movieDict[neighbor][3]))
    
avgRating /= K

Once Upon a Time in America (1984) 3.4
Desperate Measures (1998) 3.3333333333333335
Kiss the Girls (1997) 3.4615384615384617
Desperate Measures (1998) 3.2962962962962963
Kiss of Death (1995) 2.85
Amateur (1994) 3.1666666666666665
Guilty as Sin (1993) 2.1666666666666665
Juror, The (1996) 2.817073170731707
Dolores Claiborne (1994) 3.3417721518987342
City Hall (1996) 3.1392405063291138
Bound (1996) 3.8217054263565893
Diabolique (1996) 2.887323943661972
Extreme Measures (1996) 3.171875
Murder in the First (1995) 3.6
Kalifornia (1993) 3.2203389830508473
Last Supper, The (1995) 3.4482758620689653
Red Corner (1997) 3.3859649122807016
Carlito's Way (1993) 3.4074074074074074
Professional, The (1994) 3.704697986577181
Bonnie and Clyde (1967) 3.819672131147541


In [8]:
# the average rating of the 10 nearest neighbors
avgRating

3.271992445300609

In [9]:
# compare to actual average rating. The prediction looks good.
movieDict[5]

('Copycat (1995)',
 array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]),
 0.1457975986277873,
 3.302325581395349)