In [9]:
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors

# Movie Recommendations from k-Nearest Neighbors

We're going to create a movie recommendation system that will recommend a movie similar to an input movie.  In other words, perhaps a person just finished watching the horror movie "Saw."  We're going to assume they liked it, and recommend another movie like "Saw" by using only user ratings (we'll consider the user's actual preferences and previous movie ratings in a later project).

Start by making a numpy matrix where the row is the user's ID and the column is the movie's ID, and the contents are the rating entered by that user and stored in your dataset's `ratings.csv`.

Most users have not seen most movies, so there are many blanks.  Decide what to fill them in with, and explain in a markdown cell what you chose.  You may use generative AI in this portion.

In [10]:
# filepath to the datasets for this lab
filepath = "datasets/ml-latest-small/"

# read csv
df = pd.read_csv(filepath + "ratings.csv")

# create dataframe with movieId as rows and userId as columns (using pivot function from ChatGPT)
ratings = df.pivot(index='movieId', columns='userId', values='rating').fillna(0)

I decided to use `0` when filling in the blanks.

Now we need to read in `movies.csv`, so we can map movie IDs to titles.  Using your data structures knowledge, decide how to store this information so your lookup is quick.  Be aware that there are movieIDs missing (for example, there is no movie 33).  You may **not** use generative AI for this portion or anywhere later on this assignment until future notice.

In [11]:
# read in movies.csv to map movieIds to movie names
movies = pd.read_csv(filepath + "movies.csv").drop(columns=['genres'])

# create list of all movieIds that have at least one rating
withRatings = df['movieId'].unique()

# loop through all possible movieIds, and for movies with no ratings, add a row to 'ratings' dataframe (will throw off indeces if they are missing)
for i in movies['movieId']:
    if not i in withRatings:
        ratings.loc[i] = np.zeros(610, dtype=int)

# sort rows of dataframe by movieId
ratings = ratings.sort_index()

Naturally, [sklearn has a handy object for nearest neighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html).  Write a function which accepts a distance metric as an argument and builds a NearestNeighbors object, runs it to calculate the 15 movies most similar to any given movie using that metric, and returns that object.

In [12]:
def buildNN(distMetric):
    return NearestNeighbors(n_neighbors=16, metric=distMetric)

Make a function that accepts a movieID and NearestNeighbors object and returns the names of the 15 movies most similar to that movieID.

In [25]:
def findNN(movieID, nn):
    nn.fit(ratings)
    return movies.loc[nn.kneighbors([ratings.loc[movieID]], return_distance=False)[0]]['title'].reset_index().drop(columns=['index'])[1:]

Horror movies usually do very well with this approach.  Try it on Saw using euclidean distance, and make sure the answers make sense.

In [26]:
# 8957 is the movieId for 'Saw'
findNN(8957, buildNN('euclidean'))

Unnamed: 0,title
1,Phone Booth (2002)
2,Saw II (2005)
3,Final Destination 3 (2006)
4,Saw III (2006)
5,Resident Evil: Afterlife (2010)
6,Hostel (2005)
7,Repo Men (2010)
8,Hostage (2005)
9,Freddy vs. Jason (2003)
10,Firewall (2006)


Now, for analysis.  Choose five movies of different genres, and try your approach.  Run it with 1) **different distance metrics** and 2) **different techniques for filling in blank spaces**.  Qualitatively explain which recommendations make the most sense to you, and which combination of filled-in-blanks and distance metrics you would actually deploy.

In [33]:
findNN(44, buildNN('euclidean'))

Unnamed: 0,title
1,Super Mario Bros. (1993)
2,"One, The (2001)"
3,Cradle 2 the Grave (2003)
4,Vampires (1998)
5,Mortal Kombat: Annihilation (1997)
6,"Jerky Boys, The (1995)"
7,Batman: Under the Red Hood (2010)
8,Batman Beyond: Return of the Joker (2000)
9,"Fast and the Furious: Tokyo Drift, The (Fast a..."
10,Bride of Chucky (Child's Play 4) (1998)


**Perfect completion** to this point is a 90.  For the A, now copy your above code, and alter it to do **(at least)** **one of two things.**  One is to use the genres and tags in some way to try and meaningfully improve your recommendations. The other is to scale up to the [25M Dataset under the "recommended for new research" heading](https://grouplens.org/datasets/movielens/).  You will need some time for this to run.  Out of respect for your own time, you will want to run `.fit()` as few times as possible.  **You'll need to do this on `ssh.cs.usna.edu`, your lab machines won't have enough memory**.

In either case, qualitatively compare your results to your previous product.