In [14]:
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors

# Movie Recommendations from k-Nearest Neighbors

We're going to create a movie recommendation system that will recommend a movie similar to an input movie.  In other words, perhaps a person just finished watching the horror movie "Saw."  We're going to assume they liked it, and recommend another movie like "Saw" by using only user ratings (we'll consider the user's actual preferences and previous movie ratings in a later project).

Start by making a numpy matrix where the row is the user's ID and the column is the movie's ID, and the contents are the rating entered by that user and stored in your dataset's `ratings.csv`.

Most users have not seen most movies, so there are many blanks.  Decide what to fill them in with, and explain in a markdown cell what you chose.  You may use generative AI in this portion.

In [15]:
# filepath to the datasets for this lab
filepath = "datasets/ml-latest-small/"

# read csv (pivot function from ChatGPT
ratings = pd.read_csv(filepath + "ratings.csv").pivot(index='userId', columns='movieId', values='rating').fillna(2.5)

I decided to use `2.5` when filling in the blanks. I figured that by choosing the exact halfway rating, I could best represent the fact that the user has not seen the movie at all (and thus neither likes nor dislikes it at all).

Now we need to read in `movies.csv`, so we can map movie IDs to titles.  Using your data structures knowledge, decide how to store this information so your lookup is quick.  Be aware that there are movieIDs missing (for example, there is no movie 33).  You may **not** use generative AI for this portion or anywhere later on this assignment until future notice.

In [16]:
movies = pd.read_csv(filepath + "movies.csv").set_index('movieId').drop(columns=['genres'])

Naturally, [sklearn has a handy object for nearest neighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html).  Write a function which accepts a distance metric as an argument and builds a NearestNeighbors object, runs it to calculate the 15 movies most similar to any given movie using that metric, and returns that object.

In [18]:
def findNN(distMetric):
    result = NearestNeighbors(n_neighbors=15, metric=distMetric)
    return result

Make a function that accepts a movieID and NearestNeighbors object and returns the names of the 15 movies most similar to that movieID.

Horror movies usually do very well with this approach.  Try it on Saw using euclidean distance, and make sure the answers make sense.

Now, for analysis.  Choose five movies of different genres, and try your approach.  Run it with 1) **different distance metrics** and 2) **different techniques for filling in blank spaces**.  Qualitatively explain which recommendations make the most sense to you, and which combination of filled-in-blanks and distance metrics you would actually deploy.

**Perfect completion** to this point is a 90.  For the A, now copy your above code, and alter it to do **(at least)** **one of two things.**  One is to use the genres and tags in some way to try and meaningfully improve your recommendations. The other is to scale up to the [25M Dataset under the "recommended for new research" heading](https://grouplens.org/datasets/movielens/).  You will need some time for this to run.  Out of respect for your own time, you will want to run `.fit()` as few times as possible.  **You'll need to do this on `ssh.cs.usna.edu`, your lab machines won't have enough memory**.

In either case, qualitatively compare your results to your previous product.