In [1]:
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors

# Ignore warnings about data conversion when using jaccard distance metric
import warnings
warnings.filterwarnings('ignore')

# Movie Recommendations from k-Nearest Neighbors

We're going to create a movie recommendation system that will recommend a movie similar to an input movie.  In other words, perhaps a person just finished watching the horror movie "Saw."  We're going to assume they liked it, and recommend another movie like "Saw" by using only user ratings (we'll consider the user's actual preferences and previous movie ratings in a later project).

Start by making a numpy matrix where the row is the user's ID and the column is the movie's ID, and the contents are the rating entered by that user and stored in your dataset's `ratings.csv`.

Most users have not seen most movies, so there are many blanks.  Decide what to fill them in with, and explain in a markdown cell what you chose.  You may use generative AI in this portion.

In [2]:
# filepath to the datasets for this lab
filepath = "../datasets/ml-latest-small/"

# read csv
df = pd.read_csv(filepath + "ratings.csv")

# create dataframe with movieId as rows and userId as columns (using pivot function from ChatGPT)
ratings = df.pivot(index='movieId', columns='userId', values='rating').fillna(0)

I decided to use `0` when filling in the blanks.

Now we need to read in `movies.csv`, so we can map movie IDs to titles.  Using your data structures knowledge, decide how to store this information so your lookup is quick.  Be aware that there are movieIDs missing (for example, there is no movie 33).  You may **not** use generative AI for this portion or anywhere later on this assignment until future notice.

In [3]:
# read in movies.csv to map movieIds to movie names
movies = pd.read_csv(filepath + "movies.csv").drop(columns=['genres'])

# create list of all movieIds that have at least one rating
withRatings = df['movieId'].unique()

# loop through all possible movieIds, and for movies with no ratings, add a row to 'ratings' dataframe (will throw off indeces if they are missing)
for i in movies['movieId']:
    if not i in withRatings:
        ratings.loc[i] = np.zeros(610, dtype=int)

# sort rows of dataframe by movieId
ratings = ratings.sort_index()

Naturally, [sklearn has a handy object for nearest neighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html).  Write a function which accepts a distance metric as an argument and builds a NearestNeighbors object, runs it to calculate the 15 movies most similar to any given movie using that metric, and returns that object.

In [4]:
def buildNN(distMetric):
    return NearestNeighbors(n_neighbors=16, metric=distMetric)

Make a function that accepts a movieID and NearestNeighbors object and returns the names of the 15 movies most similar to that movieID.

In [5]:
def findNN(movieID, nn):
    nn.fit(ratings)
    return movies.loc[nn.kneighbors([ratings.loc[movieID]], return_distance=False)[0]]['title'].reset_index().drop(columns=['index'])[1:]

Horror movies usually do very well with this approach.  Try it on Saw using euclidean distance, and make sure the answers make sense.

In [6]:
# 8957 is the movieId for 'Saw'
findNN(8957, buildNN('euclidean'))

Unnamed: 0,title
1,Phone Booth (2002)
2,Saw II (2005)
3,Final Destination 3 (2006)
4,Saw III (2006)
5,Resident Evil: Afterlife (2010)
6,Hostel (2005)
7,Repo Men (2010)
8,Hostage (2005)
9,Freddy vs. Jason (2003)
10,Firewall (2006)


Now, for analysis.  Choose five movies of different genres, and try your approach.  Run it with 1) **different distance metrics** and 2) **different techniques for filling in blank spaces**.  Qualitatively explain which recommendations make the most sense to you, and which combination of filled-in-blanks and distance metrics you would actually deploy.

# I chose the following movies:
* 1 - Toy Story (1995)
* 44 - Mortal Kombat (1995)
* 104 - Happy Gilmore (1996)
* 116 - Anne Frank Remembered (1995)
* 215 - Before Sunrise (1995)

# First, using euclidean distance with blanks filled with 0:
---

In [7]:
# Toy Story
findNN(1, buildNN('euclidean'))

Unnamed: 0,title
1,Toy Story 2 (1999)
2,Mission: Impossible (1996)
3,Independence Day (a.k.a. ID4) (1996)
4,"Bug's Life, A (1998)"
5,"Nutty Professor, The (1996)"
6,Willy Wonka & the Chocolate Factory (1971)
7,Babe (1995)
8,Groundhog Day (1993)
9,"Mask, The (1994)"
10,"Honey, I Shrunk the Kids (1989)"


In [8]:
# Mortal Kombat
findNN(44, buildNN('euclidean'))

Unnamed: 0,title
1,Super Mario Bros. (1993)
2,"One, The (2001)"
3,Cradle 2 the Grave (2003)
4,Vampires (1998)
5,Mortal Kombat: Annihilation (1997)
6,"Jerky Boys, The (1995)"
7,Batman: Under the Red Hood (2010)
8,Batman Beyond: Return of the Joker (2000)
9,"Fast and the Furious: Tokyo Drift, The (Fast a..."
10,Bride of Chucky (Child's Play 4) (1998)


In [9]:
# Happy Gilmore
findNN(104, buildNN('euclidean'))

Unnamed: 0,title
1,Billy Madison (1995)
2,"Waterboy, The (1998)"
3,"Nutty Professor, The (1996)"
4,Role Models (2008)
5,"Cable Guy, The (1996)"
6,Wayne's World 2 (1993)
7,Bio-Dome (1996)
8,"Mighty Ducks, The (1992)"
9,Super Troopers (2001)
10,Old School (2003)


In [10]:
# Anne Frank Remembered
findNN(116, buildNN('euclidean'))

Unnamed: 0,title
1,"Thin Line Between Love and Hate, A (1996)"
2,"Line King: The Al Hirschfeld Story, The (1996)"
3,"Last Klezmer: Leopold Kozlowski, His Life and ..."
4,Hustler White (1996)
5,"Single Girl, A (Fille seule, La) (1995)"
6,Moll Flanders (1996)
7,Baby Boom (1987)
8,"Pompatus of Love, The (1996)"
9,Wild Reeds (Les roseaux sauvages) (1994)
10,French Twist (Gazon maudit) (1995)


In [11]:
# Before Sunrise
findNN(215, buildNN('euclidean'))

Unnamed: 0,title
1,Before Sunset (2004)
2,"Dreamlife of Angels, The (Vie rêvée des anges,..."
3,Melancholia (2011)
4,L.I.E. (2001)
5,"Searchers, The (1956)"
6,"Anniversary Party, The (2001)"
7,"Devil's Backbone, The (Espinazo del diablo, El..."
8,"Age of Innocence, The (1993)"
9,Bringing Out the Dead (1999)
10,Last Life in the Universe (Ruang rak noi nid m...


# Second, using jaccard distance with blanks filled with 0:
---

In [12]:
# Toy Story
findNN(1, buildNN('jaccard'))

Unnamed: 0,title
1,Independence Day (a.k.a. ID4) (1996)
2,Jurassic Park (1993)
3,Star Wars: Episode IV - A New Hope (1977)
4,Forrest Gump (1994)
5,Star Wars: Episode VI - Return of the Jedi (1983)
6,Mission: Impossible (1996)
7,"Lion King, The (1994)"
8,Back to the Future (1985)
9,Men in Black (a.k.a. MIB) (1997)
10,Groundhog Day (1993)


In [13]:
# Mortal Kombat
findNN(44, buildNN('jaccard'))

Unnamed: 0,title
1,Judge Dredd (1995)
2,"Crow, The (1994)"
3,Hackers (1995)
4,Super Mario Bros. (1993)
5,Demolition Man (1993)
6,GoldenEye (1995)
7,Batman & Robin (1997)
8,Ronin (1998)
9,Universal Soldier (1992)
10,Dr. No (1962)


In [14]:
# Happy Gilmore
findNN(104, buildNN('jaccard'))

Unnamed: 0,title
1,Billy Madison (1995)
2,American Pie (1999)
3,Wayne's World (1992)
4,"Nutty Professor, The (1996)"
5,Austin Powers: The Spy Who Shagged Me (1999)
6,Austin Powers: International Man of Mystery (1...
7,Ace Ventura: When Nature Calls (1995)
8,Dumb & Dumber (Dumb and Dumber) (1994)
9,Ferris Bueller's Day Off (1986)
10,There's Something About Mary (1998)


In [15]:
# Anne Frank Remembered
findNN(116, buildNN('jaccard'))

Unnamed: 0,title
1,Burglar (1987)
2,Baby Boom (1987)
3,"MatchMaker, The (1997)"
4,Back to the Beach (1987)
5,Against All Odds (1984)
6,Max Dugan Returns (1983)
7,"River, The (1984)"
8,Moll Flanders (1996)
9,Punchline (1988)
10,"Mirror Has Two Faces, The (1996)"


In [16]:
# Before Sunrise
findNN(215, buildNN('jaccard'))

Unnamed: 0,title
1,Before Sunset (2004)
2,"Age of Innocence, The (1993)"
3,Three Colors: White (Trzy kolory: Bialy) (1994)
4,Big Night (1996)
5,Crumb (1994)
6,State and Main (2000)
7,Smoke (1995)
8,"Wings of Desire (Himmel über Berlin, Der) (1987)"
9,Sexy Beast (2000)
10,Dogville (2003)


# Third, using euclidean distance with blanks filled with averages:
---

In [17]:
# replace blanks with the average rating for a given user
def replaceMean(row):
    mean = row.replace(0, pd.NA).mean()
    return row.replace(0, mean)

# apply function to each user
ratings = ratings.apply(replaceMean)

In [18]:
# Toy Story
findNN(1, buildNN('euclidean'))

Unnamed: 0,title
1,Toy Story 2 (1999)
2,Toy Story 3 (2010)
3,True Grit (1969)
4,Hugo (2011)
5,Clerks II (2006)
6,Dirty Harry (1971)
7,"Night at the Opera, A (1935)"
8,Condorman (1981)
9,48 Hrs. (1982)
10,Batman: Mask of the Phantasm (1993)


In [19]:
# Mortal Kombat
findNN(44, buildNN('euclidean'))

Unnamed: 0,title
1,Don't Be a Menace to South Central While Drink...
2,Mr. Nice Guy (Yat goh ho yan) (1997)
3,"Claymation Christmas Celebration, A (1987)"
4,Suburban Commando (1991)
5,"Sweetest Thing, The (2002)"
6,Joe's Apartment (1996)
7,"Hollywood Knights, The (1980)"
8,"Curse of the Jade Scorpion, The (2001)"
9,Sorority Boys (2002)
10,Feeling Minnesota (1996)


In [20]:
# Happy Gilmore
findNN(104, buildNN('euclidean'))

Unnamed: 0,title
1,Billy Madison (1995)
2,Wayne's World 2 (1993)
3,Ice Castles (1978)
4,"Amazing Panda Adventure, The (1995)"
5,Black Widow (1987)
6,Frankenstein Meets the Wolf Man (1943)
7,"Russia House, The (1990)"
8,Inventing the Abbotts (1997)
9,Burglar (1987)
10,Ruthless People (1986)


In [21]:
# Anne Frank Remembered
findNN(116, buildNN('euclidean'))

Unnamed: 0,title
1,"Pompatus of Love, The (1996)"
2,"Thin Line Between Love and Hate, A (1996)"
3,Baby Boom (1987)
4,"Gate, The (1987)"
5,Private School (1983)
6,Puppet Master II (1991)
7,Children of the Corn IV: The Gathering (1996)
8,Lost & Found (1999)
9,Prancer (1989)
10,"Bounty, The (1984)"


In [22]:
# Before Sunrise
findNN(215, buildNN('euclidean'))

Unnamed: 0,title
1,Before Sunset (2004)
2,Alpha Dog (2007)
3,Trust (1990)
4,Bring It On Again (2004)
5,"Searchers, The (1956)"
6,Looking for Richard (1996)
7,Last Life in the Universe (Ruang rak noi nid m...
8,Police Story (Ging chaat goo si) (1985)
9,"Bittersweet Life, A (Dalkomhan insaeng) (2005)"
10,American Dreamz (2006)


# Finally, using cosine distance with blanks filled with averages:

#### Using jaccard distance does not work here, because jaccard distance operates on boolean values, and since blanks are filled with non-zero values, every single value is treated as 'True'. Thus, the nearest neighbors have no obvious correlation.
---

In [23]:
# Toy Story
findNN(1, buildNN('cosine'))

Unnamed: 0,title
1,Toy Story 2 (1999)
2,True Grit (1969)
3,Hugo (2011)
4,Toy Story 3 (2010)
5,Condorman (1981)
6,48 Hrs. (1982)
7,We Bought a Zoo (2011)
8,"War of the Roses, The (1989)"
9,Dirty Harry (1971)
10,Clerks II (2006)


In [24]:
# Mortal Kombat
findNN(44, buildNN('cosine'))

Unnamed: 0,title
1,Don't Be a Menace to South Central While Drink...
2,Mr. Nice Guy (Yat goh ho yan) (1997)
3,"Claymation Christmas Celebration, A (1987)"
4,Suburban Commando (1991)
5,"Curse of the Jade Scorpion, The (2001)"
6,"Hollywood Knights, The (1980)"
7,"Sweetest Thing, The (2002)"
8,Joe's Apartment (1996)
9,Sorority Boys (2002)
10,Feeling Minnesota (1996)


In [25]:
# Happy Gilmore
findNN(104, buildNN('cosine'))

Unnamed: 0,title
1,Billy Madison (1995)
2,Wayne's World 2 (1993)
3,Ice Castles (1978)
4,"Amazing Panda Adventure, The (1995)"
5,Black Widow (1987)
6,Frankenstein Meets the Wolf Man (1943)
7,"Russia House, The (1990)"
8,Inventing the Abbotts (1997)
9,Burglar (1987)
10,"Last Kiss, The (2006)"


In [26]:
# Anne Frank Remembered
findNN(116, buildNN('cosine'))

Unnamed: 0,title
1,"Pompatus of Love, The (1996)"
2,"Thin Line Between Love and Hate, A (1996)"
3,Baby Boom (1987)
4,Puppet Master II (1991)
5,Private School (1983)
6,"Gate, The (1987)"
7,Lost & Found (1999)
8,Prancer (1989)
9,Children of the Corn IV: The Gathering (1996)
10,"Bounty, The (1984)"


In [27]:
# Before Sunrise
findNN(215, buildNN('cosine'))

Unnamed: 0,title
1,Before Sunset (2004)
2,Alpha Dog (2007)
3,Bring It On Again (2004)
4,Trust (1990)
5,"Searchers, The (1956)"
6,Looking for Richard (1996)
7,Last Life in the Universe (Ruang rak noi nid m...
8,Police Story (Ging chaat goo si) (1985)
9,American Dreamz (2006)
10,All Over the Guy (2001)


# Results:

#### Based on the results, it would appear to me that using euclidean distance with blanks filled with 0 is the best option. While certain recommendations that made clear sense showed up in most of the options, they ultimately had far more random and seemingly uncorrelated movies.

**Perfect completion** to this point is a 90.  For the A, now copy your above code, and alter it to do **(at least)** **one of two things.**  One is to use the genres and tags in some way to try and meaningfully improve your recommendations. The other is to scale up to the [25M Dataset under the "recommended for new research" heading](https://grouplens.org/datasets/movielens/).  You will need some time for this to run.  Out of respect for your own time, you will want to run `.fit()` as few times as possible.  **You'll need to do this on `ssh.cs.usna.edu`, your lab machines won't have enough memory**.

In either case, qualitatively compare your results to your previous product.