# Movie Recommendations

This notebook uses the [MovieLens dataset](https://grouplens.org/datasets/movielens/latest/)
as well as content information that is linked through the respective movie pages on [TMDB](https://www.themoviedb.org/)

* I have included csv files under our class repo on GitHub
* License info is included in the file https://raw.githubusercontent.com/benjum/UCLA-24W-DH150/main/Data/movielens-data/README.txt

**NOTE:** Running this on the JupyterHub will result in the kernel dying part-way through, due to RAM constraints.  If you want to follow along with this notebook, please download it and run it on your local system.

In [None]:
import pandas as pd

In [None]:
ratings = pd.read_csv('https://raw.githubusercontent.com/benjum/UCLA-24W-DH150/main/Data/movielens-data/ratings.csv')
movies = pd.read_csv('https://raw.githubusercontent.com/benjum/UCLA-24W-DH150/main/Data/movielens-data/movies.csv')

In [None]:
ratings

In [None]:
movies

610 users and 9724 movies

In [None]:
len(ratings['userId'].unique())

In [None]:
len(ratings['movieId'].unique())

In [None]:
ratings['rating'].unique()

# Idea 3: Recommend based on Content

Here's an opportunity to use some of our text-based algorithms!

Although here we won't use NLTK or Gensim, just the good ole Scikit-Learn.

[This notebook was motivated by DataCamp's ["Beginner Tutorial: Recommender Systems in Python"](https://www.datacamp.com/tutorial/recommender-systems-python) ]

In [None]:
# We're going to use the "Overviews" column first:

movies['overview'][0]

In [None]:
# Some entries are NaNs.  We don't want those, so replace with empty strings:

movies['overview'] = movies['overview'].fillna('')

In [None]:
# sklearn already has a method to get word counts

from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# and conveniently it will get rid of stopwords

count = CountVectorizer(stop_words='english')

In [None]:
count_matrix = count.fit_transform(movies['overview'])

In [None]:
count_matrix.shape

In [None]:
count_matrix[0,:]

Sparse arrays.... we can use 'nonzero()' to get the nonzero elements.

In [None]:
doc = 0
feature_names = count.get_feature_names_out()
feature_index = count_matrix[doc,:].nonzero()[1]
for i in feature_index:
    print(feature_names[i], count_matrix[doc,i])

In [None]:
count_matrix.shape

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [None]:
cosine_sim.shape

In [None]:
cosine_sim[1]

In [None]:
# Construct a reverse map of indices and movie titles

indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()

In [None]:
indices[0:2]

In [None]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, method):
    
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(method[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: -x[1])

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return movies['title'].iloc[movie_indices]

In [None]:
get_recommendations('Toy Story (1995)', cosine_sim)

If you know these movies, you'll know that this is a rather ridiculous set of suggestions.

Child's Play 2????  Um.... no.

* The quality of your recommender would be increased with the usage of better metadata and by capturing more of the finer details. 
* First, let's try an alternative metric for assigning importance to words.  Instead of count, use Term Frequency - Inverse Document Frequency.

Special Note:  If your kernel runs out of memory, the execution here may die and you'll get a note about the kernel dieing.  If that happens, restart the kernel, re-run from the top up to (but not including) the countvectorizer import, and resume here.

In [None]:
# Use the TF-IDF algorithm from sklearn

from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Initialize the TF-IDF Vectorizer, and set it to remove stopwords.

tfidf = TfidfVectorizer(stop_words='english')

In [None]:
# Create the array of words with TF-IDF probability scores using the array of overviews

tfidf_matrix = tfidf.fit_transform(movies['overview'])

In [None]:
tfidf_matrix.shape

In [None]:
tfidf_matrix[0,:]

In [None]:
doc = 0
feature_names = tfidf.get_feature_names_out()
feature_index = tfidf_matrix[doc,:].nonzero()[1]
for i in feature_index:
    print(feature_names[i], tfidf_matrix[doc,i])

In [None]:
doc = 0
feature_names = tfidf.get_feature_names_out()
feature_index = tfidf_matrix[doc,:].nonzero()[1]
tfidf_toystory_scores = {}
for i in feature_index:
    tfidf_toystory_scores[feature_names[i]] = tfidf_matrix[doc,i]
sorted(tfidf_toystory_scores.items(), key=lambda x: -x[1])

In [None]:
# To get a cosine similarity score, we can use vector multiplication here using linear_kernel

from sklearn.metrics.pairwise import linear_kernel

In [None]:
# Compute the cosine similarity matrix

cosine_sim2 = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
cosine_sim2.shape

In [None]:
cosine_sim2[1]

In [None]:
# Construct a reverse map of indices and movie titles

indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()

In [None]:
indices[0:2]

In [None]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, method):
    
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(method[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: -x[1])

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return movies['title'].iloc[movie_indices]

In [None]:
get_recommendations('Toy Story (1995)', cosine_sim2)

Still gives us a rather ridiculous set of movies.

* Now we'll try improving the quality of our recommender with the usage of better metadata and by capturing more of the finer details. 

In [None]:
# Use combo of director and keywords to get another content-based rating

movies['dir_and_keys'] = movies['director'] + ' ' + movies['keywords']

In [None]:
movies.head()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
movies['dir_and_keys'] = movies['dir_and_keys'].fillna('')
count_matrix = count.fit_transform(movies['dir_and_keys'])

In [None]:
count_matrix.shape

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [None]:
# Reset index of your main DataFrame and construct reverse mapping as before
metadata = movies.reset_index()
indices = pd.Series(movies.index, index=movies['title'])

In [None]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, method):
    
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(method[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: -x[1])

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return movies['title'].iloc[movie_indices]

In [None]:
get_recommendations('Toy Story (1995)', cosine_sim)

In [None]:
get_recommendations('40-Year-Old Virgin, The (2005)', cosine_sim)

Okay, this is looking better.