In this exercise, you will create a command line interactive movie recommender system. Your program will ask the user to rate ten movies and recommend other movies they might like by finding users with similar tastes in movies and recommending movies they rated highly. This technique is often called collaborative filtering.

You can find the data for this exercise in the movielens folder. You will need the movies.csv and ratings.csv files to complete this assignment.

**1. Recommendation Engine**

The first step in the process is creating a matrix of user ratings for each movie. If there are users and movies, then the user rating matrix will be N x M.

<img src="files/Images/1.jpg">

The recommendation engine should take as input user ratings one or more movies and output an ordered list of movie recommendations. Test your recommendation engine by selecting and rating ten movies from the top 200 most rated movies. The following is an example of test input.

<img src="files/Images/2.jpg">

In order to determine which movies to recommend, you should first compute the cosine distance of each existing user’s ratings from the inputs given. You can compute the cosine distance between two vectors using scipy or scikit-learn. Next, compute a weighted average of the ratings of all user ratings using the cosine distance as the weight. In other words, users who gave similar ratings to the input ratings, are weighted higher than those who gave less similar ratings.

In mathematical terms, if there are N movies and M users, the user rating matrix is an N x M matrix. The user input is a vector of length N. One row of the user rating matrix represents the movie ratings for one user. If you compute the cosine similarity between that row and the user’s input, you have a measure of how close that user’s ratings are to the input ratings. If the other user rating is everything opposite, the cosine similarity is -1. If they rate everything the same, the cosine similarity is 1.

The following Python code demonstrates how to calculate the movie recommendations without using Numpy’s matrix math operations. It uses Python list comprehensions and for loops for the purpose of clarity. The output of this function is a list of length N (the number of unique movies) with a suggested rating for each movie.

<img src="files/Images/3.jpg">

This function does not take into account movies the user has already rated, so you should make sure you remove those movies from your recommendations. Additionally, an optimal solution would use Numpy’s matrix mathematics, rather than Python lists.

**2. Movie Search Engine**

Now that you have a recommendation engine, you need to provide a way for users to find movies to rate. You will need to create a function that takes in a search parameter and returns a ranked list of movies that best matches the input. For this data set, there are less than 10,000 movies and you only need to worry about searching the titles for those movies. Therefore, we do not need to worry as much about coming up with an optimal solution that scales for larger datasets.

When returning candidate movie titles, you will want to return the titles that match the search input with the highest probability. Consider dividing up the titles and the user input into n-grams, but instead of using n-grams of works, the n-grams are characters in the string.

For example, the title Batman contains the bigrams, [‘ba’, ‘tm’, ‘an’, ‘at’, ‘ma’]. You could then match that input title to titles that contain those bigrams with the highest probability. Find a search method that generally returns correct recommendations based off the search input.

**3. Movie Recommendation Application**

In this part, you create an interactive movie recommendation application by combining the movie recommendation engine and the movie search engine. To accomplish this, you will create a simple command line application. Upon starting the application, you should ask the user to find a movie to rate. It should return a list of numbered movies or an “I don’t see what I’m looking for” option.

<img src="files/Images/4.jpg">

If the user selects a movie, they need to enter a rating from zero to five and that movie will be added to their list of movies.

<img src="files/Images/5.jpg">

After the user has rated at least five movies, give them the option to rate more movies or to get recommendations. Your application should return a ranked list of five movie recommendations.

<img src="files/Images/6.jpg">

### 1. Recommendation Engine

In [257]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse.linalg import svds
from collections import Counter
from nltk.util import ngrams
from nltk.corpus import stopwords
import string
import re
import nltk

# Create stop words

stop_words = stopwords.words('english')

# Setting this option off since I need to update a dataframe in a loop

pd.options.mode.chained_assignment = None

In [2]:
def read_csv_file(file):
    """
    Read data from csv file and return a 
    pandas data frame
    Args: file name
    Returns: pandas data farme
    """
    df = []
    df = pd.read_csv(file)

    return df

In [3]:
# Read the movies and ratings data

movies_df = read_csv_file('data/movielens/movies.csv')
ratings_df = read_csv_file('data/movielens/ratings.csv')

print('Movies:', movies_df.head())
print('Ratings:', ratings_df.head())

Movies:    movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
Ratings:    userId  movieId  rating   timestamp
0       1       31     2.5  1260759144
1       1     1029     3.0  1260759179
2       1     1061     3.0  1260759182
3       1     1129     2.0  1260759185
4       1     1172     4.0  1260759205


In [4]:
# Format the ratings matrix by transposing

R_df = ratings_df.pivot(index = 'userId', columns ='movieId', values = 'rating').fillna(0)
R_df.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [256]:
# De-mean the data

R = R_df.values
user_ratings_mean = np.mean(R, axis = 1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)

# Define latent factors to use to approximate the original ratings matrix 

U, sigma, Vt = svds(R_demeaned, k = 50)

# Diagonal matrix form

sigma = np.diag(sigma)

# Predictions from the Decomposed Matrices

all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = R_df.columns)

In [9]:
def recommend_movies(predictions_df, userID, movies_df, original_ratings_df, num_recommendations=5):
    """
    Recommendation function
    Args: prediction dataframe, user id, all movies data frame, 
          original ratings dataframe, number of recommendations (dafault = 5)
    Returns: prints number of movies already rated by the user and returns 
             the recommended movies dataframe
    """
    
    # Get and sort the user's predictions
    
    user_row_number = userID - 1 # UserID starts at 1, not 0
    sorted_user_predictions = predictions_df.iloc[user_row_number].sort_values(ascending=False)
    
    # Get the user's data and merge in the movie information.
    
    user_data = original_ratings_df[original_ratings_df['userId'] == (userID)]
    user_full = (user_data.merge(movies_df, how = 'left', left_on = 'movieId', right_on = 'movieId').
                     sort_values(['rating'], ascending=False)
                 )

    print('User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
    print('Recommending the highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    
    recommendations = (movies_df[~movies_df['movieId'].isin(user_full['movieId'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'movieId',
               right_on = 'movieId').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations

# Excute function

already_rated, predictions = recommend_movies(preds_df, 100, movies_df, ratings_df, 10)
predictions

User 100 has already rated 25 movies.
Recommending the highest 10 predicted ratings movies not already rated.


Unnamed: 0,movieId,title,genres
428,494,Executive Decision (1996),Action|Adventure|Thriller
652,832,Ransom (1996),Crime|Thriller
219,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi
28,36,Dead Man Walking (1995),Crime|Drama
638,805,"Time to Kill, A (1996)",Drama|Thriller
630,788,"Nutty Professor, The (1996)",Comedy|Fantasy|Romance|Sci-Fi
327,376,"River Wild, The (1994)",Action|Thriller
549,653,Dragonheart (1996),Action|Adventure|Fantasy
670,852,Tin Cup (1996),Comedy|Drama|Romance
12,17,Sense and Sensibility (1995),Drama|Romance


### 2. Movie Search Engine

In [255]:
def process_text(text):
    """
    Remove year of release, punctuation and 
    stop words from string (movie names)
    Args: text string
    Returns: lower case text string (removed punctuation, stop words, space)
    """
    
    # Create empty string to store output
    
    textcln = ''
    
    # Replace all whitespaces and convert to lower case
    
    pattern = re.compile(r'\s+')
    textcln = re.sub(pattern, '', text).lower()
    
    # If the movie name ends with a ')', remove the year at the end
    
    if text.strip().endswith(')'):
        textcln = textcln[:-6]
    else:
        textcln = textcln
    
    # Remove punctuation and stop words
    textcln = textcln.translate(str.maketrans('', '', string.punctuation))
    textcln = ' '.join([word for word in textcln.split() if word not in stop_words])  
    
    if textcln.strip().endswith(')'):
        textcln = ''.join(textcln.split())[:-6].upper()
    
    return textcln


def get_bigrams(text, isClean = 'Y'):
    """
    Generate N grams by taking text, x of N gram 
    and return the desired top X desired items
    Args: text string, cleaning flag Y/N
    Returns: list of bigrams
    """
    
    if isClean == 'Y':
        text = process_text(text)
        
    # Generate bigrams out of the new spaced string
    
    spaced = ''
    for ch in text:
        spaced = spaced + ch + ' '

    
    tokenized = spaced.split(" ")
    myList = list(nltk.bigrams(tokenized))
       
    # Join the items in each tuple in myList together and put them in a new list

    Bigrams = []

    for i in myList:
        Bigrams.append((''.join([w + ' ' for w in i])).strip())

    return Bigrams


def search_movie(search_movie, all_movies, N=10):
    """
    Returns matching movies (by name) following a basic algorithm
    Args: string of movie name to search, all movies dataframe, 
          number of similar movies (defaulted to 10)
    Returns: dataframe of movies matching with search, sorted by similarity
    """

    # Get the bigram of the searched movie
    
    source_ngram = get_bigrams(search_movie)
    
    # Copy the movie database and add a similarity score column
    
    match_df = movies_df
    match_df['similarity'] = 0
    
    # Get bigrams of target movie and calculate similarity score
    # by comapring the occurances of bigrams between source and target

    for i in range(0, len(match_df)):
        target_name = match_df['title'].iloc[i]
        
        if process_text(search_movie) == process_text(target_name):
            continue
            
        target_ngram = get_bigrams(target_name)
        match = len(set(source_ngram) & set(target_ngram))
        match_df['similarity'].iloc[i] = match

    # Remove rows that did not match
        
    match_df = match_df[match_df.similarity != 0]
    
    return match_df.sort_values('similarity', ascending=False).head(N)
    
    
search_movie('Bells Are Ringing (1960)',movies_df, 10)

Unnamed: 0,movieId,title,genres,similarity
7702,82934,Most Dangerous Man in America: Daniel Ellsberg...,Documentary,8
2223,2776,"Marcello Mastroianni: I Remember Yes, I Rememb...",Documentary,7
6535,48385,Borat: Cultural Learnings of America for Make ...,Comedy,7
6819,55814,"Diving Bell and the Butterfly, The (Scaphandre...",Drama,7
4543,6256,"House with Laughing Windows, The (Casa dalle f...",Horror|Mystery|Thriller,7
1540,1978,Friday the 13th Part V: A New Beginning (1985),Horror,7
7669,81831,"First Beautiful Thing, The (La prima cosa bell...",Comedy|Drama,7
5871,26842,Tai Chi Master (Twin Warriors) (Tai ji: Zhang ...,Action|Adventure|Comedy|Drama,7
5868,26835,Positively True Adventures of the Alleged Texa...,Comedy|Thriller,7
7294,70533,Evangelion: 1.0 You Are (Not) Alone (Evangerio...,Action|Animation|Sci-Fi,7


### 3. Movie Recommendation Application