# Dennis Ting - Streaming Service Home Screen
# Movie Recommender System - 03_Recommendations

**Purpose:** The main purpose of this notebook (03_Recommendations) is to compile the various models from the models notebook (02_Models), and have all the recommendations ready for a specific user.

Please read this notebook in conjunction with my report, and my other notebooks (see report for full outline).

# Section 0 - Import 

In this section, I will import the relevant Python libraries and packages, along with the relevant data and models (from MovieLens, 01_EDA notebook, 02_Models notebook).

## Import Libraries and Packages

In [1]:
# Import Common Data Science Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import time

In [2]:
# Libraries for Learning
from scipy import sparse
from surprise import Dataset
from surprise.reader import Reader
from surprise.prediction_algorithms.matrix_factorization import SVD, SVDpp

In [3]:
# Other Libraries 
from scipy import sparse
import joblib

## Import Data / Models

*MovieLens Data*

In [4]:
# Import Movie Lens Data
df_movies = pd.read_csv('data/movies.csv')

*SVD++ Model (Collaborative Filtering)*

In [5]:
# Import Test Data (from 01_EDA)
df_test_ratings = pd.read_csv('DT_files/DT_df_rating_test.csv', index_col = 0)

In [6]:
# Import model (from 02 Models)

# model for `net_user_mean_rating` column
net_user_mean_ratings_svd_pp = joblib.load('DT_files/DT_net_user_mean_ratings_svd_pp.pkl')

In [7]:
# Import Sparse Matrix - Anti Set (from 02 Models)
full_ratings_anti_sparse = sparse.load_npz('DT_files/DT_sparse_ratings_anti.npz')

*Top / Popular Movies*

In [8]:
# Import top 50 movies (from 01 EDA)
df_top_50_popular_movies = pd.read_csv('DT_files/DT_df_top_50_popular_movies.csv', index_col = 0)

*Tags-based Recommendation*

In [9]:
# Import Tag-based Recommendation related files

# Import Tag Similarities - Sparse Matrix (NPZ file)
tag_similarities = sparse.load_npz('DT_files/DT_tag_similarities.npz')

# Import - Tags of Movies DataFrame (from 01 EDA)
df_tags_of_movies = pd.read_csv('DT_files/DT_df_tags_of_movies.csv', index_col = 0)

# Section 1 - Models

In this section, I will retrieve recommendations using the various models:
- Funk SVD++ Model (Collaborative Filtering)
- Top / Popular Movies
- Tag-based Recommendations

This section is meant to be a demonstration on the various outputs from the different models. I will be looking at User ID \# 95 for demonstration. In the following section below (Section 2), I will be compiling the code into one main function (made up of a few functions), that can recommend movies for a particular user (random or specified).

## 1.1 - Funk SVD++ Model (Collaborative Filtering)

Based on a specific user's movie rating history, SVD++ would recommend movies (that the user hasn't rated yet) based on other user's rating behaviour.

In [10]:
# Get array of user IDs
arr_userid = full_ratings_anti_sparse[:,0].toarray().squeeze()
arr_userid # show

array([3.00000e+00, 3.00000e+00, 3.00000e+00, ..., 1.62538e+05,
       1.62538e+05, 1.62538e+05])

In [11]:
# define an input user 
input_user_id = 95

In [12]:
# Get index for input user
input_user_array_index = (arr_userid == input_user_id)
input_user_array_index #show

array([False, False, False, ..., False, False, False])

In [13]:
# Get Input User Test Set
input_user_testset = full_ratings_anti_sparse[input_user_array_index, :].toarray().tolist()

In [14]:
# Predict net user mean ratings
input_user_prediction = net_user_mean_ratings_svd_pp.test(input_user_testset)

In [15]:
# Put predictions into a DataFrame
df_input_user_predictions = pd.DataFrame(input_user_prediction, columns=['userId',
                                                             'movieId',
                                                             'actual_rating',
                                                             'prediction',
                                                             'details'])

In [16]:
# Get top n movies for input_user
top_n = 10

# Sort by Prediction value (Descending, Largest to smallest)
df_top_n_input_user = df_input_user_predictions.sort_values('prediction', ascending = False).copy()

# Output only Movie ID and Prediction (title not available here)
df_top_n_input_user = df_top_n_input_user[['movieId','prediction']]

# Filter for top_n movies
df_top_n_input_user = df_top_n_input_user.head(top_n)

In [17]:
# Show
df_top_n_input_user

Unnamed: 0,movieId,prediction
1196,2064.0,3.032643
510,112552.0,2.93521
648,92259.0,2.902103
3250,182723.0,2.66039
615,86345.0,2.60546
149,5669.0,2.576275
5299,120807.0,2.514898
1865,86347.0,2.440685
1114,8638.0,2.43663
3274,55814.0,2.426929


##### Summary / Findings:
- I have successfully recommended 10 movies for User ID \# 95 based on predicted scores (net ratings, i.e. rating net user's mean rating) from SVD++.
- The movie title information is not available in this output, I will need to merge this with `df_movies` when I build the function to have the title available.

## 1.2 - Top / Popular Movies

The goal of recommending top / popular movies is to have a **user independent** recommender system, which helps combat the **cold start** issue that **collaborative filtering** faces.

The below dataframe (`df_top_50_popular_movies`) contains 50 top movies based on `total_net_rating`. This dataframe was created in the EDA notebook (01_EDA), please see that notebook for more details. From the top 50 movies, I will return 10 movies at random to the user. This is done so that the user can see different movies on their screen without me constantly updating this for them. 

In the real-world, I would expect that this top list will be updated frequenty, and tailored to the user (i.e. top films this week, top films based on geographic location, etc). Below, I will just be recommending popular movies from the dataset (movies rated from 2015 to 2019). The code and execution below can be applied to the real-world scenarios discussed above.

In [18]:
# Randomly select 10 movies from the top 50 movies
df_top_50_popular_movies.sample(n=10)[['movieId','title']]

Unnamed: 0,movieId,title
520,1196,Star Wars: Episode V - The Empire Strikes Back...
152,296,Pulp Fiction (1994)
2152,5618,Spirited Away (Sen to Chihiro no kamikakushi) ...
1171,2571,"Matrix, The (1999)"
294,608,Fargo (1996)
151,293,Léon: The Professional (a.k.a. The Professiona...
134,260,Star Wars: Episode IV - A New Hope (1977)
5233,134130,The Martian (2015)
4019,79132,Inception (2010)
3547,58559,"Dark Knight, The (2008)"


##### Summary / Findings:
- I was succesfully able to generate 10 random top movie suggestions from our dataframe of top movies (`df_top_50_popular_movies`).

## 1.3 - Tags-based Recommendation

Based on a movie that the user has recently enjoyed watching (based on net ratings, i.e. rating net user's mean rating), I will use user-generated tags to find similar movies to recommend.

### Top 10 Recently watched films

In [19]:
input_user_id = 95

df_recent_10_films = df_test_ratings[df_test_ratings['userId'] == input_user_id].copy()

In [20]:
df_recent_10_films.sort_values('rating_net_user_mean', ascending=False)

Unnamed: 0,userId,movieId,rating,timestamp,rating_net_user_mean
3058,95,1221,10,1512243413,3
3148,95,58559,10,1511480584,3
3072,95,2023,9,1512243578,2
3127,95,26776,9,1514335977,2
3129,95,31878,9,1514335981,2
3163,95,73881,9,1514335989,2
3155,95,68237,8,1512247657,1
3049,95,1189,7,1511480584,0
3161,95,72998,6,1511480584,-1
3179,95,95510,4,1511480584,-3


In [21]:
# A film that was recently liked by user 95 (rating_net_user_mean = 3)
df_movies[df_movies['movieId'] == 58559]

Unnamed: 0,movieId,title,genres
12221,58559,"Dark Knight, The (2008)",Action|Crime|Drama|IMAX


##### Comment:
- It appears that User ID \# 95 has recently watched The Dark Knight (movie ID # 58559) and rated it a net rating of 3, and a rating of 10.
- I will look for similar movies below using user-generated tags.

### Recommendation based on a recently watched film

In [22]:
# Get a series of unique movie tags in `df_tags_of_movies`
unique_tags_movies = df_tags_of_movies['movieId'].unique()

In [23]:
# Define a movie index
movie_index = 58559 # The Dark Knight

In [24]:
# Check to see if movie is in `df_tags_of_movies`
movie_index in unique_tags_movies

True

In [25]:
# Create a dataframe with the movie titles
df_movie_similarity = pd.DataFrame({'movieId':df_tags_of_movies['movieId'],
                                    'title':df_tags_of_movies['title'], 
                                    'genres':df_tags_of_movies['genres'], 
                                    'similarity': np.array(tag_similarities[(df_tags_of_movies['movieId'] == movie_index), :].todense()).squeeze()})

In [26]:
# Visualize first 5 rows
df_movie_similarity.sort_values('similarity', ascending=False).head()

Unnamed: 0,movieId,title,genres,similarity
9760,58559,"Dark Knight, The (2008)",Action|Crime|Drama|IMAX,1.0
8006,33794,Batman Begins (2005),Action|Crime|IMAX,0.669983
12866,91529,"Dark Knight Rises, The (2012)",Action|Adventure|Crime|IMAX,0.587846
8390,42015,Casanova (2005),Action|Adventure|Comedy|Drama|Romance,0.510389
4461,5611,"Four Feathers, The (2002)",Adventure|War,0.499386


In [27]:
# Set up parameters for Tags-based recommender system

# Set Cosine Similarity Threshold
sim_threshold = 0.25

# Set Max number of results
top_n = 10

In [28]:
# Recommend movies based on tags (store results in df_tag_based_reco)

# Only return movies above threshold
df_tag_based_reco = df_movie_similarity[df_movie_similarity['similarity'] > sim_threshold].copy()

# Sort by Similarity
df_tag_based_reco = df_tag_based_reco.sort_values('similarity', ascending=False)

# Only keep columns - Movie ID, Title of Movie, Cosine Similarity
df_tag_based_reco = df_tag_based_reco[['movieId','title','similarity']]

# Only keep 2nd row onwards (first row is the same title)
df_tag_based_reco = df_tag_based_reco[1:]

# Keep Max number of results (or less)
df_tag_based_reco = df_tag_based_reco.head(top_n)

df_tag_based_reco # show

Unnamed: 0,movieId,title,similarity
8006,33794,Batman Begins (2005),0.669983
12866,91529,"Dark Knight Rises, The (2012)",0.587846
8390,42015,Casanova (2005),0.510389
4461,5611,"Four Feathers, The (2002)",0.499386
3402,4299,"Knight's Tale, A (2001)",0.498619
7330,27073,Two Hands (1999),0.48959
5840,7372,Ned Kelly (2003),0.439258
2547,3213,Batman: Mask of the Phantasm (1993),0.40579
23831,190017,The Death of Superman,0.371514
20359,161354,Batman: The Killing Joke (2016),0.35835


##### Summary / Findings:
- I have successfully recommended 10 movies for User ID \# 95 based on user-generated tags using cosine-similarity. 
    - Cosine similarity range from 0 to 1 in this table. If the cosine similarity is closer to 1, the vectors are more similar. For more information, please refer back to 02_Models notebook where the cosine similarity matrix was created.
- I have excluded the top cosine similarity for each recommenation, as that would be with the movie itself (i.e cosine similarity of 1). It would not make sense to recommend a movie, based on the exact same movie.
- I have set a cosine similarity threshold of 0.25, meaning that movies with cosine similarities under 0.25 will not be presented to the user. If there are less than 10 movies, then this will present what is available.


# Section 2 - Define Functions

In this section, I will define functions that will in aggregate produce recommendations for a specific user when called upon.

### Functions for each Model from (Section 1.1, 1.2, 1.3 above)

Build a function for SVD++ Recommendation using the code from Section 1.1 above. Added titles from `df_movies`.

In [29]:
def DT_SVD_pp_recommendation(SVD_user_id, SVD_top_n=10):
    # Get array of user IDs
    arr_userid = full_ratings_anti_sparse[:,0].toarray().squeeze()

    # Get index for input user
    input_user_array_index = (arr_userid == SVD_user_id)

    # Get Input User Test Set
    input_user_testset = full_ratings_anti_sparse[input_user_array_index, :].toarray().tolist()

    # Predict net user mean ratings
    input_user_prediction = net_user_mean_ratings_svd_pp.test(input_user_testset)

    # Put predictions into a DataFrame
    df_input_user_predictions = pd.DataFrame(input_user_prediction, columns=['userId',
                                                                 'movieId',
                                                                 'actual_rating',
                                                                 'prediction',
                                                                 'details'])
    
    # Sort by Prediction value (Descending, Largest to smallest)
    df_top_n_input_user = df_input_user_predictions.sort_values('prediction', ascending = False).copy()

    # Output only Movie ID and Prediction (title not available here)
    df_top_n_input_user = df_top_n_input_user[['movieId','prediction']]
    
    # Change the movie ID to int (default is float)
    df_top_n_input_user['movieId'] = df_top_n_input_user['movieId'].astype('int')

    # Filter for top_n movies
    df_top_n_input_user = df_top_n_input_user.head(SVD_top_n)
    
    
    # Merge with df_movies to get title
    
    # Add movie titles 
    df_top_n_input_user = pd.merge(left=df_top_n_input_user,             # left table
                                   right=df_movies[['movieId','title']], # right table
                                   how='left',                           # left join (i.e. return all records on left table)
                                   left_on='movieId',                    # column to join on left table
                                   right_on='movieId')                   # column to join on right table
    
    # Rearrange columns to Movie ID, title, prediction
    df_top_n_input_user = df_top_n_input_user[['movieId', 'title', 'prediction']]
    
    return df_top_n_input_user

Build a function for Top / Popular Movies Recommendation using the code from Section 1.2 above. 

In [30]:
def DT_popular_n_recommendation(popular_n=10):
    
    # Length of Popular movie dataframe 
    length_popular_df = len(df_top_50_popular_movies)
        # I know this is 50, this is just dynamic code in case future dataframes don't have the same length
    
    
    if popular_n > length_popular_df:
        # if more rows asked than available
        # return length_popular_df number of movies (i.e. all movies in the popular dataframe)
        return df_top_50_popular_movies.sample(n=length_popular_df)[['movieId','title']]
        
    else:
        # return the number of random movies inputted
        return df_top_50_popular_movies.sample(n=popular_n)[['movieId','title']]
    

Build a function for Tags-based Recommendation using the code from Section 1.2 above. 

In [31]:
def DT_tag_based_recommendation(tag_user_id, tag_top_n = 10, net_rating_threshold = 0, sim_threshold = 0.25):
    
    # Find 10 most-recently rated films
    df_recent_10_films = df_test_ratings[df_test_ratings['userId'] == tag_user_id].copy()

    # Keep only certain columns - Movie ID & Net Rating 
    df_recent_10_films = df_recent_10_films[['movieId', 'rating_net_user_mean']]
    
    # Filter for films that are equal to or greater than `net_rating_threshold`
    df_recent_10_films = df_recent_10_films[df_recent_10_films['rating_net_user_mean']>= net_rating_threshold]
    
    # Sort Net Rating by descending order
    df_recent_10_films = df_recent_10_films.sort_values('rating_net_user_mean', ascending=False)
    
    # Get Series of recent films above threshold
    series_recent_movies = df_recent_10_films['movieId']
    
    # Check length of recent movies
    
    # Create empty data Frame
    df_tag_based_reco = pd.DataFrame()

    
    # If no movies left after filter above, return empty dataframe, and nothing
    if len(series_recent_movies) == 0:
        return df_tag_based_reco, None 
        # df_tag_based_reco would be empty
    
    else:

        # Get unique series of movies
        unique_tags_movies = df_tags_of_movies['movieId'].unique()
        
        # Loop through the series_recent_movies
        for movie_id in series_recent_movies:
            
            # Check if movie is in our tags-recommendation matrix, else, skip
            if movie_id in unique_tags_movies:
                
                # Create a dataframe with the movie titles
                df_movie_similarity = pd.DataFrame({'movieId':df_tags_of_movies['movieId'],
                                                    'title':df_tags_of_movies['title'], 
                                                    'genres':df_tags_of_movies['genres'], 
                                                    'similarity': np.array(tag_similarities[(df_tags_of_movies['movieId'] == movie_id), :].todense()).squeeze()})

                # Recommend movies based on tags (store results in df_tag_based_reco)

                # Only return movies above threshold
                df_tag_based_reco = df_movie_similarity[df_movie_similarity['similarity'] > sim_threshold].copy()

                # Sort by Similarity
                df_tag_based_reco = df_tag_based_reco.sort_values('similarity', ascending=False)

                # Only keep columns - Movie ID, Title of Movie, Cosine Similarity
                df_tag_based_reco = df_tag_based_reco[['movieId','title','similarity']]

                # Only keep 2nd row onwards (first row is the same title)
                df_tag_based_reco = df_tag_based_reco[1:]

                # Keep Max number of results (or less)
                df_tag_based_reco = df_tag_based_reco.head(tag_top_n)                
                
                # output - recommendation df, and movie ID that recommendations were based on
                return df_tag_based_reco, movie_id 
                                
            # If movie not in tags-recommendation matrix - do nothing
            else:
                pass
        
        # If at the end of the series_recent_movies loop, return empty dataframe, and nothing
        return df_tag_based_reco, None 
        # df_tag_based_reco would be empty
    
    

### Putting the various model functions together

Below I have grouped the various model functions together into one function.

In [32]:
def DT_movie_recommendations(input_user_id, top_n=10, nr_t=0, cs_t=0.25):
    
    print ("--------------------------------------------------------------------------------------------------------------")
    print('') # intentionally left blank for spacing

    # Title
    print('\033[1m' + 'Recently rated movies' + '\033[0m')
    
    df_recent_10_movies = df_test_ratings[df_test_ratings['userId'] == input_user_id].copy()

    df_recent_10_movies['Review Date - M/D/Y'] = pd.to_datetime(df_recent_10_movies['timestamp'], unit='s')
    df_recent_10_movies['Review Date - M/D/Y'] = df_recent_10_movies['Review Date - M/D/Y'].dt.strftime("%m/%d/%y")

    # Add title and genre to df_tags_of_movies
    df_recent_10_movies = pd.merge(left=df_recent_10_movies,             # left table
                                   right=df_movies[['movieId','title']], # right table
                                   how='left',                           # left join (i.e. return all records on left table)
                                   left_on='movieId',                    # column to join on left table
                                   right_on='movieId')                   # column to join on right table

    df_recent_10_movies = df_recent_10_movies[['Review Date - M/D/Y', 'movieId', 'title','rating', 'rating_net_user_mean']]

    df_recent_10_movies = df_recent_10_movies.sort_values('Review Date - M/D/Y', ascending=False)

    display(df_recent_10_movies.style.hide(axis='index')) # hide index on display

    
    
    print ("--------------------------------------------------------------------------------------------------------------")
    print('') # intentionally left blank for spacing
    
    # Title
    print('\033[1m' + 'Recommendations based on similar users' + '\033[0m')

    # Display Collaborative Filtering - SVD++ Movie recommendations
    display(DT_SVD_pp_recommendation(input_user_id, top_n).style.hide(axis='index')) # hide index on display
    
    
    print ("--------------------------------------------------------------------------------------------------------------")
    print('') # intentionally left blank for spacing
    
    # Title
    print('\033[1m' + 'Popular movies' + '\033[0m')
    
    # Display Top / Popular Movie recommendations
    display(DT_popular_n_recommendation(top_n).style.hide(axis='index')) # hide index on display
    
    
    print ("--------------------------------------------------------------------------------------------------------------")
    print('') # intentionally left blank for spacing
    
    # Get tags-based recommendations and recently watched movie it was based on
    df_reco_tags, recent_movie_id = DT_tag_based_recommendation(input_user_id, top_n, nr_t, cs_t)
    
    if df_reco_tags.empty:
        pass # Do not display anything
    
    else:
        # Get recent movie name
        recent_movie_name = df_movies[df_movies['movieId']== recent_movie_id]['title'].iloc[0]
        
        # Title
        print('\033[1m' + f'Recommendations based on a film you recently reviewed: {recent_movie_name}' + '\033[0m')
        
        # Display recommendations
        display(df_reco_tags.style.hide(axis='index')) # hide index on display
        
        print ("--------------------------------------------------------------------------------------------------------------")
    

Below, I have created a function that can provide recommendations if a User ID is provided. For the sake of demonstrating this to others, I have also added an ability for this function to randomly select a user and show the recommendations for that selected user.

In [33]:
def DT_user_selection_movie_recommendation():
    unique_users = df_test_ratings['userId'].unique().astype('str')

    # Get input
    user_id = input('Please input valid User ID or the word "random" (no quotes, lowercase) for a random user:')
    
    print ("--------------------------------------------------------------------------------------------------------------")
    print('') # intentionally left blank for spacing
    
    if user_id == "random":

        # Select a random ID & print message
        random_id = np.random.choice(unique_users)
        print(f'Randomly selected User ID: {random_id}.')
        print('') # intentionally left blank for spacing
        
        # Convert random_id back to an integer for recommendation input below
        random_id = int(random_id)

        # Run the Recommendation
        DT_movie_recommendations(random_id)

    elif user_id in unique_users:
        # Print message
        print(f'User ID: {user_id} is valid.')
        print('') # intentionally left blank for spacing

        # Convert user_id back to an integer for recommendation input below
        user_id = int(user_id)


        # Run the Recommendation
        DT_movie_recommendations(user_id)


    else:
        print ('Invalid entry - please enter valid User ID or "random" (no quotes, lowercase) for a random user')

# Section 3 - Recommend Movies

In this section, I will recommend movies using the functions defined above in Section 2.

In [34]:
# Run Recommendations
DT_user_selection_movie_recommendation()
    # input random for a random user or input a valid user ID

Please input valid User ID or the word "random" (no quotes, lowercase) for a random user:random
--------------------------------------------------------------------------------------------------------------

Randomly selected User ID: 127484.

--------------------------------------------------------------------------------------------------------------

[1mRecently rated movies[0m


Review Date - M/D/Y,movieId,title,rating,rating_net_user_mean
09/04/19,48516,"Departed, The (2006)",6,-1
07/21/19,81156,Jackass 3D (2010),4,-3
07/14/19,32460,Knockin' on Heaven's Door (1997),4,-3
07/13/19,3252,Scent of a Woman (1992),3,-4
07/12/19,202439,Parasite (2019),6,-1
07/08/19,2324,Life Is Beautiful (La Vita è bella) (1997),5,-2
07/06/19,139575,Baahubali: The Beginning (2015),8,1
07/01/19,3271,Of Mice and Men (1992),5,-2
06/30/19,94959,Moonrise Kingdom (2012),7,0
06/23/19,73881,3 Idiots (2009),6,-1


--------------------------------------------------------------------------------------------------------------

[1mRecommendations based on similar users[0m


movieId,title,prediction
26578,"Sacrifice, The (Offret - Sacraficatio) (1986)",2.189614
7089,Amarcord (1973),2.027769
4438,"Fist of Fury (Chinese Connection, The) (Jing wu men) (1972)",2.015345
5899,Zulu (1964),1.946836
3639,"Man with the Golden Gun, The (1974)",1.910484
2993,Thunderball (1965),1.901428
4680,Vampire's Kiss (1989),1.876063
133001,The Spy Who Loved Flowers (1966),1.875987
2947,Goldfinger (1964),1.870695
1587,Conan the Barbarian (1982),1.868464


--------------------------------------------------------------------------------------------------------------

[1mPopular movies[0m


movieId,title
260,Star Wars: Episode IV - A New Hope (1977)
1704,Good Will Hunting (1997)
1258,"Shining, The (1980)"
2571,"Matrix, The (1999)"
50,"Usual Suspects, The (1995)"
1221,"Godfather: Part II, The (1974)"
92259,Intouchables (2011)
47,Seven (a.k.a. Se7en) (1995)
134130,The Martian (2015)
4995,"Beautiful Mind, A (2001)"


--------------------------------------------------------------------------------------------------------------

[1mRecommendations based on a film you recently reviewed: Baahubali: The Beginning (2015)[0m


movieId,title,similarity
171611,Baahubali 2: The Conclusion (2017),0.552022
110295,"Legend of Hercules, The (2014)",0.521345
188091,Аргонавты (1971),0.479974
152063,Gods of Egypt (2016),0.44295
123264,Hercules and the Circle of Fire (1994),0.4406
188073,Лабиринт (1971),0.423635
93766,Wrath of the Titans (2012),0.387183
5540,Clash of the Titans (1981),0.38438
123565,Hercules in the Underworld (1994),0.367562
104074,Percy Jackson: Sea of Monsters (2013),0.361039


--------------------------------------------------------------------------------------------------------------
