### Content Based Recommendations

In the previous notebook, you were introduced to a way of making recommendations using collaborative filtering.  However, using this technique there are a large number of users who were left without any recommendations at all.  Other users were left with fewer than the ten recommendations that were set up by our function to retrieve....

In order to help these users out, let's try another technique: **content based** recommendations. Let's start off where we were in the previous notebook.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict
from IPython.display import HTML
#import progressbar
import tests as t
import pickle


%matplotlib inline

# Read in the datasets
movies = pd.read_csv('movies_clean.csv')
reviews = pd.read_csv('reviews_clean.csv')

del movies['Unnamed: 0']
del reviews['Unnamed: 0']


all_recs = pickle.load(open("all_recs.p", "rb"))

In [2]:
movies.head(5)

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's,History,News,Horror,...,Fantasy,Romance,Game-Show,Action,Documentary,Animation,Comedy,Short,Western,Thriller
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,1,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
2,12,The Arrival of a Train (1896),Documentary|Short,1896,1,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
3,25,The Oxford and Cambridge University Boat Race ...,,1895,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,91,Le manoir du diable (1896),Short|Horror,1896,1,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0


In [3]:
reviews.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,date,month_1,month_2,month_3,month_4,month_5,...,month_9,month_10,month_11,month_12,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018
0,1,68646,10,1381620027,2013-10-12 23:20:27,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
1,1,113277,10,1379466669,2013-09-18 01:11:09,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,2,422720,8,1412178746,2014-10-01 15:52:26,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
3,2,454876,8,1394818630,2014-03-14 17:37:10,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,2,790636,7,1389963947,2014-01-17 13:05:47,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [4]:
display(reviews.info())
display(movies.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712337 entries, 0 to 712336
Data columns (total 23 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   user_id    712337 non-null  int64 
 1   movie_id   712337 non-null  int64 
 2   rating     712337 non-null  int64 
 3   timestamp  712337 non-null  int64 
 4   date       712337 non-null  object
 5   month_1    712337 non-null  int64 
 6   month_2    712337 non-null  int64 
 7   month_3    712337 non-null  int64 
 8   month_4    712337 non-null  int64 
 9   month_5    712337 non-null  int64 
 10  month_6    712337 non-null  int64 
 11  month_7    712337 non-null  int64 
 12  month_8    712337 non-null  int64 
 13  month_9    712337 non-null  int64 
 14  month_10   712337 non-null  int64 
 15  month_11   712337 non-null  int64 
 16  month_12   712337 non-null  int64 
 17  year_2013  712337 non-null  int64 
 18  year_2014  712337 non-null  int64 
 19  year_2015  712337 non-null  int64 
 20  year

None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31245 entries, 0 to 31244
Data columns (total 35 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   movie_id     31245 non-null  int64 
 1   movie        31245 non-null  object
 2   genre        31022 non-null  object
 3   date         31245 non-null  int64 
 4   1800's       31245 non-null  int64 
 5   1900's       31245 non-null  int64 
 6   2000's       31245 non-null  int64 
 7   History      31245 non-null  int64 
 8   News         31245 non-null  int64 
 9   Horror       31245 non-null  int64 
 10  Musical      31245 non-null  int64 
 11  Film-Noir    31245 non-null  int64 
 12  Mystery      31245 non-null  int64 
 13  Adventure    31245 non-null  int64 
 14  Sport        31245 non-null  int64 
 15  War          31245 non-null  int64 
 16  Music        31245 non-null  int64 
 17  Reality-TV   31245 non-null  int64 
 18  Adult        31245 non-null  int64 
 19  Crime        31245 non-nu

None

In [5]:
# Get dimention of dictionary
def dict_dims(mydict):
    d1 = len(mydict)
    d2 = 0
    for d in mydict.keys():
        d2 = max(d2, len(mydict[d]))
    return d1, d2

dict_dims(all_recs)

(23512, 1049)

In [6]:
len(all_recs)

23512

### Datasets

From the above, you now have access to three important items that you will be using throughout the rest of this notebook.  

`a.` **movies** - a dataframe of all of the movies in the dataset along with other content related information about the movies (genre and date)


`b.` **reviews** - this was the main dataframe used before for collaborative filtering, as it contains all of the interactions between users and movies.


`c.` **all_recs** - a dictionary where each key is a user, and the value is a list of movie recommendations based on collaborative filtering

For the individuals in **all_recs** who did receive 10 recommendations using collaborative filtering, we don't really need to worry about them.  However, there were a number of individuals in our dataset who did not receive any recommendations.

-----

`1.` Let's start with finding all of the users in our dataset who didn't get all 10 ratings we would have liked them to have using collaborative filtering.  

In [105]:
def get_users(mydict, num_recs = 10):
    """
    INPUT :
    mydict : a dictionary which keys are user_ids, values are names of movies.
    num_recs : a threshold number to select user who didn't get all that number in ratings.
    
    OUTPUT:
    all_recs
    need_recs
    """
    all_recs, need_recs = [], []
    [need_recs.append(d) if len(mydict[d]) < num_recs else all_recs.append(d) for d in mydict.keys()]
    return all_recs, need_recs

with_all_recs_ids, need_recs_ids = get_users(all_recs,10)

In [106]:
print(len(with_all_recs_ids), len(need_recs_ids))

22187 1325


In [107]:
users_with_all_recs = with_all_recs_ids # Store user ids who have all their recommendations in this (10 or more)
users_who_need_recs = need_recs_ids # Store users who still need recommendations here


In [108]:
# A quick test
assert len(users_with_all_recs) == 22187
print("That's right there were still another 31781 users who needed recommendations when we only used collaborative filtering!")

That's right there were still another 31781 users who needed recommendations when we only used collaborative filtering!


### Content Based Recommendations

You will be doing a bit of a mix of content and collaborative filtering to make recommendations for the users this time.  This will allow you to obtain recommendations in many cases where we didn't make recommendations earlier.     

`2.` Before finding recommendations, rank the user's ratings from highest to lowest. You will move through the movies in this order looking for other similar movies.

In [27]:
# create a dataframe similar to reviews, but ranked by rating for each user
reviews_ranked = reviews.sort_values(by=['rating'], ascending=False)
reviews_ranked.head(2)

Unnamed: 0,user_id,movie_id,rating,timestamp,date,month_1,month_2,month_3,month_4,month_5,...,month_9,month_10,month_11,month_12,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018
0,1,68646,10,1381620027,2013-10-12 23:20:27,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
499038,37313,387914,10,1497817674,2017-06-18 20:27:54,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [119]:
def get_df(df,ids):
    """
    Get dataframe of Users who need recs.
    
    INPUT:
    ids : users ids 
    
    OUTPUT:
    df_ids : dataframe of reviews for Users ids.
    """
    #df_ids = df.iloc[[True if _id in id else False for _id in df.user_id.values]]
    df_ids = df[df.user_id.isin(ids)]
    return df_ids

users_who_need_recs_df = get_df(reviews_ranked, users_who_need_recs)
users_who_need_recs_df.head(3)

Unnamed: 0,user_id,movie_id,rating,timestamp,date,month_1,month_2,month_3,month_4,month_5,...,month_9,month_10,month_11,month_12,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018
498192,37217,1375666,10,1374169145,2013-07-18 17:39:05,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
498193,37217,1499658,10,1376158697,2013-08-10 18:18:17,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
498194,37217,1640459,10,1364251883,2013-03-25 22:51:23,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


### Similarities

We have talked in our first Lecture at ITI about similarity metrics and the difference between dot product vs. cosine similarity vs euclidean distance. Here we will apply what we have learnt. 


In many cases, it turns out that one of the fastest ways we can find out how similar items are to one another (when our matrix isn't totally sparse like it was in the earlier section) is by simply using matrix multiplication.

For us to pull out a matrix that describes the movies in our dataframe in terms of content, we might just use the indicator variables related to **year** and **genre** for our movies.  

Then we can obtain a matrix of how similar movies are to one another by taking the dot product of this matrix with itself.  

We can perform the dot product on a matrix of movies with content characteristics to provide a movie by movie matrix where each cell is an indication of how similar two movies are to one another.  




`3.` Create a numpy array that is a matrix of indicator variables related to year (by century) and movie genres by movie.  Perform the dot product of this matrix with itself (transposed) to obtain a similarity matrix of each movie with every other movie.  The final matrix should be 31245 x 31245.

In [13]:
display(movies.head(2))
print('\n-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-\n')
display(movies.iloc[:,4:].head(2))

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's,History,News,Horror,...,Fantasy,Romance,Game-Show,Action,Documentary,Animation,Comedy,Short,Western,Thriller
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,1,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0



-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-



Unnamed: 0,1800's,1900's,2000's,History,News,Horror,Musical,Film-Noir,Mystery,Adventure,...,Fantasy,Romance,Game-Show,Action,Documentary,Animation,Comedy,Short,Western,Thriller
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0


In [100]:
# Subset so movie_content is only using the dummy variables for each genre and the 3 century based year dummy columns
movie_content = np.array(movies.iloc[:,4:])

# Take the dot product to obtain a movie x movie matrix of similarities
dot_prod_movies = np.dot(movie_content, movie_content.T)
dot_prod_movies

array([[3, 3, 3, ..., 0, 0, 0],
       [3, 3, 3, ..., 0, 0, 0],
       [3, 3, 3, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 2, 2, 1],
       [0, 0, 0, ..., 2, 2, 1],
       [0, 0, 0, ..., 1, 1, 2]], dtype=int64)

In [18]:
# create checks for the dot product matrix
assert dot_prod_movies.shape[0] == 31245
assert dot_prod_movies.shape[1] == 31245
assert dot_prod_movies[0, 0] == np.max(dot_prod_movies[0])
print("Looks like you passed all of the tests.  Though they weren't very robust - if you want to write some of your own, I won't complain!")

Looks like you passed all of the tests.  Though they weren't very robust - if you want to write some of your own, I won't complain!


### For Each User...


Now you have a matrix where each user has their ratings ordered.  You also have a second matrix where movies are each axis, and the matrix entries are larger where the two movies are more similar and smaller where the two movies are dissimilar.  This matrix is a measure of content similarity. Therefore, it is time to get to the fun part.

For each user, we will perform the following:

    i. For each movie, find the movies that are most similar that the user hasn't seen.

    ii. Continue through the available, rated movies until 10 recommendations or until there are no additional movies.

As a final note, you may need to adjust the criteria for 'most similar' to obtain 10 recommendations.  As a first pass, I used only movies with the highest possible similarity to one another as similar enough to add as a recommendation.

`3.` In the cell below, complete each of the functions needed for making content based recommendations.

In [35]:
x = np.array([[1,0,3],[0,6,0]])
#np.where(np.argmax(x[0]>0))
np.argmax(x[0]>0)

0

In [46]:
# First find the index where a specific movie exits
# Then search using this index in simularity matrix.
index = np.where(movies.movie_id == 10)[0][0]
index

1

In [51]:
most_similar_value = np.max(dot_prod_movies[index])
most_similar_value

3

In [53]:
indices = np.where(dot_prod_movies[index] == most_similar_value)
indices

(array([    0,     2,  9584, 12766, 21723], dtype=int64),)

In [54]:
movies.iloc[indices]

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's,History,News,Horror,...,Fantasy,Romance,Game-Show,Action,Documentary,Animation,Comedy,Short,Western,Thriller
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
2,12,The Arrival of a Train (1896),Documentary|Short,1896,1,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
9584,154152,Annabelle Serpentine Dance (1895),Documentary|Short,1895,1,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
12766,392728,Roundhay Garden Scene (1888),Documentary|Short,1888,1,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
21723,2185694,Llandudno Happy Valley and Minstrel Show (1898),Documentary|Short,1898,1,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0


In [57]:
movies.iloc[indices]['movie']

0          Edison Kinetoscopic Record of a Sneeze (1894)
2                          The Arrival of a Train (1896)
9584                   Annabelle Serpentine Dance (1895)
12766                       Roundhay Garden Scene (1888)
21723    Llandudno Happy Valley and Minstrel Show (1898)
Name: movie, dtype: object

In [174]:
# Joining Two dataframe to create a dataframe of movies and rating sorted.
sub_movies = movies[['movie_id','movie']]
sub_reviews = reviews_ranked[['movie_id','rating']]
df_movie_rate = sub_reviews.join(sub_movies.set_index('movie_id'), on=['movie_id'])
df_movie_rate

Unnamed: 0,movie_id,rating,movie
0,68646,10,The Godfather (1972)
499038,387914,10,Voces inocentes (2004)
498972,1877832,10,X-Men: Days of Future Past (2014)
498974,368226,10,The Room (2003)
498975,2090488,10,Down and Dangerous (2013)
...,...,...,...
184769,3450958,0,War for the Planet of the Apes (2017)
612049,17136,0,Metropolis (1927)
427090,3280262,0,Cult of Chucky (2017)
660995,60827,0,Persona (1966)


In [180]:
def removeMoviesDiagonal(dot):
    """
    To remove simularity betwen movies and themseleves
   
    INPUT:
    dot : input dot product similarity matrix
   
    OUTPUT:
    dot : zero diagonaled matrix
    """
    np.fill_diagonal(dot, 0)


removeMoviesDiagonal(dot_prod_movies)
dot_prod_movies

array([[0, 3, 3, ..., 0, 0, 0],
       [3, 0, 3, ..., 0, 0, 0],
       [3, 3, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 2, 1],
       [0, 0, 0, ..., 2, 0, 1],
       [0, 0, 0, ..., 1, 1, 0]], dtype=int64)

In [212]:
from collections import defaultdict

def find_similar_movies(movie_id):
    '''
    INPUT
    movie_id - a movie_id 
    OUTPUT
    similar_movies - an array of the most similar movies by title
    '''
    # find the row of each movie id
    index = np.where(movies.movie_id == movie_id)[0][0]
    
    # find the most similar movie indices - to start I said they need to be the same for all content
    most_similar_value = np.max(dot_prod_movies[index])
    indices = np.where(dot_prod_movies[index] == most_similar_value)
    
    # pull the movie titles based on the indices
    similar_movies = movies.iloc[indices]['movie']
    
    return similar_movies
    
    
def get_movie_names(movie_ids):
    '''
    INPUT
    movie_ids - a list of movie_ids
    OUTPUT
    movies - a list of movie names associated with the movie_ids
    
    '''
    # return a list with movie names for the list of input movie IDs
    movie_lst = []
    #[movie_lst.append(find_similar_movies(movie_id)) for movie_id in movie_ids]
    movie_lst = movies.loc[[True if movie_id in movie_ids else False for movie_id in movies.movie_id.values ],'movie'].to_list()
    
    return movie_lst


def make_recs():
    '''
    INPUT
    None
    OUTPUT
    recs - a dictionary with keys of the user and values of the recommendations
    '''
    # Create dictionary to return with users and ratings
    recs = defaultdict(set)
    
    reviews_ranked = reviews.sort_values(by=['rating'], ascending=False) # Create ranked reviews which is sorting decending the rating column.
    
    users_with_all_recs, users_who_need_recs = get_users(all_recs,10) # Getting Users which need recs, and which who don't.
    
    # Create Dataframe of columns: movie_id, movie , rating
    df_movie_rate = reviews_ranked[['movie_id','rating']].join(movies[['movie_id','movie']].set_index('movie_id'), on=['movie_id'])
    
    # For each user
    for user_id in users_who_need_recs:    
        
        temp_similar_movies = []
        
        # Pull only the reviews the user has seen
        watched_movies_ids_lst = reviews_ranked[reviews_ranked.user_id == user_id]['movie_id'].values
        watched_movies_lst = get_movie_names(watched_movies_ids_lst)
        
        # get number of watched movies per user:
        recsPerUser = len(all_recs[user_id])
        #print('User_ID:'+str(user_id)+"  current number of recs:"+str(recsPerUser))
        
        # Look at each of the movies (highest ranked first), 
        # pull the movies the user hasn't seen that are most similar
        # These will be the recommendations - continue until 10 recs 
        # or you have depleted the movie list for the user
        
        # Create a list of lists containing all the similar movies to the owns which the user had viewed.
        [temp_similar_movies.append(find_similar_movies(movie_id)) for movie_id in watched_movies_ids_lst]
        
        # Flatten the list into one list, then remove duplication
        flat_lst = [item for sublist in temp_similar_movies for item in sublist]
        all_recommended_movies = set(flat_lst) # remove duplicated movies
        
        movies_recs = df_movie_rate.loc[[True if movie in all_recommended_movies else False for movie in df_movie_rate.movie.values],'movie'].to_list()
   
        if len(all_recommended_movies) == 0:
            recs[user_id] = set()
        else :    
            recs[user_id] = movies_recs[:10-recsPerUser]
        # Looping Over df_movie_rate, get the higher ranked movie by movie until recsPerUser = 10
        #for movie in df_movie_rate.movie.values:
        #    if movie in all_recommended_movies:
        #        recs[user_id].add(movie)
        #        if len(recs[user_id]) > 10:
        #            break
                    
    return recs

In [182]:
# Debugging
y = defaultdict(set)
x = [[1,2,3],[2,5,8]]
flat_list = [item for sublist in x for item in sublist]
no_duplicate = set(flat_list)
y['a'] = no_duplicate
y['a']

{1, 2, 3, 5, 8}

In [89]:
# Testing
# Testing on ids = 8, 12 which are already similars
get_movie_names([8,12]) 

['Edison Kinetoscopic Record of a Sneeze (1894)',
 'The Arrival of a Train (1896)']

In [196]:
# Testing 
sim_list = get_movie_names([10, 12, 8342946, 8402090])
print("Length = "+str(len(sim_list)))
sim_list

Length = 4


['La sortie des usines Lumière (1895)',
 'The Arrival of a Train (1896)',
 'Tig Notaro: Happy To Be Here (2018)',
 'Cumali Ceber 2 (2018)']

In [213]:
# Testing
recs = make_recs()
recs

defaultdict(set,
            {26: ['The Intouchables (2011)',
              'P.K. (2014)',
              'Casse-tête chinois (2013)',
              'Don Jon (2013)',
              'Be Somebody (2016)',
              'The Grand Budapest Hotel (2014)',
              'Be Somebody (2016)',
              'War Dogs (2016)',
              'Nebraska (2013)',
              'Song for Marion (2012)'],
             71: ['Unrest (2017)',
              'Man of Steel (2013)',
              'Star Wars: The Force Awakens (2015)',
              'Man of Steel (2013)',
              'War Dogs (2016)',
              'The Amazing Spider-Man 2 (2014)',
              "Tim's Vermeer (2013)",
              'Man of Steel (2013)',
              'Spider-Man: Homecoming (2017)',
              'Citizenfour (2014)'],
             108: ['Voces inocentes (2004)',
              'The Room (2003)',
              'Lo imposible (2012)',
              'The Intouchables (2011)',
              "La vie d'Adèle (2013)",
        

### How Did We Do?

Now that you have made the recommendations, how did we do in providing everyone with a set of recommendations?

`4.` Use the cells below to see how many individuals you were able to make recommendations for, as well as explore characteristics about individuals for whom you were not able to make recommendations.  

In [218]:
# Explore recommendations
def count_users_recs(mydict):
    """
    INPUT :
    mydict : a dictionary which keys are user_ids, values are names of movies.
    
    OUTPUT:
    all_recs
    need_recs
    no_recs
    """
   
    all_recs = sum([True if len(mydict[d])==10 else False for d in mydict.keys()])
    need_recs = sum([True if len(mydict[d])<10 & len(mydict[d])>0 else False for d in mydict.keys()])
    no_recs = sum([True if len(mydict[d])==0 else False for d in mydict.keys()])
    
    return all_recs, need_recs , no_recs

all_recs_count, need_recs_count, no_recs_count = count_users_recs(recs)
print(all_recs_count, need_recs_count, no_recs_count)

1319 0 0


In [None]:
# Some characteristics of my content based recommendations
users_without_all_recs = #store user ids without recs
users_with_all_recs = # users with all recs here
no_recs = # users with no recommendations

print("There were {} users without all 10 recommendations we would have liked to have.".format(len(users_without_all_recs)))
print("There were {} users with all 10 recommendations we would like them to have.".format(len(users_with_all_recs)))
print("There were {} users with no recommendations at all!".format(len(no_recs)))