### Data Acquisition

The dataset utilized for building the Content Based Movie Recommender System is acquired from https://grouplens.org/datasets/movielens/latest/. In this notebook the small dataset named ml-latest-small.zip is utilized/

In [1]:
# Connect to Google Drive to access dataset
from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
import os
import pandas as pd
import numpy as np

In [3]:
moviesDF = pd.read_csv('/content/gdrive/My Drive/MovieRatingsDataset/movies.csv')
ratingsDF = pd.read_csv('/content/gdrive/My Drive/MovieRatingsDataset/ratings.csv')
tagsDF = pd.read_csv('/content/gdrive/My Drive/MovieRatingsDataset/tags.csv')

In [4]:
moviesDF.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
ratingsDF.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
tagsDF.head(5)

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


### Data Preparation

In this step, we will tray to generate Movie profiles and User profiles based on the data that is available with us.

Generating the Movie profiles will involve finding high-dimensional embeddings for each movie, based on the title, genre and tag information that are present for the movie.

Generating the User profiles will involve the calculation of Weight Normalized Linear Combination of the Movies that the particular user has rated.

The profiles generated in this step will help us in the later stages to calculate the similarities between a User and a Movie that he/she has not rated before. Based on this similarity we can recommend / not recommend the Movie to the User.

In [7]:
# I like working with lists so I am converting the dataframes to lists.
# Later on, once we create more appropriate Hash table based representations, these lists will be deleted to free up main memory.

movieList = moviesDF.values.tolist()
ratingList = ratingsDF.values.tolist()
tagList = tagsDF.values.tolist()

In [8]:
print(len(movieList))
print(len(ratingList))
print(len(tagList))

9742
100836
3683


In [9]:
# Create Movie Dictionary
movieDict = {}

for item in movieList:
  movieDict[item[0]] = {'title': item[1],'genre':item[2]}

del movieList
del moviesDF

In [10]:
print(movieDict)



In [11]:
# Create Tags Dictionary
tagDict = {}

# Create empty list of tags for each movie
for item in movieDict.keys():
  tagDict[item] = list()

# Fill out the list of tags for the movies that were indeed tagged by some of the users.
# If a movie is not tagged by any user, then just keep empty list.
for i in range(len(tagList)):
  tagDict[tagList[i][1]].append(tagList[i][2])

print(tagDict)

{1: ['pixar', 'pixar', 'fun'], 2: ['fantasy', 'magic board game', 'Robin Williams', 'game'], 3: ['moldy', 'old'], 4: [], 5: ['pregnancy', 'remake'], 6: [], 7: ['remake'], 8: [], 9: [], 10: [], 11: ['politics', 'president'], 12: [], 13: [], 14: ['politics', 'president'], 15: [], 16: ['Mafia'], 17: ['Jane Austen'], 18: [], 19: [], 20: [], 21: ['Hollywood'], 22: ['serial killer'], 23: [], 24: [], 25: ['alcoholism'], 26: ['Shakespeare'], 27: [], 28: ['In Netflix queue', 'Jane Austen'], 29: ['kidnapping'], 30: [], 31: ['high school', 'teacher'], 32: ['time travel', 'time travel', 'Brad Pitt', 'Bruce Willis', 'mindfuck', 'Post apocalyptic', 'post-apocalyptic', 'remake', 'time travel', 'twist ending'], 34: ['Animal movie', 'pigs', 'villain nonexistent or not needed for good story'], 36: ['death penalty', 'Nun'], 38: ['twins'], 39: ['chick flick', 'funny', 'Paul Rudd', 'quotable', 'seen more than once', 'Emma', 'Jane Austen'], 40: ['In Netflix queue', 'South Africa'], 41: ['Shakespeare'], 42: 

In [12]:
del tagsDF
del tagList

In [13]:
!pip install sentence_transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 7.1 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 57.6 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 52.8 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.2 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 62.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYA

In [14]:
# Define function that takes a Movie Id, the complete Movie Dictionary, and Tag Dictionary as input and produces a d dimensional embedding for the Movie Profile 

from gensim.models import KeyedVectors
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
wordModel = KeyedVectors.load_word2vec_format('/content/gdrive/My Drive/GoogleNews-vectors-negative300.bin', binary=True)

def getMovieProfile(movieId, movieDict, tagDict):
  # Find the movie and tag details using the dictionaries and the movieId key.
  movieDetails = movieDict[movieId]
  tagDetails = tagDict[movieId]

  # We will convert the movie title into 384-dimensional embeddings using Sentence Transformer
  # Sentence Transformer: https://pypi.org/project/sentence-transformers/

  titleEmbedding = np.array(model.encode([movieDetails['title']])).ravel()
  
  # The Genre embedding will be found by considering each genre of the movie as individual words
  # These individual words will then be converted in 300 dimensional embedding vector using Gensim
  # Finally if a movie has more than one Genre, then the individual Genre embeddings will be element-wise averaged
  # The average embeddings will be the final Genre embeddings.

  genreList = movieDetails['genre'].split('|')
  
  genreEmbeddingsList = list()
  if len(genreList) == 1:
    try:
      genreEmbeddingsList = wordModel[genreList[0]]
    except:
      genreEmbeddingsList = np.array([0 for i in range(300)])
  else:
    for j in range(len(genreList)):
      try:
        genreEmbeddingsList.append(wordModel[genreList[j]])
      except:
        genreEmbeddingsList.append(np.array([0 for i in range(300)]))
    genreEmbeddingsList = np.array(genreEmbeddingsList)
    genreEmbeddingsList = np.mean(genreEmbeddingsList,axis=0)
  
  # Tag Embeddings will be calculated using Sentence Transformer
  # For multiple tags, the element-wise averaged embeddings will be 
  if len(tagDetails) != 0:
    tagEmbeddingsList = list()
    for j in range(len(tagDetails)):
      tagEmbeddingsList.append(np.array(model.encode([tagDetails[j]])).ravel())
    tagEmbeddingsList = np.array(tagEmbeddingsList)
    tagEmbeddingsList = np.mean(tagEmbeddingsList,axis=0)
  else:
    tagEmbeddingsList = [0 for i in range(384)]

  # Concatenate all three embeddings
  movieProfile = np.concatenate([np.array(titleEmbedding),np.array(genreEmbeddingsList),np.array(tagEmbeddingsList)])

  return movieProfile

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [15]:
# Check the getMovieProfile function

a = getMovieProfile(4, movieDict, tagDict)
print(a.shape)
print(list(a))

(1068,)
[-0.0872306153178215, 0.03661156818270683, -0.021703243255615234, -0.012105151079595089, 0.06295542418956757, 0.04352511093020439, 0.013924775645136833, -0.05505093187093735, 0.07251060754060745, -0.10118148475885391, 0.08453982323408127, -0.006872117053717375, -0.047712769359350204, -0.032097212970256805, 0.0397515632212162, 0.05789643153548241, 0.014613780193030834, 0.03318478539586067, 0.013079020194709301, -0.032603781670331955, -0.03977445513010025, 0.11831115931272507, 0.020672893151640892, 0.014648020267486572, -0.05312395468354225, -0.01833694614470005, -0.011012895032763481, -0.08187359571456909, -0.03377119451761246, 0.03459329158067703, 0.04618014022707939, -0.006381849758327007, -0.07438511401414871, -0.0442366786301136, -0.02104763500392437, -0.026363598182797432, -0.030498404055833817, 0.07753938436508179, -0.06011407449841499, 0.02023373730480671, -0.025297818705439568, -0.05337719991803169, -0.043759629130363464, -0.01622876524925232, 0.01621541753411293, -0.026

In [16]:
# Create User Rating Dictionary

userRatingDict = {}

# Find all user ids and initialize empty list

for i in range(len(ratingList)):
  userRatingDict[int(ratingList[i][0])] = list()

# For each user create a list of dictionaries specifying Movie id and rating given by the user
for i in range(len(ratingList)):
  userRatingDict[int(ratingList[i][0])].append({int(ratingList[i][1]): ratingList[i][2]})

print(userRatingDict)

{1: [{1: 4.0}, {3: 4.0}, {6: 4.0}, {47: 5.0}, {50: 5.0}, {70: 3.0}, {101: 5.0}, {110: 4.0}, {151: 5.0}, {157: 5.0}, {163: 5.0}, {216: 5.0}, {223: 3.0}, {231: 5.0}, {235: 4.0}, {260: 5.0}, {296: 3.0}, {316: 3.0}, {333: 5.0}, {349: 4.0}, {356: 4.0}, {362: 5.0}, {367: 4.0}, {423: 3.0}, {441: 4.0}, {457: 5.0}, {480: 4.0}, {500: 3.0}, {527: 5.0}, {543: 4.0}, {552: 4.0}, {553: 5.0}, {590: 4.0}, {592: 4.0}, {593: 4.0}, {596: 5.0}, {608: 5.0}, {648: 3.0}, {661: 5.0}, {673: 3.0}, {733: 4.0}, {736: 3.0}, {780: 3.0}, {804: 4.0}, {919: 5.0}, {923: 5.0}, {940: 5.0}, {943: 4.0}, {954: 5.0}, {1009: 3.0}, {1023: 5.0}, {1024: 5.0}, {1025: 5.0}, {1029: 5.0}, {1030: 3.0}, {1031: 5.0}, {1032: 5.0}, {1042: 4.0}, {1049: 5.0}, {1060: 4.0}, {1073: 5.0}, {1080: 5.0}, {1089: 5.0}, {1090: 4.0}, {1092: 5.0}, {1097: 5.0}, {1127: 4.0}, {1136: 5.0}, {1196: 5.0}, {1197: 5.0}, {1198: 5.0}, {1206: 5.0}, {1208: 4.0}, {1210: 5.0}, {1213: 5.0}, {1214: 4.0}, {1219: 2.0}, {1220: 5.0}, {1222: 5.0}, {1224: 5.0}, {1226: 5.0}, 

In [17]:
del ratingList
del ratingsDF

In [18]:
# Function for creating User profiles

def getUserProfile_WeightNormalizedAverage(userId, userRatingDict, movieDict, tagDict):
  movieEmbeddings = list()
  userRatings = list()
  
  for i in range(len(userRatingDict[userId])):
    movieId = list(userRatingDict[userId][i].keys())[0]
    movieEmbeddings.append(getMovieProfile(movieId, movieDict, tagDict))
    userRatings.append(list(userRatingDict[userId][i].values())[0])
  
  userAvgRating = np.mean(userRatings)

  # Subtract the User's mean rating from the rating list to get the Normalized ratings

  for i in range(len(userRatings)):
    userRatings[i] = userRatings[i] - userAvgRating
  
  # Update Movie Profiles with (normalized ratings/len(userRatings)) as weights

  for i in range(len(userRatings)):
    for j in range(len(movieEmbeddings[i])):
      movieEmbeddings[i][j] = movieEmbeddings[i][j]*(userRatings[i]/len(userRatings))
  
  # Sum up the modified movieEmbeddings to get the user profile

  movieEmbeddings = np.array(movieEmbeddings)
  userProfile = np.sum(movieEmbeddings,axis=0)

  return userProfile 

In [19]:
# Check the get user profile function

a = getUserProfile_WeightNormalizedAverage(1, userRatingDict, movieDict, tagDict)
print(a.shape)
print(list(a))
del a

(1068,)
[0.0019569398226182665, -0.0008425040481359395, -0.0023818866394362814, 0.001555051491958307, 0.0011206191746439278, 0.0021245691227347306, -0.0010396635080487774, 0.0013175655331059135, 0.0023493037398966492, 0.0036062036777062944, -0.0002943099218018257, -0.0010856631266486134, 0.002537062655284614, -0.0022347278379819683, -0.0010411152072284887, -0.0033850706773984737, -0.0008387522014626744, -0.00466739446902995, 0.00183636444367591, -0.004723243188861901, 0.002209361404469789, 0.0019076647079408185, 0.003821372548620273, 0.003815763140256745, 0.001016040140375526, -0.00041056571519890216, 0.001221696576251036, 0.0003178448092673186, -0.00015499734571732096, -0.004710893255475722, -0.003896595755172569, -0.006057775596908935, 0.0035064899440072914, -0.00038087201208591884, -0.0020285808788461385, -0.0010641444834871285, 0.0018328514325593852, 0.005047508577500051, -0.0022232972631253314, -0.0011820451319501928, -0.0007583164458367518, 0.0012452251179818014, -0.0027728634187

In [20]:
from scipy import spatial

def getSimilarity(userProfile, movieId):
  
  movieProfile = getMovieProfile(movieId, movieDict, tagDict)

  result = 1 - spatial.distance.cosine(userProfile, movieProfile)  
  return result

In [23]:
import heapq

def recommend_K_Movies(userId,k=5):
  allMovieSimDict = {}
  alreadyRatedMovies = {}

  userProfile = getUserProfile_WeightNormalizedAverage(userId, userRatingDict, movieDict, tagDict)

  for item in movieDict.keys():
    alreadyRatedMovies[item] = 0

  for i in range(len(userRatingDict[userId])):
    alreadyRatedMovies[list(userRatingDict[userId][i].keys())[0]] = 1

  print('Starting User-Movie Similarity Calculation....')

  for movieId in movieDict.keys():
    if alreadyRatedMovies[movieId]==0:
      allMovieSimDict[movieId] = getSimilarity(userProfile,movieId)
  
  print('User-Movie Similarity Calculation Finished')
  print('\n')

  print('The User with Id '+str(userId)+' has watched and rated the following movies: ')

  count = 0
  for i in range(len(userRatingDict[userId])):
    if list(userRatingDict[userId][i].values())[0] >=4 and count <5:
      print(movieDict[list(userRatingDict[userId][i].keys())[0]]['title']+', Genre: '+movieDict[list(userRatingDict[userId][i].keys())[0]]['genre']+', Rating: '+str(list(userRatingDict[userId][i].values())[0]))
      count += 1

  print('\n')
  print('Finding top 5 movie recommendations....')
  # k_keys_sorted = heapq.nlargest(k, allMovieSimDict)
  k_keys_sorted = heapq.nlargest(k, allMovieSimDict, key=allMovieSimDict.__getitem__)

  for i in range(len(k_keys_sorted)):
    print('\n')
    print("Recommendation "+str(i+1)+": \n")
    print("MovieId: "+str(k_keys_sorted[i])+"\t Title: "+movieDict[k_keys_sorted[i]]['title']+"\t Genre: "+movieDict[k_keys_sorted[i]]['genre'])


In [26]:
recommend_K_Movies(7,k=5)

Starting User-Movie Similarity Calculation....
User-Movie Similarity Calculation Finished


The User with Id 7 has watched and rated the following movies: 
Toy Story (1995), Genre: Adventure|Animation|Children|Comedy|Fantasy, Rating: 4.5
Usual Suspects, The (1995), Genre: Crime|Mystery|Thriller, Rating: 4.5
Apollo 13 (1995), Genre: Adventure|Drama|IMAX, Rating: 4.5
Die Hard: With a Vengeance (1995), Genre: Action|Crime|Thriller, Rating: 4.0
Star Wars: Episode IV - A New Hope (1977), Genre: Action|Adventure|Sci-Fi, Rating: 5.0


Finding top 5 movie recommendations....


Recommendation 1: 

MovieId: 541	 Title: Blade Runner (1982)	 Genre: Action|Sci-Fi|Thriller


Recommendation 2: 

MovieId: 3527	 Title: Predator (1987)	 Genre: Action|Sci-Fi|Thriller


Recommendation 3: 

MovieId: 6283	 Title: Cowboy Bebop: The Movie (Cowboy Bebop: Tengoku no Tobira) (2001)	 Genre: Action|Animation|Sci-Fi|Thriller


Recommendation 4: 

MovieId: 474	 Title: In the Line of Fire (1993)	 Genre: Action|Thrill