In [20]:
import sys
import os

# Get the current working directory of the Jupyter notebook
notebook_directory = os.getcwd()

# Assuming the notebook is in the 'bin/' folder, add the parent directory to sys.path
parent_directory = os.path.dirname(notebook_directory)
sys.path.append(parent_directory)

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix
from sklearn.preprocessing import normalize
import string
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk import pos_tag
import random

In [2]:
import spacy

#Run the following commands on terminal:
# conda install spacy
# python -m spacy download en_core_web_sm

In [3]:
#I needed to download these files for word-edit functions like stopwords and lemmatization to work. 
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

#This is needed for removing names from the text (#todo)
nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\detab\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\detab\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\detab\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\detab\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [4]:
#Hello World code for TF-IDF:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Example documents
documents = ['the sky is blue', 'the sun is bright', 'the sun in the sky is bright', 'we can see the shining sun, the bright sun']

# Create the transform
vectorizer = TfidfVectorizer()

# Tokenize and build vocab
tfidf_matrix = vectorizer.fit_transform(documents)

# Compute cosine similarity between all pairs
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

#print(cosine_sim)

**Overall Recommender System:**

Context: The current group preferences (filters), and overall movie data set + properties

Input: All movies voted on by a user

Outputs: Next M = 10 movies to recommend to the user. (Say M = 5 or 10, so the user doesn't have to wait for loading times after every vote)

**Recommender Algorithm:**

Content-based filtering with TFIDF and Cosine Similarity

1. Preprocess data:
    - Get all movie overview strings
    - Tokenize the strings (break into words)
    - Clean up data not useful for comparison (stopwords, numbers, etc.)
    - Stemming/ Lemmatization (reduce words to root form)
    <p> <br> </p>
2. TF-IDF vector of words:
    - Convert all the descriptions into vectors using TF-IDF
    - Convert categorical features like genre into binary features using one-hot encoding
    - Normalize numerical features such as release year and user ratings to ensure they are on the same scale as other features (0-1)
    - Combine all 3 into one total vector describing the movie
    <p> <br> </p>
3. Calculate user profile as a weighted average vector of the feature vectors of all liked movies so far. Should be same size as the vector for each movie.
    - We could later introduce logic to use disliked movies in algorithm, though I don't think we should.
    <p> <br> </p>
4. Generate recommendations:
    - Whenever user makes a vote: (or N votes, to be more efficient), recalculate user profile vector.
    - Whenever client requests next M top movies: Calculate cosine similarity between current user profile and every candidate movie in database. Specifically, candidate movies = all movies matching group filters and not yet swiped by user.
    - Time complexity = O(No. of movies x no. of features per movie). i.e. Linear time wrt total matrix size.
    - Return the top M = 10 movies with highest cosine similarity.
     <p> <br> </p>
5. Handle new users who have not swiped yet:
    - Initial recommendation just filters by group filters and sorts by IMDB ratings.
    - Future versions can try to present a more diverse set of initial movies to get better user input, leading to better subsequent recommendations.
    <p> <br> </p>

In [5]:
#Load movie dataset
df = pd.read_csv("../amf.csv")

df['original_title'] = df['original_title'].fillna('')
df['overview'] = df['overview'].fillna('')

In [6]:
#Get string columns as lists. We won't use title for TF-IDF, just for verification purposes
id = df['id'].tolist()
titles = df['original_title'].tolist()
overviews = df['overview'].tolist()

print(overviews[:5])

["Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.", "When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.", "A family wedding reignites the ancient feud between next-door neighbors and fishing buddies John and Max. Meanwhile, a sultry Italian divorcée opens a restaurant at the local bait shop, alarming the locals who worry she'll scare the fish away. But she's less interested in seafood than she is

In [7]:
#Lemmatization stuff

def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None
    
def lemmatize_sentence(sentence, lemmatizer):
    # Tokenize the sentence and find the POS tag for each token
    nltk_tagged = pos_tag(word_tokenize(sentence))  
    # Tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            # if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            # else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

In [8]:
#Function to delete people's names from descriptions (like Harry, Ron, etc.)

def remove_people_names(text):
    # Create a spaCy document
    doc = nlp(text)
    
    # Generate a list of entities that are NOT people
    entities = [ent.text for ent in doc.ents if ent.label_ != 'PERSON']
    # Generate a list of entities that are people to replace them from the original text
    people = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']

    # Replace people's names with an empty string
    for person in people:
        text = text.replace(person, '')

    # Rejoin entities that are not people to form the processed text
    # This step may or may not be necessary based on how you want to use the result
    #text = ' '.join(entities)
    
    return text

In [10]:
#Removes stops, punctuations, digits, and double spaces.
def remove_stops(text, stops):
    words = text.split()
    final = []
    for word in words:
        if word not in stops:
            final.append(word)
    final = " ".join(final)
    final = final.translate(str.maketrans("", "", string.punctuation))
    final = "".join([i for i in final if not i.isdigit()])
    while "  " in final:
        final = final.replace("  ", " ")
    return (final)


#take in a list of strings and clean them up for use in TF-IDF
def clean_docs(docs):
    lemmatizer = WordNetLemmatizer()
    stops = stopwords.words("english")
    final = []
    for doc in docs:
        clean_doc = doc
        #clean_doc = remove_people_names(doc)
        clean_doc = lemmatize_sentence(clean_doc, lemmatizer)
        clean_doc = remove_stops(clean_doc, stops)
        #Handling weird issue where apostrophe-s ('s) --> s as separate words in cleaned version
        clean_doc = clean_doc.replace(' s ', ' ')
        final.append(clean_doc)
    return (final)

In [9]:
#FYI - Stop words that will be deleted by the remove_stops function:
stops = stopwords.words("english")
print(stops)
print(len(stops))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [11]:
#[10 mins to run] Get the cleaned overviews that will be fed into the TF-IDF function
cleaned_overviews = clean_docs(overviews)
print(cleaned_overviews[:5])

['Led Woody Andy toy live happily room Andy birthday bring Buzz Lightyear onto scene Afraid lose place Andy heart Woody plots Buzz But circumstance separate Buzz Woody owner duo eventually learn put aside difference ', 'When sibling Judy Peter discover enchant board game open door magical world unwittingly invite Alan adult trap inside game year living room Alan hope freedom finish game prove risky three find run giant rhinoceros evil monkey terrifying creature ', 'A family wedding reignite ancient feud nextdoor neighbor fishing buddy John Max Meanwhile sultry Italian divorcée open restaurant local bait shop alarm local worry ll scare fish away But less interested seafood cook hot time Max ', 'Cheated mistreat step woman hold breath wait elusive good man break string lessthanstellar lover Friends confidant Vannah Bernie Glo Robin talk determine find good way breathe ', 'Just George Banks recover daughter wedding receive news pregnant George wife Nina expect He plan sell home plan like 

In [12]:
#Generate vectorizer model. Takes about 11 seconds
vectorizer = TfidfVectorizer(
                                lowercase=True,
                                max_features= 5000,
                                max_df=0.8,
                                min_df=5,
                                ngram_range = (1,3),
                                stop_words = "english"

                            )

vectors = vectorizer.fit_transform(cleaned_overviews)

feature_names = vectorizer.get_feature_names_out()

In [13]:
dense_vectors = vectors.toarray()
df = pd.DataFrame(dense_vectors, columns=feature_names)
print(df)


       aaron  abandon  abandoned  abby  abduct  ability  able  aboard  \
0        0.0      0.0        0.0   0.0     0.0      0.0   0.0     0.0   
1        0.0      0.0        0.0   0.0     0.0      0.0   0.0     0.0   
2        0.0      0.0        0.0   0.0     0.0      0.0   0.0     0.0   
3        0.0      0.0        0.0   0.0     0.0      0.0   0.0     0.0   
4        0.0      0.0        0.0   0.0     0.0      0.0   0.0     0.0   
...      ...      ...        ...   ...     ...      ...   ...     ...   
45461    0.0      0.0        0.0   0.0     0.0      0.0   0.0     0.0   
45462    0.0      0.0        0.0   0.0     0.0      0.0   0.0     0.0   
45463    0.0      0.0        0.0   0.0     0.0      0.0   0.0     0.0   
45464    0.0      0.0        0.0   0.0     0.0      0.0   0.0     0.0   
45465    0.0      0.0        0.0   0.0     0.0      0.0   0.0     0.0   

       abortion  abroad  ...  youth  youthful   yu  zatoichi  zealand  zero  \
0           0.0     0.0  ...    0.0       0.

In [14]:
#Top values from TF-IDF tester

top_values = df.iloc[892].sort_values(ascending=False)[:10]
print(top_values)

thing          0.316296
yellow         0.271024
make friend    0.265093
dorothy        0.263323
wicked         0.261096
lion           0.256555
wizard         0.251303
make way       0.240276
make           0.236940
witch          0.223544
Name: 892, dtype: float64


In [15]:
print(vectors[:10])

  (0, 1201)	0.19953064420228217
  (0, 253)	0.228135251450199
  (0, 2521)	0.13672908683661142
  (0, 1509)	0.16180787148930936
  (0, 1334)	0.2058782082915691
  (0, 3179)	0.15628659156625843
  (0, 3951)	0.1814768168092712
  (0, 738)	0.19322501352486027
  (0, 2028)	0.15470588341737868
  (0, 3312)	0.1348644698142803
  (0, 2638)	0.13355872379188607
  (0, 81)	0.2184228005546969
  (0, 3873)	0.162179038005134
  (0, 523)	0.13087873107203457
  (0, 433)	0.19162174426761105
  (0, 3790)	0.16951345675152515
  (0, 1984)	0.20704015258040712
  (0, 2599)	0.11169821423324114
  (0, 4535)	0.21393290725764616
  (0, 165)	0.6391564300822646
  (1, 996)	0.15563187815440185
  (1, 4441)	0.19134065423614086
  (1, 2904)	0.20918469376033463
  (1, 1514)	0.13334559243126776
  (1, 1870)	0.16715962965957554
  :	:
  (8, 3968)	0.19189179478322363
  (8, 4484)	0.14775784938517658
  (8, 4357)	0.19593045192860659
  (8, 4406)	0.1148077691392191
  (8, 4716)	0.31727749237618247
  (8, 756)	0.19699223086722104
  (8, 2360)	0.1639448

In [35]:
#This calculates Cosines similarity between 2 vectors (movies).

#Note: Cosine similarity expects 2D matrices. 
#To perform cosine similarity on vectors, remember to reshape the vector in the 2D shape (1, N), where N is the vector length.
#to-do: Update this function to become a weighted cosine, using weights from a file.
def get_cosine_similarity(movie_vector_1, movie_vector_2):

    cosine_sim = cosine_similarity(movie_vector_1, movie_vector_2)
    return cosine_sim

In [36]:
#Testing Cosine Similarity

movie_vector_1 = vectors[0] #Toy Story
movie_vector_2 = vectors[1] #Jumanji

print(get_cosine_similarity(movie_vector_1, movie_vector_2))

movie_vector_1 = vectors[4766] #Harry Potter 1 (TPS)
movie_vector_2 = vectors[5678] #Harry Potter 2 (TCoS)

print(get_cosine_similarity(movie_vector_1, movie_vector_2))

movie_vector_1 = vectors[4766] #Harry Potter 1 (TPS)
movie_vector_2 = vectors[892] #The Wizard of Oz
print(get_cosine_similarity(movie_vector_1, movie_vector_2))

[[0.02594483]]
[[0.15380733]]
[[0.0632266]]


In [40]:
#Get the top movies relating to a given movie vector using cosine similarity. 
#2 use cases for this:
# 1. given_movie_vector = a specific movie's TF-IDF vector. This will return top movies relating to that movie.
# 2. given_movie_vector = user_profile's vector. This will return top movies recommended for this user. 

def get_top_movies_cosine(tfidf_matrix, given_movie_vector, movie_titles, top_n=5):
    
    # Compute cosine similarity between the movie at movie_index and all movies in the matrix
    cosine_similarities = get_cosine_similarity(given_movie_vector, tfidf_matrix).flatten()
    
    # Get the indices of the top_n movies with the highest cosine similarity scores
    # Use argsort and reverse it with [::-1] to get the indices in descending order of similarity
    # Skip the first one as it is the movie itself with a similarity of 1
    similar_indices = cosine_similarities.argsort()[::-1][1:top_n+1]
    
    # Get the scores for the top_n movies
    similar_scores = cosine_similarities[similar_indices]
    
    # Combine indices and scores into a list of tuples and return
    top_movies = [(movie_titles[index], index, score) for index, score in zip(similar_indices, similar_scores)]

    print(f"Top similar movies to the provided movie vector:\n")
    for num, (title, index, score) in enumerate(top_movies, start = 1):
        print(f"{num}. \"{title}\" at ROW {index} with similarity score: {score}")

    return top_movies

In [42]:
get_top_movies_cosine(vectors, vectors[162], titles, 10);

Top similar movies to the provided movie vector:

1. "Broadway Melody of 1940" at ROW 10175 with similarity score: 0.3059949855886764
2. "If These Knishes Could Talk: The Story of the NY Accent" at ROW 37657 with similarity score: 0.2521433823774023
3. "The Transfiguration" at ROW 43275 with similarity score: 0.2515363367569358
4. "Les Ripoux" at ROW 44766 with similarity score: 0.2486966593060226
5. "Khiladi 786" at ROW 40343 with similarity score: 0.24721219021926344
6. "Loose Cannons" at ROW 6411 with similarity score: 0.2472099680754719
7. "Texas Killing Fields" at ROW 19018 with similarity score: 0.24454771954582263
8. "Shoot the Moon" at ROW 5989 with similarity score: 0.2262166595812431
9. "Strictly Ballroom" at ROW 1147 with similarity score: 0.22098725761853977
10. "Alvin and the Chipmunks: The Road Chip" at ROW 38589 with similarity score: 0.2112776847969979


In [27]:
#Calculate updated user profile after they have voted on M movies. 
# M = 1 means immediate feedback loop. But it may not be ideal. It might bias our recommendations towards our initial dataset (High exploit, low explore)
# I think M = 5 or 10 might be better. 
# An even better idea is a hybrid of the above. M = 10 inititally, and after some votes M --> 1. 

def update_user_profile_batch(user_profile, movie_vectors, ratings, M):
    """
    Update the user profile based on a batch of movie ratings.

    :param user_profile: scipy.sparse matrix, the current user profile vector (1, N)
    :param movie_vectors: list of scipy.sparse matrices, the TF-IDF vectors of the rated movies [(1, N), (1, N), ...]
    :param ratings: list of str, the ratings for each movie ('like' or 'dislike')
    :param M: int, the number of ratings to process before updating the profile
    :return: scipy.sparse matrix, the updated user profile vector (1, N)
    """
    dislike_factor = 1/3 #we can tweak this to see impact on recommendations. 

    if len(movie_vectors) != len(ratings):
        raise ValueError("The number of movie vectors and ratings must be the same")

    if len(movie_vectors) < M:
        raise ValueError("The number of movie vectors must be at least M")

    # Initialize a temporary profile change vector
    profile_change = csr_matrix((1, user_profile.shape[1]))

    # Process each movie vector and rating
    for movie_vector, rating in zip(movie_vectors, ratings):
        if rating == 'like':
            profile_change += movie_vector
        elif rating == 'dislike':
            profile_change -= (dislike_factor * movie_vector)
        else:
            raise ValueError("Rating must be 'like' or 'dislike'")

    # Update the user profile after processing M ratings
    updated_profile = user_profile + profile_change

    # Normalize the updated profile
    updated_profile = normalize(updated_profile, norm='l2', axis=1)

    return updated_profile


In [50]:
#Example usage of User Profile Update:

# In our app, we should initialize user_profile as a 1-D sparse matrix of zeros when the User() is created.
# i.e. user_profile should be a property of the User() object.

VECTOR_LENGTH = vectors.shape[1] #This could be assigned as a global variable. Once we settle on an algorithm, this should not change. 

user_profile = csr_matrix((1, VECTOR_LENGTH)) #Sparse matrix for quick maths. (e.g. 2 + 2 is 4. Minus 1 that's 3)
print(type(user_profile), user_profile.shape)

movie_vectors = [vectors[i] for i in range(5)]  # Replace with actual indices of movies the user rated
ratings = ['like', 'dislike', 'like', 'like', 'dislike']  # Example ratings

#For display purposes:
print('Displaying rated movies:')
for i, _ in enumerate(movie_vectors):
    print(f"{i}. {titles[i]} - {ratings[i]}")

# Update the profile based on user ratings of M movies
M = 5
user_profile = update_user_profile_batch(user_profile, movie_vectors, ratings, M)

<class 'scipy.sparse._csr.csr_matrix'> (1, 5000)
Displaying rated movies:
0. Toy Story - like
1. Jumanji - dislike
2. Grumpier Old Men - like
3. Waiting to Exhale - like
4. Father of the Bride Part II - dislike


In [45]:
#Now that the user profile has been updated, get the top 10 recommendations for this user:
print(get_top_movies_cosine(vectors, user_profile, titles, 10))

Top similar movies to the provided movie vector:

1. "Toy Story" at ROW 0 with similarity score: 0.5573292945991553
2. "Grumpier Old Men" at ROW 2 with similarity score: 0.550677658042087
3. "The 40 Year Old Virgin" at ROW 10301 with similarity score: 0.2886061333063643
4. "The Champ" at ROW 8327 with similarity score: 0.28021104108161937
5. "Toy Story 3" at ROW 15348 with similarity score: 0.2789042331047257
6. "Andy Kaufman Plays Carnegie Hall" at ROW 43427 with similarity score: 0.26488180074963247
7. "Andy Hardy's Blonde Trouble" at ROW 23843 with similarity score: 0.25113475128272983
8. "Superstar: The Life and Times of Andy Warhol" at ROW 38476 with similarity score: 0.24818528564740497
9. "Andy Peters: Exclamation Mark Question Point" at ROW 42721 with similarity score: 0.23805647569731256
10. "桃姐" at ROW 20326 with similarity score: 0.22845267824390414
[('Toy Story', 0, 0.5573292945991553), ('Grumpier Old Men', 2, 0.550677658042087), ('The 40 Year Old Virgin', 10301, 0.28860613