# OVERVIEW

Note that this notebook is *not intended* to be run sequentially. Because of the nature of each stage, e.g. the triplets dataset creating stage taking two hours, the notebook should be run disjointly.

It is thus advised to first run the STAGE-4 (GPU required) where the triplets and the vectorized datasets are imported from the Google Drive and everything else is ready to be working.

Otherwise, you can run the notebook sequentially but the last stage will still use the files created beforehand rather than use the one produced during this session. This means that all the parts of the task can be theoretically reproduced in this notebook.

**How did I decide to tackle this problem?**

To create vector embeddings for users and movies of the MovieLens dataset I decided to use a custom Siamese network and a Bert model as an initial encoder as using only Fasttext, for instance, as embedder and look straight as is for similar movies neighbors proved unreliable. Thus, I needed to make my own embeddings.

The basic idea of why Siamese network is ideal for clustering and suggesting similar to something items is expressed at the Doordash Engineering blog in the article [Using Triplet Loss and Siamese Neural Networks to Train Catalog Item Embeddings](https://doordash.engineering/2021/09/08/using-twin-neural-networks-to-train-catalog-item-embeddings) and also at the medium page [News Aggregator in 2 weeks](https://towardsdatascience.com/news-aggregator-in-2-weeks-5b38783b95e3) by Ilya Gusev. The key idea is that Siamese network allows to take vectors of movies and force the similar movies to be close to each other in the latent vector space while making dissimilar movies as far from each other as possible. As a result, by carefully crafting a dataset in the form (anchor, positive, negative) we can start training and change the model's weights in such a way that a naturally existing clusters that weren't robust enough in the previous vector space start to emerge in the new one.

In the case of Doordash the triplets for the Siamese network were formed using various heuristics, for instance, as whether a person bought an item or not etc; Ilya Gusev used time of the article was written to create the triplets. Our case is similar but still different. To create a triplet for each movie we took its genres strings and its first 5 most relevant tags defined in the genome files. We then considered an anchor movie as being close in the meaning to the positive under condition that they shared most overlapping strings in their genres and tags (hereafter the descriptors). Negative and anchor pairs respectively were defined as the pairs with the least number of overlapping elements with the current anchor element. At the end the list of descriptors for each movie was embedded using Bert encoder by extracting CLS token of the string and averaging the resulting embeddings.

I trained the Siamese network using Triplet loss with only one layer with 50 hidden units. As a distance metric I used the Euclidean distance metric with normalization but no ReLu activation, dropouts or more layers used. I trained the model for 10 epochs without using early stopping. Train and test split were prepared in the proportion 90 % and 10 % respectively since the dataset was big enough to allow that: the resulting dataset contained roughly 57 thousand triplets.

Thus, the movie embedding is the averaged embedding of the Bert CLS token over the descriptors that was passed through the Siamese network embedder. The user embedding, however, is the weighted average of the movies embeddings that the person already seen and ranked. As weights we used the user’s rating for each movie scaled from 0 to 1. The recommendations were prepared using the Scikit’s K-Nearest Neighbor algorithm where the movies were discarded from recommendation list if the user had already seen them.

The simplicity of the model didn’t seem to affect the end results too badly. But here some improvements that could be made: 1) improve heuristics for how triplets are formed especially when it comes to picking the negative element 2) Add more layers, dropout and ReLu activations to the model to make it more robust to overfitting 3) lemmatize the genome tags for the movies and keep only unique ones 4) deal with the dataset being skewed to much: almost 60 % of the data is genres Drama and Comedy and as a result it is hard, for instance, to find something very specific recommended as a movie about climbers or alpinists 5) instead of 5 most relevant genome tags I could include 10 6) I could improve the way the embedding for the movie prepared by using something more sophisticated then average.


# STAGE-1 DATA WRANGLING

Run this set of cells to see how the data was preprocessed to look like merged_clean_final.csv in the stage 3.

Installation and imports

In [None]:
!pip install wget

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9657 sha256=f341312de7eb601e70b8bc444f20e674fd5717ca77c874169028143d3a340f96
  Stored in directory: /root/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [None]:
!pip install transformers

In [None]:
import wget
import pandas as pd
import ast
from tqdm.auto import tqdm
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import pairwise_distances_argmin_min
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
import numpy as np

Download the dataset

In [None]:
output_directory = r'/content'
url = r'https://files.grouplens.org/datasets/movielens/ml-latest.zip'
filename = wget.download(url, out=output_directory)

In [None]:
!unzip /content/ml-latest.zip

Archive:  /content/ml-latest.zip
   creating: ml-latest/
  inflating: ml-latest/links.csv     
  inflating: ml-latest/tags.csv      
  inflating: ml-latest/genome-tags.csv  
  inflating: ml-latest/ratings.csv   
  inflating: ml-latest/README.txt    
  inflating: ml-latest/genome-scores.csv  
  inflating: ml-latest/movies.csv    


In [None]:
cd ml-latest/

/content/ml-latest


Load and format genome-scores.csv file

In [None]:
df_scores = pd.read_csv('genome-scores.csv', encoding='utf-8')

1) Sort by the relevance column so that the top relevant tags for a movie are at the top and 2) then group by the the movieId column 3) Keep for each movie only top five tagId rows - this way each movie will have five words that best describe it

In [None]:
df_scores = df_scores.sort_values('relevance', ascending=False).groupby('movieId').head(5)

Load the genome-tags.csv since the previos files only had the tags ids but not the tags themselves

In [None]:
df_tags = pd.read_csv('genome-tags.csv', encoding='utf-8')

Merge two dataframes

In [None]:
df_5_tags = pd.merge(df_scores, df_tags, on='tagId', how='left') # merge two columns replacings the tag ids with their actual word representation
df_5_tags = df_5_tags.drop('tagId', axis=1)
df_5_tags = df_5_tags.rename(columns={'tag': 'tagId'})

Merge all tags for each movie into one row

In [None]:
df_5_tags = df_5_tags.drop('relevance', axis=1)
df_5_tags = df_5_tags.groupby('movieId')['tagId'].apply(list).reset_index(name='tags_5')

Load the movies.csv

In [None]:
df_movies = pd.read_csv('movies.csv', encoding='utf-8')

Merge df_5_tags and df_movies so that now each movie besides genres column describing its main genres would have also a tags_5 column which tells which tags most desctive each movie

In [None]:
df_movies = pd.merge(df_movies, df_5_tags, on='movieId', how='left')
df_movies['tags_5'] = df_movies['tags_5'].fillna('no tags')

Note that the table now has 4 colums (movieId, title, genres, tags_5) and 58098 rows

Since each movie has multiple genres ascribed to and it is expressed as a string with a pipe as a delimiter we need to break each string into a list

In [None]:
df_movies['genres'] = df_movies['genres'].str.split('|')

Get some statistics on genres column

In [None]:
# Create a new dataframe where each row represents a single genre for a single movie
df_genres = df_movies.explode('genres')

# Count the number of occurrences of each genre
genre_counts = df_genres['genres'].value_counts()

# Calculate the percentage of movies that belong to each genre
genre_percentages = genre_counts / len(df_movies) * 100

# Display the results
print(genre_counts)
print(genre_percentages)

Drama                 24144
Comedy                15956
Thriller               8216
Romance                7412
Action                 7130
Horror                 5555
Documentary            5118
Crime                  5105
(no genres listed)     4266
Adventure              4067
Sci-Fi                 3444
Mystery                2773
Children               2749
Animation              2663
Fantasy                2637
War                    1820
Western                1378
Musical                1113
Film-Noir               364
IMAX                    197
Name: genres, dtype: int64
Drama                 41.557369
Comedy                27.463940
Thriller              14.141623
Romance               12.757754
Action                12.272367
Horror                 9.561431
Documentary            8.809253
Crime                  8.786877
(no genres listed)     7.342766
Adventure              7.000241
Sci-Fi                 5.927915
Mystery                4.772970
Children               4.7316

While exploring dataset we found that some movies have neither genre nor tags describing it. We need to separate them from the main dataset

In [None]:
count = df_movies.apply(lambda x: x['genres'] == ['(no genres listed)'] and x['tags_5'] == 'no tags', axis=1).sum()
print(f'\nNumber of movies that have no desctiptive information: {count}')


Number of movies that have no desctiptive information: 4237


Separate the empty movies

In [None]:
df_filtered = df_movies[df_movies.apply(lambda x: x['genres'] == ['(no genres listed)'] and x['tags_5'] == 'no tags', axis=1)]

Export the ids into a txt file

In [None]:
ids_with_no_info = df_filtered['movieId'].to_list()

In [None]:
with open('no_info_ids.txt', 'w') as f:
  for i in ids_with_no_info:
    f.write((str(i)) + ',')

I decided that 4k samples missing from the dataset it is too much so I wrote and parsed the MovieLens page and retrieved for most of the movies their genres

You can look into the scripts that I used at my [repository](https://github.com/eistakovskii/vk_internship_2023/tree/main/scrape_genres)
I don't include the code here to execute since I used Selenium and it took me almost 9 hours to parse 4k webpages. Anyway, all the scripts and their output files you can find in the repo

Clone the repo with restored data that I parsed

In [None]:
!git clone https://github.com/eistakovskii/vk_internship_2023.git

Cloning into 'vk_internship_2023'...
remote: Enumerating objects: 14, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 14 (delta 0), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (14/14), 171.44 KiB | 1.84 MiB/s, done.


Load the missing data

In [None]:
df_repl = pd.read_csv(r'/content/ml-latest/vk_internship_2023/scrape_genres/output_clean.csv', encoding='utf-8')
df_repl['Genres'] = df_repl['Genres'].apply(ast.literal_eval) # list was registered as a string, fixing it here

Replacing the data back into the main dataframe by merging and filling missing values

In [None]:
df_merged = pd.merge(df_movies, df_repl, left_on='movieId', right_on='MovieId', how='left')
df_merged['genres'] = df_merged['Genres'].where(df_merged['Genres'].notnull(), df_merged['genres'])
df_merged = df_merged.drop('Genres', axis=1)
df_merged = df_merged.drop('MovieId', axis=1)

Delete rows in df5 where 'genres' column contains ['(no genres listed)'] and 'tags_5' column contains 'no tags'. I need to do that because for some ids I couldn't retrieve genres or there weren't any. Around 617 sample was lost as a result

In [None]:
df_merged = df_merged[~((df_merged['genres'].apply(lambda x: '(no genres listed)' in x)) & (df_merged['tags_5'] == 'no tags'))]

Define a function to merge the 'genres' and 'tags_5' columns

In [None]:
def merge_genres_tags(row):
    genres = row['genres']
    tags = row['tags_5']
    
    # check if genres is a list and remove '(no genres listed)' if it's in the list
    if isinstance(genres, list) and '(no genres listed)' in genres:
        genres.remove('(no genres listed)')
    
    # check if tags is a string and set it to an empty list if it's 'no tags'
    if isinstance(tags, str) and tags == 'no tags':
        tags = []
    
    # concatenate the two lists and return the result
    return genres + tags


Create a new column 'descriptors' by applying the merge_genres_tags function to each row

In [None]:
df_merged['descriptors'] = df_merged.apply(merge_genres_tags, axis=1)

Export the resulting dataframe where each movie has a list of words that describe it

In [None]:
df_merged.to_csv('merged_clean_final.csv', index=False, encoding='utf-8') # Note that this is the csv that it is then imported at the stage 4

# STAGE-2 CREATE DATASET OF TRIPLETS

To form triplets I tried two approaches:
1) Slow one with O(n^2) speed using a for loop
2) Fast one using machine learning (K-means)

At the end I used a slow one and it took me two hours to create a dataset. I was satisfied with how anchors and positives turned out though negatives weren't very varied.

Fast solution created dataset in 5 mins, anchors, positives, negatives are okay but at the time of developing I was not sure that the dataset was good enough so I didn't use it. Afterall this solution does not use sets to match based on literal overlaps and thus can't really guarantee the match. Still, the solution using K-means is viable and seems to be seriously speeding up the dataset preparation process.  

## SLOW SOLUTION

Below is the function that takes a list of tuples as input and returns a pandas DataFrame with columns 'anchor', 'positive', and 'negative' where each element is a movie id.

This function iterates over each tuple in the input list and treats it as an anchor. For each anchor, it finds the positive element by sorting the remaining tuples by the size of the intersection of their sets with the anchor's set in descending order and taking the first element. It then finds the negative element by sorting the remaining tuples by the size of the intersection of their sets with the anchor's set in ascending order and taking a random choice from the first three elements. The function then appends a tuple containing the ids of the anchor, positive, and negative elements to a list. Finally, it creates a DataFrame from this list with columns 'anchor', 'positive', and 'negative' and returns it.

In [None]:
# DO NOT RUN THIS CELL. FOR DEMONSTRATION PURPOSES ONLY. I MEAN IT WILL LITERALLY TAKE HOURS TO FINISH
def process_descriptors(row):
    descriptors = row['descriptors']
    return row['movieId'], set(map(str.lower, descriptors))

result = list(df_merged.apply(process_descriptors, axis=1))

def create_dataframe_2(tuples):
    data = []
    for anchor in tqdm(tuples):
        positives = [t for t in tuples if t[0] != anchor[0]]
        positives.sort(key=lambda x: len(x[1].intersection(anchor[1])), reverse=True)
        positive = positives[0]
        negatives = [t for t in tuples if t[0] != anchor[0] and t[0] != positive[0]]
        negatives.sort(key=lambda x: len(x[1].intersection(anchor[1])))
        negative = random.choice(negatives[:3])
        data.append((anchor[0], positive[0], negative[0]))
    df = pd.DataFrame(data, columns=['anchor', 'positive', 'negative'])
    return df

df_siam = create_dataframe_2(result)

df_siam.to_csv(r'triplets_dataset.csv', encoding='utf-8')

##FAST SOLUTION

Define a function to lowercase each string in the 'descriptors' list and convert it into a set

In [None]:
def process_descriptors(row):
    descriptors = row['descriptors']
    return row['movieId'], set(map(str.lower, descriptors))

Now create a list of tuples from the 'movieId' and processed 'descriptors' columns

In [None]:
result = list(df_merged.apply(process_descriptors, axis=1))
# The final tuple for each movie will look somewhat like that 
#  (1,
#  {'adventure',
#   'animation',
#   'children',
#   'comedy',
#   'computer animation',
#   'fantasy',
#   'kids and family',
#   'pixar animation',
#   'toys'})

Create a dictionary to map each movie ID to its set of descriptors for future ease of use

In [None]:
descriptors_dict = {movie_id: descriptors for movie_id, descriptors in result}

  This function uses K-means to cluster the tuples based on their sets of descriptors and find the positive and negative values for each anchor. First the sets of descriptors are converted into strings by joining their elements using the join method. These strings are then passed to the CountVectorizer,which converts them into a matrix of binary features. Then the function uses the KMeans class from the scikit library to cluster the tuples into k clusters. For each tuple, the function finds the positive value as the closest tuple in the same cluster as the anchor and the negative value as the closest tuple in a different cluster than the anchor. Finally, it creates a DataFrame with columns 'anchor', 'positive', and 'negative' containing the respective values.

In [None]:
def create_dataframe(tuples):
    data = []
    descriptors_dict = {movie_id: ' '.join(descriptors) for movie_id, descriptors in tuples}
    vectorizer = CountVectorizer(binary=True)
    X = vectorizer.fit_transform(descriptors_dict.values())
    kmeans = KMeans(n_clusters=min(len(tuples), 10))
    kmeans.fit(X)
    closest_tuples, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, X)
    for i, anchor in enumerate(tuples):
        positive = tuples[closest_tuples[kmeans.labels_[i]]]
        negative_cluster = min(set(range(kmeans.n_clusters)) - {kmeans.labels_[i]}, key=lambda x: kmeans.transform(X[i].reshape(1,-1))[0][x])
        negative = tuples[closest_tuples[negative_cluster]]
        data.append((anchor[0], positive[0], negative[0]))
    df = pd.DataFrame(data, columns=['anchor', 'positive', 'negative'])
    return df

In [None]:
df_siam = create_dataframe(result) # runnign time is 5 min



In [None]:
df_siam

Unnamed: 0,anchor,positive,negative
0,1,4075,576
1,2,4075,3085
2,3,825,576
3,4,825,130
4,5,576,825
...,...,...,...
57476,193876,3085,134
57477,193878,576,825
57478,193880,130,1165
57479,193882,2658,130


Sanity check that the triplets are somewhat valid

In [None]:
print(f'\n{descriptors_dict[1]}\n{descriptors_dict[4075]}\n{descriptors_dict[576]}')


{'pixar animation', 'animation', 'fantasy', 'comedy', 'kids and family', 'adventure', 'toys', 'computer animation', 'children'}
{'animation', 'children'}
{'comedy'}


In [None]:
print(f'\n{descriptors_dict[2]}\n{descriptors_dict[4075]}\n{descriptors_dict[3085]}')


{'fantasy', 'kids', 'adventure', 'jungle', 'children'}
{'animation', 'children'}
{'horror'}


In [None]:
print(f'\n{descriptors_dict[3]}\n{descriptors_dict[825]}\n{descriptors_dict[576]}')


{'romance', 'comedy', 'sequel', 'original', 'sequels', 'good sequel'}
{'romance', 'comedy'}
{'comedy'}


Export the resulting dataset as a csv

In [None]:
df_siam.to_csv('triplets_dataset.csv', encoding='utf-8') # note that exactly this dataset we are using at the end when training

# STAGE-3 CAST TRIPLETS DATASET INTO VECTOR FORM

Note that now we have two dataframes triplets_dataset and merged_clean_final.

While we have a ready dataset in triplets_dataset it contains only pointers (movie indices) to the textual data (a list of strings) in merged_clean_final. We need however a vector representation to train a Siamese network

To prepare the dataset I first need to find an encoder to get for each movie an averaged embedding of the words in the column 'descriptors'.
First I thought of using Fasttext but decided against it since it takes to much space, takes too much time load its almost 5 gb.

In [None]:
df_merged

Unnamed: 0,movieId,title,genres,tags_5,descriptors
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]","[toys, computer animation, pixar animation, an...","[Adventure, Animation, Children, Comedy, Fanta..."
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]","[adventure, children, fantasy, kids, jungle]","[Adventure, Children, Fantasy, adventure, chil..."
2,3,Grumpier Old Men (1995),"[Comedy, Romance]","[sequel, good sequel, sequels, comedy, original]","[Comedy, Romance, sequel, good sequel, sequels..."
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]","[women, chick flick, girlie movie, romantic, a...","[Comedy, Drama, Romance, women, chick flick, g..."
4,5,Father of the Bride Part II (1995),[Comedy],"[good sequel, sequel, sequels, pregnancy, fath...","[Comedy, good sequel, sequel, sequels, pregnan..."
...,...,...,...,...,...
58093,193876,The Great Glinka (1946),"[music, history]",no tags,"[music, history]"
58094,193878,Les tribulations d'une caissière (2011),[Comedy],no tags,[Comedy]
58095,193880,Her Name Was Mumu (2016),[Drama],no tags,[Drama]
58096,193882,Flora (2017),"[Adventure, Drama, Horror, Sci-Fi]",no tags,"[Adventure, Drama, Horror, Sci-Fi]"


I stopped at the idea of using your run of the mill Bert encoder and extract for each word as input its CLS token. This way I will be able to both catch a decent semantic representation of the string and also deal with long strings, bigrams and trigrams, for them you need to take into account the syntax, the CLS token does fairly well that.

First intitiate Bert

In [None]:
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
model = model.to('cuda')

Create a function that extracts a CLS token for the input string

In [None]:
def get_cls_token(str_in: str, model_curr, tokenizer_curr):
    
    inputs = tokenizer(str_in, return_tensors='pt').to('cuda')
    outputs = model(**inputs, output_hidden_states=True)
    last_hidden_states = outputs.hidden_states[-1]
    cls_token = last_hidden_states[0,0,:]
    cls_token = cls_token.detach().cpu().numpy()
    
    return cls_token

Define the function that takens as input a list of strings and gets an average embedding for it

In [None]:
def avg_embedding(strings):
  embeddings = []
  for s in strings:
    embeddings.append(get_cls_token(s, model, tokenizer))
  avg_embedding = np.mean(embeddings, axis=0)
  return avg_embedding

Now the code below will go through each row of the df_merged and use the avg_embedding function to turn each list of strings describing respective movie in the descriptors column into a numpy vector. These vectors are then stored in a numpy matrix with dimensions len(df_final) by 300. Finally, the matrix is exported to a file named result_matrix.npy. At the time of running it took around 40 minutes on GPU to get all vectors

In [None]:
result_matrix = np.zeros((len(df_merged), 768))
for i, row in df_merged.iterrows():
    result_matrix[i] = avg_embedding(row['descriptors'])

np.save('result_matrix.npy', result_matrix) # this matrix will be imported then at the stage 4

Congratulations! The dataset is prepared!

Now we have two csv files: one csv in the form anchor, positive, negative where each element has a pointer to the movie id in another csv file that contains the movies indices with their meanigful descriptors and their dataframe indices. These aforementioned indices as you have already guessed point in their turn to the numpy matrix. This way we store all the necessary files to train the Siamese network and then run inference on it while preserving the connection and links to the data.

Please restart the notebook and run only the last stage to see the training and get a chance test the recommender model in inference.

# STAGE-4 TRAIN SIAMESE NETWORK AND IMPLEMENT RECOMMENDER FUNCTIONS

Imports and downloads

In [1]:
!pip install gdown

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
!gdown https://drive.google.com/uc?id=1rc5pUykrv0d9Sk_n0WSN24zfiqSTYZWi # download the main csv file

Downloading...
From: https://drive.google.com/uc?id=1rc5pUykrv0d9Sk_n0WSN24zfiqSTYZWi
To: /content/merged_clean_final.csv
  0% 0.00/6.62M [00:00<?, ?B/s]100% 6.62M/6.62M [00:00<00:00, 88.6MB/s]


In [3]:
!gdown https://drive.google.com/uc?id=1AMrfYhXCJWn8xjgJ1VBq8DZscpzvJlM_ # download the triplets dataset

Downloading...
From: https://drive.google.com/uc?id=1AMrfYhXCJWn8xjgJ1VBq8DZscpzvJlM_
To: /content/triplets_dataset.csv
  0% 0.00/1.07M [00:00<?, ?B/s]100% 1.07M/1.07M [00:00<00:00, 87.9MB/s]


In [4]:
!gdown https://drive.google.com/uc?id=1an3kEe79b5K92nTBhQ4RtlbdWasFooy1 # download the numpy matrix with with every movie encoded as an vector (768,). The dimensions are (57481, 768)

Downloading...
From: https://drive.google.com/uc?id=1an3kEe79b5K92nTBhQ4RtlbdWasFooy1
To: /content/result_matrix.npy
100% 353M/353M [00:03<00:00, 100MB/s]


In [5]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.0-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m41.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.29.0


In [6]:
!pip install wget

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9657 sha256=bda90385611f50e4655cb8256a38c7660b8765f40544baaf0feddc412a3c2bb7
  Stored in directory: /root/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [7]:
import wget

import pandas as pd
import ast
import numpy as np

from sklearn import metrics
from scipy import spatial

import torch
import torch.nn as nn
import torch.optim as optim

import random
import time

import plotly.express as px
from sklearn.decomposition import PCA

from transformers import AutoModel, AutoTokenizer

from collections import Counter

from sklearn.neighbors import NearestNeighbors

Load csv files and the numpy matrix

In [8]:
output_directory = r'/content' 
url = r'https://files.grouplens.org/datasets/movielens/ml-latest.zip' # Download the dataset
filename = wget.download(url, out=output_directory)

In [9]:
!unzip /content/ml-latest.zip

Archive:  /content/ml-latest.zip
   creating: ml-latest/
  inflating: ml-latest/links.csv     
  inflating: ml-latest/tags.csv      
  inflating: ml-latest/genome-tags.csv  
  inflating: ml-latest/ratings.csv   
  inflating: ml-latest/README.txt    
  inflating: ml-latest/genome-scores.csv  
  inflating: ml-latest/movies.csv    


In [10]:
cd ml-latest/

/content/ml-latest


In [11]:
df_triplets = pd.read_csv(r'/content/triplets_dataset.csv', encoding='utf-8')
df_triplets = df_triplets.drop('Unnamed: 0', axis=1)

In [12]:
df_main = pd.read_csv(r'/content/merged_clean_final.csv', encoding='utf-8')
df_main['descriptors'] = df_main['descriptors'].apply(ast.literal_eval)

In [13]:
main_matrix = np.load(r'/content/result_matrix.npy')

Implement function to retrieve for every triplet (anchor, positive, negative) in df_triplets a respective tuple with embeddings

How does it work?

This function takes as input the df_triplets and df_main dataframes and the np_matrix numpy matrix. It iterates over each row of df_triplets, finds the index of the corresponding movieId in df_main, retrieves the vector at that index from the numpy matrix and appends it to a list of tuples. The final result is a list of tuples containing the vectors for the anchor, positive and negative columns of each row in df_triplets.

This resulting list of tuples with vectors we can safely feed into a further down the line Siames network

In [14]:
def get_vectors(df_triplets, df_main, np_matrix):
    result = []
    for _, row in df_triplets.iterrows():
        anchor_index = df_main[df_main['movieId'] == row['anchor']].index[0]
        positive_index = df_main[df_main['movieId'] == row['positive']].index[0]
        negative_index = df_main[df_main['movieId'] == row['negative']].index[0]
        anchor_vector = np_matrix[anchor_index]
        positive_vector = np_matrix[positive_index]
        negative_vector = np_matrix[negative_index]
        result.append((anchor_vector, positive_vector, negative_vector))
    return result

Call the function and retrieve triplets thus as a result creating a dataset

In [15]:
samples = get_vectors(df_triplets, df_main, main_matrix) # running time around 1 min

Split the resulting dataset into train and test (90/10)

In [16]:
test_size = len(samples) // 10
train_samples = samples[:-test_size]
test_samples = samples[-test_size:]

Run a sanity check to make sure that the anchor is in fact close to the positive and similarly anchor is distant from a negative

In [25]:
scores = []
test_y = []
for sample in test_samples:
    left_vector, pos_right_vector, neg_right_vector = sample
    test_y += [1, 0]
    scores.append(-spatial.distance.cosine(left_vector, pos_right_vector))
    scores.append(-spatial.distance.cosine(left_vector, neg_right_vector))
roc_auc = metrics.roc_auc_score(test_y, scores)
print(f'TEST ROC AUC: {roc_auc}')

TEST ROC AUC: 0.7505658680590361


Define Siamese model with Triplet loss

In [18]:
random.seed(42) # fix the random seed

In [19]:
class SiamiseModelTripletLoss(nn.Module):
    def __init__(self, embedding_dim=768, hidden_dim=50):
        super().__init__()
        
        self.mapping_layer = nn.Linear(embedding_dim, hidden_dim)
        self.distance = nn.PairwiseDistance(p=2)
        self.margin = 0.3
    
    def build_projections(self, in_vectors):
        projections = self.mapping_layer(in_vectors)
        norm = projections.norm(p=2, dim=1, keepdim=True)
        projections = projections.div(norm)
        return projections

    def forward(self, pivot_vectors, positive_vectors, negative_vectors):
        pivot = self.build_projections(pivot_vectors)
        positive = self.build_projections(positive_vectors)
        negative = self.build_projections(negative_vectors)
        distances = self.distance(pivot, positive) - self.distance(pivot, negative) + self.margin
        loss = torch.mean(torch.max(distances, torch.zeros_like(distances)))
        return loss
    
    def apply(self, vectors):
        return self.build_projections(vectors)

Define helper function to generate batches

In [20]:
def get_next_gen_batch(samples, batch_size=32):
    indices = np.arange(len(samples))
    np.random.shuffle(indices)
    batch_begin = 0
    while batch_begin < len(samples):
        batch_indices = indices[batch_begin: batch_begin + batch_size]
        pivot_vectors = []
        positive_vectors = []
        negative_vectors = []
        for data_ind in batch_indices:
            pivot, positive, negative = samples[data_ind]
            pivot_vectors.append(pivot)
            positive_vectors.append(positive)
            negative_vectors.append(negative)
        batch_begin += batch_size
        yield torch.cuda.FloatTensor(pivot_vectors), torch.cuda.FloatTensor(positive_vectors), torch.cuda.FloatTensor(negative_vectors)

Define training loop

In [21]:
def train_model(model, train_samples, val_samples, epochs_count=10, loss_every_nsteps=10000, lr=0.01, device_name="cuda"):
    device = torch.device(device_name)
    model = model.to(device)
    total_loss = 0
    start_time = time.time()
    optimizer = optim.Adam(model.parameters(), lr=lr)
    prev_avg_val_loss = None
    for epoch in range(epochs_count):
        model.train()
        for step, (pivot, positive, negative) in enumerate(get_next_gen_batch(train_samples)):
            loss = model(pivot, positive, negative)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            total_loss += loss.item()
            if step % loss_every_nsteps == 0:
                val_total_loss = 0
                val_batch_count = 0
                model.eval()
                for _, (pivot, positive, negative) in enumerate(get_next_gen_batch(val_samples)):
                    val_total_loss += model(pivot, positive, negative)
                    val_batch_count += 1
                avg_val_loss = val_total_loss/val_batch_count
                print("Epoch = {}, Avg Train Loss = {:.6f}, Avg val loss = {:.6f}, Time = {:.2f}s".format(epoch, total_loss / loss_every_nsteps, avg_val_loss, time.time() - start_time))
                total_loss = 0
                start_time = time.time()

Start training

In [22]:
random.shuffle(train_samples)
random.shuffle(test_samples)
model = SiamiseModelTripletLoss()
train_model(model, train_samples, test_samples) # running time around 3 min

  yield torch.cuda.FloatTensor(pivot_vectors), torch.cuda.FloatTensor(positive_vectors), torch.cuda.FloatTensor(negative_vectors)


Epoch = 0, Avg Train Loss = 0.000020, Avg val loss = 0.116902, Time = 2.65s
Epoch = 1, Avg Train Loss = 0.000108, Avg val loss = 0.000010, Time = 25.67s
Epoch = 2, Avg Train Loss = 0.000021, Avg val loss = 0.000036, Time = 17.90s
Epoch = 3, Avg Train Loss = 0.000021, Avg val loss = 0.000060, Time = 17.86s
Epoch = 4, Avg Train Loss = 0.000011, Avg val loss = 0.000045, Time = 19.33s
Epoch = 5, Avg Train Loss = 0.000018, Avg val loss = 0.000000, Time = 17.71s
Epoch = 6, Avg Train Loss = 0.000008, Avg val loss = 0.000070, Time = 20.44s
Epoch = 7, Avg Train Loss = 0.000008, Avg val loss = 0.000039, Time = 18.33s
Epoch = 8, Avg Train Loss = 0.000009, Avg val loss = 0.000008, Time = 17.83s
Epoch = 9, Avg Train Loss = 0.000002, Avg val loss = 0.000003, Time = 19.42s


Sanity check after

In [24]:
test_left = []
test_right = []
test_y = []
for sample in test_samples:
    left, pos_right, neg_right = sample
    test_left += [left, left]
    test_right += [pos_right, neg_right]
    test_y += [1, 0]

batch = []
batch_start = 0
nrows = len(test_left)
scores = []
while batch_start < nrows:
    batch_end = batch_start + 32
    left_batch = test_left[batch_start: batch_end]
    right_batch = test_right[batch_start: batch_end]
    left = model.apply(torch.cuda.FloatTensor(left_batch)).cpu().detach().numpy()
    right = model.apply(torch.cuda.FloatTensor(right_batch)).t().cpu().detach().numpy()
    left = left / np.linalg.norm(left)
    right = right / np.linalg.norm(right)
    score = (left.dot(right) + 1.0) / 2.0 - 1.0
    score = np.diag(score)
    scores.extend(score.tolist())
    batch_start = batch_end
roc_auc = metrics.roc_auc_score(test_y, scores)
print(f'TEST ROC AUC: {roc_auc}')

TEST ROC AUC: 0.9953614315759703


Create an Enbedder class to use the linear layer weights trained before for encoding

In [26]:
class Embedder(nn.Module):
    def __init__(self, embedding_dim=768, hidden_dim=50):
        super().__init__()
        
        self.mapping_layer = nn.Linear(embedding_dim, hidden_dim)
    
    def forward(self, in_vectors):
        projections = self.mapping_layer(in_vectors)
        norm = projections.norm(p=2, dim=1, keepdim=True)
        projections = projections.div(norm)
        return projections

Cast to embedder the resulting model transferring its weights and biases

In [27]:
model = model.cpu()
embedder = Embedder()
embedder.mapping_layer.weight.data = model.mapping_layer.weight.data
embedder.mapping_layer.bias = model.mapping_layer.bias

Pass the matrix of our dataset from (57481, 768) to (57481, 50) using our freshly made encoder

In [28]:
input_vectors = torch.FloatTensor(main_matrix)
embeddings = embedder(input_vectors)
embeddings = embeddings.detach().cpu().numpy()

Plot our vectors to make sure some clusters emerge. Note that the plot is interactive and you can look in to see cluster closer

In [29]:
def plot_2d_interactive(matrix):
    pca = PCA(n_components=2)
    reduced_matrix = pca.fit_transform(matrix)
    fig = px.scatter(x=reduced_matrix[:, 0], y=reduced_matrix[:, 1])
    fig.show()
plot_2d_interactive(embeddings)

Initiate Bert encoder

In [30]:
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model = model.to('cuda')

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Define function to take bert as encoder and retrieve its CLS token encoding to represent worsd longer than 1

In [31]:
def get_cls_token(str_in: str):
    inputs = tokenizer(str_in, return_tensors='pt').to('cuda')
    outputs = model(**inputs, output_hidden_states=True)
    last_hidden_states = outputs.hidden_states[-1]
    cls_token = last_hidden_states[0,0,:]
    cls_token = cls_token.detach().cpu().numpy()
    return cls_token

Define function to get an average embedding of a list of words and phrases

In [32]:
def avg_embedding(strings):
  embeddings_in = []
  for s in strings:
    embeddings_in.append(get_cls_token(s))
  avg_embedding = np.mean(embeddings_in, axis=0)
  return avg_embedding

Define a function to retrieve k-nearest neighbors

In [34]:
def find_nearest_neighbors_for_key_words(df_main, embeddings, test_embedding):
    # Fit the NearestNeighbors model to the data
    nbrs = NearestNeighbors(n_neighbors=10).fit(embeddings)
    
    # Find the indices of the 10 nearest neighbors for test_embedding
    distances, indices = nbrs.kneighbors(test_embedding.reshape(1,-1))

    # Get the titles of the 10 nearest neighbors from the dataframe
    titles = list(df_main.loc[indices[0], 'title'].values)
    descr = list(df_main.loc[indices[0], 'descriptors'].values )
    ids = list(df_main.loc[indices[0], 'movieId'].values)

    l_of_tuples = zip(ids, titles, descr)

    print('\n\tMOVIES FOUND TO BE MOST SIMILAR TO THE KEY WORDS PROVIDED:\n')
    curr_descr = list()
    for i in l_of_tuples:
      ind, ttl, des = i
      curr_descr.extend(des)
      print(f'\t{ind}\t{ttl}\t{des}')
    
    curr_descr = [i.lower() for i in curr_descr]
    descr_counter = Counter(curr_descr)
    
    print(f'\n\tHERE ARE TOP 10 KEY WORDS MOST ASSOCIATED WITH THE MOVIES WE SUGGESTED:')
    print(f'\n\t{descr_counter.most_common(10)}')

    print('\n')
    
    pass

Define a function that takes a list of keywords, produce an average embedding and finds for it closest neighbors, i.e. movies most similar to these key words

In [35]:
def return_movies_most_associated_with_key_words(your_key_words: list):
  print('\n\tKEY WORDS PROVIDED:\n')
  print(f'\t{your_key_words}')
  key_w_emb = avg_embedding(your_key_words)
  key_w_emb = torch.FloatTensor(key_w_emb)
  key_w_emb = key_w_emb.unsqueeze(0) # add an extra dimension
  key_w_emb = embedder(key_w_emb)
  key_w_emb = key_w_emb.detach().cpu().numpy()

  find_nearest_neighbors_for_key_words(df_main, embeddings, key_w_emb)

  pass

Feel free to test your list of key words

In [36]:
return_movies_most_associated_with_key_words(['Children', 'family', 'animation'])


	KEY WORDS PROVIDED:

	['Children', 'family', 'animation']

	MOVIES FOUND TO BE MOST SIMILAR TO THE KEY WORDS PROVIDED:

	4090	Brave Little Toaster, The (1987)	['Animation', 'Children', 'childhood', 'animation', 'cartoon', 'kids', 'children']
	5539	Care Bears Movie II: A New Generation (1986)	['Animation', 'Children', 'kids and family', 'cartoon', 'kids', 'children', 'animation']
	4519	Land Before Time, The (1988)	['Adventure', 'Animation', 'Children', 'Fantasy', 'kids and family', 'animation', 'childhood', 'dinosaurs', 'oscar (best animated feature)']
	5538	Care Bears Movie, The (1985)	['Animation', 'Children', 'Fantasy', 'cartoon', 'kids', 'children', 'kids and family', 'childhood']
	89586	Phineas and Ferb the Movie: Across the 2nd Dimension (2011)	['Adventure', 'Animation', 'Children', 'cartoon', 'animation', 'family', 'animals', 'kids and family']
	6251	Piglet's Big Movie (2003)	['Animation', 'Children', 'disney animated feature', 'cartoon', 'animation', 'children', 'disney']
	118

In [37]:
return_movies_most_associated_with_key_words(['war', 'violence', 'drama'])


	KEY WORDS PROVIDED:

	['war', 'violence', 'drama']

	MOVIES FOUND TO BE MOST SIMILAR TO THE KEY WORDS PROVIDED:

	72605	Brothers (2009)	['Drama', 'Thriller', 'War', 'family bonds', 'war', 'drama', 'family drama', 'intense']
	8589	Winter War (Talvisota) (1989)	['Drama', 'War', 'finnish', 'best war films', 'war', 'war movie', 'history']
	48001	Bow, The (Hwal) (2005)	['Drama', 'Romance', 'boat', 'moral ambiguity', 'melancholic', 'visual', 'police investigation']
	3342	Birdy (1984)	['Drama', 'War', 'obsession', 'drama', 'friendship', 'unlikely friendships', 'great ending']
	8126	Shock Corridor (1963)	['Drama', 'criterion', 'insanity', 'stylized', 'murder', 'tense']
	144916	Taking Chances (2011)	['drama', 'family', 'war']
	159195	I, Daniel Blake (2016)	['Drama', 'heartbreaking', 'social commentary', 'disability', 'dramatic', 'workplace']
	137996	Silent Tongue (1993)	['Drama', 'Horror', 'Western']
	61449	Burning Plain, The (2008)	['Drama', 'Romance', 'storytelling', 'non-linear', 'guilt', 

Load in the user ratings.csv table

In [38]:
df_ratings = pd.read_csv('ratings.csv', encoding='utf-8')
df_ratings = df_ratings.drop('timestamp', axis=1)

Define function to receive as input an integer referring to the user id in ratings.csv, retrieve the movies seen by the user and their respective ratings and form the user embedding as a weighted mean of movie vectors

In [39]:
def get_weighted_mean_embedding(userId, df_ratings, df_main):
    
    # extract movie ids and ratings for user
    user_ratings = df_ratings[df_ratings['userId'] == userId][['movieId', 'rating']]

    
    # merge with main dataframe to get movie titles and descriptors
    user_movie_info = pd.merge(user_ratings, df_main, on='movieId')

    user_movie_info = user_movie_info.drop('genres', axis=1)
    user_movie_info = user_movie_info.drop('tags_5', axis=1)

    movies_watched = user_movie_info['movieId'].to_list()
    
    # turn descriptors into embeddings
    user_movie_info['embedding'] = user_movie_info['descriptors'].apply(avg_embedding)
    # Scale ratings from 0 to 1
    user_movie_info['rating'] = user_movie_info['rating'] / 5.0 

    # calculate weighted mean embedding
    embeddings = np.array(user_movie_info['embedding'].tolist())
    ratings = np.array(user_movie_info['rating'].tolist())
    weighted_mean_embedding = np.average(embeddings, axis=0, weights=ratings)

    descr_l = user_movie_info['descriptors'].tolist()
    descr_l = [i.lower() for i in [item for sublist in descr_l for item in sublist]]
    descr_set = set(descr_l)
    descr_counter = Counter(descr_l)
    
    print(f'\n\tHERE ARE TOP 10 KEY WORDS MOST ASSOCIATED WITH MOVIES THE USER {userId} ALREADY SEEN:')
    print(f'\n\t{descr_counter.most_common(10)}')
    
    return weighted_mean_embedding, movies_watched

Define function to find movies most similar to the user embedding with a condition that they haven't been seen yet by the user

In [40]:
def find_nearest_neighbors_ids(df_main, embeddings, test_embedding, watched_mvs):
    seen_movies = set(watched_mvs)

    # Fit the NearestNeighbors model to the data
    nbrs = NearestNeighbors(n_neighbors=10).fit(embeddings)
    
    # Find the indices of the 10 nearest neighbors for test_embedding
    distances, indices = nbrs.kneighbors(test_embedding.reshape(1,-1))

    # Get the titles, indices, and descriptors of the 10 nearest neighbors from the dataframe
    titles = list(df_main.loc[indices[0], 'title'].values)
    descr = list(df_main.loc[indices[0], 'descriptors'].values )
    ids = list(df_main.loc[indices[0], 'movieId'].values)

    l_of_tuples = zip(ids, titles, descr)

    l_of_tuples = [i for i in l_of_tuples if i[0] not in seen_movies]

    print('\n\tRECOMMENDED MOVIES FOR THE USER NOT SEEN BEFORE:\n')

    for i in l_of_tuples:
      ind, ttl, des = i
      print(f'\t{ind}\t{ttl}\t{des}')

    curr_descr = list()
    for i in l_of_tuples:
      curr_descr.extend(i[2])
    curr_descr = [i.lower() for i in curr_descr]
    descr_counter = Counter(curr_descr)
    
    # print(descr_set)
    print(f'\n\tHERE ARE TOP 10 KEY WORDS MOST ASSOCIATED WITH MOVIES WE SUGGESTED:')
    print(f'\n\t{descr_counter.most_common(10)}')
    print('\n')
    pass

Define function that takes as an input a user id and recommeds to user movies based on the movies seen before and the ratings given

In [41]:
def return_recommendations_for_a_user_x(user_id_in: int):
  """
  This function returns movies recommended for a user with an id number user_id_in.
  The movies and user rating retrieved from the csv file ratings.csv
  Input:
    user_id_in: a unique user id (an integer) retrieve from ratings.csv from the column
  """
  user_x, movies_watched_user_x = get_weighted_mean_embedding(user_id_in, df_ratings, df_main)
  user_x = torch.FloatTensor(user_x) # your input vector here
  user_x = user_x.unsqueeze(0) # add an extra dimension
  user_x = embedder(user_x)
  user_x = user_x.detach().cpu().numpy()

  find_nearest_neighbors_ids(df_main, embeddings, user_x, movies_watched_user_x)

  pass

Feel free to test the function on any user

In [42]:
return_recommendations_for_a_user_x(3)


	HERE ARE TOP 10 KEY WORDS MOST ASSOCIATED WITH MOVIES THE USER 3 ALREADY SEEN:

	[('drama', 9), ('horror', 4), ('thriller', 3), ('oscar (best directing)', 3), ('original', 2), ('crime', 2), ('devil', 2), ('gangsters', 2), ('comedy', 2), ('mystery', 2)]

	RECOMMENDED MOVIES FOR THE USER NOT SEEN BEFORE:

	7820	Virgin Spring, The (Jungfrukällan) (1960)	['Crime', 'Drama', 'oscar (best foreign language film)', 'rape', 'criterion', 'innocence lost', 'bleak']
	4235	Amores Perros (Love's a Bitch) (2000)	['Drama', 'Thriller', 'oscar (best foreign language film)', 'amazing photography', 'imdb top 250', 'violence', 'spanish']
	906	Gaslight (1944)	['Drama', 'Thriller', 'oscar (best actress)', 'tense', 'murder', 'manipulation', 'psychological']
	3335	Jail Bait (1954)	['Crime', 'Drama', 'horrible', "so bad it's funny", 'sexy', 'bad acting', 'original']
	111	Taxi Driver (1976)	['Crime', 'Drama', 'Thriller', 'loneliness', 'masterpiece', 'golden palm', 'imdb top 250', 'character study']
	60735	Shotg

In [43]:
return_recommendations_for_a_user_x(2)


	HERE ARE TOP 10 KEY WORDS MOST ASSOCIATED WITH MOVIES THE USER 2 ALREADY SEEN:

	[('comedy', 10), ('drama', 9), ('romance', 6), ('thriller', 3), ('action', 2), ('adventure', 2), ('awesome soundtrack', 2), ('los angeles', 2), ('dark humor', 2), ('relationships', 2)]

	RECOMMENDED MOVIES FOR THE USER NOT SEEN BEFORE:

	100226	Why Stop Now (2012)	['Comedy', 'Drama', 'dysfunctional family', 'original', 'independent film', 'addiction', 'weed']
	67839	Lucky Ones, The (2008)	['Comedy', 'Drama', 'War', 'road trip', 'war', 'independent film', 'road movie', 'iraq war']
	1484	Daytrippers, The (1996)	['Comedy', 'Drama', 'Mystery', 'Romance', 'relationships', 'dysfunctional family', 'independent film', 'ensemble cast', 'road trip']
	3410	Soft Fruit (1999)	['Comedy', 'Drama', 'dysfunctional family', 'visually appealing', 'culture clash', 'original', 'criterion']
	7122	King of Hearts (1966)	['Comedy', 'Drama', 'War', 'war', 'anti-war', 'insanity', 'wartime', 'war movie']
	7286	Simple Men (1992)	['C

In [44]:
return_recommendations_for_a_user_x(1)


	HERE ARE TOP 10 KEY WORDS MOST ASSOCIATED WITH MOVIES THE USER 1 ALREADY SEEN:

	[('comedy', 8), ('sci-fi', 8), ('thriller', 7), ('drama', 6), ('action', 5), ('horror', 3), ('criterion', 2), ('violent', 2), ('violence', 2), ('silly fun', 2)]

	RECOMMENDED MOVIES FOR THE USER NOT SEEN BEFORE:

	51709	Host, The (Gwoemul) (2006)	['Comedy', 'Drama', 'Horror', 'Sci-Fi', 'Thriller', 'monsters', 'social commentary', 'family drama', 'monster', 'family bonds']
	70728	Bronson (2009)	['Action', 'Comedy', 'Drama', 'Thriller', 'violent', 'male nudity', 'insanity', 'violence', 'prison']
	7924	Stray Dog (Nora inu) (1949)	['Drama', 'Film-Noir', 'Thriller', 'criterion', 'japan', 'bleak', 'tense', 'kurosawa']
	4108	Five Corners (1987)	['Drama', 'coen bros', 'quirky', 'violence', 'vengeance', 'criterion']
	2026	Disturbing Behavior (1998)	['Horror', 'Thriller', 'teen movie', 'high school', 'teen', 'teens', 'teenagers']
	163645	Hacksaw Ridge (2016)	['Drama', 'War', 'war', 'best war films', 'true story', 

In [45]:
return_recommendations_for_a_user_x(5)


	HERE ARE TOP 10 KEY WORDS MOST ASSOCIATED WITH MOVIES THE USER 5 ALREADY SEEN:

	[('drama', 52), ('crime', 34), ('thriller', 24), ('comedy', 24), ('action', 15), ('imdb top 250', 12), ('mystery', 11), ('war', 11), ('great acting', 9), ('romance', 9)]

	RECOMMENDED MOVIES FOR THE USER NOT SEEN BEFORE:

	2212	Man Who Knew Too Much, The (1934)	['Drama', 'Thriller', 'hitchcock', 'criterion', 'assassination', 'kidnapping', 'suspense']
	3335	Jail Bait (1954)	['Crime', 'Drama', 'horrible', "so bad it's funny", 'sexy', 'bad acting', 'original']
	2939	Niagara (1953)	['Drama', 'Thriller', 'noir thriller', 'hitchcock', 'murder', 'sexy', 'suspense']
	155288	Eye in the Sky (2016)	['Drama', 'Thriller', 'War', 'surveillance', 'military', 'suspense', 'terrorism', 'tense']
	4235	Amores Perros (Love's a Bitch) (2000)	['Drama', 'Thriller', 'oscar (best foreign language film)', 'amazing photography', 'imdb top 250', 'violence', 'spanish']
	55363	Assassination of Jesse James by the Coward Robert Ford, Th

In [46]:
return_recommendations_for_a_user_x(6)


	HERE ARE TOP 10 KEY WORDS MOST ASSOCIATED WITH MOVIES THE USER 6 ALREADY SEEN:

	[('action', 30), ('thriller', 22), ('crime', 15), ('drama', 15), ('comedy', 12), ('adventure', 11), ('sci-fi', 9), ('romance', 8), ('good action', 8), ('adapted from:book', 5)]

	RECOMMENDED MOVIES FOR THE USER NOT SEEN BEFORE:

	174055	Dunkirk (2017)	['Action', 'Drama', 'Thriller', 'War', '70mm', 'best war films', 'tense', 'war', 'intense']
	51086	Number 23, The (2007)	['Drama', 'Mystery', 'Thriller', 'obsession', 'plot twist', 'conspiracy theory', 'twist ending', 'twist']
	882	Trigger Effect, The (1996)	['Drama', 'Thriller', 'very interesting', 'original', 'bad ending', 'paranoia', 'pointless']
	37733	History of Violence, A (2005)	['Action', 'Crime', 'Drama', 'Thriller', 'violence', 'gratuitous violence', 'violent', 'brutality', 'graphic novel']
	2535	Earthquake (1974)	['Action', 'Drama', 'Thriller', 'natural disaster', 'disaster', '70mm', 'special effects', 'los angeles']
	1208	Apocalypse Now (1979)	[