# BBC Tech Screening - Principal Data Scientist

For this Technical screening I have chosen to use the Movielens dataset to create a personalised movie recommender system. This is a multi-model recommender that combines an approximate nearest neighbors look up of semantically similiar movie titles and genres to a base query as a form of candidate item filtering. Then a colaborative filtering model was trained on all of the user interactions and the list of movie candidates is reranked and deduplicated of movies a user has seen before to create a personalised list of recommendations.

I have packaged all of this up into a single recommender class which you can import and run the *print_recs* method on with any integer input for userID and string to specify the movie look ups, like what are in the below example *personalisedRecommender.print_recs(42, "Horror films with zombies")*.

I'm going to start off with this example, then walk backwards through the components of this project and wrap up with an evaluation and recommendations for improvement.

In [1]:
# Import the personalisedSearcher class
from recommender.recommender import personalisedSearcher

In [2]:
# Instantiate an instance of it, this will take a few moments as
# in the initialization it loads into memory all of the requisite data
personalisedRecommender = personalisedSearcher()

ImportError: The scann library is not present. Please install it using `pip install scann` to use the ScaNN layer.

In [None]:
# Now the recommendations can be generated for any userid and request string.
# Below you can see the example for user 42's personalised recommendations
# for the request for "Horror films with Zombies".
personalisedRecommender.print_recs(42, "Horror films with zombies")

### Intro

I'll now go through this project from the beginning, highlighting my choices and how I made them. First thing was to map out generally how I would accomplish this. The below workflow illustrates the final structure I came up with, one model to generate embeddings from the text and metadata around the movie and generate a candidate items from this look up, the other to learn the user embeddings to refine these lists and offer a personalised experience. All of the data required to replicate this should be available in the correct subdirectories, but the entire project can be run from scratch by running the setup command in the make file to create and install the requirements in a virtual environment and download the data. The embeddings and collaborative filtering model can then be generated by running python *src/cf.py* and *src/nlp.py*

![alt](diagram.png)

## NLP embedding generation

**All of the code I will be walking through for this section can be found in *src/nlp.py*.**

I used the sentence-encoder Huggingface model "LaBSE" which stands for "Language Agnostic BERT Sentence Encoder" to create the embeddings for the items, this was for two reasons, I knew there were non-English titles and this would handle those cases, and it is the most used use sentence-encoder model, which provides some assurance on its robustness.

```python
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/LaBSE")
model = AutoModel.from_pretrained("sentence-transformers/LaBSE")
data = pd.read_csv("../ml-25m/movies.csv")
```

The data has two useful fields, the title (with date) and the genres attached to each title. There isn't a good reason to encode the pipes in the genres or to keep the parenthesis in the title, so I will remove each of these.


```python 
>>> data.head()
   movieId                               title                                       genres
0        1                    Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy
1        2                      Jumanji (1995)                   Adventure|Children|Fantasy
2        3             Grumpier Old Men (1995)                               Comedy|Romance
3        4            Waiting to Exhale (1995)                         Comedy|Drama|Romance
4        5  Father of the Bride Part II (1995)                                       Comedy
```

I removed the useless characters in both fields and replaced the pipes with spaces so I didn't end up with a long string but instead three strings, then created an input vector for the model by concatenating the remaining information for each item together. 

```python 
def remove_pars(x):
    x = str(x)
    return re.sub('[()]', "", x)

def remove_pipes(x):
    x = str(x)
    return re.sub('\|', " ", x)

def remove_nulls(a, b, i):
    string_m = a[i] + " " + b[i]
    return re.sub("\(no genres listed\)", "", string_m)

# process the titles and genres
titles = [remove_pars(i) for i in data['title']]
genres = [remove_pipes(i) for i in data['genres']]

# make a list of the input strings for each item from these bits of data. 
input_string = [remove_nulls(titles, genres, i) for i in range(len(genres))]
```

Lastly, I loaded the model on to a GPU and iterated over each of the strings with the huggingface tokenizer to create the sentence embedding. These were then extracted from the output tensor and saved to be loaded in to the recommender later. 

```python
# this will be using a GPU to speed things up 
# but will default to CPU if no devices are available
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Create embeddings for each item 
embeddings_list = []
for _, i in enumerate(input_string):
    encoded_input = tokenizer(i, padding=True, truncation=True, max_length=64, return_tensors='pt').to(device)
    with torch.no_grad():
        model_output = model(**encoded_input)
    embeddings = model_output.pooler_output
    embeddings = torch.nn.functional.normalize(embeddings)
    embeddings_list.append(embeddings)
    if _ % 10000  == 0:
        print(str(_))
        
# extract the embeddings
embeddings_list_tensors = []
for i in embeddings_list:
    d = i.cpu()[0].numpy()
    embeddings_list_tensors.append(d)

# save them to local file. 
embeddings = pd.DataFrame(np.vstack(embeddings_list_tensors))
embeddings.to_csv("../embeddings/data.csv")
```

## Collaborative Filtering model training

**All of the code I will be walking through for this section can be found in *src/cf.py*.**

First the data needs some preprocessing. The data is read into memory and the movie and user id's were deduplicated and then a new mapping was applied to join the index ID to the encoded ID. I then shuffled the data for fairer sampling then split the data 90/10 training test.

```python
df = pd.read_csv("../ml-25m/ratings.csv")

user_ids = df["userId"].unique().tolist()
movie_ids = df["movieId"].unique().tolist()

user2user_encoded = {x: i for i, x in enumerate(user_ids)}
userencoded2user = {i: x for i, x in enumerate(user_ids)}
movie2movie_encoded = {x: i for i, x in enumerate(movie_ids)}
movie_encoded2movie = {i: x for i, x in enumerate(movie_ids)}
df["user"] = df["userId"].map(user2user_encoded)
df["movie"] = df["movieId"].map(movie2movie_encoded)

df = df.sample(frac=1, random_state=42)
x = df[["user", "movie"]].values
y = df["rating"].apply(lambda x: (x - min_rating) / (max_rating - min_rating)).values

train_indices = int(0.9 * df.shape[0])
x_train, x_val, y_train, y_val = (
    x[:train_indices],
    x[train_indices:],
    y[:train_indices],
    y[train_indices:],
)
```

I used a battle tested colaborative filtering architecture and chose an embedding space of 128. The choice of Keras was for pragmatic reasons, for getting something robustly built quickly Keras is a simple option. 

```python
# shape the neural colaborative filtering network
EMBEDDING_SIZE = 128

class RecommenderNet(keras.Model):
    def __init__(self, num_users, num_movies, embedding_size, **kwargs):
        super(RecommenderNet, self).__init__(**kwargs)
        self.num_users = num_users
        self.num_movies = num_movies
        self.embedding_size = embedding_size
        self.user_embedding = layers.Embedding(
            num_users,
            embedding_size,
            embeddings_initializer="he_normal",
            embeddings_regularizer=keras.regularizers.l2(1e-6),
        )
        self.user_bias = layers.Embedding(num_users, 1)
        self.movie_embedding = layers.Embedding(
            num_movies,
            embedding_size,
            embeddings_initializer="he_normal",
            embeddings_regularizer=keras.regularizers.l2(1e-6),
        )
        self.movie_bias = layers.Embedding(num_movies, 1)

    def call(self, inputs):
        user_vector = self.user_embedding(inputs[:, 0])
        user_bias = self.user_bias(inputs[:, 0])
        movie_vector = self.movie_embedding(inputs[:, 1])
        movie_bias = self.movie_bias(inputs[:, 1])
        dot_user_movie = tf.tensordot(user_vector, movie_vector, 2)
        # Add all the components (including bias)
        x = dot_user_movie + user_bias + movie_bias
        # The sigmoid activation forces the rating to between 0 and 1
        return tf.nn.sigmoid(x)
```

Lastly instantiate, compile, train and then save the model out to be loaded in later for making predictions on the candidate items per user. I played with three parameters here, the optimizer, the learning rate and the batch size. I tried SGD, Adam, NAdam and Adamax. Adam provided the smoothest loss curve so stayed with that, the learning rate I started with 0.01 and reduced it by orders of 10 until I found a LR that didn't start to vary the loss with epochs. I did try learning rate schedulers, but didn't find a stable method. Lastly, the batch size was set quite large to maintain training time under 1 hour, on a T4 GPU this took approximately 50 minutes. 

```python
# instantiate the model
model = RecommenderNet(num_users, num_movies, EMBEDDING_SIZE)

# compile the model
model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(), 
    optimizer = keras.optimizers.Adam(learning_rate=0.0001)
)

# train the model
history = model.fit(
    x=x_train,
    y=y_train,
    batch_size=4096,
    epochs=5,
    verbose=1,
    validation_data=(x_val, y_val),
)

# save the model after training. 
model.save('../CF')
```

The loss curve looks sensible and like the model is converging, I am happy with this model at these parameters. 


![Training vs Validation Loss](loss.png)

## Evaluate the component performance

To start with I wanted to look at the performance of the embeddings to return appropriate movies to a given query. To do that I indexed the embeddings in a SCANN model, created a new embedding of the same length as each item from a given test query and looked at the resulting movies. 

```python
scann = tfrs.layers.factorized_top_k.ScaNN(num_leaves=1000, 
                                           num_leaves_to_search = 100, 
                                           k = round(np.sqrt(len(item_tensor))))
scann.index(item_tensor)

test = "Horror films with zombies"
encoded_input = tokenizer(test, padding=True, truncation=True, max_length=64, return_tensors='pt').to(device)
with torch.no_grad():
    model_output = model(**encoded_input)
query = model_output.pooler_output
query = torch.nn.functional.normalize(embeddings)

test_case = scann(np.array(query.cpu()))
data.iloc[test_case[1].numpy()[0]][0:9]
```

For the base query of "Horror films with zombies" I got 250 movies returned, the top ten of which are printed below. From an initial examination they look pretty decent! Zombie is in all of the titles, Horror is present in the genre tags, I don't know yet if this is the optimal solution, but it is atleast a sensible starting place.


```python
 	movieId 	title 	genres
11068 	47980 	Bio Zombie (Sun faa sau si) (1998) 	Comedy|Horror
13822 	71535 	Zombieland (2009) 	Action|Comedy|Horror
46049 	171651 	Redneck Zombies (1989) 	Horror
23643 	118810 	Zombie Women of Satan (2009) 	Comedy|Horror
45150 	169738 	Zombie Wars (2006) 	Horror
55180 	191327 	Teenage Zombies (1960) 	Horror|Sci-Fi
41540 	161912 	Zombie Night (2003) 	Comedy|Horror|Sci-Fi
23642 	118808 	Zombie Reanimation (2009) 	Action|Comedy|Horror
14427 	75404 	ZMD: Zombies of Mass Destruction (2009) 	Comedy|Horror
```

Next I performed a similar sense check on the collaborative filtering recommender. I looked at the recommendations for a random user from their user embedding, then filtered off the films they have seen before.

```python
recs = model.predict(user_movie_array).flatten()
top_ratings_indices = recs.argsort()[-10:][::-1]
recommended_movie_ids = [movie_encoded2movie.get(movies_not_watched[x][0]) for x in top_ratings_indices]
```
For user 136160 their highest rated films were the five below, and from that user history the next 10 are recommended. To me, this looks like a sensible set of recommendations, similar genres are present, movies geared to an older audience, slightly more mature themes. This looks like it has created reasonable user embeddings and has correctly understood something about the user activity in its training. 

```latex
Showing recommendations for user: 136160
====================================
Movies with high ratings from user
--------------------------------
GoldenEye (1995) : Action|Adventure|Thriller
Twelve Monkeys (a.k.a. 12 Monkeys) (1995) : Mystery|Sci-Fi|Thriller
From Dusk Till Dawn (1996) : Action|Comedy|Horror|Thriller
Batman Forever (1995) : Action|Adventure|Comedy|Crime
Robin Hood: Men in Tights (1993) : Comedy
--------------------------------
Top movie recommendations
--------------------------------
Usual Suspects, The (1995) : Crime|Mystery|Thriller
Shawshank Redemption, The (1994) : Crime|Drama
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964) : Comedy|War
Godfather, The (1972) : Crime|Drama
Rear Window (1954) : Mystery|Thriller
One Flew Over the Cuckoo's Nest (1975) : Drama
12 Angry Men (1957) : Drama
Godfather: Part II, The (1974) : Crime|Drama
Seven Samurai (Shichinin no samurai) (1954) : Action|Adventure|Drama
Fight Club (1999) : Action|Crime|Drama|Thriller
```

## Putting it all together.

**In *recommender/recommender.py* you can find the code for the next section**

I combined the implementation of both of the above models into one personalisedSearcher class. The first step was to load all of the embeddings and models into the \_\_init__ of the class and create the SCANN index. 

```python
class personalisedSearcher:
    def __init__(self):
        self.movies = pd.read_csv("ml-25m/movies.csv")
        self.ratings = pd.read_csv("ml-25m/ratings.csv")
        self.embeddings = pd.read_csv("embeddings/data.csv", index_col=0)
        self.item_tensor = tf.convert_to_tensor(self.embeddings, dtype=tf.float32)
        self.scann = tfrs.layers.factorized_top_k.ScaNN(num_leaves=1000, 
                                                        num_leaves_to_search = 100, 
                                                        k = round(np.sqrt(len(self.item_tensor))))
        self.scann.index(self.item_tensor)
        self.model = AutoModel.from_pretrained("sentence-transformers/LaBSE")
        self.tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/LaBSE")
        self.recommender = keras.models.load_model('CF')
```

I wrote a couple of helper functions to process the data along the way, create lists of the user history item indices, generate the candidates from the query string and then deduplicate it with their watching history and create the movie array of candidates. This culminates in the following prediction step and reranking of the candidates based on the user embedding. 

```python
def personalised_search(self, user_id, query):
        movie_array, movies_not_watched, movies_watched_by_user = self.filter_candidates(user_id, query)
        scored_items = self.recommender.predict(movie_array).flatten()
        top_rated = scored_items.argsort()[-10:][::-1]
        _, movie_encoded2movie = self.get_movie_encodings()
        recommended_movie_ids = [movie_encoded2movie.get(movies_not_watched[x][0]) for x in top_rated]
        
        return recommended_movie_ids, movies_watched_by_user
```

This is subsequently wrapped up the *print_recs()* method which takes in the userID and query string, feeds them back down the stack of helper functions to generate the recommendations and rerank them then prints out the user history for reference and the reranked predicted recommendations resulting in lists like the below from the initial query string! 

```python
Showing recommendations for user: 42
====================================
Movies with high ratings from user
--------------------------------
Seven (a.k.a. Se7en) (1995) : Mystery|Thriller
Silence of the Lambs, The (1991) : Crime|Horror|Thriller
Snake Eyes (1998) : Action|Crime|Mystery|Thriller
Payback (1999) : Action|Thriller
Total Recall (1990) : Action|Adventure|Sci-Fi|Thriller
--------------------------------
Top movie recommendations
--------------------------------
Bio Zombie (Sun faa sau si) (1998) : Comedy|Horror
Zombieland (2009) : Action|Comedy|Horror
ZMD: Zombies of Mass Destruction (2009) : Comedy|Horror
Zombie Reanimation (2009) : Action|Comedy|Horror
Zombie Women of Satan (2009) : Comedy|Horror
The Zombie Diaries (2006) : Action|Horror|Thriller
Redneck Zombies (1989) : Horror
Hobgoblins 2 (2009) : Horror|Sci-Fi
Teenage Zombies (1960) : Horror|Sci-Fi
    
    
 	movieId 	title 	genres
11068 	47980 	Bio Zombie (Sun faa sau si) (1998) 	Comedy|Horror
13822 	71535 	Zombieland (2009) 	Action|Comedy|Horror
46049 	171651 	Redneck Zombies (1989) 	Horror
23643 	118810 	Zombie Women of Satan (2009) 	Comedy|Horror
45150 	169738 	Zombie Wars (2006) 	Horror
55180 	191327 	Teenage Zombies (1960) 	Horror|Sci-Fi
41540 	161912 	Zombie Night (2003) 	Comedy|Horror|Sci-Fi
23642 	118808 	Zombie Reanimation (2009) 	Action|Comedy|Horror
14427 	75404 	ZMD: Zombies of Mass Destruction (2009) 	Comedy|Horror
```

## Improvements and considerations

So there is definitely reranking of the films, and it is clear where items like hobgoblins 2 are introduced to replace items which are previously watched by the user. However, I think that there are ways of improving this before it could be put in front of audiences or adapted to other use cases. I think there are a few clear improvements to this model, the first would be a filter on the content for children. Either a "family friendly" toggle, or if some percentage of the user's history is made of young audience content then don't show mature content. Both of these are impeded by the fact that there is no age rating data so the programme metadata would need to first be enriched for this content. A second improvement would be to further enrich the items with movie summaries or plot descriptions. These could then be added to the encoded strings to better refine the initial search. This could probably be done through joining to IMBD or Wiki data. This would certainly help when the titles are non-descriptive of the content or are strange non-lexical words, for example "Jumanji". Additionally, the length of the candidate items could be varied. When only returning the top 12 items from the SCANN index we end up with a decent list length, 8-10 for all of the users I tested after filtering and reranking. However, for a given query that list is very similar and there is only minimal reranking between users. As the list gets longer more diversity is included in the recommendations, at the very end of the list is some truly unrelated items, but I suspect it would be possible to tune the list length to have greater diversity between lists for users. Whether that behaviour is desirable or whether the nearest things to the query string should be more respected is another matter to balance. Lastly, this reranking only works for users that are present in the CF model's user embeddings. A cold start option should be added for users outside of this, or new users which would likely just be the raw return from the word embeddings query. 