In [2]:
#Adapted from the Keras Example https://keras.io/examples/structured_data/collaborative_filtering_movielens/

In [3]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from pathlib import Path
import matplotlib.pyplot as plt

# Week 5 - Embeddings for Recommendation 

Here we'll see how to train our simple **Dot Product** model, along with our **user embeddings** and **item embeddings** using the **Keras** library. As before, we'll be checking out the **MovieLens** dataset 

## Loading in the Dataset

First we load in the small version of the dataset. As this is a **Collaborative Filtering** approach, we are interested in the **ratings.csv**, which has all over ratings made by each user.

In [6]:
url = 'https://raw.githubusercontent.com/dsahla/mycourse/main/ratings.csv'

In [7]:
df = pd.read_csv(url)

In [8]:
df.tail(10)

Unnamed: 0,userId,movieId,rating,timestamp
99994,671,5952,5.0,1063502716
99995,671,5989,4.0,1064890625
99996,671,5991,4.5,1064245387
99997,671,5995,4.0,1066793014
99998,671,6212,2.5,1065149436
99999,671,6268,2.5,1065579370
100000,671,6269,4.0,1065149201
100001,671,6365,4.0,1070940363
100002,671,6385,2.5,1070979663
100003,671,6565,3.5,1074784724


## Preprocessing 

What we have in the dataset is a list of **userId** and **movieId** pairs loaded into a ``Pandas`` DataFrame. 

As we said before, you can think of an embedding layer as a **one-hot encoding** layer the size of your **vocabulary**, followed by a **fully connected layer** the size of your embedding. 

When we make the embedding, we will need a way of mapping back from **indexes** in the **one-hot encoding** back to the ids for the users and movies. 

### Vocabulary 

In order to make the vocabulary (all the unique ids), we can use the ``unique()`` function in ``Pandas``

In [9]:
user_ids = df["userId"].unique().tolist()
movie_ids = df["movieId"].unique().tolist()

In [10]:
len(movie_ids)

9066

In [11]:
#Non-sequential list of ids
movie_ids[:6]

[31, 1029, 1061, 1129, 1172, 1263]

### Dictionary Comprehensions 

We've seen ``Dictionaries`` (e.g. when looking at JSON from REST APIs). This is a collection like a ``List``, but instead of using indexes to access data (**values**), we use **keys**. 

We've also seen ``List Comprehensions``, a short hand way to iterate through an existing collection and make a new ``List``. 

As we want something where we can use an arbitrary string/number (e.g. a movie or user id) to look up an index, a ``Dictionary`` seems like a good data structure to use. We can declare dictionaries manually (see below), but it would be much quicker and cleaner to use the information we already have to make this.


In [12]:
#Manually making the dictionary
movie_id_to_index = {
    31: 1,
    1029: 2,
    1061: 3
}
#Use a movie id to look up an index
movie_id_to_index[31]

1

Like the ``List Comprehension``, the ``Dictionary Comprehension`` iterates through a given collection, does some calculation and stores new values in a new collection. 

In this case, we need to return both a ``Key`` and a ``Value`` for each item. 

```
a = [1,2,3]
b = {i:i+1 for i in a} 
```

is the same as 

```
a = [1,2,3]
b = {}
for i in a:
    b[i] = i+1
```

where we end up with the ``Dictionary``

```
{
    1: 2,
    2: 3,
    3: 4
}
```

Below, we combine the dictionary comprehension with the ``enumerate()`` function to return the id (x) and the index (i) and store them in a new dictionary 

In [14]:
#Make a dictionary mapping ids (keys) to indexes (values)
user_id_to_index = {x: i for i, x in enumerate(user_ids)}
movie_id_to_index = {x: i for i, x in enumerate(movie_ids)}

In [15]:
#Make a new column in the dataframe which contains the appropriate index for each user and movie
df["user_index"] = [user_id_to_index[i] for i in df["userId"]]
df["movie_index"] = [movie_id_to_index[i] for i in df["movieId"]]

In [16]:
df.head(5)

Unnamed: 0,userId,movieId,rating,timestamp,user_index,movie_index
0,1,31,2.5,1260759144,0,0
1,1,1029,3.0,1260759179,0,1
2,1,1061,3.0,1260759182,0,2
3,1,1129,2.0,1260759185,0,3
4,1,1172,4.0,1260759205,0,4


### Scaling the ratings

As is good when working with ``gradient descent``, it helps to have our values on a similar range, and for that to be between 0 and 1. We can use the ``MinMaxScaler`` from ``Scikit-Learn`` to scale our ratings to between 0 and 1

In [19]:
df["rating"].describe()

count    100004.000000
mean          3.543608
std           1.058064
min           0.500000
25%           3.000000
50%           4.000000
75%           4.000000
max           5.000000
Name: rating, dtype: float64

In [None]:
from sklearn.preprocessing import MinMaxScaler
##Pick the range
df["rating"] = MinMaxScaler().fit_transform(df["rating"].values.reshape(-1, 1))

In [None]:
df["rating"].describe()

## Training Set

We are making a **predictive model** that will take a **user** and **movie** and return a **rating**. 

For our training, we will make a dataset using the information we already know. In this context, our input feautres (``x``) are the movie and user indexes, and the our output (``y``) is the rating. 

We make a train - test split of ``10%`` to validate our model. 

In [20]:
from sklearn.model_selection import train_test_split
#Inputs
x = df[["user_index", "movie_index"]]
#Outputs
y = df["rating"]
#Get train-test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=42)

## Making a Custom Model 

Previously in ``Keras`` we have used to pre-existing layers, connecting them all together in using the [Sequential](https://keras.io/guides/sequential_model/) object. This allows us to fit together layers that pass information forwards in a structure that works for most **Neural Networks**.

``Keras`` also has a [Model](https://keras.io/api/models/model/) object which we can **subclass**. Without getting too bogged down in the details of **Object Oriented Programming**, essentially what this means is we can take the **existing functionality** from this object and **override** certain functions to add in custom behaviour.

Using the ``Model`` structure, we have something that can take advantage of a lot the things that are built into the ``Keras`` library. It can be trained, can have layers, can have parameters that can be optimised.

**But**, we can also add in our own functionality. 

The two main functions we want to override and these are 

1. ``def __init__()``
    
    * This is called **once** when the object is first made. We can use this to define our layers 
    

2. ``def call()``

    * This is called everytime we want to make a forwards pass. This means it takes some **inputs** and returns some **outputs**. This is called during training, or for inference on a trained model. 
    
### LouisNet

Below, we show an **incredibly simple model**, but it should help you get an intuition for what function is called at when in the training process

We can see the ``__init__()`` is called once, and then the ``call()`` is called **once per batch**, where we get the inputs for this batch and return some outputs

This model doesnt actually have any parameters to train, its more to demonstrate the subclassing principle in the simplest terms

In [22]:
#Define class and subclass keras.Model
class LouisNet(keras.Model):
    
    #Override __init__()
    def __init__(self, **kwargs):
        super(LouisNet, self).__init__(**kwargs)
        print("__init__ called")
    
    #Override call()
    def call(self, inputs):
        tf.print("\nforwards pass (new batch)")
        tf.print(inputs,"\n")
        #return the output (its just the input, unchanged)
        return inputs

#Make a new instance of LouisNet    
louisNet = LouisNet()
louisNet.compile()
#Train
louisNet.fit(
    x=[[1],[2],[3],[4]],
    y=[[5],[6],[7],[8]],
    epochs=2,
    batch_size=2
)

__init__ called
Epoch 1/2

forwards pass (new batch)
[[3]
 [4]] 

forwards pass (new batch)
[[1]
 [2]] 

Epoch 2/2

forwards pass (new batch)
[[3]
 [1]] 

forwards pass (new batch)
[[4]
 [2]] 



<tensorflow.python.keras.callbacks.History at 0x7f409d3cc210>

## The Dot Product Recommender Model

Lets remember the model we're trying to make. 


```
Predicted Rating = Dot Product(user_vector, item_vector) + user_bias + item_bias
```


Our target is to find a vector for each movie and user so that their dot product (+ their biases) is an accurate prediction for the rating that user would make for that movie. 

Each of these vectors will be contained in a matrix, that we call an **embedding**


### The Embedding Layer 

Again, you can think of an embedding layer as a **one-hot encoding** layer the size of your **vocabulary**, followed by a **fully connected layer** the size of your embedding. 

Luckily, ```Keras``` has a layer already we can use, all we have to say is 

1. How many items we have (vocabulary size)

2. The size of the embedding 

You might use something between 10-300, and this is something you will have to tune

### New Arguments for ``__init__``

Again, we will override the ```__init__()``` function, but this time we will add in some extra arguments. We can use this to pass in 

1. Number of users 

2. Number of movies

3. Size of Embedding

These get passed in when we make the new object 

```
model = RecommenderNet(num_users, num_movies, EMBEDDING_SIZE)

```

### Saving Variables and ```self```

Finally, the last **Object-oriented** concept we'll need allows us to save things within the object. These are sometimes called ``instance variables`` or ``fields``, but the main thing you need to know is **these are like the variables we use all the time to store objects and data**, apart from they belong to the object, and only work within this context 

We use the keyword ```self``` within the object to refer to itself. We can use this to make layers in the ```__init__()``` function, store them in the object, and then reuse and update them in the ```call()``` function.


In [24]:
#Define the new class
class RecommenderNet(keras.Model):
    
    #Override init with new arguments 
    def __init__(self, num_users, num_movies, embedding_size, **kwargs):
        super(RecommenderNet, self).__init__(**kwargs)
        #Make an embedding layer for users
        self.user_embedding = layers.Embedding(
            num_users,
            embedding_size,
            embeddings_initializer="he_normal",
            embeddings_regularizer=keras.regularizers.l2(1e-6),
        )
        #Make an embedding layer for user bias
        self.user_bias = layers.Embedding(num_users, 1)
        #Make an embedding layer for movies
        self.movie_embedding = layers.Embedding(
            num_movies,
            embedding_size,
            embeddings_initializer="he_normal",
            embeddings_regularizer=keras.regularizers.l2(1e-6),
        )
        #Make an embedding layer for movie bias
        self.movie_bias = layers.Embedding(num_movies, 1)

    def call(self, inputs):
        #inputs contains [[user,movie],[user,movie],[user,movie]...]
        user_vector = self.user_embedding(inputs[:, 0])
        user_bias = self.user_bias(inputs[:, 0])
        movie_vector = self.movie_embedding(inputs[:, 1])
        movie_bias = self.movie_bias(inputs[:, 1])
        #Dot product 
        dot_user_movie = tf.tensordot(user_vector, movie_vector, 2)
        # Add all the components (including bias)
        x = dot_user_movie + user_bias + movie_bias
        # The sigmoid activation forces the rating to between 0 and 1
        return tf.nn.sigmoid(x)

### Train 

Now we can ```compile()``` and ```fit()``` just like we would any model. 


On every forwards pass (see ``call()`` above)

1. We take a batch of ``users`` and ``movies``


2. Run them through the normal embedding and bias embedding layers respectively 


3. Get the vectors for each out 


4. Get the dot product of the user and movie vectors 


5. Add the biases 


6. Run through a sigmoid


7. Return!

In [40]:
#Pick Embedding size
EMBEDDING_SIZE = 20
#Make new object (calls __init__())
num_users = len(user_ids)
num_movies = len(movie_ids)
model = RecommenderNet(num_users, num_movies, EMBEDDING_SIZE)
model.compile(
    loss=tf.keras.losses.MeanSquaredError(), optimizer=keras.optimizers.Adam(learning_rate=0.001)
)


In [27]:
#TRAIN
history = model.fit(
    x=x_train,
    y=y_train,
    batch_size=64,
    epochs=20,
    validation_data=(x_test, y_test)
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


### Accessing the Embeddings 

We can access the **embedding layers** in our model object, and access the ``trainable_weights``. This is the embedding and we can see is has a shape of ```num_users x EMBEDDING_SIZE```

In [29]:
model.user_embedding.trainable_weights


[<tf.Variable 'recommender_net_1/embedding_4/embeddings:0' shape=(671, 20) dtype=float32, numpy=
 array([[ 0.00323575,  0.00570699,  0.01023343, ..., -0.01301755,
          0.01245636,  0.00198435],
        [ 0.05382209,  0.06811429,  0.09230632, ..., -0.10678348,
          0.10328662,  0.04452353],
        [ 0.01374136,  0.02334543,  0.04015435, ..., -0.05157775,
          0.04909866,  0.00910005],
        ...,
        [ 0.03227362,  0.03606731,  0.04236639, ..., -0.04617469,
          0.04538468,  0.0296296 ],
        [ 0.02007853,  0.03041926,  0.04551858, ..., -0.05465017,
          0.05285931,  0.01347085],
        [ 0.06410526,  0.08520655,  0.119528  , ..., -0.14170747,
          0.13725142,  0.05167754]], dtype=float32)>]

### Making Predictions 

Now, we can use our trained model to make predictions, and with the predicted ratings, we can pick some recommendations!

In order to get the ratings for all movies for a given user, we need to get pass in our data in the form 

```
[
    [user_id, movie_1_id],
    [user_id, movie_2_id],
    [user_id, movie_3_id],
    .....
]

```

In [35]:
url2 = 'https://raw.githubusercontent.com/dsahla/mycourse/main/movies.csv'

In [36]:
#Get the movie data so we can map back to names
movie_data = pd.read_csv(url2)
movie_data.columns

Index(['movieId', 'title', 'genres'], dtype='object')

### Making predictions and `argsort()`

Once we have the predicted ratings for each film, we need to get the **Top N**

Here we use `np.argsort()`, which does the sort based on the **ratings** but returns the **indexes** rather than the **ratings themselves**. We can then use this to look up the `movie_ids` and then the `title`.

In [37]:
user = 3
n = 10
#For one user, make a pair with every movie index
x = [[user, i] for i in np.arange(num_movies)]

In [38]:
#Predict
predicted_ratings = model.predict(x).flatten()
#Get Top-N indexes
top_n_indexes = predicted_ratings.argsort()[-n:]
#Get Movie Names
top_n = [movie_data[movie_data["movieId"]==movie_ids[i]]["title"] for i in top_n_indexes]

In [39]:
top_n

[2407    Being John Malkovich (1999)
 Name: title, dtype: object, 1393    As Good as It Gets (1997)
 Name: title, dtype: object, 1486    There's Something About Mary (1998)
 Name: title, dtype: object, 977    Godfather: Part II, The (1974)
 Name: title, dtype: object, 2004    Office Space (1999)
 Name: title, dtype: object, 309    Ace Ventura: Pet Detective (1994)
 Name: title, dtype: object, 203    Dumb & Dumber (Dumb and Dumber) (1994)
 Name: title, dtype: object, 1336    Truman Show, The (1998)
 Name: title, dtype: object, 521    Aladdin (1992)
 Name: title, dtype: object, 427    Jurassic Park (1993)
 Name: title, dtype: object]

# Assessed Assignment 2

Please remember to comment your code clearl, submit ``.ipynb`` 

## Task 1

We're going to ask you take the trained model and write the code to make two metrics - **Diversity** and **Novelty**

### Diversity 

This tells us what the mean diversity (1-similarity, based on movie embeddings) between each film in every users Top 10 films is.  

### Novelty 

This tells us what the mean popularity (e.g. mean rating) of the films in every users Top 10 films is 

## Task 2

Using a dimensionality reduction approach, plot the top 30 best rated films on a 2-D graph based on their movie embeddings 

In [78]:
#Install 
!pip install scikit-surprise



In [85]:
from surprise import SVD
from surprise import KNNBaseline
from surprise.model_selection import train_test_split
from surprise.model_selection import LeaveOneOut


In [None]:
print("\nComputing complete recommendations, no hold outs...")
algo.fit(fullTrainSet)
bigTestSet = fullTrainSet.build_anti_testset()
allPredictions = algo.test(bigTestSet)
topNPredicted = RecommenderMetrics.GetTopN(allPredictions, n=10)

# Measure diversity of recommendations:
print("\nDiversity: ", RecommenderMetrics.Diversity(topNPredicted, simsAlgo))

# Measure novelty (average popularity rank of recommendations):
print("\nNovelty (average popularity rank): ", RecommenderMetrics.Novelty(topNPredicted, rankings))

In [88]:
surprise.similarities.cosine()

NameError: ignored