In [1]:
#ONLY FOR USING GPU (LOUIS)
import tensorflow as tf
def expand():
    gpus = tf.config.experimental.list_physical_devices('GPU')
    if gpus:
      try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
          tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
      except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)
expand()

1 Physical GPUs, 1 Logical GPUs


# Week 7 - RNNs for Recommending Fashion 

This week we'll take a look at how we can take the **sequence of purchases** into account when recommending new things. 

Developing on using **embeddings** (Week 5) and looking at **deep learning** approaches at YouTube (Week 6), we're going to use as dataset of customer interactions from the Fasion Rental site [Rent the Runway](https://www.renttherunway.com/)

We'll take an approach adapted from this informative approach recently described by [Decatholon's](https://www.decathlon.co.uk/) engineers as they moved from a **matrix factorisation approach** (similar to what we saw in week 5), to a **Recurrent Neural Network**. The full description is in this [blog post](https://medium.com/decathlondevelopers/building-a-rnn-recommendation-engine-with-tensorflow-505644aa9ff3)

### The Approach 

**Decatholon** describe a dataset where they have a list of items purchased by each user, and the date that it was purchased. They aim to leverage this information about the sequence of purchases, as well as the information encoded within the date of each purchase to build a model that can predict future items to recommend. 

As we'vee seen before **Recurrent Neural Networks** (RNNs) are great at encoding this sequenital information. We also have sparse categorical data for ``item ids``, so we'll also incorporate an **embedding layer** at the beginning. 

We don't have their data, but we ca find a dataset that has this sequential, dated purchase history by user. We have the **Rent the Runway** dataset that has this, along with loads of other rich metadata about the customers and their experiences of each rental. 

In [2]:
import pandas as pd 
import numpy as np

In [3]:
##Load the data 
df = pd.read_json("../data/renttherunway_cleaned.json")

In [4]:
df.head(5)

Unnamed: 0,fit,user_id,bust size,item_id,weight,rating,rented for,review_text,body type,review_summary,category,height,size,age,review_date
0,fit,420272,34d,2260466,137lbs,10.0,vacation,An adorable romper! Belt and zipper were a lit...,hourglass,So many compliments!,romper,"5' 8""",14,28.0,"April 20, 2016"
1,fit,273551,34b,153475,132lbs,10.0,other,I rented this dress for a photo shoot. The the...,straight & narrow,I felt so glamourous!!!,gown,"5' 6""",12,36.0,"June 18, 2013"
2,fit,360448,,1063761,,10.0,party,This hugged in all the right places! It was a ...,,It was a great time to celebrate the (almost) ...,sheath,"5' 4""",4,116.0,"December 14, 2015"
3,fit,909926,34c,126335,135lbs,8.0,formal affair,I rented this for my company's black tie award...,pear,Dress arrived on time and in perfect condition.,dress,"5' 5""",8,34.0,"February 12, 2014"
4,fit,151944,34b,616682,145lbs,10.0,wedding,I have always been petite in my upper body and...,athletic,Was in love with this dress !!!,gown,"5' 9""",12,27.0,"September 26, 2016"


In [5]:
#How many interactions in total, how many unique users, how many unique items 
len(df), len(df["user_id"].unique()), len(df["item_id"].unique())

(192544, 105571, 5850)

### Dates 

Since sequence is important to us, we're going to need to sort the data by review date. Currently, ``Pandas`` sees this column as an ``object``, so we'll use ``pd.to_datetime()`` to convert the string to a date. We can then sort it, and do maths with it! 

In [6]:
#Parse the dates
df["review_date"] = pd.to_datetime(df["review_date"])

In [7]:
#Confirm the data types are correct
df.dtypes

fit                       object
user_id                    int64
bust size                 object
item_id                    int64
weight                    object
rating                   float64
rented for                object
review_text               object
body type                 object
review_summary            object
category                  object
height                    object
size                       int64
age                      float64
review_date       datetime64[ns]
dtype: object

In [8]:
#Sort
df = df.sort_values(by="review_date")

### Getting the User Sequence 

Currently, our dataset has each item on a separate row. In order to get the sequence of purchase for each user, we need to format the data. 

We also take the date in its absolute form and change it to ``days since end of dataset`` (the most recent rental being ``0 days``)

In [9]:
users = df["user_id"].unique()

In [10]:
data = []
last_date = df["review_date"].max()

for user in users:
    ##Get all the items for that user
    rows = df[df["user_id"]==user]
    
    #Get all the item_ids
    items = rows["item_id"]
    
    #Get all the dates
    days_since_now = (last_date - rows["review_date"])
    days_since_now = np.array([i.days for i in days_since_now])
    
    #Collect into a dictionary for each user
    data.append({"item_id":items.values,"nb_days":days_since_now})

In [11]:
#Convert to a dataframe and save
data = pd.DataFrame(data)

In [35]:
data

Unnamed: 0,item_id,nb_days
0,"[125564, 183200]","[2623, 1383]"
1,"[126335, 132738, 130259]","[2520, 2096, 1745]"
2,[126335],[2511]
3,"[125564, 240137, 468020]","[2510, 848, 848]"
4,[190529],[2500]
...,...,...
105566,[454564],[3]
105567,[1498329],[3]
105568,[2835159],[3]
105569,[1969604],[3]


### Make the Dataset 

Now we need to make the actual **training set** that we will use for our model. 

Remember, the purpose of this model is to learn to predict the **next item rented in the sequence**. So for our input we will have 

 * [item1, item2, item3, item4, ..., itemN]
 
 * [days1, days2, days3, days4, ..., daysN]
 
 
And for our output we will have **the next item in the sequence** 
  
 
 * [itemN+1]
 
 
We also need our sequences to all be **of equal length**, and because this dataset doesn't have loads of really long sequences, we don't want to throw away stuff below a threshold! So, instead we **zero padd** the end of each sequence if its not as long as the maximum length we have picked (in this case ``5``)

* [0, 0, item1, item2, ..., itemN]
 
* [0, 0, days1, days2, ..., daysN]

In [12]:
from keras.preprocessing.sequence import pad_sequences

In [100]:
#split into examples of 5 last things + 1, if less than 5, zero pad
training_set = []
max_len = 5
for _, user in data.iterrows():
    i = 0
    #Get all items for each user 
    items = user["item_id"]
    num_items = len(items)
    #If only one item purchased, there is no sequence! We need at least 2
    #I've found better accuracy limiting to min 3
    if num_items > 2:
        nb_days = np.array(user["nb_days"])
        end = False
        #Cycle over items, taking windows of 5 and moving forwards by 1 each time
        while not end:
            
            #If we're off the end of the list, break out of the loop
            target = i + max_len
            if target >= num_items - 1:
                end = True
                target = num_items - 1
            
            #Get the input item and day features, zero padding
            input_items = pad_sequences([items[i:target]], max_len)[0]
            days = pad_sequences([nb_days[i:target]], max_len)[0]
            
            #Get the adjusted seqeunce (shifted by one) for the target
            target_items = np.concatenate((input_items[1:],[items[target]]))
            #target_items = [items[target]]
            
            #Add to the dataset
            row = {
                "item_id":input_items,
                "nb_days":days,
                "target":target_items
            }
            training_set.append(row)
            
            #Increment pointer
            i = i + 1

In [101]:
training_set = pd.DataFrame(training_set)

In [102]:
len(training_set)

34588

In [103]:
training_set

Unnamed: 0,item_id,nb_days,target
0,"[0, 0, 0, 126335, 132738]","[0, 0, 0, 2520, 2096]","[0, 0, 126335, 132738, 130259]"
1,"[0, 0, 0, 125564, 240137]","[0, 0, 0, 2510, 848]","[0, 0, 125564, 240137, 468020]"
2,"[0, 0, 0, 126335, 531077]","[0, 0, 0, 2478, 992]","[0, 0, 126335, 531077, 253667]"
3,"[0, 0, 126335, 1338469, 2261828]","[0, 0, 2474, 1105, 1105]","[0, 126335, 1338469, 2261828, 1846462]"
4,"[0, 0, 126335, 180014, 833666]","[0, 0, 2458, 1119, 735]","[0, 126335, 180014, 833666, 640839]"
...,...,...,...
34583,"[0, 0, 0, 715164, 2720289]","[0, 0, 0, 4, 4]","[0, 0, 715164, 2720289, 2665815]"
34584,"[0, 0, 0, 872442, 322704]","[0, 0, 0, 4, 4]","[0, 0, 872442, 322704, 844580]"
34585,"[0, 0, 2893615, 2072280, 2675545]","[0, 0, 4, 4, 4]","[0, 2893615, 2072280, 2675545, 1434889]"
34586,"[0, 0, 0, 1252971, 2945301]","[0, 0, 0, 4, 4]","[0, 0, 1252971, 2945301, 1793377]"


## Feature Engineering 

### Item Ids

Currently the ``item_ids`` are arbitrary and relate to the **Rent the Runway** catalogue. We are going to encode the in an **embedding** so need to make them indexes from ``0 -> num_items``

In [72]:
#Get all the unique items from the data set
all_input_items = np.array([i for i in training_set["item_id"].values]).flatten()
all_target_items = np.array([i for i in training_set["target"].values]).flatten()
all_items = np.concatenate((all_input_items, all_target_items))
unique_items = np.unique(all_items)

In [73]:
#Make a look up dictionary from item_id to index
item_to_index = {item_id:i for i, item_id in enumerate(unique_items)}

In [74]:
#Swap out the new ids and convert to 2d arrays for use in training 
item_indexes = np.array([np.array([item_to_index[i] for i in ids], dtype=int) for ids in training_set["item_id"]], dtype=int)
target_indexes = np.array([np.array([item_to_index[i] for i in ids], dtype=int) for ids in training_set["target"]], dtype=int)
days = np.array([np.array([j for j in i],dtype=int) for i in training_set["nb_days"]],dtype=int)

### Bucketting Dates

One interesting approach they have taken at Decathalon is to **bucket** days and then **learn an embedding**, almost treating it as a **categorical variable**. 

To do this, we use ``tf.keras.layers.experimental.preprocessing.Discretization()`` to separate the days into **100 equally spaced bins**, with one more for the **zero padding** 

In [75]:
import tensorflow as tf

In [80]:
#Get min and max
days_min = days.min()
days_max = days.max()

#Generate 100 equally spaced boundaries 
boundaries = list(np.linspace(days_min, days_max,100,dtype=int))
boundaries.insert(0,0)
boundaries[1] = 1

#Bucket the day features
discretize_layer = tf.keras.layers.experimental.preprocessing.Discretization(
    bins=boundaries)
bucket_days = discretize_layer(days).numpy() - 1
bucket_days = bucket_days



In [81]:
days

array([[   0,    0,    0, ...,    0, 2520, 2096],
       [   0,    0,    0, ...,    0, 2510,  848],
       [   0,    0,    0, ...,    0, 2478,  992],
       ...,
       [   0,    0,    0, ...,    4,    4,    4],
       [   0,    0,    0, ...,    0,    4,    4],
       [   0,    0,    0, ...,    0,    3,    3]])

In [82]:
bucket_days

array([[  0,   0,   0, ...,   0, 100,  84],
       [  0,   0,   0, ...,   0, 100,  35],
       [  0,   0,   0, ...,   0,  99,  40],
       ...,
       [  0,   0,   0, ...,   2,   2,   2],
       [  0,   0,   0, ...,   0,   2,   2],
       [  0,   0,   0, ...,   0,   2,   2]], dtype=int32)

## Training and Validation Sets 

Next we split the data into training and validation sets 

In [83]:
#Generate and shuffle the indexes
total = len(training_set)
all_indexes = np.arange(total)
np.random.shuffle(all_indexes)

#Split the indexes
split = 0.9
train_indexes = all_indexes[:int(total*split)]
test_indexes = all_indexes[int(total*split):]

In [84]:
#Make a dictionary for the inputs 
train_x = {'item_id': item_indexes[train_indexes],
          'nb_days': bucket_days[train_indexes]}
test_x = {'item_id': item_indexes[test_indexes],
          'nb_days': bucket_days[test_indexes]}

#Make an array for the outputs 
train_y = target_indexes[train_indexes]
test_y = target_indexes[test_indexes]

In [85]:
test_x

{'item_id': array([[   0,    0,    0, ...,  145, 1705, 2357],
        [3126, 1764,  315, ..., 3407, 4145,  458],
        [5408, 3923, 5160, ...,  722, 1066, 2175],
        ...,
        [   0,    0,    0, ...,    0,  101,  154],
        [1041, 1941, 1403, ..., 4106, 3607, 1453],
        [3911, 5262, 5070, ..., 2835, 1346, 5518]]),
 'nb_days': array([[ 0,  0,  0, ..., 50, 18, 18],
        [37, 37, 37, ..., 34, 31, 25],
        [12, 11, 11, ..., 11, 10, 10],
        ...,
        [ 0,  0,  0, ...,  0, 55, 53],
        [10, 10,  9, ...,  7,  7,  7],
        [11, 11, 11, ..., 11, 11, 11]], dtype=int32)}

## Build and Train the Model 

Previously we have used ``Keras's Sequential API``, where we first make a model then **sequentially add layers to it** one by one. This works because each layer only has one input, and outputs directly into the next. 

However, whilst this is broadly true of our model, its not strictly true. Our **Embedding layers** both feed into the **Concatenate layer**. So instead, we will use the [Keras Functional API](https://keras.io/guides/functional_api/).

As the documentation says 

```
The Keras functional API is a way to create models that are more flexible than the tf.keras.Sequential API. The functional API can handle models with non-linear topology, shared layers, and even multiple inputs or outputs.
```

Essentially, when we build up the ``network graph``, instead of **adding things to the model**, we **specify input layer** we want this new layer to have. 

Here, we make a layer (``x``), and then a layer(``output``), specifying that ``x`` is in the input for ``output``

```
x = layers.Dense(64, activation="relu")
outputs = layers.Dense(10)(x)
```

In [86]:
#Model hyper parameters 
item_vocab_size = len(unique_items)
hp = {
    "embedding_item":100,
    "embedding_nb_days":20,
    "rnn_units_cat":[1024,512],
    "learning_rate":0.01
}

### ``tf.keras.Input()``

The [Input Layer](https://keras.io/api/layers/core_layers/input/) sits at the base of our ``Model`` and takes does what it says on the tin - Takes the input. 

We specfiy a dictionary of inputs to match our dictionaries we made in the **training set**. This means we have **two separate inputs**, that feed into **two separate embedding layers**.

We give them a ``batch_input_shape`` of ``[None, max_len]``, so that it know each batch will be a sequence of ``max_len`` (in our ase 5) items, but the batch_size itself isn't decided until we ``compile()`` the model. 

In [87]:
inputs = {}
inputs['item_id'] = tf.keras.Input(batch_input_shape=[None, max_len],
                                   name='item_id', dtype=tf.int32)

# nb_days bucketized
inputs['nb_days'] = tf.keras.Input(batch_input_shape=[None, max_len],
                                   name='nb_days', dtype=tf.int32)

### ``tf.keras.layers.Embedding()``

Then we add the embedding layers, each time specifying which item from the **input dictionary** to take as input. 

The **item embedding** takes an input the size of the **number of unique items (vocab size)** and learns a mapping to a denser embedding of a given size. 

The **days embedding** takes an input the size of the **number of buckets + 1 (for zero padding)** and learns a mapping to a denser embedding of a given size. 


In [88]:
embedding_item = tf.keras.layers.Embedding(input_dim=item_vocab_size,
                                           output_dim=hp.get('embedding_item'),
                                           name='embedding_item'
                                          )(inputs['item_id'])
# nbins=100, +1 for zero padding
embedding_nb_days = tf.keras.layers.Embedding(input_dim=100 + 1,
                                              output_dim=hp.get('embedding_nb_days'),
                                              name='embedding_nb_days'
                                             )(inputs['nb_days'])

### `` tf.keras.layers.Concatenate()``

We then concatentate embedding layers into one layer 

In [89]:
# Concatenate embedding layers
concat_embedding_input = tf.keras.layers.Concatenate(
 name='concat_embedding_input')([embedding_item, embedding_nb_days])

concat_embedding_input = tf.keras.layers.BatchNormalization(
 name='batchnorm_inputs')(concat_embedding_input)

### LSTM Layers 

When then put in the ``tf.keras.layers.LSTM()`` layer, with a ``tf.keras.layers.BatchNormalization()`` either side. 

More on that in the lecture!

In [90]:
input_layer = concat_embedding_input

for i, num_units in enumerate(hp.get('rnn_units_cat')):
    
    # LSTM layer
    rnn = tf.keras.layers.LSTM(units=num_units,
                                   return_sequences=True,
                                   recurrent_initializer='glorot_normal',
                                   name='LSTM_cat' + str(i)
                                   )(input_layer)

    rnn = tf.keras.layers.BatchNormalization(name='batchnorm_lstm' + str(i))(rnn)
    
    input_layer = rnn

# create encoding padding mask
encoding_padding_mask = tf.math.logical_not(tf.math.equal(inputs['item_id'], 0))

# Self attention so key=value in inputs
att = tf.keras.layers.Attention(use_scale=False, causal=True,
                                name='attention')(inputs=[rnn, rnn],
                                                  mask=[encoding_padding_mask,
                                                        encoding_padding_mask])



### The Output

Finally, we bring it all together in a ``tf.keras.layers.Dense()`` **softmax layer**. This means that the output of this layer will be 

``
[batch_size x max_len x item_vocab_size]
``

Where each sequence in the batch is a ``[max_len x item_vocab_size]`` tensor telling us the probability of that item in the catalogue being next in the sequence 

In [91]:
# Last layer is a fully connected one
output = tf.keras.layers.Dense(item_vocab_size, activation = tf.nn.softmax, name='output')(att)

### Loss Function 

We write a custom loss function. This is necessary because we need to again mask out the 0s to stop the model optimising towards them. 

In [92]:
from tensorflow.keras.losses import SparseCategoricalCrossentropy
def loss_function(real, pred):
    loss = SparseCategoricalCrossentropy()(real, pred)
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    mask = tf.cast(mask, dtype=loss.dtype)
    loss *= mask

    return tf.reduce_mean(loss)

### ``tf.keras.Model()``

Finally, we're ready to join the ``Inputs`` and the ``Outputs`` into a ``Model()`` object and ``compile()``.

In [93]:
model = tf.keras.Model(inputs, output)

model.compile(
    optimizer=tf.keras.optimizers.Adam(hp.get('learning_rate')),
    loss=loss_function,
    metrics=['sparse_categorical_accuracy'])

In [94]:
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
item_id (InputLayer)            [(None, 8)]          0                                            
__________________________________________________________________________________________________
nb_days (InputLayer)            [(None, 8)]          0                                            
__________________________________________________________________________________________________
embedding_item (Embedding)      (None, 8, 100)       570100      item_id[0][0]                    
__________________________________________________________________________________________________
embedding_nb_days (Embedding)   (None, 8, 20)        2020        nb_days[0][0]                    
____________________________________________________________________________________________

In [95]:
history = model.fit(train_x,train_y,
                    epochs=20, 
                    verbose=1,
                    batch_size=512,
                    validation_data=(test_x, test_y))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


## Predictions 



In [96]:
#Predictions for whole test set
results = model.predict(test_x)

In [97]:
#exmples x max_len x vocab size
results.shape

(2779, 8, 5701)

In [98]:
#max_len x vocab size
results[100].shape

(8, 5701)

In [99]:
#Get index of highest prob in item vocab
results[100][-1].sum()

1.0

In [51]:
results = model.predict([{"item_id":test_x["item_id"][0],"nb_days":test_x["nb_days"][0]}])



In [53]:
results[-1].argmax()

3174

In [57]:
index_to_item = {v:k for k,v in item_to_index.items()}    

In [59]:
index_to_item[3174]

1698166