# Recommender systems

The final challenge notebook will use graph neural networks for music recommender system. However, such 'sophisticated' approach is not always the most efficient one. At certain cases, using simpler models or algorithms might provide similar results with lower computational requirements.

In this notebook, we will go through the basics of recommender systems.

### Basics

![basics](https://miro.medium.com/max/875/1*AaE5pUCOkMS6Dv6j96trsA.png)

In short, recommender systems can be divided into the following categories:

1. **Popular-based**. Recommending based on rating (for instance, IMDB, Netflix, etc.)
2. **Content-based**. If you gave a high rating for a certain object, then system will try to recommend similar objects
3. **Collaborative filtering**. In short, such system will try to allocate individuals to groups based on their preferences and then recommend items that were highly rated by the individuals in that group


### Implementation

In the implementation part, we will try to compare a content-based approach and more sophisticated collaborative filtering approach. For that, we will use a randomly generated film dataset.

#### Preparing data

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import gc

from collections import defaultdict

USER_COL = 'user_id'
ITEM_COL = 'item_id'
RATING_COL = 'rating'


# synthesize data
NUM_USERS = 10_000
NUM_ITEMS = 1_000
user_id = np.arange(start = 0, stop = NUM_USERS)
item_id = np.arange(start = 0, stop = NUM_ITEMS)
np.random.seed(42)

user_item_dict = defaultdict(list)
genres = ['Action','Comedy','Drama','Fantasy','Horror','Mystery','Romance','Thriller','Western']
for id in user_id:
    
    # random the number of item generation
    # for each user, random 3 to 5 items to be rated.
    num_rand_item = np.random.randint(low = 3, high = 5)

    # random from the item_id
    rand_items = np.random.choice(item_id, size = num_rand_item, replace = False)

    # random rating for each itme_id
    rand_rating = np.random.randint(low = 1, high = 10, size = num_rand_item)

    # collect the user-item pairs.
    for uid, iid,rating in zip([id] * num_rand_item, rand_items, rand_rating):
        user_item_dict['user_id'].append(uid)
        user_item_dict['item_id'].append(iid)
        user_item_dict['rating'].append(rating)

# prepare dataframe
ratings = pd.DataFrame(user_item_dict)
print("Rating Dataframe")
ratings[['user_id','item_id']] = ratings[['user_id','item_id']].astype(str)
display(ratings.head())

item_genre_dict = defaultdict(list)
for iid in item_id:

    # random number of genres
    num_rand_genre = np.random.randint(low = 1, high = 3)
    # random set of genres
    rand_genres = np.random.choice(genres, size = num_rand_genre, replace = False)
    item_genre_dict['item_id'].append(iid)
    item_genre_dict['genres'].append(', '.join(list(rand_genres)))

# prepare dataframe
items = pd.DataFrame(item_genre_dict)
print("\nItem Dataframe")
items = items.astype(str)
display(items.head())

Rating Dataframe


Unnamed: 0,user_id,item_id,rating
0,0,521,2
1,0,941,8
2,0,741,2
3,1,986,5
4,1,275,5



Item Dataframe


Unnamed: 0,item_id,genres
0,0,"Romance, Action"
1,1,Mystery
2,2,"Drama, Western"
3,3,"Fantasy, Horror"
4,4,"Comedy, Drama"


#### Content-based

As it has been mentioned in the previous part, content-based approach recommends items that are similar to items you liked previously.

When we analyse our dataset, we are saving information related to users in a vector form that contains user's historical data. The overall information is stored in a vector format as well. The content-based approach aims to find the cosine of the angle betweent he profile vector and the item vector (**cosine similarity**).

In [4]:
from sklearn.metrics.pairwise import cosine_similarity

def top_k_items(item_id, top_k, corr_mat, map_name):
    
    # sort correlation value ascendingly and select top_k item_id
    top_items = corr_mat[item_id,:].argsort()[-top_k:][::-1] 
    top_items = [map_name[e] for e in top_items] 

    return top_items

# preprocessing
rated_items = items.loc[items[ITEM_COL].isin(ratings[ITEM_COL])].copy()

# extract the genre
### implement here
genre =

# get all possible genre
all_genre = set()
for c in genre.columns:
    distinct_genre = genre[c].str.lower().str.strip().unique()
    all_genre.update(distinct_genre)
all_genre.remove(None)

# create item-genre matrix
item_genre_mat = rated_items[[ITEM_COL, 'genres']].copy()
item_genre_mat['genres'] = item_genre_mat['genres'].str.lower().str.strip()

# add columns based on genres
### implement here

item_genre_mat = item_genre_mat.drop(['genres'], axis=1)
item_genre_mat = item_genre_mat.set_index(ITEM_COL)

# compute similarity matix
### implement here
corr_mat =

# get top-k similar items
ind2name = {ind:name for ind,name in enumerate(item_genre_mat.index)}
name2ind = {v:k for k,v in ind2name.items()}
similar_items = top_k_items(name2ind['99'],
                            top_k = 10,
                            corr_mat = corr_mat,
                            map_name = ind2name)

# display result
print("The top-k similar movie to item_id 99")
display(items.loc[items[ITEM_COL].isin(similar_items)])

del corr_mat
gc.collect();

The top-k similar movie to item_id 99


Unnamed: 0,item_id,genres
0,0,"Romance, Action"
99,99,"Romance, Action"
211,211,"Romance, Action"
352,352,"Romance, Action"
512,512,"Action, Romance"
618,618,"Romance, Action"
737,737,"Action, Romance"
744,744,"Romance, Action"
813,813,"Action, Romance"
858,858,"Romance, Action"


#### Collaborative filtering

It is one of the most used algorithms in industry that tries to analyse user behaviour for recommending items. The algorithm itself may analyse user behaviour from a few perspectives:
- It may try to find similarities between the individuals based on their object rating
- May try to measure the similarity between any two pairs of items

In addition to that, the algorithm can be memory-based (does not involve gradient descent) or model-based (trains model using gradient descent).

There are many different algorithms belonging to collaborative filtering category, but for now let's focus on deep DNN MF.

It tries to reconstruct the predicted rating with the inner product of the shared features.

In [5]:
!pip install -q tensorflow-recommenders
!pip install -q --upgrade tensorflow-datasets
!pip install -q scann

You should consider upgrading via the 'C:\Users\marty\anaconda3\envs\ai_learning\python.exe -m pip install --upgrade pip' command.
You should consider upgrading via the 'C:\Users\marty\anaconda3\envs\ai_learning\python.exe -m pip install --upgrade pip' command.
ERROR: Could not find a version that satisfies the requirement scann (from versions: none)
ERROR: No matching distribution found for scann
You should consider upgrading via the 'C:\Users\marty\anaconda3\envs\ai_learning\python.exe -m pip install --upgrade pip' command.


In [6]:
from IPython.display import clear_output

import tensorflow as tf
import tensorflow_recommenders as tfrs
import tensorflow.keras as keras
from sklearn.model_selection import train_test_split

from typing import Dict, Text, Tuple

def df_to_ds(df):

    # convert pd.DataFrame to tf.data.Dataset
    ds = tf.data.Dataset.from_tensor_slices(
        (dict(df[['user_id','item_id']]), df['rating']))
    
    # convert Tuple[Dict[Text, tf.Tensor], tf.Tensor] to Dict[Text, tf.Tensor]
    ds = ds.map(lambda x, y: {
    'user_id' : x['user_id'],
    'item_id' : x['item_id'],
    'rating' : y
    })

    return ds.batch(256)

class RankingModel(keras.Model):

    def __init__(self, user_id, item_id, embedding_size):
        super().__init__()
        
        # user model
        input = keras.Input(shape=(), dtype=tf.string)
        x = keras.layers.StringLookup(
            ### implement here
            vocabulary =    , mask_token = None
            )(input)
        output = keras.layers.Embedding(
            input_dim = len(user_id) + 1,
            output_dim = embedding_size,
            name = 'embedding'
        )(x)
        self.user_model = keras.Model(inputs = input,
                                      outputs = output,
                                      name = 'user_model')

        # item model
        input = keras.Input(shape=(), dtype=tf.string)
        x = keras.layers.StringLookup(
            ### implement here
            vocabulary =    , mask_token = None
            )(input)
        output = keras.layers.Embedding(
            input_dim = len(item_id) + 1,
            output_dim = embedding_size,
            name = 'embedding'
        )(x)
        self.item_model = keras.Model(inputs = input,
                                  outputs = output,
                                  name = 'item_model')

        # rating model
        user_input = keras.Input(shape=(embedding_size,), name='user_emb')
        item_input = keras.Input(shape=(embedding_size,), name='item_emb')
        
        ### implement here
        x = 
        output = keras.layers.Dense(1)(x)
        
        self.rating_model = keras.Model(
            inputs = {
                'user_id' : user_input,
                'item_id' : item_input
            },
            outputs = output,
            name = 'rating_model'
        )

    def call(self, inputs: Dict[Text, tf.Tensor]) -> tf.Tensor:

        user_emb = self.user_model(inputs['user_id'])
        item_emb = self.item_model(inputs['item_id'])

        prediction = self.rating_model({
            'user_id' : user_emb,
            'item_id' : item_emb
        })
        
        return prediction

class GMFModel(tfrs.models.Model):

    def __init__(self, user_id, item_id, embedding_size):
        super().__init__()
        self.ranking_model = RankingModel(user_id, item_id, embedding_size)
        self.task = tfrs.tasks.Ranking(
            ### implement here
            loss = 
            metrics = 
        )
    
    def call(self, features: Dict[Text, tf.Tensor]) -> tf.Tensor:
        
        return self.ranking_model(
            {
             'user_id' : features['user_id'], 
             'item_id' : features['item_id']
            })

    def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:

        return self.task(labels = features.pop('rating'),
                         predictions = self.ranking_model(features))

# preprocess
train, test = train_test_split(ratings, train_size = .8, random_state=42)
train, test = df_to_ds(train), df_to_ds(test)

# # init model
embedding_size = 64
model = GMFModel(user_id.astype(str),
                 item_id.astype(str),
                 embedding_size)
model.compile(
    optimizer = keras.optimizers.Adagrad(learning_rate = .01)
)

# # fitting the model
### implement here

# evaluate with the test data
result = model.evaluate(test, return_dict=True, verbose=0)
print("\nEvaluation on the test set:")
display(result)

# extract item embedding
item_emb = model.ranking_model.item_model.layers[-1].get_weights()[0]


item_corr_mat = cosine_similarity(item_emb)

print("\nThe top-k similar movie to item_id 99")
similar_items = top_k_items(name2ind['99'],
                            top_k = 10,
                            corr_mat = item_corr_mat,
                            map_name = ind2name)

display(items.loc[items[ITEM_COL].isin(similar_items)])

del item_corr_mat
gc.collect();


Evaluation on the test set:


{'root_mean_squared_error': 2.579379081726074,
 'loss': 6.170330047607422,
 'regularization_loss': 0,
 'total_loss': 6.170330047607422}


The top-k similar movie to item_id 99


Unnamed: 0,item_id,genres
83,83,Horror
99,99,"Romance, Action"
126,126,Fantasy
173,173,Thriller
179,179,"Action, Romance"
337,337,"Drama, Fantasy"
469,469,"Mystery, Fantasy"
516,516,Action
601,601,Comedy
621,621,Horror
