# Implementing diversity in final recommendations



The diversity concept we consider here limits itself to diversity in "content". Mathematically, diversity can be considered to be the inverse of similarity. I.e., a group of movie recommendations with similar content will be "similar", or less "diverse". The higher the content similarity, the lower the diversity. 

Similarity measures are obtained through natural language processing techniques, as already implemented in notebooks "3_content_based_recommender 1,2". 


In this notebook, the goal is to produce recommendations with a diversity level chosen by the user. There are two input to this notebook:
- the recommendations produced by collaborative filtering (CF)
- the similarity matrix resulting from content-based filtering analysis
- a desired diversity level set by an external user





In [199]:
import sys
import pandas as pd
from pandas_profiling import ProfileReport

import numpy as np
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
import random

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%matplotlib inline


Importing the top 100 CF recommendations, the similarity matrix and the df_features dataframe: 

In [288]:

df_cf_recos = pd.read_csv('../data/recommendations1.csv')
df_cos_sim = pd.read_csv('../data/cos_sim_matrix.csv')
df_features = pd.read_csv('../data/df_features.csv')



In [289]:
df_cos_sim.head()

Unnamed: 0,movieId,1,2,3,4,5,6,7,8,9,...,175475,175569,175577,175585,175693,175705,175707,175743,175781,176051
0,1,1.0,0.0339,0.010186,0.010879,0.0,0.0,0.012142,0.046066,0.0,...,0.015958,0.0,0.016464,0.0,0.018334,0.014679,0.008266,0.025425,0.052926,0.0
1,2,0.0339,1.0,0.020544,0.0,0.010148,0.030094,0.0,0.009292,0.100504,...,0.0,0.042701,0.0,0.0,0.0,0.0,0.008336,0.0,0.0,0.0
2,3,0.010186,0.020544,1.0,0.013186,0.0,0.0,0.0,0.011167,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,0.010879,0.0,0.013186,1.0,0.026053,0.019316,0.031439,0.023855,0.0,...,0.020659,0.0,0.04263,0.01472,0.023736,0.019004,0.010701,0.032915,0.068519,0.013027
4,5,0.0,0.010148,0.0,0.026053,1.0,0.017865,0.014539,0.0,0.073432,...,0.0,0.0,0.019714,0.0,0.0,0.017576,0.0,0.0,0.0,0.024096


Columns of the similariy matrix have the right labels, rows do not. So we copy the column labels and impose them on the rows. 

Also there is the movieId column, which is unnecessary here. Remove it!

In [290]:


# dropping movieId column
df_cos_sim = df_cos_sim.drop('movieId', axis=1, inplace=False)
# creating row indices identical to column indices
clm_lst = df_cos_sim.columns.to_list()
df_cos_sim = df_cos_sim.set_axis(clm_lst, axis=0)
df_cos_sim.head()
df_cos_sim.tail()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,175475,175569,175577,175585,175693,175705,175707,175743,175781,176051
1,1.0,0.0339,0.010186,0.010879,0.0,0.0,0.012142,0.046066,0.0,0.0,...,0.015958,0.0,0.016464,0.0,0.018334,0.014679,0.008266,0.025425,0.052926,0.0
2,0.0339,1.0,0.020544,0.0,0.010148,0.030094,0.0,0.009292,0.100504,0.009387,...,0.0,0.042701,0.0,0.0,0.0,0.0,0.008336,0.0,0.0,0.0
3,0.010186,0.020544,1.0,0.013186,0.0,0.0,0.0,0.011167,0.0,0.011282,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.010879,0.0,0.013186,1.0,0.026053,0.019316,0.031439,0.023855,0.0,0.0,...,0.020659,0.0,0.04263,0.01472,0.023736,0.019004,0.010701,0.032915,0.068519,0.013027
5,0.0,0.010148,0.0,0.026053,1.0,0.017865,0.014539,0.0,0.073432,0.0,...,0.0,0.0,0.019714,0.0,0.0,0.017576,0.0,0.0,0.0,0.024096


Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,175475,175569,175577,175585,175693,175705,175707,175743,175781,176051
175705,0.014679,0.0,0.0,0.019004,0.017576,0.013031,0.0,0.0,0.0,0.0,...,0.027875,0.0,0.0,0.0,0.0,1.0,0.014438,0.044412,0.0,0.0
175707,0.008266,0.008336,0.0,0.010701,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.011184,0.0,0.014438,1.0,0.0,0.0,0.0
175743,0.025425,0.0,0.0,0.032915,0.0,0.0,0.036736,0.0,0.0,0.0,...,0.04828,0.0,0.049814,0.0,0.05547,0.044412,0.0,1.0,0.160128,0.0
175781,0.052926,0.0,0.0,0.068519,0.0,0.0,0.076472,0.0,0.0,0.0,...,0.100504,0.0,0.103695,0.0,0.11547,0.0,0.0,0.160128,1.0,0.0
176051,0.0,0.0,0.0,0.013027,0.024096,0.008932,0.0,0.022063,0.036716,0.02229,...,0.0,0.012674,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


Now, the similarity matrix seems fine. 

A function is created to calculate the intra-list similarity of a group of movies. The intra-list similarity is the main similarity of all movie pairs, based on content filtering. 

In [292]:

def intralist_similarity(list_of_movieIds, df_cos_sim=df_cos_sim):
    df_reco = df_cos_sim[list_of_movieIds] # Keeping only movieId columns
    df_reco = df_reco.loc[list_of_movieIds]  # keeping only movieId rows
    np_reco = df_reco.to_numpy()
    np_reco_triu = np_reco[np.triu_indices(np_reco.shape[0], k = 1)]
    return np.mean(np_reco_triu)


Reducing the top 100 recommendations to a small list (of 6 items for example) and calculating the intra-list similarity for this small subset of recommendations

In [295]:
# cf_similarity is calculated

top_list_len = 6
recos_ss = recos_100[:top_list_len]
cf_similarity = intralist_similarity(recos_ss)
print(f"The movies list {recos_ss} has an intra-list similarity of  {cf_similarity}")

The movies list ['57669', '953', '3421', '2324', '1041', '4848'] has an intra-list similarity of  0.014107927666389014


How to optimise for the diversity level we want? 

- we shuffle the top 100, subset it, test the intra-list similarity, and repeating this until we obtain a list with the diversity required. 

Obtaining movie titles for a list of movie IDs :

In [299]:

def get_titles(movieIds_list):
    titles = []
    for movie in movieIds_list:
        titles.append(df_features[df_features['movieId']== int(movie)].title.tolist()[0] )
    return titles

get_titles(recos_ss)

['Spirited Away',
 'Toy Story 3',
 'Stand by Me',
 'The Lord of the Rings: The Return of the King',
 'Bound',
 'The Untouchables']

Creating a function that produces recommendations for each user within a given similarity interval:

In [351]:
top_list_len = 6    # The length of the recommendation list

low_sim         = 0.004
mean_sim        = 0.020
high_sim        = 0.035
very_high_sim   = 0.050 
vv_high_sim     = 0.100

def get_diverse_recommendations(user_id, sim_lower_bound, sim_higher_bound, top_list_len=6):
    recos_100 = df_cf_recos[df_cf_recos['userId'] == user_id].values.tolist()[0][1:]
    recos_100 = [str(i) for i in recos_100]
    for _ in range(1000):
        recos_100_sfl = random.sample(recos_100, len(recos_100))
        recos_ss = recos_100_sfl[:top_list_len]
        reco_sim = intralist_similarity(recos_ss)
        if sim_lower_bound < reco_sim < sim_higher_bound:
            titles = get_titles(recos_ss)
            return reco_sim, titles
    
#get_diverse_recommendations(1, low_sim, mean_sim)
get_diverse_recommendations(10,very_high_sim, vv_high_sim)

(0.05609805153294482,
 ['The Green Mile',
  "Harry Potter and the Philosopher's Stone",
  'Rebecca',
  'Bound',
  'Manhattan',
  'Harry Potter and the Goblet of Fire'])

### Example of recommendation lists with high (and low) similarity scores for user 10: 

+ 0.11280016687936643
- Das Boot
- Mister Roberts
- Lawrence of Arabia
- Lifeboat
- Castle in the Sky
- Life Is Beautiful

+ 0.12338505218616813
- Mister Roberts
- The Great Escape
- Lifeboat
- Das Boot
- Heavenly Creatures


+ 0.0006693831806740066
- The Gods Must Be Crazy
- Secrets & Lies
- The Departed
- Raiders of the Lost Ark
- Love and Death
- 8½



+ 0.0025674328024291398
- The Wrong Trousers
- Run Lola Run
- Snatch
- Open Range
- American History X
- Being There