# Improving recommendation lists through topic diversification

This notebook implements the Topic Diversification Algorithm that was defined in the [Improving Reccomendation Lists through Topic Diversification](https://www.researchgate.net/publication/200110416_Improving_recommendation_lists_through_topic_diversification) research paper. The dataset can be found [here](https://www.kaggle.com/timschaum/subreddit-recommender).

In [1]:
import pandas as pd
import numpy as np
import torch
import io
import csv
from scipy.spatial.distance import pdist
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Sarcasm Dataset
df_m = pd.read_csv('../../civility/recommender/train-balanced-sarcasm-processed.csv')

In [3]:
# Add all comments to a list
corpus = df_m['comment'].to_list()

In [4]:
from sentence_transformers import SentenceTransformer, util
import torch

embedder = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Embed each comment
import time
start_time = time.time()
sarcasm_embeddings = embedder.encode(corpus, convert_to_tensor=True)
end_time = time.time()
print("Time for computing embeddings:"+ str(end_time-start_time) )

Time for computing embeddings:37.183178424835205


In [5]:
# Add vector embeddings as column in df
vectors = []
for vector in sarcasm_embeddings:
    vectors.append(list(vector.cpu().numpy()))
    
df_m['vector'] = vectors

### Step 1: Generate predictions (at least 5N for a final top-N recommendation list).

In [6]:
# Define Method for getting top-n similar comments
def get_similar_posts(query, n):
    """
    query (string): the text of the post
    n (int): number of posts to recommend
    """
    # Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
    top_k = min(n, len(corpus))
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    similarities = []
    pairs = []

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.pytorch_cos_sim(query_embedding, sarcasm_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("Query:", query)
    print("\nTop {n} most similar sentences in corpus:".format(n=n))

    for score, idx in zip(top_results[0], top_results[1]):
#         print(corpus[idx], "(Score: {:.4f})".format(score))
        pairs.append(tuple((corpus[idx], score)))
    
    recommend_frame = []
    for val in pairs:
        recommend_frame.append({'Comment':val[0],'Similarity':val[1].cpu().numpy()})
     
    df = pd.DataFrame(recommend_frame)
    return df

In [7]:
n = 10 # Select how many recommendations you want
N = 5 * n
C_prime = get_similar_posts('Xbox is much better than PS4', N)
C_prime = C_prime.join(df_m.set_index('comment'), on='Comment')
C_prime.head()

Query: Xbox is much better than PS4

Top 50 most similar sentences in corpus:


Unnamed: 0.1,Comment,Similarity,Unnamed: 0,parent_comment,author,subreddit,vector
0,PS4 owners are stupid and ruin the gaming comm...,0.7556305,60973,Why are most ps4 owners so adamant about their...,poppymelt,xboxone,"[0.20560789, -0.4412364, 0.32114562, -0.819991..."
1,Well all I can say is that I used to be xbox f...,0.7221293,84910,"On the IGN post-show, they mentioned a good po...",koncept61,gaming,"[-0.17582169, -0.47674617, 0.123696476, -0.365..."
2,out of curiosity is there anything you prefer ...,0.69925743,67893,It's not blind it's preference,Cageshep,metalgearsolid,"[-0.17672586, -0.40686485, 0.24583732, -0.5249..."
3,Yes because games on the Xbox 360 look so grea...,0.6865504,74049,It hasn't aged at all. It was on the Xbox 360 ...,ScreamHawk,pics,"[-0.031248106, -0.80437386, -0.010843297, -0.7..."
4,"Works fine on my PC, although I'd say the orig...",0.68609136,90807,I get a really unplayable choppy frame rate on...,LordManders,gaming,"[-0.022004578, -0.56376666, 0.11417452, -0.567..."


### Step 2: For each N+1 position item calculate the ILS (diversity) if this item was part of the top-N list.

For every list entry z ∈ [2, N], we collect the items from candidate set Bi that do not occur in positions o < z in Pwi∗ and compute their similarity with set {Pwi∗(k) | k ∈ 1, z }, which contains all new recommendations preceding rank z

In [8]:
# Prepare df for pariwise distance
df_ils = C_prime.copy()
df_ils = df_ils.set_index(['Comment'])

In [9]:
ils = {}
# set ILS for first item
ils[df_ils.head(1)['Similarity'].index.values.item(0)] = df_ils.head(1)['Similarity'].values[0].item(0)
for i in range(2, 51):
    top_n = df_ils.head(i - 1)
    top_n = top_n[['Similarity']]
    bottom = df_ils.tail(len(df_ils) - i + 1)
    bottom = bottom[['Similarity']]
    for item in bottom.index:
        rowData = bottom.loc[[item] , :]
        top_n = top_n.append(rowData)
        ils[item] = sum( [x for x in pdist(top_n)] ) / len(top_n) # ILS Calculation
        top_n= top_n.drop(index=item)

In [10]:
len(ils)

50

### Step 3: Sort the remaining items in reverse (according to ILS rank) to get their dissimilarity rank.

In [11]:
dissimilarity_rank = {k: v for k, v in sorted(ils.items(), key=lambda item: item[1], reverse=True)}

### Step 4: Calculate new rank for each item as r = a ∗ P + b ∗ Pd, with P being the original rank, Pd being the dissimilarity rank and a, b being constants in range [0, 1]

In [12]:
from collections import OrderedDict
# a,b ∈ [0,1]
a = 0.5
b = 0.5
new_rank = {}
ordered_dissimilarity_rank = OrderedDict(dissimilarity_rank)
for item in df_ils.index:
    P = C_prime['Similarity'][C_prime['Comment'] == item].values[0]
    Pd = ordered_dissimilarity_rank[item]
#     P = C_prime.index[C_prime['Comment'] == item]
#     Pd = list(ordered_dissimilarity_rank.keys()).index(item)
    new_rank[item] = (a * P) + (b * Pd)

In [15]:
ordered_dissimilarity_rank

OrderedDict([('Yeah Obviously you have some inside knowledge and not just guessing, yeah fuck you Xbox',
              1.1531847298145295),
             ('So what game do you keep pre-loaded in the ps1?',
              1.1328668448389794),
             ('Yeah, on about 2-3 years well see the Ps4 Pro slim and the new Ps5.',
              1.1119956001639366),
             ("I'm sorry love, but you are on Xbox and I am on PS3, it just wasn't meant to be.",
              1.0904722822473405),
             ('Lag isnt exactly about power you know... And the Xbox One isn\'t exactly what I would call "powerful", but yeah, not underpowered to the point we see in that video tho.',
              1.0682330183360889),
             ("Ah, ok, I don't have a PS4 yet but hopefully will soon.",
              1.0458334313498603),
             ('I am sure that kind of a place is much more shady than where Xboxes get made.',
              1.0224904783747413),
             ("I honestly didn't check to see th

### Step 5: Select the top-N items according to the newly calculated rank

In [13]:
final_ranks = {k: v for k, v in sorted(new_rank.items(), key=lambda item: item[1], reverse=True)}

In [16]:
data = []
for comment, score in final_ranks.items():
    data.append({'Comment': comment,'Rank': score})

df_sim = pd.DataFrame(data)
df_sim = df_sim.set_index(['Comment'])
similarities = []
for item in df_sim.index:
    similarities.append(ordered_dissimilarity_rank[item])

df_sim['Similarity'] = similarities
df_sim = df_sim.head(10)
df_sim.sort_values(by=['Rank'], ascending=False)

Unnamed: 0_level_0,Rank,Similarity
Comment,Unnamed: 1_level_1,Unnamed: 2_level_1
"Yeah Obviously you have some inside knowledge and not just guessing, yeah fuck you Xbox",0.865436,0.57768637
So what game do you keep pre-loaded in the ps1?,0.85542,0.5779736
"Yeah, on about 2-3 years well see the Ps4 Pro slim and the new Ps5.",0.8451,0.57820475
"I'm sorry love, but you are on Xbox and I am on PS3, it just wasn't meant to be.",0.834446,0.5784199
"Lag isnt exactly about power you know... And the Xbox One isn't exactly what I would call ""powerful"", but yeah, not underpowered to the point we see in that video tho.",0.823732,0.57923
"Ah, ok, I don't have a PS4 yet but hopefully will soon.",0.812568,0.57930315
I am sure that kind of a place is much more shady than where Xboxes get made.,0.800996,0.5795016
"I honestly didn't check to see the costs of higher ones, but typically this Xbox branded one is more expensive, and people here recommend against Seagate because they have a higher failure rate.",0.789357,0.58044994
"I used HD598s and I have a GAME ONE now...HD598s were more comfortable, but they're both great.",0.777584,0.5813107
"now you're just being silly, everybody knows only Xboxes overheat and RROD...",0.765714,0.5822885


In [17]:
# Find the Diversity
n = 10
df_copy = df_sim.copy()
df_copy = df_copy.drop(columns=['Rank'])
dis_similarity = [x for x in pdist(df_copy)]
avg_dissim_greedy = (sum(dis_similarity))/((n/2)*(n-1))
avg_dissim_greedy

0.001762733194563124

In [18]:
# Define Method for getting top-n similar comments/titles
def get_similar_posts(query, n):
    """
    query (string): the text of the post
    n (int): number of posts to recommend
    """
    # Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
    top_k = min(n, len(corpus))
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    similarities = []
    pairs = []

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.pytorch_cos_sim(query_embedding, sarcasm_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("Query:", query)
    print("\nTop {n} most similar sentences in corpus:".format(n=n))

    for score, idx in zip(top_results[0], top_results[1]):
#         print(corpus[idx], "(Score: {:.4f})".format(score))
        pairs.append(tuple((corpus[idx], score)))
    
    recommend_frame = []
    for val in pairs:
        recommend_frame.append({'Comment':val[0],'Similarity':val[1].cpu().numpy()})
     
    df = pd.DataFrame(recommend_frame)
    df = df.set_index(['Comment'])
    return df

In [19]:
df_control = get_similar_posts('Xbox is much better than PS4', 10)
df_control.head()

Query: Xbox is much better than PS4

Top 10 most similar sentences in corpus:


Unnamed: 0_level_0,Similarity
Comment,Unnamed: 1_level_1
"PS4 owners are stupid and ruin the gaming community, they just know Xbox is 10 times better.",0.7556305
"Well all I can say is that I used to be xbox first, and then I'd pick up the playstation to play a few games like MGS... Looks like I'm just going to have a PS4 now.",0.7221293
out of curiosity is there anything you prefer about the PS4 compared to the PC?,0.69925743
Yes because games on the Xbox 360 look so great now.,0.6865504
"Works fine on my PC, although I'd say the original Xbox 360 version is the ""best""",0.68609136


In [20]:
n = 10
dis_similarity = [x for x in pdist(df_control)]

avg_dissim_control = (sum(dis_similarity))/((n/2)*(n-1))
avg_dissim_control

0.029833086331685386

In [21]:
percent_change = ((avg_dissim_greedy - avg_dissim_control)/avg_dissim_control)*100
round(percent_change, 2)

-94.09

In [None]:
### END OF NOTEBOOK ###