# Diversity in Reddit Comments

This notebook implements the Greedy Selection Algorithm that was  the [Improving Reccomendation Diversity](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.8.5232&rep=rep1&type=pdf) research paper. The dataset can be found [here](https://www.kaggle.com/sherinclaudia/sarcastic-comments-on-reddit).

## Reading the Data

In [1]:
import pandas as pd
import numpy as np
import torch
import io
import csv
from scipy.spatial.distance import pdist
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Sarcasm Dataset
df_m = pd.read_csv('../../civility/recommender/train-balanced-sarcasm-processed.csv')

In [3]:
# Add all comments to a list
corpus = df_m['comment'].to_list()

In [5]:
from sentence_transformers import SentenceTransformer, util
import torch

embedder = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Embed each comment
import time
start_time = time.time()
sarcasm_embeddings = embedder.encode(corpus, convert_to_tensor=True)
end_time = time.time()
print("Time for computing embeddings:"+ str(end_time-start_time) )

Time for computing embeddings:281.16767168045044


In [6]:
# Add vector embeddings as column in df
vectors = []
for vector in sarcasm_embeddings:
    vectors.append(list(vector.cpu().numpy()))
    
df_m['vector'] = vectors

## Saving Embeddings to .pt File

In [None]:
# Vectors
torch.save(sarcasm_embeddings, 'sarcasm_embeddings.pt')
# Load Vectors from file 
# sarcasm_embeddings = torch.load('sarcasm_embeddings.pt')

In [None]:
# Comments
outfile = open('sarcasm_metadata.csv','w')
out = csv.writer(outfile)
out.writerows(map(lambda x: [x], corpus))
outfile.close()

## Example Query Recommendations

In [7]:
"""
Run this for example queries
This script outputs for various queries the top 5 most similar sentences in the corpus.
"""
# Query sentences:
queries = ['Trump has improved the economy so much!', 'Xbox is much better than PS4', 'Taliban bombs school']


# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.pytorch_cos_sim(query_embedding, sarcasm_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))

    """
    # Alternatively, we can also use util.semantic_search to perform cosine similarty + topk
    hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)
    hits = hits[0]      #Get the hits for the first query
    for hit in hits:
        print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
    """





Query: Trump has improved the economy so much!

Top 5 most similar sentences in corpus:
Trump will make it better. (Score: 0.6741)
Well thank goodness Trump is going to make it great again. (Score: 0.6713)
Because Donald Trump is going to make America great again. (Score: 0.6546)
BUT TRUMP SAID HE WAS GONNA MAKE AMERICA GREAT AGAIN! (Score: 0.6480)
But Trump can make all of America great again! (Score: 0.6452)




Query: Xbox is much better than PS4

Top 5 most similar sentences in corpus:
No, because the xbox one is soooo much better than the ps4 (Score: 0.9228)
but xbox has the highest quality pixels, so it should be better than ps4, right? (Score: 0.8895)
Xbox or PS4 (Score: 0.8632)
Ayy, better than Xbox 7 and PS5 (Score: 0.8622)
Ps4 or Xbox (Score: 0.8430)




Query: Taliban bombs school

Top 5 most similar sentences in corpus:
Nah, the Pakistani Taliban shoot school girls for going to school because of the invasion of Iraq. (Score: 0.7269)
And bomb Afghani schools (Score: 0.71

## Recommending the Top 50 Similar Subreddits

Here, I am recommending the 50 most similar comments.

In [8]:
# Define Method for getting top-n similar comments/titles
def get_similar_posts(query, n):
    """
    query (string): the text of the post
    n (int): number of posts to recommend
    """
    # Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
    top_k = min(n, len(corpus))
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    similarities = []
    pairs = []

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.pytorch_cos_sim(query_embedding, sarcasm_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("Query:", query)
    print("\nTop {n} most similar sentences in corpus:".format(n=n))

    for score, idx in zip(top_results[0], top_results[1]):
#         print(corpus[idx], "(Score: {:.4f})".format(score))
        pairs.append(tuple((corpus[idx], score)))
    
    recommend_frame = []
    for val in pairs:
        recommend_frame.append({'Comment':val[0],'Similarity':val[1].cpu().numpy()})
     
    df = pd.DataFrame(recommend_frame)
    df = df.set_index(['Comment'])
    return df

In [10]:
df_control = get_similar_posts('Xbox is much better than PS4', 50)
df_control.head()

Query: Xbox is much better than PS4

Top 50 most similar sentences in corpus:


Unnamed: 0_level_0,Similarity
Comment,Unnamed: 1_level_1
"No, because the xbox one is soooo much better than the ps4",0.9228309
"but xbox has the highest quality pixels, so it should be better than ps4, right?",0.88953227
Xbox or PS4,0.8632204
"Ayy, better than Xbox 7 and PS5",0.8621861
Ps4 or Xbox,0.8430183


## Calculating the Diversity

The diversity of a set of items, c1,...cn, is defined as the average dissimilarity between all pairs of items in the result set.
{add equation}

Here, we can calculate the average dissimilarity of items recommended when no diversity algorithms are implemented. This will be used as a control to help us evaluate and compare our results later on.

In [11]:
n = 50
dis_similarity = [x for x in pdist(df_control)]

avg_dissim_control = (sum(dis_similarity))/((n/2)*(n-1))
avg_dissim_control

0.059638280090020625

## Bounded Greedy Selection Algorithm

The Greedy Selection algorithm seeks to provide a more principled approach to improving diversity by using a quality metric to guide the construction of a result set, R, in an incremental fashion. During each iteration the remaining items are ordered according to their quality and the highest quality item added to R. The first item to be selected is always the one with the highest similarity to the target. During each subsequent iteration, the item selected is the one with the highest quality with respect to the set of cases selected during the previous iteration. This algorithm is expensive.

To reduce the complexity of the Greedy Selection algorithm we can implement a bounded version. The Bounded Greedy Selection algorithm first selects the best x cases according to their similarity to the target query and then applies the greedy selection method to these. 

[Source](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.8.5232&rep=rep1&type=pdf)

### Step 1: Select the best x = 500 cases according to their similarity to the target query. Set C'

In [12]:
# Define Method for getting top-n similar comments/titles
def get_similar_posts(query, n):
    """
    query (string): the text of the post
    n (int): number of posts to recommend
    """
    # Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
    top_k = min(n, len(corpus))
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    similarities = []
    pairs = []

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.pytorch_cos_sim(query_embedding, sarcasm_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("Query:", query)
    print("\nTop {n} most similar sentences in corpus:".format(n=n))

    for score, idx in zip(top_results[0], top_results[1]):
#         print(corpus[idx], "(Score: {:.4f})".format(score))
        pairs.append(tuple((corpus[idx], score)))
    
    recommend_frame = []
    for val in pairs:
        recommend_frame.append({'Comment':val[0],'Similarity':val[1].cpu().numpy()})
     
    df = pd.DataFrame(recommend_frame)
    return df

In [13]:
C_prime = get_similar_posts('Xbox is much better than PS4', 500)
C_prime = C_prime.join(df_m.set_index('comment'), on='Comment')
C_prime.head()

Query: Xbox is much better than PS4

Top 500 most similar sentences in corpus:


Unnamed: 0,Comment,Similarity,label,author,subreddit,score,ups,downs,date,created_utc,parent_comment,vector
0,"No, because the xbox one is soooo much better ...",0.9228309,1,awesome7332,gaming,1,1,0,2015-06,2015-06-08 03:38:26,How about we stop arguing about the platforms.,"[0.30111814, -0.6113374, 0.116583675, -0.63111..."
1,"but xbox has the highest quality pixels, so it...",0.88953227,1,its_high_knut,pcmasterrace,2,2,0,2016-08,2016-08-09 20:20:45,They look even better if you play them on your...,"[0.41256794, -0.464011, -0.2718223, -0.6063775..."
2,Xbox or PS4,0.8632204,0,DogblockBernie,Rainbow6,1,1,0,2016-09,2016-09-19 01:26:24,Same here!,"[0.51289624, -0.74949, 0.08234843, -0.94248724..."
3,"Ayy, better than Xbox 7 and PS5",0.8621861,1,DarkShadow1253,pcmasterrace,1,1,0,2015-06,2015-06-15 17:59:29,To make you happy with your specs: YOUR SPECS ...,"[0.083633505, -0.6993774, -0.01547823, -0.6487..."
4,Ps4 or Xbox,0.8430183,1,Karma_y0,AskReddit,1,1,0,2016-05,2016-05-21 21:37:44,If you could ask anyone on the internet someth...,"[0.48774123, -0.7246228, 0.0145492, -0.9653652..."


### Step 2: Add the most similar item from C' as the first item in the result set R and drop this item from C'

In [14]:
df_temp = C_prime
recommendations = ['dummy']
recommendations[0] = C_prime["Comment"][0]  # first item is always the one with the highest similarity

index = df_temp[(df_temp.Comment == recommendations[0])].index

df_temp = df_temp.drop(index)

### Step 3: During each subsequent iteration, the item selected is the one with the highest quality with respect to the set of cases selected during the previous iteration

The quality of an item c is proportional to the similarity between c and the current target t, and to the diversity of c relative to those items so far selected, R = {r1,...,rm}.

Quality(t,c,R) = Similarity(t,c) ∗ RelDiversity(c,R)

In [15]:
from sklearn.metrics.pairwise import cosine_similarity
def calculate_quality(c, R, df, df_sim):
    quality = 0
    rel_diversity = 0
    
    if len(R) == 0:
        rel_diversity = 1
        
    vector = np.array(df['vector'][df['Comment'] == c].to_numpy()[0]).reshape(1, -1)
    diversity = []
    for item in R:
        diversity.append(1 - cosine_similarity(vector, np.array(df_sim['vector'][df_sim['Comment'] == item].to_numpy()[0]).reshape(1, -1)))
        
    rel_diversity = sum(diversity)/len(R) # relative diversity
    
    similarity = df['Similarity'][df['Comment'] == c].to_numpy()[0] # similarity
    
    quality = rel_diversity[0][0] * similarity # quality
    return quality

In [16]:
# set k = 50 to get top 50 recommendations
k = 50
for i in range(k):
    qualities = {}
    # Calculate the quality of each subreddit
    for item in df_temp['Comment']:
        qualities[item] = calculate_quality(item, recommendations, df_temp, C_prime)

    highest_quality = max(qualities.values())
    highest_quality_subreddit = max(qualities, key= lambda x: qualities[x])
    recommendations.append(highest_quality_subreddit)
    
    index = df_temp[(df_temp.Comment == recommendations[-1])].index
    df_temp = df_temp.drop(index)

In [17]:
recommendations[:5]

['No, because the xbox one is soooo much better than the ps4',
 'The best part is that Xbox gamers will play on separate servers to PC.',
 'Tip: PSAs are better.',
 "Omg no, I'm an investor and I'm sooooo disappointed in Xbox One not outselling PS4!",
 'Works fine on my PC, although I\'d say the original Xbox 360 version is the "best"']

## Evaluate the Recommendations

In [18]:
similarities = []
for item in recommendations:
    sim = C_prime['Similarity'][C_prime['Comment'] == item].to_numpy()[0]
    similarities.append(sim)

pairs = list(zip(recommendations, similarities))
recommend_frame = []
for val in pairs:
    recommend_frame.append({'Comment':val[0],'Similarity':val[1].item(0)})    

df_sim = pd.DataFrame(recommend_frame)
df_sim = df_sim.set_index(['Comment'])
df_sim.head()

Unnamed: 0_level_0,Similarity
Comment,Unnamed: 1_level_1
"No, because the xbox one is soooo much better than the ps4",0.922831
The best part is that Xbox gamers will play on separate servers to PC.,0.578906
Tip: PSAs are better.,0.599215
"Omg no, I'm an investor and I'm sooooo disappointed in Xbox One not outselling PS4!",0.722312
"Works fine on my PC, although I'd say the original Xbox 360 version is the ""best""",0.686091


In [19]:
# Find the Diversity
n = 50
dis_similarity = [x for x in pdist(df_sim)]
avg_dissim_greedy = (sum(dis_similarity))/((n/2)*(n-1))
avg_dissim_greedy

0.07405671810617252

## Compare Results to Normal RecSys

We can compare the average dissimilarity of these new, diverse recommendations to our original ones in order to compare and evaluate the diversity in each set.

In [20]:
percent_change = ((avg_dissim_greedy - avg_dissim_control)/avg_dissim_control)*100
round(percent_change, 2)

24.18

Thus, there was a 24.2% increase in diversity (defined as average dissimilarity) when we move from a normal reccomendation system to a system that implements the Bounded Greedy Selection algorithm.

In [None]:
### END OF NOTEBOOK ###