# Recommendation System - Community Embedding

In this system, we will use the vectors outputted by the Community Embedding model (Word2Vec Skipgram) to compute the cosine distances and recommend the most similar subreddits. To reiterate, the community embeddings were performed as explained below:

"Our community embedding is learned solely from interaction data—high similarity between a pair of communities. It requires **not a similarity in language but a similarity in the users who comment in them**. To generate our embedding, we applied the Word2Vec algorithm to interaction data by treating communities as “words” and commenters as “contexts”—every instance of a user commenting in a community becomes a word-context pair. **Communities are then similar if and only if many similar users have the time and interest to comment in them both**." - [Source](https://www.cs.toronto.edu/~ashton/pubs/cultural-dims2020.pdf)

In this case, we **DO NOT** use PCA to perform dimensionality reduction. The 128-dimensional vectors are used compute the cosine distance. The results are much better than when performed with PCA. The information stays entact so the system is able to give the best recommendations! :)

## Exploring the Data

In [None]:
import pandas as pd
import numpy as np
from scipy.spatial import distance

In [None]:
# Reading the dataframe
df = pd.read_csv('../datasets/vectors.tsv', sep='\t', header=None)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107253 entries, 0 to 107252
Columns: 128 entries, 0 to 127
dtypes: float64(128)
memory usage: 104.7 MB


In [None]:
df.head() # 128 dimensions

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,118,119,120,121,122,123,124,125,126,127
0,0.033573,-0.00306,0.028717,-0.019389,-0.01554,0.025682,-0.04433,-0.046595,0.041823,-0.014383,...,-0.013282,-0.013637,-0.001407,0.037057,0.010626,-0.008644,0.045264,-0.003153,0.020575,-0.005486
1,0.044829,-0.01299,0.007363,-0.047944,0.034858,-0.017518,-0.002088,0.023707,-0.005278,-0.025445,...,0.005026,0.026392,-0.028762,-0.010561,-0.047328,0.031681,0.005442,0.046587,0.007106,-0.039871
2,-0.018356,0.031759,-0.040295,0.005679,0.026772,-0.03112,0.00129,0.013565,0.048065,-0.042526,...,0.045085,0.008656,0.045598,-0.01429,0.046919,0.028597,-0.021231,-0.026456,0.010762,-0.026287
3,0.024258,0.047903,-0.024987,-0.042511,-0.026175,0.010981,0.045995,0.024367,0.005051,-0.043828,...,-0.03832,-0.021706,-0.036309,0.040087,0.015059,-0.002017,0.029141,0.031405,-0.04953,-0.0117
4,0.006294,0.022231,0.015397,0.046233,0.042291,-0.029548,0.02926,-0.014885,-0.006767,-0.030875,...,-0.032459,0.032619,0.000161,0.010016,0.019834,-0.048005,0.046414,-0.047809,-0.028996,0.044578


## New Dataframe with Labels and Corresponding 128-dim Vectors

In [None]:
df_labels = pd.read_csv('../datasets/metadata.tsv', sep='\t', names=['Labels'])
df_labels.head()

Unnamed: 0,Labels
0,------Username------
1,----Michel----
2,----The_Truth-----
3,----meh----
4,----petrichor----


In [None]:
df['vector'] = df[:].values.tolist()

In [None]:
dfnew = pd.concat([df_labels, df], axis = 1)

In [None]:
dfnew.head()

Unnamed: 0,Labels,0,1,2,3,4,5,6,7,8,...,119,120,121,122,123,124,125,126,127,vector
0,------Username------,0.033573,-0.00306,0.028717,-0.019389,-0.01554,0.025682,-0.04433,-0.046595,0.041823,...,-0.013637,-0.001407,0.037057,0.010626,-0.008644,0.045264,-0.003153,0.020575,-0.005486,"[0.033573065, -0.003059756, 0.0287168729999999..."
1,----Michel----,0.044829,-0.01299,0.007363,-0.047944,0.034858,-0.017518,-0.002088,0.023707,-0.005278,...,0.026392,-0.028762,-0.010561,-0.047328,0.031681,0.005442,0.046587,0.007106,-0.039871,"[0.04482906, -0.0129903555, 0.0073633566, -0.0..."
2,----The_Truth-----,-0.018356,0.031759,-0.040295,0.005679,0.026772,-0.03112,0.00129,0.013565,0.048065,...,0.008656,0.045598,-0.01429,0.046919,0.028597,-0.021231,-0.026456,0.010762,-0.026287,"[-0.018355988, 0.03175925, -0.04029547, 0.0056..."
3,----meh----,0.024258,0.047903,-0.024987,-0.042511,-0.026175,0.010981,0.045995,0.024367,0.005051,...,-0.021706,-0.036309,0.040087,0.015059,-0.002017,0.029141,0.031405,-0.04953,-0.0117,"[0.024258208, 0.047902945, -0.02498666, -0.042..."
4,----petrichor----,0.006294,0.022231,0.015397,0.046233,0.042291,-0.029548,0.02926,-0.014885,-0.006767,...,0.032619,0.000161,0.010016,0.019834,-0.048005,0.046414,-0.047809,-0.028996,0.044578,"[0.006293676999999999, 0.02223121, 0.015396725..."


## Subreddit Recommender 

First we find the vector corresponding to the subreddit given to the function. Then, we compute the distance between this vector and all others in the dataset, appending them to a *distances* array. We then add labels so that we can see which subreddit coressponds to which cosine distance. Then, we return a dataframe containing the top 10 similar subreddits as well as their cosine distances.

In [None]:
# Defining subreddit receommender function
def subreddit_recommender(sub_name):
    num_subs_to_reccomend = 10
    distances = []
    sub_name_vector = dfnew['vector'][dfnew['Labels'] == sub_name].to_numpy()[0]
    
    for vector in dfnew['vector'].tolist():
        distances.append(distance.cosine(sub_name_vector, vector))
    
    pairs = list(zip(dfnew['Labels'], distances))
    closest_subs = sorted(pairs, key=lambda item: item[1])[1:num_subs_to_reccomend+1]
    recommend_frame = []
    for val in closest_subs:
        recommend_frame.append({'Subreddit':val[0],'Distance':val[1]})
    
    df_result = pd.DataFrame(recommend_frame)
    return df_result

### Some Examples

In [None]:
subreddit_recommender("CryptoCurrencies")

Unnamed: 0,Subreddit,Distance
0,ethfinance,0.302931
1,eos,0.325076
2,cardano,0.327063
3,binance,0.353524
4,CryptoCurrency,0.357308
5,SPCE,0.37602
6,TREZOR,0.378562
7,ethtrader,0.378717
8,Tronix,0.383169
9,LINKTrader,0.384919


In [None]:
subreddit_recommender("ApplyingToCollege")

Unnamed: 0,Subreddit,Distance
0,chanceme,0.282058
1,Sat,0.31579
2,collegeresults,0.33027
3,CollegeEssayReview,0.406791
4,ACT,0.42767
5,A2Relationships,0.454906
6,dartmouth,0.471442
7,APStudents,0.472742
8,MITAdmissions,0.481454
9,APChem,0.488204


In [None]:
subreddit_recommender("gaming")

Unnamed: 0,Subreddit,Distance
0,StratagemFC,0.444791
1,RedDeadGlitches,0.476793
2,playrustservers,0.506419
3,RebelsHate,0.51076
4,StarWarsAndor,0.512291
5,poole,0.518877
6,FinDuGame,0.519495
7,TestMySite,0.523709
8,Witcher3WildHunt,0.524311
9,BuildAnApp,0.524365


In [None]:
subreddit_recommender("ProgrammingLanguages")

Unnamed: 0,Subreddit,Distance
0,LinuxOnThinkpad,0.371785
1,EmojiSquad,0.380775
2,brainfuck,0.398505
3,DeusVult,0.408758
4,phonebatterylevelbot,0.409959
5,CreativeDestruction1,0.422297
6,MinecraftBedrockers,0.422919
7,BanHotWheels,0.424109
8,gibson,0.424873
9,DragonQuestBuilders2,0.428571


These results are much more relevant and accurate than with PCA. The results are almost identical to those from the Tensorflow Projector. This system, without PCA, should be used to further implement diversity! :)