# Subreddit2Vec Model for Community Embeddings

Word2Vec is not a singular algorithm, rather, it is a family of model architectures and optimizations that can be used to learn word embeddings from large datasets. Embeddings learned through Word2Vec have proven to be successful on a variety of downstream natural language processing tasks.

The Continuous Skip-gram Model predicts words within a certain range before and after the current word in the same sentence. The model predicts the context (or neighbors) of a word, given the word itself. The context of a word can be represented through a set of skip-gram pairs of (target_word, context_word) where context_word appears in the neighboring context of target_word. The training objective of the skip-gram model is to maximize the probability of predicting context words given the target word. [Source](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/word2vec.ipynb#scrollTo=gK1gN1jwkMpU)

In this case, we use a Subreddit2Vec model to generate subreddit embeddings rather than word embeddings. We apply the Word2Vec algorithm on interaction data by treating subreddits as "words" and the users that comment on them as "contexts" - every instance of a user commenting in a subreddit then becomes a word-context (subreddit-user) pair. Then, Subreddits are similar if and only if many similar users have the time and interest to comment in them both. [Source](https://www.cs.toronto.edu/~ashton/pubs/cultural-dims2020.pdf)

## Reading the Data

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('reddit_user_data_count.csv')

In [None]:
df.head()

Unnamed: 0,user,subreddit,count
0,------Username------,AskReddit,20
1,------Username------,Barca,9
2,------Username------,FIFA,4
3,------Username------,MMA,5
4,------Username------,RioGrandeValley,3


In [None]:
df.columns

Index(['user', 'subreddit', 'count'], dtype='object')

In [None]:
df['subreddit'].value_counts()

AskReddit             21021
funny                 10796
pics                  10608
gaming                 9353
memes                  9291
                      ...  
wagakkiband               1
SignsBeingBros            1
Muxiphobia                1
Memelanterncorps          1
GoldenMountainDogs        1
Name: subreddit, Length: 69490, dtype: int64

In [None]:
df['user'].value_counts()

CarpenterAcademic    851
sukhata              701
PrincessBananas85    586
munmoonpat           495
i_like_the_idea      445
                    ... 
garysanchezisagod      1
Loyal-Two              1
Hugh___Mungus          1
VexianJura             1
SPACsBot               1
Name: user, Length: 37845, dtype: int64

### Setup

In [None]:
import io
import re
import string
import tensorflow as tf
import tqdm

from tensorflow.keras import Model
from tensorflow.keras.layers import Dot, Embedding, Flatten
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [None]:
SEED = 42
AUTOTUNE = tf.data.experimental.AUTOTUNE

In [None]:
word_tokens = df["subreddit"].tolist()

In [None]:
context_tokens = df["user"].tolist()

In [None]:
tokens = context_tokens + word_tokens

Create a vocabulary to save mappings from tokens to integer indices.

In [None]:
vocab, index = {}, 1
vocab['<pad>'] = 0  # add a padding token
for token in tokens:
    if token not in vocab:
        vocab[token] = index
        index += 1

vocab_size = len(vocab)
print(vocab)



Create an inverse vocabulary to save mappings from integer indices to tokens.

In [None]:
inverse_vocab = {index: token for token, index in vocab.items()}
print(inverse_vocab)



## Negative sampling for one skip-gram subreddit

In [None]:
# Create tuple for each comment in a subreddit - (subreddit, commenter/user)
df['pair'] = list(zip(df.subreddit, df.user))
pairing = df['pair'].tolist()
pairing[:20]

[('AskReddit', '------Username------'),
 ('Barca', '------Username------'),
 ('FIFA', '------Username------'),
 ('MMA', '------Username------'),
 ('RioGrandeValley', '------Username------'),
 ('Showerthoughts', '------Username------'),
 ('WTF', '------Username------'),
 ('bodybuilding', '------Username------'),
 ('cringepics', '------Username------'),
 ('funny', '------Username------'),
 ('malefashionadvice', '------Username------'),
 ('movies', '------Username------'),
 ('pics', '------Username------'),
 ('realmadrid', '------Username------'),
 ('short', '------Username------'),
 ('soccer', '------Username------'),
 ('streetwear', '------Username------'),
 ('thanosdidnothingwrong', '------Username------'),
 ('ACNHIslandInspo', '----Michel----'),
 ('ACNHTurnips', '----Michel----')]

In [None]:
for target, context in df['pair'][:20]:
    print(f"({vocab[target]}, {vocab[context]}): ({target}, {context})")

(37846, 1): (AskReddit, ------Username------)
(37847, 1): (Barca, ------Username------)
(37848, 1): (FIFA, ------Username------)
(37849, 1): (MMA, ------Username------)
(37850, 1): (RioGrandeValley, ------Username------)
(37851, 1): (Showerthoughts, ------Username------)
(37852, 1): (WTF, ------Username------)
(37853, 1): (bodybuilding, ------Username------)
(37854, 1): (cringepics, ------Username------)
(37855, 1): (funny, ------Username------)
(37856, 1): (malefashionadvice, ------Username------)
(37857, 1): (movies, ------Username------)
(37858, 1): (pics, ------Username------)
(37859, 1): (realmadrid, ------Username------)
(37860, 1): (short, ------Username------)
(37861, 1): (soccer, ------Username------)
(37862, 1): (streetwear, ------Username------)
(37863, 1): (thanosdidnothingwrong, ------Username------)
(37864, 2): (ACNHIslandInspo, ----Michel----)
(37865, 2): (ACNHTurnips, ----Michel----)


The skipgrams function returns all positive skip-gram pairs by sliding over a given window span. To produce additional skip-gram pairs that would serve as negative samples for training, you need to sample random words from the vocabulary. Use the tf.random.log_uniform_candidate_sampler function to sample num_ns number of negative samples for a given target word in a window. You can call the function on one skip-grams's target word and pass the context word as true class to exclude it from being sampled.

In [None]:
# Get target and context words for one positive skip-gram.
target_word, context_word = vocab[df['pair'][0][0]], vocab[df['pair'][0][1]]
print(target_word, context_word)

# Set the number of negative samples per positive context.
num_ns = 4

context_class = tf.reshape(tf.constant(context_word, dtype="int64"), (1, 1))
negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
    true_classes=context_class,  # class that should be sampled as 'positive'
    num_true=1,  # each positive skip-gram has 1 positive context class
    num_sampled=num_ns,  # number of negative context words to sample
    unique=True,  # all the negative samples should be unique
    range_max=vocab_size,  # pick index of the samples from [0, vocab_size]
    seed=SEED,  # seed for reproducibility
    name="negative_sampling"  # name of this operation
)
print(type(negative_sampling_candidates))
print([inverse_vocab[index.numpy()] for index in negative_sampling_candidates])

37846 1
<class 'tensorflow.python.framework.ops.EagerTensor'>
['AmavaArts', '-Steelseries-', 'Manvic', '1us1']


## Construct one Training Example

For a given positive (target_subreddit, context_user) skip-gram, you now also have num_ns negative sampled context words that do not appear in the window size neighborhood of target_word. Batch the 1 positive context_word and num_ns negative context words into one tensor. This produces a set of positive skip-grams (labelled as 1) and negative samples (labelled as 0) for each target word.

In [None]:
# Add a dimension so you can use concatenation (on the next step).
negative_sampling_candidates = tf.expand_dims(negative_sampling_candidates, 1)

# Concat positive context word with negative sampled words.
context = tf.concat([context_class, negative_sampling_candidates], 0)

# Label first context word as 1 (positive) followed by num_ns 0s (negative).
label = tf.constant([1] + [0]*num_ns, dtype="int64")

# Reshape target to shape (1,) and context and label to (num_ns+1,).
target = tf.squeeze(target_word)
context = tf.squeeze(context)
label = tf.squeeze(label)

In [None]:
print(f"target_index    : {target}")
print(f"target_word     : {inverse_vocab[target_word]}")
print(f"context_indices : {context}")
print(f"context_words   : {[inverse_vocab[c.numpy()] for c in context]}")
print(f"label           : {label}")

target_index    : 37846
target_word     : AskReddit
context_indices : [    1  1286    65 11554   284]
context_words   : ['------Username------', 'AmavaArts', '-Steelseries-', 'Manvic', '1us1']
label           : [1 0 0 0 0]


In [None]:
print("target  :", target)
print("context :", context)
print("label   :", label)

## Compile all steps into one function

In [None]:
# Generates skip-gram pairs with negative sampling for a list of sequences
# (int-encoded sentences) based on window size, number of negative samples and vocabulary size.

def generate_training_data(tokens, tuples, num_ns, vocab_size, vocab, seed):
    # Elements of each training example are appended to these lists.
    targets, contexts, labels = [], [], []

    # Build the sampling table for vocab_size tokens.
    sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)

    # Iterate over all tuples in dataset
    # Generate positive skip-gram pairs for a tuple
    target_words, context_words = [], []
    for i in tuples:
        target_words.append(vocab[i[0]])
        context_words.append(vocab[i[1]])
    
    positive_skip_grams = list(zip(target_words, context_words))
    

    # Iterate over each positive skip-gram pair to produce training examples
    # with positive context word and negative samples.
    for target_word, context_word in positive_skip_grams:
        context_class = tf.expand_dims(tf.constant([context_word], dtype="int64"), 1)
        
        negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
          true_classes=context_class,
          num_true=1,
          num_sampled=num_ns,
          unique=True,
          range_max=vocab_size,
          seed=SEED,
          name="negative_sampling")

        # Build context and label vectors (for one target word)
        negative_sampling_candidates = tf.expand_dims(negative_sampling_candidates, 1)
        context = tf.concat([context_class, negative_sampling_candidates], 0)
        label = tf.constant([1] + [0]*num_ns, dtype="int64")

        # Append each element from the training example to global lists.
        targets.append(target_word)
        contexts.append(context)
        labels.append(label)
    
    return targets, contexts, labels

In [None]:
targets, contexts, labels = generate_training_data(tokens, df['pair'], 4, vocab_size, vocab, SEED)
print(len(targets), len(contexts), len(labels))

1738737 1738737 1738737


To perform efficient batching for the potentially large number of training examples, use the tf.data.Dataset API. After this step, you would have a tf.data.Dataset object of (target_subreddit, context_user), (label) elements to train your Word2Vec model!

In [None]:
BATCH_SIZE = 1024
BUFFER_SIZE = 10000
dataset = tf.data.Dataset.from_tensor_slices(((targets, contexts), labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

Add cache() and prefetch() to improve performance.

In [None]:
dataset = dataset.cache().prefetch(buffer_size=AUTOTUNE)

## Model and Training

The Word2Vec model can be implemented as a classifier to distinguish between true context users from skip-grams and false context users obtained through negative sampling. You can perform a dot product between the embeddings of target subreddit and context users to obtain predictions for labels and compute loss against true labels in the dataset.

Use the Keras Subclassing API to define your Word2Vec model with the following layers:

target_embedding: A tf.keras.layers.Embedding layer which looks up the embedding of a word when it appears as a target word. The number of parameters in this layer are (vocab_size * embedding_dim).
context_embedding: Another tf.keras.layers.Embedding layer which looks up the embedding of a word when it appears as a context word. The number of parameters in this layer are the same as those in target_embedding, i.e. (vocab_size * embedding_dim).
dots: A tf.keras.layers.Dot layer that computes the dot product of target and context embeddings from a training pair.
flatten: A tf.keras.layers.Flatten layer to flatten the results of dots layer into logits.
With the subclassed model, you can define the call() function that accepts (target, context) pairs which can then be passed into their corresponding embedding layer. Reshape the context_embedding to perform a dot product with target_embedding and return the flattened result.

In [None]:
class Word2Vec(Model):
    def __init__(self, vocab_size, embedding_dim):
        super(Word2Vec, self).__init__()
        self.target_embedding = Embedding(vocab_size,
                                          embedding_dim,
                                          input_length=1,
                                          name="w2v_embedding")
        self.context_embedding = Embedding(vocab_size,
                                           embedding_dim,
                                           input_length=num_ns+1)
        self.dots = Dot(axes=(3, 2))
        self.flatten = Flatten()

    def call(self, pair):
        target, context = pair
        word_emb = self.target_embedding(target)
        context_emb = self.context_embedding(context)
        dots = self.dots([context_emb, word_emb])
        return self.flatten(dots)

## Define loss function and compile model

In [None]:
def custom_loss(x_logit, y_true):
      return tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=y_true)

It's time to build your model! Instantiate your Word2Vec class with an embedding dimension of 128 (you could experiment with different values). Compile the model with the tf.keras.optimizers.Adam optimizer.

In [None]:
embedding_dim = 128
word2vec = Word2Vec(vocab_size, embedding_dim)
word2vec.compile(optimizer='adam',
                 loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
                 metrics=['accuracy'])

In [None]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

Train the model with dataset prepared above for some number of epochs.

In [None]:
word2vec.fit(dataset, epochs=10, callbacks=[tensorboard_callback])

Epoch 1/10
Instructions for updating:
use `tf.profiler.experimental.stop` instead.
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f3514517e10>

## Embedding lookup and analysis

In [None]:
weights = word2vec.get_layer('w2v_embedding').get_weights()[0]

Download the vectors.tsv and metadata.tsv to analyze the obtained embeddings in the [Embedding Projector](https://projector.tensorflow.org/).

In [None]:
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
    if index == 0:
        continue  # skip 0, it's padding.
    vec = weights[index]
    out_v.write('\t'.join([str(x) for x in vec]) + "\n")
    out_m.write(word + "\n")
out_v.close()
out_m.close()