# VAE for Topic Modeling

## Key Concepts:
1. `Encoder/Decoder`: The `VAE` architecture has an encoder that maps documents to a latent space and a decoder that attempts to reconstruct the input from the latent space.
2. `Latent Space`: The latent variables in the `VAE` can represent topics. By clustering or examining this space, topics can be discovered.
3. `Loss Function`: Combines reconstruction loss (how well the input is reconstructed) and KL divergence (encouraging the latent space to follow a normal distribution).  

In [56]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, losses
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import math

import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

## Sample dataset: List of documents

tf.clip_by_global_norm

In [58]:
data_words_bigrams = pd.read_pickle('scied_words_bigrams_V5.pkl')
documents_full = data_words_bigrams #taking first group of documents in our dataset

In [59]:
import gensim
from gensim.parsing.preprocessing import STOPWORDS
from gensim.corpora.dictionary import Dictionary
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MinMaxScaler
 
# Load the data
#data_words_bigrams = pd.read_pickle('scied_words_bigrams_V5.pkl')
 
# Take the first 10 documents
documents = data_words_bigrams  # Represents all words in the first 10 documents
 
# Gensim filtering
no_below = 15  # Keep tokens which are contained in at least 15 documents
no_above = 0.5  # Remove tokens that are contained in more than 50% of the documents
id2word = gensim.corpora.Dictionary(documents)
id2word.filter_extremes(no_below=no_below, no_above=no_above, keep_n=100000)
 
# Convert documents to Bag-of-Words representation
bow_corpus = [id2word.doc2bow(doc) for doc in documents]
 
# Prepare documents for CountVectorizer
documents = [" ".join(doc) for doc in documents]
 
# Generate Bag-of-Words matrix
vectorizer = CountVectorizer(max_features=1500)
X = vectorizer.fit_transform(documents).toarray()
 
# Normalize the BoW matrix using MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
X_normalized = scaler.fit_transform(X)
 
# Check the shape of the normalized data
print(X_normalized.shape)

(5577, 1500)


In [61]:
X_normalized

array([[0.05555556, 0.45945946, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.01587302, 0.02702703, 0.00869565, ..., 0.        , 0.        ,
        0.        ],
       [0.01587302, 0.02702703, 0.        , ..., 0.        , 0.00943396,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.01639344]])

In [40]:
documents_full = [" ".join(doc) for doc in documents_full]
documents = documents_full

In [41]:
import pandas as pd
from collections import Counter

def count_word_frequencies_to_dataframe(documents):
    """
    Count the frequency of each word across all documents and return a DataFrame.

    :param documents: List of strings, where each string is a document.
    :return: A Pandas DataFrame with columns 'word' and 'count', sorted by 'count' in descending order.
    """
    # Flatten all documents into one list of words
    all_words = []
    for doc in documents:
        all_words.extend(doc.split())  # Split each document into words and add to the list

    # Use Counter to count word frequencies
    word_frequencies = Counter(all_words)

    # Convert the Counter to a DataFrame
    df = pd.DataFrame(word_frequencies.items(), columns=['word', 'count'])

    # Sort the DataFrame by 'count' in descending order
    df = df.sort_values(by='count', ascending=False).reset_index(drop=True)

    return df

In [62]:
#count_word_frequencies_to_dataframe(documents).head(20)

In [43]:
common_words = {"science", "student", "teacher", "study", "teach", "learn", "education", "school", "group", "course"}

def remove_common_words(documents, common_words):
    """
    Remove specified common words from a list of documents.

    :param documents: List of strings, where each string is a document.
    :param common_words: Set of words to remove.
    :return: List of documents with the common words removed.
    """
    cleaned_documents = []
    for doc in documents:
        # Split the document into words, remove common words, and join it back
        filtered_words = [word for word in doc.split() if word.lower() not in common_words]
        cleaned_documents.append(' '.join(filtered_words))
    return cleaned_documents

In [44]:
#documents = remove_common_words(documents, common_words)

## Preprocessing text data using CountVectorizer
* We use `CountVectorizer` from `scikit-learn` to transform the documents into a `Bag-of-Words` representation.  
* This means that each document is converted into a vector where each element represents the frequency of a word in the document.  
* `max_features=1000` limits the number of words (or features) to 1,000, though this is more relevant for larger corpora.  

In [45]:
#vectorizer = CountVectorizer(max_features=1500)
#X = vectorizer.fit_transform(documents).toarray()

In [46]:
X.shape

(5577, 1500)

In [66]:
#X
X = X_normalized

## Define Variational Autoencoder (VAE) architecture

In [67]:
#Architecture
#1. `Encoder/Decoder`: The `VAE` architecture has an encoder that maps documents to a latent space and a decoder that attempts to reconstruct the input from the latent space.
#2. `Latent Space`: The latent variables in the `VAE` can represent topics. By clustering or examining this space, topics can be discovered.
#3. `Loss Function`: Combines reconstruction loss (how well the input is reconstructed) and KL divergence (encouraging the latent space to follow a normal distribution).  

#`encode()`: The encoder splits the output into mean and logvar (log variance).
#`reparameterize()`: Instead of directly using the mean and log variance, this function samples from the Gaussian distribution defined by the mean and variance. The reason for this "reparameterization trick" is to allow backpropagation through the stochastic sampling process.
#`sample()`: Once trained, can call this functiion to use the decoder to generate new samples (e.g., new document representations) by sampling from the latent space.

class VAE(tf.keras.Model):
    def __init__(self, original_dim, latent_dim):
        super(VAE, self).__init__()
        
        # Encoder with careful initialization
        self.encoder = tf.keras.Sequential([
            layers.InputLayer(input_shape=(original_dim,)),
            layers.Dense(512, 
                activation='relu', 
                kernel_initializer=tf.keras.initializers.HeNormal(),
                bias_initializer='zeros'),
            layers.Dense(256, 
                activation='relu', 
                kernel_initializer=tf.keras.initializers.HeNormal(),
                bias_initializer='zeros'),
            layers.Dense(latent_dim * 2,  
                kernel_initializer=tf.keras.initializers.GlorotNormal(),
                bias_initializer='zeros')
        ])
        
        # Decoder with careful initialization
        self.decoder = tf.keras.Sequential([
            layers.InputLayer(input_shape=(latent_dim,)),
            layers.Dense(256, 
                activation='relu', 
                kernel_initializer=tf.keras.initializers.HeNormal(),
                bias_initializer='zeros'),
            layers.Dense(512, 
                activation='relu', 
                kernel_initializer=tf.keras.initializers.HeNormal(),
                bias_initializer='zeros'),
            layers.Dense(original_dim, 
                activation='sigmoid',  # Ensure output is between 0 and 1
                kernel_initializer=tf.keras.initializers.GlorotNormal(),
                bias_initializer='zeros')
        ])
        
    def encode(self, x):
        # Add numerical stability checks
        x = tf.cast(x, tf.float32)
        
        # Split output into mean and log variance
        z = self.encoder(x)
        mean, logvar = tf.split(z, num_or_size_splits=2, axis=1)
        
        # Clip log variance to prevent extreme values
        logvar = tf.clip_by_value(logvar, -10, 10)
        
        return mean, logvar
    
    def reparameterize(self, mean, logvar):
        # Stable reparameterization trick
        std = tf.exp(0.5 * logvar)
        eps = tf.random.normal(tf.shape(mean))
        return mean + std * eps
    
    def decode(self, z, apply_sigmoid=True):
        # Decode and optionally apply sigmoid
        logits = self.decoder(z)
        if apply_sigmoid:
            return tf.sigmoid(logits)
        return logits
    
    def sample(self, eps=None):
        if eps is None:
            eps = tf.random.normal(shape=(100, self.decoder.input_shape[1]))
        return self.decode(eps)

## Compute loss


In [68]:
#The loss function in a VAE combines two parts:
#1. `Reconstruction Loss`: Measures how well the model can reconstruct the input from the latent space.
#2. `KL Divergence`: Encourages the distribution of latent variables to be close to a normal distribution (helps with regularization).

#`Reconstruction loss`: This is the binary cross-entropy between the original input x and its reconstruction x_logit. It is summed across all dimensions (words) of the document.  

#`KL divergence`: This term encourages the latent variable distribution to stay close to a unit normal distribution (with mean 0 and variance 1). This helps regularize the model and ensure that the latent space is structured.  

#The final loss is the sum of the reconstruction loss and KL divergence, and we minimize this during training.

def kl_divergence_loss(mean, logvar):
    # Standard KL divergence loss between the learned distribution and standard normal
    kl_loss = -0.5 * tf.reduce_mean(1 + logvar - tf.square(mean) - tf.exp(logvar))
    return kl_loss

def compute_loss(model, x, beta=1.0):
    x = tf.clip_by_value(tf.cast(x, tf.float32), 1e-7, 1 - 1e-7)
    mean, logvar = model.encode(x)
    z = model.reparameterize(mean, logvar)
    x_recon = model.decode(z)
    
    # Reconstruction loss
    reconstruction_loss = tf.reduce_mean(tf.reduce_sum(
        x_recon - x * tf.math.log(x_recon + 1e-7), axis=-1
    ))
    
    # KL divergence
    kl_loss = -0.5 * tf.reduce_mean(1 + logvar - tf.square(mean) - tf.exp(logvar))
    
    # Total loss with adjustable beta
    total_loss = reconstruction_loss + beta * kl_loss
    return total_loss

## Set hyperparameters

In [69]:
# Training loop modifications
latent_dim = 15 #Latent dimensions to use
original_dim = X.shape[1] #Original shape of the input data
vae = VAE(original_dim, latent_dim) #Init VAE model

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)  #Setting the learning rate
epochs = 100  #Number of training epochs
batch_size = 32  #Batch size



## Training the VAE model


In [None]:
#* `Optimizer`: We use the Adam optimizer to minimize the loss function.

#* `Training Loop`: We loop over the data in batches (batch size 2 in this case) for a number of epochs. Inside the loop:

#A mini-batch of documents (x_batch) is fed into the VAE.
#The loss is computed using the compute_loss function.
#The gradients are calculated using TensorFlow’s GradientTape, and the model is updated accordingly.
#After each epoch, the current loss is printed out to track progress.

# training loop
for epoch in range(epochs):
    epoch_loss = 0.0
    for i in range(0, len(X), 32):
        x_batch = X[i:i+32]
        
        with tf.GradientTape() as tape:
            loss = compute_loss(vae, x_batch)
        
        gradients = tape.gradient(loss, vae.trainable_variables)
        gradients = [tf.clip_by_norm(g, 1.0) for g in gradients]
        optimizer.apply_gradients(zip(gradients, vae.trainable_variables))
        
        epoch_loss += loss.numpy()
    
    print(f"Epoch {epoch}, Loss: {epoch_loss / (len(X) // 32)}")

Epoch 0, Loss: 773.776985606928
Epoch 1, Loss: 769.5078837076823
Epoch 2, Loss: 769.5078237248563
Epoch 3, Loss: 769.5078132015535
Epoch 4, Loss: 769.5077907518408
Epoch 5, Loss: 769.5077714591191
Epoch 6, Loss: 769.5077725114494
Epoch 7, Loss: 769.5077507632902
Epoch 8, Loss: 769.5077448000853
Epoch 9, Loss: 769.5077412923177
Epoch 10, Loss: 769.5077367322199
Epoch 11, Loss: 769.5077300674614


## Interpreting the Latent Space
* The latent space can be used to represent topics  
* Can cluster the latent vectors to interpret the topics
* z represents the latent space, and clustering these latent representations will give groups of documents that share similar latent features, which can be interpreted as topics.

In [27]:
mean, logvar = vae.encode(X)
z = vae.reparameterize(mean, logvar)

# Steps to Visualize Documents by Topic:
1. Obtain Latent Vectors: After training the VAE, extract the latent vectors (z) for each document.
2. Assign Topics: Assign topics to documents by clustering the latent vectors (using KMeans or another clustering algorithm).
3. Dimensionality Reduction: Use t-SNE or PCA to reduce the dimensionality of the latent vectors to 2D or 3D for visualization.
4. Plot: Use Matplotlib to plot the projections and color them by their assigned topics.

The visualization shows a scatter plot of documents mapped to a 2D latent space, where the colors represent different topic clusters as identified by the KMeans clustering algorithm. However, the plot alone doesn't tell us what the actual topics are in terms of words or themes.

To determine the topics, you need to inspect the original document vectors and see which words (or features) are most associated with each topic. There are a couple of ways to extract and interpret the topics:

1. Get the Top Words per Topic:
After clustering the documents, you can examine which words or features are most influential in each topic cluster by looking at the original word distributions for the documents assigned to each topic.

Approach:
Assign the topics to each document and, for each topic, sum the word counts across the documents assigned to that topic.

In [30]:
# `z` is the latent space representation of the documents

# Assign topics using KMeans clustering
num_topics = 23  # You can adjust this depending on how many topics you expect
kmeans = KMeans(n_clusters=num_topics, random_state=42)
topics = kmeans.fit_predict(z)

ValueError: n_samples=9 should be >= n_clusters=23.

In [None]:
# `topics` are the cluster assignments for each document
# and `X` is the term-document matrix from CountVectorizer

# Get feature names (words) from the CountVectorizer
feature_names = vectorizer.get_feature_names_out()

# Create a DataFrame for easier analysis
df = pd.DataFrame(X, columns=feature_names)

# Add the topic assignments to the DataFrame
df['topic'] = topics

# Get top words for each topic by summing word counts for each topic
for topic in range(num_topics):
    print(f"Topic {topic}:")
    # Get all documents assigned to this topic
    topic_docs = df[df['topic'] == topic].drop(columns='topic')
    
    # Sum the word counts for this topic
    topic_word_sums = topic_docs.sum(axis=0).sort_values(ascending=False)
    
    # Display the top 10 words for the topic
    print(topic_word_sums.head(10))
    print()


In [None]:
df['topic'].value_counts()

In [None]:
# If `latent_dim` is 2 (for easier visualization in 2D), we skip t-SNE.
# If latent_dim > 2, you can uncomment t-SNE for dimensionality reduction.

import matplotlib.pyplot as plt

if z.shape[1] == 2:
    plt.scatter(z[:, 0], z[:, 1], c=topics, cmap='viridis', alpha=0.5)
    plt.colorbar(label='Topic')
    plt.title('Latent Space Clustering')
    plt.show()
else:
    from sklearn.manifold import TSNE
    z_2d = TSNE(n_components=2, random_state=42).fit_transform(z)
    plt.scatter(z_2d[:, 0], z_2d[:, 1], c=topics, cmap='viridis', alpha=0.5)
    plt.colorbar(label='Topic')
    plt.title('Latent Space Clustering (t-SNE)')
    plt.show()