An autoencoder is a neural network used to compress and then reconstruct input data. Here, we're using it on text data converted into a bag-of-words format.

In this project, I used a bag-of-words model to vectorize text. Then, I trained an autoencoder — the encoder compresses high-dimensional word vectors into a 5-dimensional latent space, and the decoder reconstructs the input. The idea is to learn a compressed representation of each sentence. We used Binary Cross Entropy since the inputs are binary, and trained it using the Adam optimizer.

In [38]:
# Import essential libraries
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [39]:
# GPU Check
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


In [89]:
# Sample sentences
# sentences = [
#     "i loved this movie",
#     "the acting was terrible",
#     "great performances by the cast",
#     "i fell asleep during the film",
#     "this film is a masterpiece",
#     "the special effects were amazing",
#     "worst movie i have seen",
#     "the soundtrack was beautiful"
# ]

sentences = texts

In [90]:
# Create BOW representation for sample data
# CountVectorizer(binary=True) creates a binary bag-of-words (1 = word is present, 0 = not).
# X is a 2D numpy array of shape (8, vocab_size), where each row is a sentence.
vectorizer = CountVectorizer(binary = True)
X = vectorizer.fit_transform(sentences).toarray()
vocab_size = len(vectorizer.get_feature_names_out())
print(f"Vocabulary size: {vocab_size}")

Vocabulary size: 3131


In [91]:
# Convert to Tensors
# Converts the X matrix to a PyTorch tensor and moves it to GPU.
X_tensor = torch.FloatTensor(X).to(device)

In [92]:
# Encoder: Linear layer compresses input (e.g., 26-dim) into a 5-dimensional latent space.
# Decoder: Linear layer expands 5D vector back into original input shape.
# relu for encoding, sigmoid for decoding since outputs are binary-like.
# Define Encoder Architecture
class TextAutoencoder(nn.Module):
  def __init__(self, input_dim, encoding_dim):
    super(TextAutoencoder, self).__init__()
    self.encoder = nn.Linear(input_dim, encoding_dim)
    self.decoder = nn.Linear(encoding_dim, input_dim)

  def forward(self, x):
    encoded = torch.relu(self.encoder(x))
    decoded = torch.sigmoid(self.decoder(encoded))
    return decoded, encoded



In [137]:
# Initialize the model and move to device
input_dim = vocab_size
encoding_dim = 20 # Compressed representation size
model = TextAutoencoder(input_dim, encoding_dim).to(device)

In [138]:
# Loss Criterion
# BCELoss: Binary Cross Entropy — since input and output are both binary vectors.
# Adam: A fast and adaptive optimizer.

criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr = 0.01)

Input is passed through the model.

Model learns to reconstruct the same input (X_tensor) from compressed encoded representation.

Prints loss every 20 epochs to monitor learning.

In [139]:
# Training loop with GPU Accelaration
num_epochs = 1000
for epoch in range(num_epochs):
  # Forward pass
  reconstructed, encoded = model(X_tensor)
  loss = criterion(reconstructed, X_tensor)

  # Backward pass and optimize
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

  # Print Progress
  if epoch%20 == 0:
    print(f"Epoch[{epoch+1}/{num_epochs}], Loss: {loss.item(): .4f}")

Epoch[1/1000], Loss:  0.6958
Epoch[21/1000], Loss:  0.0265
Epoch[41/1000], Loss:  0.0251
Epoch[61/1000], Loss:  0.0233
Epoch[81/1000], Loss:  0.0218
Epoch[101/1000], Loss:  0.0210
Epoch[121/1000], Loss:  0.0206
Epoch[141/1000], Loss:  0.0204
Epoch[161/1000], Loss:  0.0202
Epoch[181/1000], Loss:  0.0201
Epoch[201/1000], Loss:  0.0200
Epoch[221/1000], Loss:  0.0198
Epoch[241/1000], Loss:  0.0197
Epoch[261/1000], Loss:  0.0195
Epoch[281/1000], Loss:  0.0193
Epoch[301/1000], Loss:  0.0191
Epoch[321/1000], Loss:  0.0189
Epoch[341/1000], Loss:  0.0187
Epoch[361/1000], Loss:  0.0184
Epoch[381/1000], Loss:  0.0181
Epoch[401/1000], Loss:  0.0179
Epoch[421/1000], Loss:  0.0176
Epoch[441/1000], Loss:  0.0173
Epoch[461/1000], Loss:  0.0170
Epoch[481/1000], Loss:  0.0167
Epoch[501/1000], Loss:  0.0165
Epoch[521/1000], Loss:  0.0162
Epoch[541/1000], Loss:  0.0159
Epoch[561/1000], Loss:  0.0156
Epoch[581/1000], Loss:  0.0153
Epoch[601/1000], Loss:  0.0150
Epoch[621/1000], Loss:  0.0147
Epoch[641/1000

In [None]:
# Original Vs Reconstructed:

# Original: i rented i am curiousyellow from my video store because of all the controversy that
# Reconstructed words: the, this, of, to, is, movie, and, it, in, was, film, that, for

# Original: i am curious yellow is a risible and pretentious steaming pile it doesnt matter what
# Reconstructed words: the, this, of, to, is, movie, and, it, in, was, film, that, for

# Original: if only to avoid making this type of film in the future this film is
# Reconstructed words: the, this, of, to, is, movie, and, it, in, was, film, that, for

# Original: this film was probably inspired by godards masculin fminin and i urge you to see
# Reconstructed words: the, this, of, to, is, movie, and, it, in, was, film, that, for, one

# Original: oh brotherafter hearing about this ridiculous film for umpteen years all i can think of
# Reconstructed words: the, this, of, to, is, movie, and, it, in, was, film, that, for, one

# Original: i would put this at the top of my list of films in the category
# Reconstructed words: the, this, of, to, is, movie, and, it, was, in, film, that

# Original: whoever wrote the screenplay for this movie obviously never consulted any books about lucille ball
# Reconstructed words: the, this, of, to, is, movie, and, it, in, was, film, that, for, one, with

# Original: when i first saw a glimpse of this movie i quickly noticed the actress who
# Reconstructed words: the, this, of, to, is, movie, and, it, in, was, film, that

# Original: who are these they the actors the filmmakers certainly couldnt be the audience this is
# Reconstructed words: the, this, of, to, is, movie, and, it, in, was, film, that, for

# Original: this is said to be a personal film for peter bogdonavitch he based it on
# Reconstructed words: the, this, of, to, is, movie, and, it, in, was, film, that, for, one

In [136]:
# Test the Autoencoder
model.eval()       # This will stop BP. Only forward pass
with torch.no_grad():
  reconstructed, encoded_data = model(X_tensor)

  # Move data back to CPU for processing
  reconstructed = reconstructed.cpu()
  encoded_data = encoded_data.cpu()
  X_tensor_cpu = X_tensor.cpu()


  # # Print encoded representation
  # print("\nEncoded representations (5-dimensional):")
  # for i, sentence in enumerate(sentences):
  #     print(f"{sentence}: {encoded_data[i].numpy()}")

  # Print original and reconstructed text
  print("\nOriginal Vs Reconstructed: ")
  for i in range(10):
    print(f"\nOriginal: {sentences[i]}")

    # Get Original words
    original_indices = X_tensor_cpu[i].nonzero().flatten().tolist()
    original_words = [vectorizer.get_feature_names_out()[idx] for idx in original_indices]

    # Get reconstructed words (top N where N is number of words in original)
    n_words = len(original_indices)
    values, indices = torch.topk(reconstructed[i], n_words)
    reconstructed_words = [vectorizer.get_feature_names_out()[idx.item()] for idx in indices]

    print(f"Reconstructed words: {', '.join(reconstructed_words)}")





Original Vs Reconstructed: 

Original: i rented i am curiousyellow from my video store because of all the controversy that
Reconstructed words: store, of, the, curiousyellow, controversy, video, rented, one, that, at, my, just, all

Original: i am curious yellow is a risible and pretentious steaming pile it doesnt matter what
Reconstructed words: and, steaming, risible, yellow, is, pretentious, what, curious, matter, pile, film, it, in

Original: if only to avoid making this type of film in the future this film is
Reconstructed words: this, the, is, to, future, of, type, avoid, making, film, into, has, in

Original: this film was probably inspired by godards masculin fminin and i urge you to see
Reconstructed words: this, to, was, urge, masculin, fminin, inspired, godards, probably, and, see, be, in, dont

Original: oh brotherafter hearing about this ridiculous film for umpteen years all i can think of
Reconstructed words: of, this, brotherafter, umpteen, can, oh, hearing, ridiculous,

In [None]:
## Testing this approach on a larger dataset

In [23]:
def load_imdb_dataset(max_samples=1000):
  from torchtext.datasets import IMDB

  train_iter = IMDB(split='train')

  # Convert iterator to list for easier processing

  data = []

  labels = []

  count = 0

  for label, text in train_iter:

    # Process and clean the text (simplified)

    text = text.lower()

    data.append(text)

    labels.append(1 if label == 'pos' else 0)

    count += 1

    if count >= max_samples:

      break

  return data, labels

texts, labels = load_imdb_dataset(100)


In [24]:
len(texts)

100

In [48]:
import re

def clean_text(text, max_words=15):
    text = text.lower()
    text = re.sub(r'<.*?>', ' ', text)  # remove HTML tags
    text = re.sub(r'[^a-z0-9\s]', '', text)  # remove non-alphanumeric characters
    text = re.sub(r'\s+', ' ', text).strip()  # collapse whitespace

    words = text.split()
    if len(words) > max_words:
      words = words[:15]
    return " ".join(words)

def load_imdb_dataset(max_samples=1000):
    from torchtext.datasets import IMDB
    train_iter = IMDB(split='train')

    data = []
    labels = []
    count = 0

    for label, text in train_iter:
        text = clean_text(text)
        data.append(text)
        labels.append(1 if label == 'pos' else 0)
        count += 1
        if count >= max_samples:
            break

    return data, labels


In [54]:
texts, labels = load_imdb_dataset(1000)

In [55]:
texts

['i rented i am curiousyellow from my video store because of all the controversy that',
 'i am curious yellow is a risible and pretentious steaming pile it doesnt matter what',
 'if only to avoid making this type of film in the future this film is',
 'this film was probably inspired by godards masculin fminin and i urge you to see',
 'oh brotherafter hearing about this ridiculous film for umpteen years all i can think of',
 'i would put this at the top of my list of films in the category',
 'whoever wrote the screenplay for this movie obviously never consulted any books about lucille ball',
 'when i first saw a glimpse of this movie i quickly noticed the actress who',
 'who are these they the actors the filmmakers certainly couldnt be the audience this is',
 'this is said to be a personal film for peter bogdonavitch he based it on',
 'it was great to see some of my favorite stars of 30 years ago including',
 'i cant believe that those praising this movie herein arent thinking of some o