# Zvi Badash 214553034
## Assignment 17
### Question 2 - Implementing a Textual Denoising Autoencoder
The video explaining the exercise can be found [here](https://youtu.be/5s7z2aDHATs)

## Imports

In [None]:
!pip install 'portalocker>=2.0.0'

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting portalocker>=2.0.0
  Downloading portalocker-2.7.0-py2.py3-none-any.whl (15 kB)
Installing collected packages: portalocker
Successfully installed portalocker-2.7.0


In [None]:
import random
import torch
import re
import nltk
from nltk.corpus import stopwords
from tqdm.notebook import tqdm
from torch import nn
from torchtext.datasets import IMDB
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
import matplotlib.pyplot as plt

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Deciding on a dataset
wasn't really easy because the question did not give away too much details, but I ended up choosing the IMDB dataset.

The dataset allows me to use the learned encoder for later classification downstream task, and from what I gathered it was pretty easy to work with it and TorchText together.

In [None]:
train_dataset, test_dataset = IMDB(split=('train', 'test'))
train_dataset = list(train_dataset)

In [None]:
_, train_sentences = zip(*train_dataset)

## Cleaning the data, as it's quite irregular
I copied this section from https://www.kaggle.com/code/grigol1/applying-lstm-to-sentiment-analysis-imdb-reviews.

In [None]:
STOPWORDS = set(stopwords.words('english'))
def preprocess_text(text):
    text = text.lower() # lowercase text
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwords from text
    text = re.sub(r'\W', ' ', text) # Remove all the special characters
    text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text)  # remove all single characters
    text = re.sub(r'\^[a-zA-Z]\s+', ' ', text) # Remove single characters from the start
    text = re.sub(r'\s+', ' ', text, flags=re.I) # Substituting multiple spaces with single space
    return text

In [None]:
train_sentences = list(map(preprocess_text, train_sentences))

## Tokenization and filtering

For the tokenizer I decided going with a simple split/lower case letters/punctuation normalization tokenizer because I didn't want to use too difficult models that would only complicate my workflow.

In [None]:
tokenizer = get_tokenizer('basic_english')

Like in the study guide, I filtered the dataset and chose to train only on sentences with 200 tokens or less.

In [None]:
MAX_LENGTH = 200
sentence_filter = lambda x: len(x) <= MAX_LENGTH

In [None]:
train_sentences_tokenized = map(tokenizer, train_sentences)

In [None]:
train_sentences_tokenized = filter(sentence_filter, train_sentences_tokenized)

In [None]:
train_sentences_tokenized = list(train_sentences_tokenized)

Shuffle the dataset to randomize training and lowering bias (the sentences are grouped with respect to each movie)

In [None]:
random.shuffle(train_sentences_tokenized)

## Creating the vocabulary

My vocabulary has special \<START\> and \<END\> tokens because I thought it would help training but it seems like it didn't, so in retrospect it might've been redundant.

The vocab also has a \<DROPPED\> token that is used to signal to the model which words has been accidentally deleted (or purposefully redacted)

In [None]:
vocab = build_vocab_from_iterator(train_sentences_tokenized, min_freq=2, specials=['<UNK>', '<START>', '<END>', '<DROPPED>'])
vocab.set_default_index(0)

In [None]:
start_token_index = vocab.get_stoi()['<START>']
end_token_index = vocab.get_stoi()['<END>']
unk_token_index = vocab.get_stoi()['<UNK>']
dropped_token_index = vocab.get_stoi()['<DROPPED>']

## Creating the noisy dataset by corrupting the train sentences

As I mentioned earlier the method that I used to corrupt the sentences was to randomly drop a few tokens in each sequence (I didn't use a dropout layer because I replaced the tokens and not completly discarded them). The tokens are replaced with a \<DROPPED\> token to signal to the model they need replacement.

I chose this corruption because it's word-level and not character-level, and because I found it both easy to implement and useful (perhaps this kind of model can be used to un-redact classified documents or help with archaeological reconstruction of ancient scrolls).

In [None]:
p = 0.075
delete_random_word = lambda sentence: [word if random.random() > p else '<DROPPED>' for word in sentence]
corrupt_sentence = lambda sentence: delete_random_word(sentence)

In [None]:
train_sentences_clean = train_sentences_tokenized.copy()

In [None]:
train_sentences_noisy = [corrupt_sentence(sentence) for sentence in train_sentences_clean.copy()]

## Padding the sentences with \<START\> and \<END\>

In [None]:
train_sentences_clean = [['<START>'] + sentence + ['<END>'] for sentence in train_sentences_clean]
train_sentences_noisy = [['<START>'] + sentence + ['<END>'] for sentence in train_sentences_noisy]

In [None]:
stoi = lambda x: torch.tensor(vocab(x))

In [None]:
train_sentences_clean_ints = list(map(stoi, train_sentences_clean))
train_sentences_noisy_ints = list(map(stoi, train_sentences_noisy))

## Looking at an example from the data

In [None]:
itos = lambda t: vocab.get_itos()[t.item()]

In [None]:
clean_example = list(map(itos, train_sentences_clean_ints[67]))
noisy_example = list(map(itos, train_sentences_noisy_ints[67]))

In [None]:
' '.join(clean_example)

'<START> watched film beginning end promised friend would lacks even unintentional entertainment value many bad films have may worst film ever seen m surprised distributor put name it <END>'

In [None]:
' '.join(noisy_example)

'<START> watched film beginning end promised friend would lacks even unintentional entertainment value many bad films have <DROPPED> worst film ever seen m surprised distributor put name it <END>'

## Even after filtering

The model has ~20K examples to learn from, more than enough in my opinion.

In [None]:
len(train_sentences_clean_ints)

20689

## Defining the model

To implement the model I tried aiming for the simplest model possible,

The encoder and the decoder are simple LSTM cells and the decoder returns a probabillity distribution (softmax) of the next word at each step.

In [None]:
class DAE(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, num_layers=2):
        super(DAE, self).__init__()
        self.embedding = nn.Embedding(len(vocab), embedding_dim)
        self.encoder = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers)
        self.decoder = nn.LSTM(hidden_dim, embedding_dim, num_layers=num_layers)
        self.fc = nn.Linear(embedding_dim, len(vocab))

    def forward(self, input_seq):
        embedded = self.embedding(input_seq)
        encoded_outputs, _ = self.encoder(embedded)
        decoded_outputs, _ = self.decoder(encoded_outputs)
        word_logits = self.fc(decoded_outputs)
        return word_logits

In [None]:
class Seq2SeqDAE(nn.Module):
    def __init__(self, hidden_dim, num_layers, dropout):
        super(DAE, self).__init__()

        self.hidden_dim = hidden_dim
        self.num_layers = num_layers

        self.embedding = nn.Embedding(len(vocab), hidden_dim)
        self.encoder   = nn.LSTM(hidden_dim, hidden_dim, num_layers, dropout=dropout)
        self.decoder   = nn.LSTM(hidden_dim, hidden_dim, num_layers, dropout=dropout)
        self.fc        = nn.Linear(hidden_dim, len(vocab))
        self.dropout   = nn.Dropout(dropout)

    def forward(self, src, trg):
        src_emb = self.dropout(self.embedding(src))
        trg_emb = self.dropout(self.embedding(trg))

        _, contexts = self.encoder(src_emb)
        output, _ = self.decoder(trg_emb, contexts)

        return self.fc(output)

    def evaluate_sentence(self, sentence):
        self.eval()
        with torch.no_grad():
            sentence = sentence.unsqueeze(1)
            src_emb = self.embedding(sentence)
            _, (hidden, cell) = self.encoder(src_emb)

            output = torch.tensor([[START_TOKEN]])
            outputs = []

            for _ in range(MAX_LENGTH):
                output_emb = self.embedding(output)
                output, (hidden, cell) = self.decoder(output_emb, (hidden.expand(-1, 1, -1), cell.expand(-1, 1, -1)))
                output = self.fc(output.squeeze(0))
                generated_token = torch.argmax(output, dim=1)
                outputs.append(generated_token.item())
                if generated_token.item() == END_TOKEN:
                    break
                output = generated_token.unsqueeze(0)

        generated_sentence = [vocab.get_itos()[token] for token in outputs]
        return ' '.join(generated_sentence)

## Training

I used a typical training loop for training the model, but it's worth talking about the loss function --

As with any autoencoder I decided that I should encode the noisy sentence, decode it, and receive a "reconstructed" version of it.

This version is trained to be as close as possible to the clean, original sentence in means of a probabillity distribution of the liklihood of each token in the sequence.

In [None]:
# model = DAE(250, 200, 2)
# criterion = nn.CrossEntropyLoss()
# optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [None]:
# def iterate_one_pair(input_seq, target_seq):
#     optimizer.zero_grad()
#     reconstructed_seq = model(input_seq)
#     loss = criterion(reconstructed_seq, target_seq)
#     loss.backward()
#     optimizer.step()
#     return loss.item()

In [None]:
# num_samples, epochs = 2500, 10
# epoch_losses = []

In [None]:
# %%time
# for epoch in range(epochs):
#     torch.save(model.state_dict(), f'DAE_model_epoch:{epoch}.pth')
#     batch_loss_agg = torch.tensor([0.])

#     for idx in tqdm(range(num_samples)):
#         batch_loss_agg += iterate_one_pair(train_sentences_noisy_ints[idx], train_sentences_clean_ints[idx])

#     epoch_loss = batch_loss_agg / num_samples
#     epoch_losses.append(epoch_loss)
#     print("Epoch", epoch, " loss:", epoch_loss.item())

In [None]:
# plt.plot(epoch_losses)

## Testing the model on a previously unseen senetence

In [None]:
def evaluate_max_word(eval_model, input_seq):
    eval_model.eval()
    reconstructed_seq = eval_model(input_seq)
    cleaned_seq_int = torch.argmax(reconstructed_seq, dim=1)
    cleaned_seq = [vocab.get_itos()[x] for x in cleaned_seq_int]
    return cleaned_seq

In [None]:
test_idx = 4564

In [None]:
trained_model = DAE(250, 200, 2)
untrained_model = DAE(250, 200, 2)

In [None]:
trained_model.load_state_dict(torch.load('DAE_model_epoch:9.pth'))
untrained_model.load_state_dict(torch.load('DAE_model_epoch:0.pth'))

In [None]:
denoised_untrained = evaluate_max_word(untrained_model, train_sentences_noisy_ints[test_idx])
denoised_trained = evaluate_max_word(trained_model, train_sentences_noisy_ints[test_idx])
clean = list(map(itos, train_sentences_clean_ints[test_idx]))
dirty = list(map(itos, train_sentences_noisy_ints[test_idx]))

In [None]:
' '.join(denoised_untrained)

'romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp romp'

In [None]:
' '.join(denoised_trained)

'<START> one film actual teenager took place pulled war eyed kept attention end yours viewing it ever read morgan bucks curious see cowgirls direct off movie historical films era br br my major sandler rhode environment look good acting little stiff times like old man names combining kong golden paris soldier playing rates models film strange overall aankhen film <END>'

In [None]:
' '.join(clean)

'<START> interesting film actual event took place civil war vermont kept attention end regret viewing it ever read raid incident curious see rebels pulled off enjoy historical films era br br my major complaint confederate uniforms look good acting little stiff times like old man eating mashed potatoes teeth wounded soldier playing fetch hound little strange overall descent film <END>'

In [None]:
' '.join(dirty)

'<START> <DROPPED> film actual event took place civil war vermont kept attention end regret viewing it ever read raid incident curious see rebels pulled off <DROPPED> historical films era br br my major complaint confederate uniforms look good acting little stiff times like old man eating mashed potatoes teeth wounded soldier playing fetch hound <DROPPED> strange overall descent film <END>'

## Downstream tasks

Or, using the model for a classification mission

### Handle data loading

In [None]:
train_dataset, test_dataset = IMDB(split=('train', 'test'))

In [None]:
train_dataset = list(train_dataset)
test_dataset = list(test_dataset)

In [None]:
random.shuffle(train_dataset)
random.shuffle(test_dataset)

In [None]:
train_labels, train_sentences = zip(*train_dataset)
test_labels, test_sentences = zip(*test_dataset)

In [None]:
train_sentences = list(map(preprocess_text, train_sentences))
test_sentences = list(map(preprocess_text, test_sentences))

In [None]:
train_sentences = map(tokenizer, train_sentences)
test_sentences = map(tokenizer, test_sentences)

In [None]:
train_sentences = filter(sentence_filter, train_sentences)
test_sentences = filter(sentence_filter, test_sentences)

In [None]:
train_sentences = list(train_sentences)
test_sentences = list(test_sentences)

In [None]:
train_sentences_ints = list(map(stoi, train_sentences))
test_sentences_ints = list(map(stoi, test_sentences))

### Model definition

The model I chose is the embedding layer, followed by the encoder, followed by
a classification head.

In [None]:
class IMDBClassifier(nn.Module):
    def __init__(self, DAE):
        super(IMDBClassifier, self).__init__()
        self.embedding = DAE.embedding
        self.encoder = DAE.encoder
        self.classification_head = nn.Sequential(
            nn.LazyLinear(128), nn.ReLU(), nn.Dropout(0.3),
            nn.LazyLinear(32), nn.ReLU(), nn.Dropout(0.3),
            nn.LazyLinear(2)
        )

    def forward(self, input_seq):
        embedded = self.embedding(input_seq.unsqueeze(0)).squeeze()
        encoded_outputs, _ = self.encoder(embedded)
        encoded_outputs = encoded_outputs[-1, :].view(-1)
        return self.classification_head(encoded_outputs)

In [None]:
classification_model = IMDBClassifier(trained_model)



In [None]:
for p in classification_model.embedding.parameters():
    p.requires_grad = False

In [None]:
for p in classification_model.encoder.parameters():
    p.requires_grad = False

### Training!

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(classification_model.classification_head.parameters(), lr=0.005)

In [None]:
num_samples, epochs = 5000, 5
epoch_losses = []

In [None]:
for epoch in range(epochs):
    epoch_loss = 0.0

    for idx in tqdm(range(num_samples)):
        optimizer.zero_grad()
        inputs, label = train_sentences_ints[idx], train_labels[idx]
        label = torch.tensor([label - 1])
        label = nn.functional.one_hot(label, num_classes=2).squeeze()
        label = label.type(torch.float)
        output = classification_model(inputs)

        loss = criterion(output, label)
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
    epoch_losses.append(epoch_loss)
    print("Epoch", epoch, "loss:", epoch_loss)

  0%|          | 0/5000 [00:00<?, ?it/s]

Epoch 0 loss: 3508.877294052392


  0%|          | 0/5000 [00:00<?, ?it/s]

Epoch 1 loss: 3473.967836678028


  0%|          | 0/5000 [00:00<?, ?it/s]

Epoch 2 loss: 3473.092191338539


  0%|          | 0/5000 [00:00<?, ?it/s]

Epoch 3 loss: 3473.08959454298


  0%|          | 0/5000 [00:00<?, ?it/s]

Epoch 4 loss: 3473.0895958542824


### Testing the model

In [None]:
def test_IMDb_classification_model():
    correct = 0
    total = 0
    with torch.no_grad():
        for idx in tqdm(range(len(test_sentences_ints))[:1000]):
            inputs, label = test_sentences_ints[idx], test_labels[idx]
            output = classification_model(inputs)
            predicted = torch.argmax(output)
            total += 1
            if predicted.item() == label - 1:
                correct += 1
    return correct / total

In [None]:
f'Model accuracy: {test_IMDb_classification_model() * 100:.2f}%'

  0%|          | 0/1000 [00:00<?, ?it/s]

'Model accuracy: 52.40%'