<a href="https://colab.research.google.com/github/dmika1234/dl_uwr/blob/develop/Assignments/Assignment5/assignment5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assigment 5

**Submission deadlines**:

* last lab before 20.06.2023

**Points:** Aim to get 6 (updated value) out of 15+ possible points

All needed data files are on Drive: <https://drive.google.com/drive/folders/1uufpGn46Mwv4oBwajIeOj4rvAK96iaS-?usp=sharing> (or will be soon :) )


## Task 1 (5 points)

Consider the vowel reconstruction task -- i.e. inserting missing vowels (aeuioy) to obtain proper English text. For instance for the input sentence:

<pre>
h m gd smbd hs stln ll m vwls
</pre>

the best result is

<pre>
oh my god somebody has stolen all my vowels
</pre>

In this task both dev and test data come from the two books about Winnie-the-Pooh. You have to train two RNN Language Models on *pooh-train.txt*. For the first model use the code below, for the second choose different hyperparameters (different dropout, smaller number of units or layers, or just do any modification you want).

The code below is based on
https://www.kdnuggets.com/2020/07/pytorch-lstm-text-generation-tutorial.html

In [1]:
!gdown "https://drive.google.com/uc?id=1-k8e9OG7NOVk73Kkv4WpqNQKHrVVmVXa" -O pooh_train.txt
!gdown "https://drive.google.com/uc?id=1ADNyasf6AEUsmz-163DWHw_rSldfnpta" -O pooh_test.txt
!gdown "https://drive.google.com/uc?id=1POiC9I_BjZKBQe-7XkW5CW0z8_6inWtY" -O pooh_words.txt

Downloading...
From: https://drive.google.com/uc?id=1-k8e9OG7NOVk73Kkv4WpqNQKHrVVmVXa
To: /content/pooh_train.txt
100% 255k/255k [00:00<00:00, 116MB/s]
Downloading...
From: https://drive.google.com/uc?id=1ADNyasf6AEUsmz-163DWHw_rSldfnpta
To: /content/pooh_test.txt
100% 34.6k/34.6k [00:00<00:00, 83.5MB/s]
Downloading...
From: https://drive.google.com/uc?id=1POiC9I_BjZKBQe-7XkW5CW0z8_6inWtY
To: /content/pooh_words.txt
100% 20.4k/20.4k [00:00<00:00, 83.7MB/s]


### Code and first model training

In [2]:
import torch
from collections import Counter

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

SEQUENCE_LENGTH = 15

class PoohDataset(torch.utils.data.Dataset):
    def __init__(self, sequence_length, device):
        txt = open('pooh_train.txt').read()

        self.words = txt.lower().split() # The text is already tokenized

        self.uniq_words = self.get_uniq_words()

        self.index_to_word = {index: word for index, word in enumerate(self.uniq_words)}
        self.word_to_index = {word: index for index, word in enumerate(self.uniq_words)}

        self.words_indexes = [self.word_to_index[w] for w in self.words]
        self.sequence_length = sequence_length
        self.device = device


    def get_uniq_words(self):
        word_counts = Counter(self.words)
        return sorted(word_counts, key=word_counts.get, reverse=True)

    def __len__(self):
        return len(self.words_indexes) - self.sequence_length

    def __getitem__(self, index):
        return (
            torch.tensor(self.words_indexes[index:index+self.sequence_length], device=self.device),
            torch.tensor(self.words_indexes[index+1:index+self.sequence_length+1], device=self.device)
        )

pooh_dataset = PoohDataset(SEQUENCE_LENGTH, device)

In [138]:
from torch import nn, optim

class LSTMModel(nn.Module):
    def __init__(self, dataset, device):
        super(LSTMModel, self).__init__()
        self.lstm_size = 512
        self.embedding_dim = 100
        self.num_layers = 2
        self.device = device


        n_vocab = len(dataset.uniq_words)
        self.embedding = nn.Embedding(
            num_embeddings=n_vocab,
            embedding_dim=self.embedding_dim,
        )
        self.lstm = nn.LSTM(
            input_size=self.embedding_dim,
            hidden_size=self.lstm_size,
            num_layers=self.num_layers,
            dropout=0.2,
        )
        self.fc = nn.Linear(self.lstm_size, n_vocab)

    def forward(self, x, prev_state):
        embed = self.embedding(x)
        output, state = self.lstm(embed, prev_state)
        logits = self.fc(output)
        return logits, state

    def init_state(self, sequence_length):
        return (torch.zeros(self.num_layers, sequence_length, self.lstm_size).to(self.device),
                torch.zeros(self.num_layers, sequence_length, self.lstm_size).to(self.device))

model = LSTMModel(pooh_dataset, device)
model.to(device)

In [4]:
import numpy as np
from torch.utils.data import DataLoader

batch_size = 512
max_epochs = 30

def train(dataset, model):
    model.train()

    dataloader = DataLoader(dataset, batch_size=batch_size)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    for epoch in range(max_epochs):
        state_h, state_c = model.init_state(SEQUENCE_LENGTH)

        for batch, (x, y) in enumerate(dataloader):

            optimizer.zero_grad()

            y_pred, (state_h, state_c) = model(x, (state_h, state_c))
            loss = criterion(y_pred.transpose(1, 2), y)

            state_h = state_h.detach()
            state_c = state_c.detach()

            loss.backward()
            optimizer.step()

        print({ 'epoch': epoch, 'batch': batch, 'loss': loss.item() })

train(pooh_dataset, model)

{'epoch': 0, 'batch': 113, 'loss': 5.560373306274414}
{'epoch': 1, 'batch': 113, 'loss': 5.049388408660889}
{'epoch': 2, 'batch': 113, 'loss': 4.688978672027588}
{'epoch': 3, 'batch': 113, 'loss': 4.363079071044922}
{'epoch': 4, 'batch': 113, 'loss': 4.132043838500977}
{'epoch': 5, 'batch': 113, 'loss': 3.9728634357452393}
{'epoch': 6, 'batch': 113, 'loss': 3.8696725368499756}
{'epoch': 7, 'batch': 113, 'loss': 3.7312123775482178}
{'epoch': 8, 'batch': 113, 'loss': 3.6177730560302734}
{'epoch': 9, 'batch': 113, 'loss': 3.5054659843444824}
{'epoch': 10, 'batch': 113, 'loss': 3.3890910148620605}
{'epoch': 11, 'batch': 113, 'loss': 3.284022808074951}
{'epoch': 12, 'batch': 113, 'loss': 3.182687997817993}
{'epoch': 13, 'batch': 113, 'loss': 3.09560489654541}
{'epoch': 14, 'batch': 113, 'loss': 3.0284974575042725}
{'epoch': 15, 'batch': 113, 'loss': 2.9465301036834717}
{'epoch': 16, 'batch': 113, 'loss': 2.8474338054656982}
{'epoch': 17, 'batch': 113, 'loss': 2.7900633811950684}
{'epoch': 1

In [145]:
# torch.save(model.state_dict(), 'pooh_2x512_30ep.model')
# model = LSTMModel(pooh_dataset, device)
# model.load_state_dict(torch.load("pooh_2x512_30ep.model"))
# model.eval()

LSTMModel(
  (embedding): Embedding(2548, 100)
  (lstm): LSTM(100, 512, num_layers=2, dropout=0.2)
  (fc): Linear(in_features=512, out_features=2548, bias=True)
)

In [6]:
# The predict function is a text generator. You have to modify this code!

def predict(dataset, model, text, next_words=15):
    model.eval()

    words = text.split()
    state_h, state_c = model.init_state(len(words))

    for i in range(0, next_words):
        x = torch.tensor([[dataset.word_to_index[w] for w in words[i:]]])
        x = x.to(device)

        y_pred, (state_h, state_c) = model(x, (state_h, state_c))

        last_word_logits = y_pred[0][-1]
        p = torch.nn.functional.softmax(last_word_logits, dim=0).detach().cpu().numpy()
        word_index = np.random.choice(len(last_word_logits), p=p)
        words.append(dataset.index_to_word[word_index])

    return ' '.join(words)

# DEMO
speakers = ['pooh', 'piglet', 'christopher robin', 'rabbit', 'owl', 'tigger', 'eeyore']
for s in speakers:
    prompt = 'in the morning ' + s
    for i in range(1):
        print (predict(pooh_dataset, model, prompt, 50))
    print ()

in the morning pooh certain he thought by a friend or two , which shows piglet ! '' `` i see , '' said pooh . chapter viii . and pooh and piglet bent into a arms and said `` oh ! '' and that he was talking about . `` it 's all

in the morning piglet remember drifted tuesday 'help pen r cuckoo spotted land vests enjoyed numb skoos help-yourself dried interrupted shoulders misty suggested board ice fought help-yourself seized referring stepped affectionately ease practised mole sea strip curious rains damp noise-you-make-before- curious line practised supper fluttered blue-bells outland slept flashed desert regretful beautiful sparkle slippery

in the morning christopher robin -- -- he called piglet again . `` that 's just come in a sort of idea , pooh ? '' piglet said `` yes , yes , and then he had never had , piglet 's house as north . '' `` well , '' said pooh , ``

in the morning rabbit giving front trap can pooh said `` oh ! '' he added brushing . `` it 's easy . come on out and cou

In [39]:
# You can use the code if you want

from collections import defaultdict as dd

vowels = set("aoiuye'")
def devowelize(s):
    rv = ''.join(a for a in s if a not in vowels)
    if rv:
        return rv
    return '_' # Symbol for words without consonants

pooh_words = set(open('pooh_words.txt').read().split())
representation = dd(set)

for w in pooh_words:
    r = devowelize(w)
    if w in pooh_dataset.word_to_index.keys():
      representation[r].add(w)

hard_words = set()
for r, ws in representation.items():
    if len(ws) > 1:
        hard_words.update(ws)

print (len(hard_words))

799


### solutions -> results

In [156]:
def sequence_probability(model, dataset, sequence):
    model.eval()

    words = sequence.split()
    state_h, state_c = model.init_state(len(words))

    x = torch.tensor([[dataset.word_to_index[w] for w in words]], device=device)
    y_pred, _ = model(x, (state_h, state_c))

    # softmax to get probabilities
    p = torch.nn.functional.softmax(y_pred[0], dim=1).detach().cpu().numpy()

    # get log-probability of the sequence
    log_p = np.log(p[range(len(words)), [dataset.word_to_index[w] for w in words]]).sum()

    return log_p

def beam_search(model, devowelized_sentence, beam_width, dataset=pooh_dataset):
    # Initialize the list of candidate sequences
    candidates = [('', 0.0)]  # note that we initialize with log-probability 0

    # Process each devowelized word in the sentence
    for word in devowelized_sentence.split():
        if len(representation[word]) != 0:
          new_candidates = []

          # For each candidate sequence, extend it with each possible reconstructed word
          for seq, seq_log_p in candidates:
              for reconstructed_word in representation[word]:
                  new_seq = seq + ' ' + reconstructed_word if seq else reconstructed_word
                  # Calculate the log-probability of the new sequence
                  new_seq_log_p = seq_log_p + sequence_probability(model, dataset, new_seq)
                  new_candidates.append((new_seq, new_seq_log_p))

          # Sort the new candidates by log-probability in descending order
          new_candidates.sort(key=lambda x: x[1], reverse=True)
          # Keep only the top candidates with highest log-probabilities
          candidates = new_candidates[:beam_width]
        else:
          new_candidates = []
          for seq, log_p in candidates:
            new_seq = seq + ''
            new_candidates.append((new_seq, log_p))
          # print(f"{word} skipped!")
          candidates = new_candidates[:beam_width]
    # Return the candidate sequence with the highest overall log-probability
    return candidates[0][0]

You can assume that only words from pooh_words.txt can occur in the reconstructed text. For decoding you have two options (choose one, or implement both ang get **+1** bonus point)

1. Sample reconstructed text several times (with quite a low temperature), choose the most likely result.
2. Perform beam search.

Of course in the sampling procedure you should consider only words matching the given consonants.

Report accuracy of your methods (for both language models). The accuracy should be computed by the following function, it should be *greater than 0.25*.


```python
def accuracy(original_sequence, reconstructed_sequence):
    sa = original_sequence
    sb = reconstructed_sequence
    score = len([1 for (a,b) in zip(sa, sb) if a == b])
    return score / len(original_sequence)
```


In [123]:
def accuracy(original_sequence, reconstructed_sequence):
    sa = original_sequence
    sb = reconstructed_sequence
    score = len([1 for (a,b) in zip(sa, sb) if a == b])
    return score / len(original_sequence)

In [157]:
model.to(device)
devowelized_sentence = "h m gd smbd hs stln ll m vwls"
beam_width = 4  # You can adjust the beam width
result = beam_search(model, devowelized_sentence, beam_width)
print(f"Reconstructed sentence: {result}")
print(f"First model accuracy: {accuracy(result.split(), 'oh my god somebody has stolen all my vowels'.split()):.3f}")

Reconstructed sentence: he my good somebody his stolen all my
First model accuracy: 0.625


### Second model

In [143]:
class LSTMModel2(nn.Module):
    def __init__(self, dataset, device):
        super(LSTMModel2, self).__init__()
        self.lstm_size = 256
        self.embedding_dim = 100
        self.num_layers = 2
        self.device = device


        n_vocab = len(dataset.uniq_words)
        self.embedding = nn.Embedding(
            num_embeddings=n_vocab,
            embedding_dim=self.embedding_dim,
        )
        self.lstm = nn.LSTM(
            input_size=self.embedding_dim,
            hidden_size=self.lstm_size,
            num_layers=self.num_layers,
            dropout=0.2,
        )
        self.fc = nn.Sequential(
            nn.Linear(self.lstm_size, self.lstm_size),
            nn.ReLU(),
            nn.Linear(self.lstm_size, n_vocab))


    def forward(self, x, prev_state):
        embed = self.embedding(x)
        output, state = self.lstm(embed, prev_state)
        logits = self.fc(output)
        return logits, state

    def init_state(self, sequence_length):
        return (torch.zeros(self.num_layers, sequence_length, self.lstm_size).to(self.device),
                torch.zeros(self.num_layers, sequence_length, self.lstm_size).to(self.device))

model2 = LSTMModel2(pooh_dataset, device)
model2.to(device)

LSTMModel2(
  (embedding): Embedding(2548, 100)
  (lstm): LSTM(100, 256, num_layers=2, dropout=0.2)
  (fc): Sequential(
    (0): Linear(in_features=256, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=2548, bias=True)
  )
)

In [146]:
train(pooh_dataset, model2)

{'epoch': 0, 'batch': 113, 'loss': 5.370290756225586}
{'epoch': 1, 'batch': 113, 'loss': 4.9626994132995605}
{'epoch': 2, 'batch': 113, 'loss': 4.6660475730896}
{'epoch': 3, 'batch': 113, 'loss': 4.377070903778076}
{'epoch': 4, 'batch': 113, 'loss': 4.202948093414307}
{'epoch': 5, 'batch': 113, 'loss': 4.038951396942139}
{'epoch': 6, 'batch': 113, 'loss': 3.903787612915039}
{'epoch': 7, 'batch': 113, 'loss': 3.763707160949707}
{'epoch': 8, 'batch': 113, 'loss': 3.6513772010803223}
{'epoch': 9, 'batch': 113, 'loss': 3.552811861038208}
{'epoch': 10, 'batch': 113, 'loss': 3.433504819869995}
{'epoch': 11, 'batch': 113, 'loss': 3.2995903491973877}
{'epoch': 12, 'batch': 113, 'loss': 3.1876747608184814}
{'epoch': 13, 'batch': 113, 'loss': 3.08644437789917}
{'epoch': 14, 'batch': 113, 'loss': 3.011042356491089}
{'epoch': 15, 'batch': 113, 'loss': 2.9016997814178467}
{'epoch': 16, 'batch': 113, 'loss': 2.787229061126709}
{'epoch': 17, 'batch': 113, 'loss': 2.7463219165802}
{'epoch': 18, 'batch

In [147]:
torch.save(model2.state_dict(), '2_pooh_2x512_30ep.model')
# model2 = LSTMModel2(pooh_dataset, device)
# model2.load_state_dict(torch.load("pooh_2x512_30ep.model"))
# model2.eval()
# model2.to(device)

In [158]:
model2.to(device)
devowelized_sentence = "h m gd smbd hs stln ll m vwls"
beam_width = 4  # You can adjust the beam width
result = beam_search(model2, devowelized_sentence, beam_width)
print(f"Reconstructed sentence: {result}")
print(f"Second model accuracy: {accuracy(result.split(), 'oh my god somebody has stolen all my vowels'.split()):.3f}")

Reconstructed sentence: oh my good somebody has stolen 'll my
Second model accuracy: 0.750


## Task 2 (6 points)

This task is about text generation. You have to:

**A**. Create text corpora containing texts with similar vocabulary (for instance books from the same genre, or written by the same author). This corpora should have approximately 1M words. You can consider using the following sources: Project Gutenberg (https://www.gutenberg.org/), Wolne Lektury (https://wolnelektury.pl/), parts of BookCorpus, https://github.com/soskek/bookcorpus, but generally feel free. Texts could be in English, Polish or any other language you know.

**B**. choose the tokenization procedure. It should have two stages:

1. word tokenization (you can use nltk.tokenize.word_tokenize, tokenizer from spaCy, pytorch, keras, ...). Test your tokenizer on your corpora, and look at a set of tokens containing both letters and special characters. If some of them should be in your opinion treated as a sequence of tokens, then modify the tokenization procedure

2. sub-word tokenization (you can either use the existing procedure, like wordpiece or sentencepiece, or create something by yourself). Here is a simple idea: take 8K most popular words (W), 1K most popular suffixes (S), and 1K most popular prefixes (P). Words in W are its own tokens. Word x outside W should be tokenized as 'p_ _s' where p is the longest prefix of x in P, and s is the longest prefix of W

**C**. write text generation procedure. The procedure should fulfill the following requirements:

1. it should use the RNN language model (trained on sub-word tokens)
2. generated tokens should be presented as a text containing words (without extra spaces, or other extra characters, as begin-of-word introduced during tokenization)
3. all words in a generated text should belond to the corpora (note that this is not guaranteed by LSTM)
4. in generation Top-P sampling should be used (see NN-NLP.6, slide X)
5. in generated texts every token 3-gram should be uniq
6. *(optionally, +1 point)* all token bigrams in generated texts occur in the corpora

Of course to fulfill these constraints you have to do rejection sampling, or beam search, or ... If you want to be more up-to-date you can also use transformer-like language model. In this case consider using nanoGPT (by A. Karpathy)

## Task 3 (4 or 6 p)

In this task you have to create a network which looks at characters of the word and tries to guess whether the word is a noun, a verb, an adjective, and so on. To be more precise: the input is a word (without context), the output is a POS-tag (Part-of-Speech). Since some words are unambiguous, and we have no context, our network is supposed to return the set of possible tags.

The data is taken from Universal Dependencies English corpus, and of course it contains errors, especially because not all possible tags occured in the data.

Train a network (4p) or two networks (+2p) solving this task. Both networks should look at character n-grams occuring in the word. There are two options:

* **Fixed size:** for instance take 2,3, and 4-character suffixes of the word, use them as  features (whith 1-hot encoding). You can also combine prefix and suffix features. Simple, useful trick: when looking at suffixes, add some '_' characters at the beginning of the word to guarantee that shorter words have suffixes of a desired length.

* **Variable size:** take for instance 4-grams (or 4 grams and 3-grams), use Deep Averaging Network. Simple trick: add extra character at the beginning and at the end of the word, to add the information, that ngram occurs at special position ('ed' at the end has slightly different meaning that 'ed' in the middle)


## Task 4 (5p)

Apply seq2seq model (you can modify the code from this tutorial: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html) to compute grapheme to phoneme conversion for English. Train the model on dev_cmu_dict.txt and test it on test_cmu_dict.txt. Report accuracy of your solution using two metrics:
* exact match (how many words are perfectly converted to phonemes)
* exact match without stress (how many words are perfectly converted to phonemes when we remove the information about stress)
