Name: Edvin Yang   
Email address: st124277@ait.ac.th

# Task 1 Dataset Acquisition

In this task, "crude" catetory under Reuters dataset from the NLTK(Natural Language Toolkit) library will be used. The dataset's original source is Reuters News Agency, which provided a collection of news articles for research purposes.

Source: https://www.nltk.org/


# Task 2 Model Training

1) Detailed steps for preprocessing text data are described in form of comments.
2) Model architecture and training processs are illustrated in following code snippet.

# Task 3 Web App

Web app is created with DASH and can be found in folder app. 

### 1. Library Imports and Configure Device:

In [1]:
# Import necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim
from tqdm import tqdm
import torchtext, math
from sklearn.model_selection import train_test_split

In [2]:
# Configure the device for training (GPU or CPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### 2. Download datasets

In [3]:
import nltk
from nltk.corpus import reuters

# Download the 'reuters' dataset from the NLTK library.
nltk.download('reuters')
# Download the 'punkt' tokenizer models.It is used by NLTK for sentence tokenization for breaking text into individual sentences.
nltk.download('punkt')

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### 3. Data Cleaning and Splitting:

In [4]:
# Define a function 'clean_text' that takes 'text' as input and performs text cleaning operations.
import re
def clean_text(text):
    text = text.replace('\n', ' ').replace('\r', ' ')
    text = re.sub(' +', ' ', text)
    return text

In [5]:
# Get a list of document IDs from the Reuters corpus.
documents = reuters.fileids()

In [6]:
# Select documents that belong to the 'crude' category and store them in 'crude_docs'.
crude_docs = [doc for doc in documents if 'crude' in reuters.categories(doc)]

# Clean and concatenate the raw text from selected 'crude' category documents into a single 'text' variable.
text = ' '.join([clean_text(reuters.raw(doc)) for doc in crude_docs])

In [7]:
# Tokenize the concatenated 'text' into sentences using NLTK's sentence tokenizer.
sentences = nltk.sent_tokenize(text)

In [8]:
random_seed = 42

# Split the 'sentences' dataset into training (70%), testing (15%), and validation (15%) sets.
train_data, temp_data = train_test_split(sentences, test_size=0.3, random_state=random_seed)
valid_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=random_seed)


print(f"Number of training set: {len(train_data)}")
print(f"Number of validation set: {len(valid_data )}")
print(f"Number of test set: {len(test_data)}")

Number of training set: 3276
Number of validation set: 702
Number of test set: 702


### 4. Data Tokenization and Vocabulary Building:

In [9]:
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [10]:
# Create a tokenizer for basic English text.
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

# Define a function 'tokenize_data' that takes an 'example' and a 'tokenizer' as input.
tokenize_data = lambda example, tokenizer: {'tokens': tokenizer(example)}

In [11]:
# Tokenize the training, testing, and validation datasets using the tokenizer.
tokenized_train_data = [{'tokens': tokenizer(example)} for example in train_data]
tokenized_test_data = [{'tokens': tokenizer(example)} for example in test_data]
tokenized_validation_data = [{'tokens': tokenizer(example)} for example in valid_data]

In [12]:
# Combine tokenized datasets into suitable lists for building the vocabulary.
tokenized_train_dataset = [entry['tokens'] for entry in tokenized_train_data]
tokenized_test_dataset = [entry['tokens'] for entry in tokenized_test_data]
tokenized_validation_dataset = [entry['tokens'] for entry in tokenized_validation_data]

We will tell torchtext to add any word that has occurred at least three times in the dataset to the vocabulary because otherwise it would be too big. Also we shall make sure to add unk and eos.

In [13]:
# Build a vocabulary from the tokenized training dataset.
vocab = torchtext.vocab.build_vocab_from_iterator(tokenized_train_dataset)
vocab.insert_token('<unk>', 0)
vocab.insert_token('<eos>', 1)  # '<eos>' is used to indicate the end of a sequence.
vocab.set_default_index(vocab['<unk>'])  # Set the default index for out-of-vocabulary words to '<unk>'.

print(len(vocab))

7404


In [14]:
print(vocab.get_itos()[:10])

['<unk>', '<eos>', '.', 'the', ',', 'to', 'of', 'in', 'said', 'and']


In [15]:
# Save the vocabulary to 'vocab.pt'.
torch.save(vocab, 'app/vocab.pt')

### 5. Data Batching Function:

In [16]:
# Define a function 'get_data'
def get_data(dataset, vocab, batch_size):
    data = []
    for example in dataset:
        if example:
            tokens = example.append('<eos>') # Add '<eos>' to the end of each sentence.
            tokens = [vocab[token] for token in example]  # Convert each word to its corresponding numerical index.
            data.extend(tokens)

    # Convert the 'data' list to a PyTorch LongTensor.
    data = torch.LongTensor(data)

    # Calculate the number of batches based on the desired 'batch_size'.
    num_batches = data.shape[0] // batch_size
    data = data[:num_batches * batch_size]
    data = data.view(batch_size, num_batches)
    return data

In [17]:
batch_size = 32
train_data = get_data(tokenized_train_dataset, vocab, batch_size)
valid_data = get_data(tokenized_validation_dataset, vocab, batch_size)
test_data  = get_data(tokenized_test_dataset,  vocab, batch_size)

### 6. Model Definition:

In [18]:
class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, emb_dim, hid_dim, num_layers, dropout_rate):
        super().__init__()
        self.num_layers = num_layers
        self.hid_dim    = hid_dim
        self.emb_dim    = emb_dim

        self.embedding  = nn.Embedding(vocab_size, emb_dim)
        self.lstm       = nn.LSTM(emb_dim, hid_dim, num_layers=num_layers, dropout=dropout_rate, batch_first=True)
        self.dropout    = nn.Dropout(dropout_rate)
        self.fc         = nn.Linear(hid_dim, vocab_size)

        self.init_weights()

    def init_weights(self):
        init_range_emb = 0.1
        init_range_other = 1/math.sqrt(self.hid_dim)
        self.embedding.weight.data.uniform_(-init_range_emb, init_range_other)
        self.fc.weight.data.uniform_(-init_range_other, init_range_other)
        self.fc.bias.data.zero_()
        for i in range(self.num_layers):
            self.lstm.all_weights[i][0] = torch.FloatTensor(self.emb_dim,
                self.hid_dim).uniform_(-init_range_other, init_range_other) #We
            self.lstm.all_weights[i][1] = torch.FloatTensor(self.hid_dim,
                self.hid_dim).uniform_(-init_range_other, init_range_other) #Wh

    def init_hidden(self, batch_size, device):
        hidden = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        cell   = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        return hidden, cell

    def detach_hidden(self, hidden):
        hidden, cell = hidden
        hidden = hidden.detach() #not to be used for gradient computation
        cell   = cell.detach()
        return hidden, cell

    def forward(self, src, hidden):
        #src: [batch_size, seq len]
        embedding = self.dropout(self.embedding(src)) #harry potter is
        #embedding: [batch-size, seq len, emb dim]
        output, hidden = self.lstm(embedding, hidden)
        #ouput: [batch size, seq len, hid dim]
        #hidden: [num_layers * direction, seq len, hid_dim]
        output = self.dropout(output)
        prediction =self.fc(output)
        #prediction: [batch_size, seq_len, vocab_size]
        return prediction, hidden

### 7. Model Configuration:

In [19]:
vocab_size = len(vocab)
emb_dim = 1024                # 400 in the paper
hid_dim = 1024                # 1150 in the paper
num_layers = 2                # 3 in the paper
dropout_rate = 0.65
lr = 1e-3

In [20]:
model      = LSTMLanguageModel(vocab_size, emb_dim, hid_dim, num_layers, dropout_rate).to(device)
optimizer  = optim.Adam(model.parameters(), lr=lr)
criterion  = nn.CrossEntropyLoss()
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {num_params:,} trainable parameters')

The model has 31,964,396 trainable parameters


### 8. Batch Preparation Function:

In [21]:
def get_batch(data, seq_len, idx):
    #data #[batch size, bunch of tokens]
    src    = data[:, idx:idx+seq_len]
    target = data[:, idx+1:idx+seq_len+1]  #target simply is ahead of src by 1
    return src, target

### 9. Training Function:

In [22]:
def train(model, data, optimizer, criterion, batch_size, seq_len, clip, device):

    epoch_loss = 0
    model.train()
    # drop all batches that are not a multiple of seq_len
    # data #[batch size, seq len]
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]  #we need to -1 because we start at 0
    num_batches = data.shape[-1]

    #reset the hidden every epoch
    hidden = model.init_hidden(batch_size, device)

    for idx in tqdm(range(0, num_batches - 1, seq_len), desc='Training: ',leave=False):
        optimizer.zero_grad()

        #hidden does not need to be in the computational graph for efficiency
        hidden = model.detach_hidden(hidden)

        src, target = get_batch(data, seq_len, idx) #src, target: [batch size, seq len]
        src, target = src.to(device), target.to(device)
        batch_size = src.shape[0]
        prediction, hidden = model(src, hidden)

        #need to reshape because criterion expects pred to be 2d and target to be 1d
        prediction = prediction.reshape(batch_size * seq_len, -1)  #prediction: [batch size * seq len, vocab size]
        target = target.reshape(-1)
        loss = criterion(prediction, target)

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

### 10. Evaluation Function:

In [23]:
def evaluate(model, data, criterion, batch_size, seq_len, device):

    epoch_loss = 0
    model.eval()
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]
    num_batches = data.shape[-1]

    hidden = model.init_hidden(batch_size, device)

    with torch.no_grad():
        for idx in range(0, num_batches - 1, seq_len):
            hidden = model.detach_hidden(hidden)
            src, target = get_batch(data, seq_len, idx)
            src, target = src.to(device), target.to(device)
            batch_size= src.shape[0]

            prediction, hidden = model(src, hidden)
            prediction = prediction.reshape(batch_size * seq_len, -1)
            target = target.reshape(-1)

            loss = criterion(prediction, target)
            epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

### 11. Training Loop:

Follows very basic procedure. One note is that some of the sequences that will be fed to the model may involve parts from different sequences in the original dataset or be a subset of one (depending on the decoding length). For this reason we will reset the hidden state every epoch, this is like assuming that the next batch of sequences is probably always a follow up on the previous in the original dataset.

In [24]:
n_epochs = 50
seq_len  = 50 #<----decoding length
clip    = 0.25

# Initializes a learning rate scheduler.
# The learning rate scheduler reduces the learning rate (lr) during training if the validation loss stops improving.
lr_scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.5, patience=0)

best_valid_loss = float('inf')

# Enters a loop that iterates over the specified number of epochs
for epoch in range(n_epochs):
    train_loss = train(model, train_data, optimizer, criterion,
                batch_size, seq_len, clip, device)
    valid_loss = evaluate(model, valid_data, criterion, batch_size,
                seq_len, device)

    lr_scheduler.step(valid_loss)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'app/best-val-lstm_lm.pt')

    print(f'\tTrain Perplexity: {math.exp(train_loss):.3f}')
    print(f'\tValid Perplexity: {math.exp(valid_loss):.3f}')



	Train Perplexity: 830.553
	Valid Perplexity: 603.404




	Train Perplexity: 503.127
	Valid Perplexity: 398.726




	Train Perplexity: 349.864
	Valid Perplexity: 318.659




	Train Perplexity: 280.327
	Valid Perplexity: 267.742




	Train Perplexity: 229.223
	Valid Perplexity: 235.110




	Train Perplexity: 193.302
	Valid Perplexity: 210.282




	Train Perplexity: 166.559
	Valid Perplexity: 194.332




	Train Perplexity: 144.612
	Valid Perplexity: 182.184




	Train Perplexity: 127.507
	Valid Perplexity: 172.989




	Train Perplexity: 112.697
	Valid Perplexity: 166.259




	Train Perplexity: 100.794
	Valid Perplexity: 163.566




	Train Perplexity: 89.998
	Valid Perplexity: 157.723




	Train Perplexity: 81.246
	Valid Perplexity: 155.146




	Train Perplexity: 73.027
	Valid Perplexity: 154.237




	Train Perplexity: 66.369
	Valid Perplexity: 151.237




	Train Perplexity: 59.948
	Valid Perplexity: 149.899




	Train Perplexity: 54.639
	Valid Perplexity: 149.659




	Train Perplexity: 50.031
	Valid Perplexity: 150.222




	Train Perplexity: 44.039
	Valid Perplexity: 150.681




	Train Perplexity: 39.974
	Valid Perplexity: 147.991




	Train Perplexity: 38.180
	Valid Perplexity: 147.373




	Train Perplexity: 36.944
	Valid Perplexity: 148.207




	Train Perplexity: 35.407
	Valid Perplexity: 147.996




	Train Perplexity: 34.586
	Valid Perplexity: 147.798




	Train Perplexity: 33.925
	Valid Perplexity: 148.174




	Train Perplexity: 33.901
	Valid Perplexity: 148.295




	Train Perplexity: 33.790
	Valid Perplexity: 148.318




	Train Perplexity: 33.694
	Valid Perplexity: 148.326




	Train Perplexity: 33.795
	Valid Perplexity: 148.348




	Train Perplexity: 33.708
	Valid Perplexity: 148.347




	Train Perplexity: 33.756
	Valid Perplexity: 148.353




	Train Perplexity: 33.640
	Valid Perplexity: 148.357




	Train Perplexity: 33.696
	Valid Perplexity: 148.358




	Train Perplexity: 33.730
	Valid Perplexity: 148.358




	Train Perplexity: 33.719
	Valid Perplexity: 148.359




	Train Perplexity: 33.627
	Valid Perplexity: 148.359




	Train Perplexity: 33.746
	Valid Perplexity: 148.359




	Train Perplexity: 33.677
	Valid Perplexity: 148.359




	Train Perplexity: 33.516
	Valid Perplexity: 148.359




	Train Perplexity: 33.727
	Valid Perplexity: 148.359




	Train Perplexity: 33.538
	Valid Perplexity: 148.359




	Train Perplexity: 33.682
	Valid Perplexity: 148.359




	Train Perplexity: 33.709
	Valid Perplexity: 148.359




	Train Perplexity: 33.658
	Valid Perplexity: 148.359




	Train Perplexity: 33.533
	Valid Perplexity: 148.360




	Train Perplexity: 33.679
	Valid Perplexity: 148.360




	Train Perplexity: 33.710
	Valid Perplexity: 148.360




	Train Perplexity: 33.680
	Valid Perplexity: 148.360




	Train Perplexity: 33.713
	Valid Perplexity: 148.360




	Train Perplexity: 33.516
	Valid Perplexity: 148.360


### 12. Testing:

In [25]:
model.load_state_dict(torch.load('app/best-val-lstm_lm.pt',  map_location=device))
test_loss = evaluate(model, test_data, criterion, batch_size, seq_len, device)
print(f'Test Perplexity: {math.exp(test_loss):.3f}')

Test Perplexity: 166.933


### 13. Real-world inference

Here we take the prompt, tokenize, encode and feed it into the model to get the predictions. We then apply softmax while specifying that we want the output due to the last word in the sequence which represents the prediction for the next word. We divide the logits by a temperature value to alter the model’s confidence by adjusting the softmax probability distribution.

Once we have the Softmax distribution, we randomly sample it to make our prediction on the next word. If we get then we give that another try. Once we get we stop predicting.

We decode the prediction back to strings last lines.

In [26]:
def generate(prompt, max_seq_len, temperature, model, tokenizer, vocab, device, seed=None):
    if seed is not None:
        torch.manual_seed(seed)
    model.eval()
    tokens = tokenizer(prompt)
    indices = [vocab[t] for t in tokens]
    batch_size = 1
    hidden = model.init_hidden(batch_size, device)
    with torch.no_grad():
        for i in range(max_seq_len):
            src = torch.LongTensor([indices]).to(device)
            prediction, hidden = model(src, hidden)

            #prediction: [batch size, seq len, vocab size]
            #prediction[:, -1]: [batch size, vocab size] #probability of last vocab

            probs = torch.softmax(prediction[:, -1] / temperature, dim=-1)
            prediction = torch.multinomial(probs, num_samples=1).item()

            while prediction == vocab['<unk>']: #if it is unk, we sample again
                prediction = torch.multinomial(probs, num_samples=1).item()

            if prediction == vocab['<eos>']:    #if it is eos, we stop
                break

            indices.append(prediction) #autoregressive, thus output becomes input

    itos = vocab.get_itos()
    tokens = [itos[i] for i in indices]
    return tokens

In [27]:
prompt = 'Petrol gas is '
max_seq_len = 30
seed = 0

#smaller the temperature, more diverse tokens but comes
#with a tradeoff of less-make-sense sentence
temperatures = [0.5, 0.7, 0.75, 0.8, 1.0]
for temperature in temperatures:
    generation = generate(prompt, max_seq_len, temperature, model, tokenizer,
                          vocab, device, seed)
    print(str(temperature)+'\n'+' '.join(generation)+'\n')

0.5
petrol gas is the first oil company , he said .

0.7
petrol gas is the four countries , including reuters .

0.75
petrol gas is half by much in the market .

0.8
petrol gas is half by much in the market .

1.0
petrol gas is half by much yesterday , al-wattari said .

