# COLX 563 Lab Assignment 4: Slot filling
## Assignment Objectives

In this lab, you will build an end-to-end system for basic (binary) intent recognition and slot filling in the context of a dialogue system. It is a team assignment, and you have nearly complete freedom with regards to your solution, with a few restrictions mentioned below. For this lab, you will work with your capstone team.

## Getting Started

Add imports below.

In [26]:
#provided code
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import accuracy_score
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from tqdm import tqdm, trange
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, classification_report, confusion_matrix

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [27]:
manual_seed = 11
torch.manual_seed(manual_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
n_gpu = torch.cuda.device_count()
if n_gpu > 0:
    torch.cuda.manual_seed(manual_seed)

print(torch.cuda.get_device_name(0))

cuda
Tesla P100-PCIE-16GB


For this lab, you'll be working with the MultiWOZ dataset of goal-oriented dialogues (2.2). You can look at the full corpus [here](https://github.com/budzianowski/multiwoz/tree/master/data/MultiWOZ_2.2). It has an impressively detailed annotation involving multiple turns and multiple goals which we have simplified to just the initiating request (first turn) and involving two possible intents and the corresponding slots for those intents. Download the data from [github](https://github.ubc.ca/jungyeul/COLX_563_adv-semantics_lab_students/raw/master/Multiwoz.zip), unzip it into a directory outside of your lab repo and change the path below.

In [28]:
#provided code
woz_directory ="/content/drive/MyDrive/Colab Notebooks/COLX_563_lab4/data/"

## Tidy Submission
rubric={mechanics:1}

To get the marks for tidy submission:
- Submit the assignment by filling in this Jupyter notebook with your answers embedded
- Be sure to follow the instructions

## Inspecting the data

Let's look at corresponding pairs of utterances and answers from the training portion of our corpus

In [29]:
count = 0
with open(woz_directory + "WOZ_train_utt.txt") as f1:
    with open(woz_directory + "WOZ_train_ans.txt") as f2:
        while count < 20:
            print(f1.readline().strip())
            print(f2.readline().strip())
            print("------")
            count += 1

Guten Tag, I am staying overnight in Cambridge and need a place to sleep. I need free parking and internet.
find_hotel|hotel-area=centre|hotel-internet=yes|hotel-parking=yes
------
Hi there! Can you give me some info on Cityroomz?
find_hotel|hotel-name=cityroomz
------
I am looking for a hotel named alyesbray lodge guest house.
find_hotel|hotel-name=alyesbray lodge guest house
------
I am looking for a restaurant. I would like something cheap that has Chinese food.
find_restaurant|restaurant-food=chinese|restaurant-pricerange=cheap
------
I'm looking for an expensive restaurant in the centre if you could help me.
find_restaurant|restaurant-area=centre|restaurant-pricerange=expensive
------
I'm looking for a places to go and see during my upcoming trip to Cambridge.
find_hotel
------
Yeah, could you recommend a good gastropub?
find_restaurant|restaurant-food=gastropub
------
I want to find an expensive restaurant and serves european food. Can i also have the address, phone number and it

In [30]:
X_train = []
y_train = []
with open(woz_directory + "WOZ_train_utt.txt") as f1:
  for line in f1:
    line = line.strip()
    X_train.append(line)
with open(woz_directory + "WOZ_train_ans.txt") as f2:
  for line in f2:
    line = line.strip()
    line_lst = line.split('|')
    y_train.append(line_lst[0])

In [7]:
# X_train

In [31]:
X_dev = []
y_dev = []
with open(woz_directory + "WOZ_dev_utt.txt") as f1:
  for line in f1:
    line = line.strip()
    X_dev.append(line)
with open(woz_directory + "WOZ_dev_ans.txt") as f2:
  for line in f2:
    line = line.strip()
    line_lst = line.split('|')
    y_dev.append(line_lst[0])

In [None]:
len(X_train) == len(y_train)

True

In [None]:
# with open(woz_directory + "train.tsv", "w") as fout:
#   for text, label in zip(X_train, y_train):
#     fout.write(text + "\t" + label + "\n")

# with open(woz_directory + "dev.tsv", "w") as fout:
#   for text, label in zip(X_dev, y_dev):
#     fout.write(text + "\t" + label + "\n")

In [None]:
with open(woz_directory + "test.tsv", "w") as fout:
  with open(woz_directory + "WOZ_test_utt.txt", "r") as f:
    for line in f:
      line = line.strip()
      fout.write(line + "\t" + "find_hotel" + "\n")


In [12]:
! pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 6.0MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 68.0MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 55.9MB/s 
Installing collected packages: sacremoses, tokenizers, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.1


In [13]:
! pip install sentencepiece

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 4.4MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.95


In [14]:
from transformers import *

In [None]:
X_train[0]

'Guten Tag, I am staying overnight in Cambridge and need a place to sleep. I need free parking and internet.'

In [15]:
model_path = "bert-large-uncased"
# define label to number dictionary
lab2ind = {'find_hotel': 0, 'find_restaurant': 1}

# tokenizer from pre-trained BERT model
tokenizer = BertTokenizerFast.from_pretrained('bert-large-uncased',do_lower_case=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




In [18]:
class CustomDataset(Dataset):
    # initialization
    def __init__(self, data_lst, labels, tokenizer, max_len, lab2ind):
        """
          dataframe: pandas DataFrame.
          tokenizer: Hugginfance BERT/RoBERTa tokenizer
          max_len: maximal length of input sequence
          lab2ind: dictionary of label classes
        """
        self.tokenizer = tokenizer
        # self.data = dataframe
        self.comment_text = data_lst
        self.labels = labels
        self.max_len = max_len
        self.lab2ind = lab2ind

    # get the size of the dataset
    def __len__(self):
        return len(self.comment_text)

    # generate sample by index
    def __getitem__(self, index):
        # get ith sample and label
        comment_text = str(self.comment_text[index])
        label = str(self.labels[index])

        label = self.lab2ind[label]
        # use encode_plus() of Transformers to tokenize and vectorize input seuqnce and covert it to tensors. 
        # this method truncate or pad sequence to the maximal length and then return pytorch tensors. 
        inputs = self.tokenizer.encode_plus(
            comment_text,
            None,
            add_special_tokens=True,
            padding="max_length",
            truncation=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            return_tensors = "pt"
        )

        return {
            'ids': inputs['input_ids'],
            'mask': inputs['attention_mask'],
            'targets': torch.tensor(label, dtype=torch.long)
        }

In [None]:
# class CustomDataset(Dataset):
#     # initialization
#     def __init__(self, dataframe, tokenizer, max_len, lab2ind):
#         """
#           dataframe: pandas DataFrame.
#           tokenizer: Hugginfance BERT/RoBERTa tokenizer
#           max_len: maximal length of input sequence
#           lab2ind: dictionary of label classes
#         """
#         self.tokenizer = tokenizer
#         self.data = dataframe
#         self.comment_text = self.data.content
#         self.labels = self.data.label
#         self.max_len = max_len
#         self.lab2ind = lab2ind

#     # get the size of the dataset
#     def __len__(self):
#         return len(self.comment_text)

#     # generate sample by index
#     def __getitem__(self, index):
#         # get ith sample and label
#         comment_text = str(self.comment_text[index])
#         label = str(self.labels[index])

#         label = self.lab2ind[label]
#         # use encode_plus() of Transformers to tokenize and vectorize input seuqnce and covert it to tensors. 
#         # this method truncate or pad sequence to the maximal length and then return pytorch tensors. 
#         inputs = self.tokenizer.encode_plus(
#             comment_text,
#             None,
#             add_special_tokens=True,
#             padding="max_length",
#             truncation=True,
#             max_length=self.max_len,
#             return_token_type_ids=False,
#             return_tensors = "pt"
#         )

#         return {
#             'ids': inputs['input_ids'],
#             'mask': inputs['attention_mask'],
#             'targets': torch.tensor(label, dtype=torch.long)
#         }

In [24]:
def regular_encode(X_train, y_train, tokenizer, lab2ind, shuffle=True, num_workers = 2, batch_size=64, maxlen = 32, mode = 'train'): 
    '''
      file_path: path to your dataset file
      tokenizer: tokenizer method
      lab2ind: label-to-index dictionary
      shuffle: shuffle the dataset or not
      num_workers: a number of data processors
      batch_size: the number of batch size
      maxlen: maximal sequence length
      mode: the type of dataset
    '''
    if we are in train mode, we will load two columns (i.e., text and label).
    if mode == 'train':
        # Use pandas to load dataset, the dataset should be a tsv file where the first line is the header.
        df = pd.read_csv(file_path, delimiter='\t',header=None, names=['content','label'], encoding='utf-8', quotechar=None, quoting=3)
    
    # if we are in predict mode, we will load one column (i.e., text).
    elif mode == 'predict':
        df = pd.read_csv(file_path, delimiter='\t',header=None, names=['content', 'label'])
    else:
        print("the type of mode should be either 'train' or 'predict'. ")
        return
        
    print("{} Dataset: {}".format(file_path, df.shape))
    # instantiate the dataset instance 
    custom_set = CustomDataset(X_train, y_train, tokenizer, maxlen,lab2ind)
    
    dataset_params = {'batch_size': batch_size, 'shuffle': shuffle, 'num_workers': num_workers}

    batch_data_loader = DataLoader(custom_set, **dataset_params)
    # return a data iterator
    return batch_data_loader

In [20]:
tokenizer.encode_plus(X_train[1], add_special_tokens=True, padding='longest', return_token_type_ids=False, max_length=128, return_tensors="pt")['input_ids'][0]

tensor([  101,  7632,  2045,   999,  2064,  2017,  2507,  2033,  2070, 18558,
         2006,  2103,  9954,  2480,  1029,   102])

In [22]:
import os
import pandas as pd
batch_size = 32
max_seq_length = 32
num_epochs = 5
warmup_proportion = 0.1
learning_rate = 3e-4
max_grad_norm = 1.0


train_file = os.path.join(woz_directory, "train.tsv")
dev_file = os.path.join(woz_directory, "dev.tsv")
test_file = os.path.join(woz_directory, "test.tsv")

In [25]:
train_dataloader = regular_encode(X_train, y_train, tokenizer, lab2ind, shuffle=True, batch_size=batch_size, maxlen = max_seq_length)
validation_dataloader = regular_encode(X_dev,y_dev, tokenizer, lab2ind, shuffle=False, batch_size=batch_size, maxlen = max_seq_length)
# test_dataloader = regular_encode(test_file, tokenizer, lab2ind, shuffle=False, batch_size=batch_size, maxlen = max_seq_length)

In [None]:
class Bert_cls(nn.Module):

    def __init__(self, lab2ind, model_path, hidden_size):
        super(Bert_cls, self).__init__()
        self.model_path = model_path
        self.hidden_size = hidden_size
        self.bert_model = BertModel.from_pretrained(model_path, output_hidden_states=True, output_attentions=True)
        
        self.label_num = len(lab2ind)
        
        self.dense = nn.Linear(self.hidden_size, self.hidden_size)
        self.dropout = nn.Dropout(0.1)
        self.fc = nn.Linear(self.hidden_size, self.label_num)

    def forward(self, bert_ids, bert_mask):
        outputs = self.bert_model(input_ids=bert_ids, attention_mask = bert_mask)
        pooler_output = outputs['pooler_output']
        attentions = outputs['attentions']
        
        x = self.dense(pooler_output)
        x = torch.tanh(x)
        x = self.dropout(x)
        fc_output = self.fc(x)

        return fc_output, attentions

In [None]:
dense = nn.Linear(1024, 1024).to(device)
dropout = nn.Dropout(0.1).to(device)
fc = nn.Linear(1024, 2).to(device)

In [None]:
bert_model = Bert_cls(lab2ind, 'bert-large-uncased', 1024).to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=434.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1344997306.0, style=ProgressStyle(descr…




In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(bert_model):,} trainable parameters')

The model has 336,193,538 trainable parameters


In [None]:
# Parameters:
lr = 2e-5
max_grad_norm = 1.0
epochs = 3
warmup_proportion = 0.1
num_training_steps  = len(train_dataloader) * epochs
num_warmup_steps = num_training_steps * warmup_proportion

### In Transformers, optimizer and schedules are instantiated like this:
# Note: AdamW is a class from the huggingface library
# the 'W' stands for 'Weight Decay"
optimizer = AdamW(bert_model.parameters(), lr=lr, correct_bias=False)
# schedules
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)  # PyTorch scheduler

# We use nn.CrossEntropyLoss() as our loss function. 
criterion = nn.CrossEntropyLoss()

In [None]:
def train(model, iterator, optimizer, scheduler, criterion):
    
    model.train()
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        # Add batch to GPU
        # print(i, batch)
        input_ids = batch['ids'].to(device, dtype = torch.long)
        input_mask = batch['mask'].to(device, dtype = torch.long)
        labels = batch['targets'].to(device, dtype = torch.long)
        # Unpack the inputs from our dataloader
        # input_ids, input_mask, labels = batch
        # print(input_mask.shape)
        # print(input_ids.size())
        
        outputs, _ = model(input_ids.squeeze(1), input_mask)
        # print(outputs.shape)

        loss = criterion(outputs, labels)
        # delete used variables to free GPU memory
        del batch, input_ids, input_mask, labels
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)  # Gradient clipping is not in AdamW anymore
        optimizer.step()
        scheduler.step()
        epoch_loss += loss.cpu().item()
        optimizer.zero_grad()
    
    # free GPU memory
    if device == 'cuda':
        torch.cuda.empty_cache()

    return epoch_loss / len(iterator)

In [None]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    all_pred=[]
    all_label = []
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            # Add batch to GPU
            batch = tuple(t.to(device) for t in batch.values())
            # Unpack the inputs from our dataloader
            input_ids, input_mask, labels = batch
            
            outputs,_ = model(input_ids.squeeze(1), input_mask)
            
            loss = criterion(outputs, labels)

            # delete used variables to free GPU memory
            del batch, input_ids, input_mask
            epoch_loss += loss.cpu().item()

            # identify the predicted class for each example in the batch
            probabilities, predicted = torch.max(outputs.cpu().data, 1)
            # put all the true labels and predictions to two lists
            all_pred.extend(predicted)
            all_label.extend(labels.cpu())
    
    accuracy = accuracy_score(all_label, all_pred)
    f1score = f1_score(all_label, all_pred, average='macro') 
    return epoch_loss / len(iterator), accuracy, f1score

In [None]:
import os
save_path = '/content/drive/MyDrive/Colab Notebooks/COLX_563_lab4/ckpt'
if os.path.exists(save_path) == False:
    os.makedirs(save_path)

In [None]:
# Train the model
loss_list = []
acc_list = []

for epoch in trange(epochs, desc="Epoch"):
    train_loss = train(bert_model, train_dataloader, optimizer, scheduler, criterion)  
    val_loss, val_acc, val_f1 = evaluate(bert_model, validation_dataloader, criterion)

    # Create checkpoint at end of each epoch
    state = {
        'epoch': epoch,
        'state_dict': bert_model.state_dict(),
        'optimizer': optimizer.state_dict(),
        'scheduler': scheduler.state_dict()
        }

    torch.save(state, "/content/drive/MyDrive/Colab Notebooks/COLX_563_lab4/ckpt/BERT_"+str(epoch+1)+".pt")

    print('\n Epoch [{}/{}], Train Loss: {:.4f}, Validation Loss: {:.4f}, Validation Accuracy: {:.4f}, Validation F1: {:.4f}'.format(epoch+1, epochs, train_loss, val_loss, val_acc, val_f1))




Epoch:   0%|          | 0/3 [00:00<?, ?it/s][A[A[A


Epoch:  33%|███▎      | 1/3 [01:00<02:01, 60.56s/it][A[A[A


 Epoch [1/3], Train Loss: 0.0011, Validation Loss: 0.0257, Validation Accuracy: 0.9951, Validation F1: 0.9951





Epoch:  67%|██████▋   | 2/3 [02:05<01:01, 61.82s/it][A[A[A


 Epoch [2/3], Train Loss: 0.0009, Validation Loss: 0.0257, Validation Accuracy: 0.9951, Validation F1: 0.9951





Epoch: 100%|██████████| 3/3 [03:10<00:00, 63.46s/it]


 Epoch [3/3], Train Loss: 0.0009, Validation Loss: 0.0257, Validation Accuracy: 0.9951, Validation F1: 0.9951





In [None]:


test_dataloader = regular_encode(woz_directory + "test.tsv", tokenizer, lab2ind, shuffle=False)

/content/drive/MyDrive/Colab Notebooks/COLX_563_lab4/data/test.tsv Dataset: (400, 2)


In [None]:
all_pred = []
# all_probs = []
softmax = nn.Softmax(dim=1)
with torch.no_grad():

    for batch in test_dataloader:

        # Add batch to GPU
        batch = tuple(t.to(device) for t in batch.values())
        # Unpack the inputs from our dataloader
        input_ids, input_mask, labels = batch

        outputs,_ = bert_model(input_ids.squeeze(1), input_mask.squeeze(1))
        

        # delete used variables to free GPU memory
        del batch, input_ids, input_mask

        # identify the predicted class for each example in the batch
        prob_dist = softmax(outputs.cpu().data)
        _, predicted = torch.max(prob_dist, 1)
        # put all the true labels and predictions to two lists
        all_pred.extend(predicted)

In [None]:
ind2label = {0: 'find_hotel', 1: 'find_restaurant'}
i = 0
with open('/content/drive/MyDrive/Colab Notebooks/COLX_563_lab4/data/WOZ_test_ans_predict.txt', "w") as fout:
  for pred in all_pred:
    fout.write(ind2label[int(pred)]+"\n")

In [None]:
len(all_pred)

400

The utterances consists of a request for information about either hotels or restaurants. The first part of the answer starts with the intent (either find_restaurant or find_hotel) and then lists the slots that have been filled in based on the utterance. Your goal is to generate this string of intents and slots based purely on the utterance. A few things to note:

* Not all slots are filled in, and sometimes there are no slots filled in at all (but there is always an intent).
* There are a fixed number of slots for each intent, and they always appear in a particular order, when they are filled in
* The slot values sometimes but do not always correspond to what appears in the utterance. For example, a mention of wanting wifi in the request becomes hotel-internet=yes.

We will be evaluating based on exact duplication of the entire output string, so before you start coding a solution, you should look carefully at examples in the training set and make sure you understand all the different components of the output, and how they related to the input utterance. In particular, you should identify the various constituent parts of the task, and judge which are likely to be easy, and which are likely to be more difficult.

## Solution
rubric={accuracy:10,quality:5,efficiency:3}

You will build a system that, when provided with an utterance, predicts the appropriate intent and slots in the format used in the provided answers. This is an open-ended problem and you may solve it however you like, with the following restrictions:

* Your solution should include at least one of token-level prediction models used in Labs 1-3 of this course, i.e. you should make use of a CRF, an LSTM, or a BERT model. You may use multiple models.
* You may use basic NLP tools (tokenizer, POS, parser) and unsupervised resources such as word embeddings, but you should NOT use an existing NER system, or any additional labeled data for this task.
* Your solution should be appropriately decomposed into parts, and documented. This is a complex enough problem that you should have several functions. You may wrap things up into a single class if you like, but you don't have to.
* Use the provided assert to test `dev_predicted`, the output of your complete model on the dev set, you will need to pass the assert to get full accuracy points. 
* Though you may use dev *accuracy* to guide the development of your model, you should not look at either utterances or answers for the dev (or the test) when developing your model. Limit your inspection of the data (e.g. for the purposes of error analysis) to the training set.

Other things to consider:

* You may want to build "standard" (non-sequential) ML classifiers for some aspects of this problem, but you don't have to!
* You may want to use appropriate lexicons. You can build them yourself, or find some.
* Rather than using statistical classifiers, you may want to use rule-based methods to solve some of the problems you're facing.
* You should probably do regular error analysis, some kind of crossvalidation in the training set is a good approach for this, or you can create another (inspectable) internal dev set by splitting up the training set.
* If you're looking for just a little bit more performance, don't forget to tune your hyperparameters!

## Report
rubric={raw:2,reasoning:3,writing:1}

Describe your system, and discuss what your thinking about particular choices and any experiments you tried. Please talk about things you tried but didn't work, or things you thought of doing but didn't. Finally, discuss how each group member contributed to the project. As usual, there is an expectation that every group member will have made some significant contribution to the project. 

## Submit to Kaggle 
rubric={accuracy:2}

Run your system over the test data, and submit the result (in the same format as the train/dev answers) to the Kaggle competition. The competition is hosted [here](https://www.kaggle.com/c/mds-cl-2020-21-colx-563-lab-assignment-4). To get full points, you need to beat the public baseline. Use your capstone partner as your team name please!


## Exercise: Kaggle competition (Optional)
rubric={raw:2}

As a team, compete to get the best result in the task. Since there are only 8 teams, the distribution of marks is a bit different than usual, only the top 3 groups will get bonus points. As usual, the rankings will be based on the score on the private leaderboard:


- 1st place: 2
- 2nd place: 1
- 3rd place: 0.5