# Assignment 3: Intent Classifier with Transformers

In assignment 2, you have built a simple classifier with traditional machine learning methods. In this assignment, you are going to have a hands-on experience of newer and larger pre-trained models, particularly in Transformers, which is an architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease. Transformers compute representations of its input and output using its self-attention mechanism. For further reading about Transformers, please refer to [this well-written blog](https://nlp.seas.harvard.edu/2018/04/03/attention.html) or the [original paper](https://arxiv.org/abs/1706.03762). Secondly, you are going create a similar intent data set of your own based on UW CSE course catalogs.

This assignment will mainly focus helping you getting familiar with [Pytorch](https://pytorch.org/), an open source machine learning library based on the `Torch` library, and ['transformers' library from Huggingface](https://huggingface.co/transformers/), as well as learning to create good-quality datasets. It is okay if you would like to continue with `Tensorflow`, as long as you have your write-up questions correctly arranged.

Before you start writing any code, please read through this specification, understand the questions.

## Setting up your environment

This assignment will be presented in [Jupyter Notebook](https://ipython.org/notebook.html), making it easier for students without GPUs to utilize resources from [Google Colab](https://colab.research.google.com/), [DeepNote](https://deepnote.com/) or other platform that provide free access to GPU/devices. However, if you prefer to not use Jupyter Notebook, please be sure to include your write up file in `hw2_writeup.pdf` in your repository.

### Installing Dependencies

Following the dependencies you installed in assignment 2, you should also install `Pytorch` from [this page](https://pytorch.org/), be sure to select the correct OS, package, and compute platform. If you are using newer GPUs such as RTX3090, you need to install a specific version of `Pytorch` and `CUDA`, for which [this page](https://lambdalabs.com/blog/install-tensorflow-and-pytorch-on-rtx-30-series/) may be helpful.

Then, you can install `transformers` library with `pip` or `conda`.

### Using Colab/DeepNote

To use Colab or DeepNote, you can simply upload this notebook as well as your data/files and run it with Google/DeepNote's computing resources. When you are done with this notebook, simply click `File`->`Download` and save it to your repository. Be sure to select GPU, otherwise, it may takes hours to run on CPU.

For further instructions, see [this blog for `Colab`](https://towardsdatascience.com/getting-started-with-google-colab-f2fff97f594c) and [this video](deepnote.com) for DeepNote.

## Tips

* Be sure to *select GPU* for your Colab/DeepNote(`Edit`->`Notebook Settings`), otherwise it will take hours for you to run the code.
* If you are using `Windows WSL` with `CUDA` and find it very slow, consider not using `WSL`, whose support for `CUDA` was limited(in addition, only `WSL2` support `CUDA`).
* You may find documentation for `transformers` particularly useful for this assignment.
* You may want to be careful on creating UW CSE course catalog datasets, because it is going to be used in the following assignments.


In [None]:
# Dependencies
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from torch.utils.data import DataLoader
# Import transformers library here
# TODO: If you would like to use models other than DistilBert, you can change the import here and names in later part
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, Trainer, TrainingArguments, AdamW
from tqdm import tqdm
import torch

import tools


## Part 1. Data preprocessing & Tokenization

In this section, you are going to

* Read in you train, validation, and test data.
* Convert the categories of data into ids.
* Tokenize your texts.
* Create a IntentDataset class for later use. 

In [None]:
# TODO: Read data, replace the dummy file names with your file paths
train_texts, train_labels = tools.read_data("train")
val_texts, val_labels = tools.read_data("val")
test_texts, test_labels = tools.read_data("test")
train_texts = train_texts.tolist()
val_texts = val_texts.tolist()
test_texts = test_texts.tolist()

In [None]:
# Create integer class labels instead of strings
classes = tools.labels(train_labels).tolist()
train_labels = tools.relabel(train_labels, classes)
val_labels = tools.relabel(val_labels, classes)
test_labels = tools.relabel(test_labels, classes)

In [None]:
# You will need a tokenier to translate human readable text to machine readable representations
# TODO: Initialize a tokenizer that convert your texts into encodings

# TODO: Use the tokenizer to get representations of your texts


### IntentDataset

Making your data prepration easier and more extendable will save much effort. The Dataset class provided by `PyTorch` is one of the tools that make data loading simpler.

In this part, you are going to create a [PyTorch Dataset class](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html) for the data. Your `IntentDataset` should inherit `Dataset` and override below methods.

In [None]:
class IntentDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        """
        To support the indexing such that dataset[i] can be used to get the i-th sample
        """
        # TODO: Implement this
        
        return item

    def __len__(self):
        """
        Returns the size of the dataset.
        """
        # TODO: Implement this

        return length

In [None]:
# Turn the encodings and labels to a dataset object
train_dataset = IntentDataset(train_encodings, train_labels)
val_dataset = IntentDataset(val_encodings, val_labels)
test_dataset = IntentDataset(test_encodings, test_labels)

## Part 2. Initialize model and Train/Validate function

In this part, you are going to 
* initialize a classification model from `transformers`.
* Implement the train function.
* Implement the validate function.

In [None]:
# TODO: Initialize model from pretrained. Be sure that your model should be consistent with your tokenizer and you have correct num_labels field


In [None]:
def train(dataloader, optimizer, device):
    """
    Train the model with an epoch
    Arguments:
        dataloader : the pytorch dataloader that parse data into tensors
        optimizer : to optimize the loss
        device : current device

    Returns:
        a list of predicted labels, a list of true labels, average loss
    """
    # Initialize the labels and total loss
    pred_labels = []
    true_labels = []
    tot_loss = 0.0

    # TODO: Implement this, following the comments as hint(Each comment can be done within a line of code)
    # Let model enter training mode

    # Loop through each batch to process
    for batch in tqdm(dataloader, total=len(dataloader)):
        # Save original labels for evaluation
        
        # Set device for everything in batch
        
        # Clear previously calculated gradients
        
        # Feed items in batch into the forward pass
        
        # Record the loss and logits
        
        # Backward pass with loss
        
        # Add loss value into tot_loss
        
        # Proceed a step with optimizer
        
        # Move logits and labels to CPU
        
        # Convert these logits to list of predicted labels values.
        
        
    return pred_labels, true_labels, tot_loss / len(dataloader)

In [None]:
def validate(dataloader, device):
    """
    Validate the model with an epoch
    Arguments:
        dataloader : the pytorch dataloader that parse data into tensors
        device : current device

    Returns:
        a list of predicted labels, a list of true labels, average loss
    """
    pred_labels = []
    true_labels = []
    tot_loss = 0.0

    # TODO: Implement this, following the comments as hint. validate()
    # is going to be very simiar to train()

    # Let model enter evaluation mode
    
    # Loop through each batch to process
    for batch in tqdm(dataloader, total=len(dataloader)):
        # Save original labels for evaluation
        
        # Set device for everything in batch
        
        # Not compute gradients to save memory and speedup
        with torch.no_grad():
            # Feed items in batch into the forward pass
            
            # Record the loss and logits
            
            # Add loss value into tot_loss
            
            # Move logits and labels to CPU
            
            # Convert these logits to list of predicted labels values.
            

    return pred_labels, true_labels, tot_loss / len(dataloader)

## Part 3. Fine-tune your model

Good work! You have now completed the core parts of fine-tuning model. 
In this section, will be implementing the final parts of the fine-tuning process. 
Train and validate your model recording the accuracy and loss along the way.
It may take some time to run this section, so please be patient and double check for typos before running.

In [None]:
# Create optimizer, set your learning rate
optimizer = AdamW(model.parameters(), lr=5e-5)
# Set device variable
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# Create dataloader for train and validation dataset
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=True)
# Record losses and accuricies for each epoch
losses = {'train_loss':[], 'val_loss':[]}
accuracies = {'train_acc':[], 'val_acc':[]}
# Define the number of epoches you will want to run
epoch_num = 3

model.to(device)
# TODO: Loop through each epoch, train and validate
for epoch in tqdm(range(epoch_num)):
    # TODO: Implement this. Each line has comment and beginning as hint.
    print("Processing epoch ", epoch)
    # Fine-tune using the training set
    train_pred, train_labels, train_loss = 
    # Get validation results
    val_pred, val_labels, val_loss = 
    # Compute the accuracies using the results
    train_acc = 
    val_acc = 
    # Record the accuracies and losses for train and validation
    accuracies['train_acc'].append(train_acc)
    losses['train_loss'].append(train_loss)
    accuracies['val_acc'].append(val_acc)
    losses['val_loss'].append(val_loss)

print(losses)
print(accuracies)

## Part 4. Evaluation and Analysis
In this section, you are going to
* Plot the training and validation loss/accuracy with respect to epochs you ran.
* Compute the performance metrics (precision, F1, recall) of your _final model_.
* Compare these results with your model from Assignment 2.

No extra code is required in this section, but you should run it and observe the result.

In [None]:
# Plot the loss with respect to epoches
plt.plot(losses['train_loss'], 'r--', label='train loss')
plt.plot(losses['val_loss'], 'b', label='validation loss')
plt.title("Loss wrt Epoch")
plt.xlabel('Epoches')
plt.ylabel('Loss')
plt.legend(loc='upper right')
plt.xticks([0, 1, 2], [1, 2, 3])
plt.show()

# Plot the accuricies with respect to epoches
plt.plot(accuracies['train_acc'], 'r--', label='train accuracy')
plt.plot(accuracies['val_acc'], 'b', label='validation accuracy')
plt.title("Accuracy wrt Epoch")
plt.xlabel('Epoches')
plt.ylabel('Accuracy')
plt.legend()
plt.xticks([0, 1, 2], [1, 2, 3])
plt.show()

In [None]:
# Evaluate using the test set at the very end
# Note: You should only run this cell ONCE at the very end after you have completed any tuning
#       and training. 
#       It is inadvisible to develop your model against the test set as you can end up
#       inadvertantly overfitting to the test data. 
from sklearn.metrics import classification_report
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=True)
pred_labels, true_labels, _ = validate(test_loader, device)
# Compute evaluate report
report = classification_report(true_labels, pred_labels, labels=[i for i in range(len(classes))], target_names=classes)
print()
print(report)

## Part 5. Answer the Following Questions

For each the questions below, either write a short paragraph or report the metrics asked with clear annotation.

#### 1. What model and optimizer did you tried?

YOUR ANSWER HERE

#### 2. How long did it take for you to fine-tune your model? How does it compare to assignment 2?

YOUR ANSWER HERE

#### 3. Report your general accuracy for train, validation, and test set here.

YOUR ANSWRE HERE

#### 4. How was the performance compare to assignment 2? Why is it the case?

YOUR ANSWRE HERE

#### 5. Did you observe any trends from the plot of loss/accuracy with respect to epochs?

YOUR ANSWER HERE


## Part 6. Create your own dataset

Look at the page of [UW CSE course catalog](https://www.cs.washington.edu/education/courses/) and review the intent dataset you have worked with. 
You would like to support 3 new intents called "cse_course_content", "cse_course_prerequisites" and "cse_course_id" which models a student asking questions about various aspects (what content a certain course covers, what prequisites a certain course has, and what course ids cover a certain type of content) of the course offerings. 
Can you come up with some questions that a student may ask a virtual assistant for these intents? Valid questions should be answerable by a human with only the information on the course catalog page.

For example, a question about "cse_course_content" might be something like: 'What is CSE P 590B about?'

For this part, your goal is to create some training data for these new intents that you would like to support. Brainstorm at least 10 questions for _each_ intent, following the same format as the intent dataset you were working with. Create a new file `data/my_intents_train.json`.
Tips: Refer to the existing examples in the dataset provided for inspiration on how to come up with training examples. Remember, because the training data is concrete, even small variations like a different course id can constitute a separate question. For grading purposes we're OK with even small variations, though it's still a good idea to make some effort to come up with as many diverse phrasings as you can (like in the provided dataset) as this new data will be useful in future assignments.

## Submission

For submission, along with other files provided by this assignment you should include a report `hw3.pdf` which you can export from

the notebook. This report which should include a history of your fine-tuning process, classification report and any plots that were generated. 

You can also use other means to create the report if you're not using a notebook as long as it has all the information included.

You should also make sure to include the training data you created in `data/my_intents_train.json`.

## Extra Credit: Implement a Model with A Custom Architecture or Train With Your New Intents

In this assignment, you created a classifier with `transformers` provided by `Huggingface`. However, you can also build everything from scratch and define your own architecture. For an option of extra credit, you can implement a simple alternative neural model such as a Recurrent Neural Network (RNN), or try some different layer setup. Include the code for your model and report the accuracy (train, dev, test) for the model you trained. The procedure will be very similar to the previous parts of this assignment, except this time you will be designing your own forward pass.

If you decide to do this part, [this classification tutorial provided by Pytorch](https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html) may be helpful.

Another option of extra credit is to train with your new intents, you can either fine-tune some model with your new intent dataset or train some traditional models as you did in assignment 2. Include the code for your model and report the accuracy (train, dev, test) for the model you trained/fine-tuned.

In [None]:
# TODO: Your code here (Optional)

#### Report your test accuracy values here. (Optional)

YOUR ANSWER HERE