# Assignment 3: Intent Classifier with Transformers

In assignment 2, you have built a simple classifier with traditional machine learning methods. In this assignment, you are going to have a hands-on experience of newer and larger pre-trained models, particularly in Transformers, which is an architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease. Transformers compute representations of its input and output using its self-attention mechanism. For further reading about Transformers, please refer to [this well-written blog](https://nlp.seas.harvard.edu/2018/04/03/attention.html) or the [original paper](https://arxiv.org/abs/1706.03762). Secondly, you are going create a similar intent data set of your own based on UW CSE course catalogs.

This assignment will mainly focus helping you getting familiar with [Pytorch](https://pytorch.org/), an open source machine learning library based on the `Torch` library, and ['transformers' library from Huggingface](https://huggingface.co/transformers/), as well as learning to create good-quality datasets. It is okay if you would like to continue with `Tensorflow`, as long as you have your write-up questions correctly arranged.

Before you start writing any code, please read through this specification, understand the questions.

## Setting up your environment

This assignment will be presented in [Jupyter Notebook](https://ipython.org/notebook.html), making it easier for students without GPUs to utilize resources from [Google Colab](https://colab.research.google.com/), [DeepNote](https://deepnote.com/) or other platform that provide free access to GPU/devices. However, if you prefer to not use Jupyter Notebook, please be sure to include your write up file in `hw2_writeup.pdf` in your repository.

### Installing Dependencies

Following the dependencies you installed in assignment 2, you should also install `Pytorch` from [this page](https://pytorch.org/), be sure to select the correct OS, package, and compute platform. If you are using newer GPUs such as RTX3090, you need to install a specific version of `Pytorch` and `CUDA`, for which [this page](https://lambdalabs.com/blog/install-tensorflow-and-pytorch-on-rtx-30-series/) may be helpful.

Then, you can install `transformers` library with `pip` or `conda`.

### Using Colab/DeepNote

To use Colab or DeepNote, you can simply upload this notebook as well as your data/files and run it with Google/DeepNote's computing resources. When you are done with this notebook, simply click `File`->`Download` and save it to your repository. Be sure to select GPU, otherwise, it may takes hours to run on CPU.

For further instructions, see [this blog for `Colab`](https://towardsdatascience.com/getting-started-with-google-colab-f2fff97f594c) and [this video](deepnote.com) for DeepNote.

## Tips

* Be sure to *select GPU* for your Colab/DeepNote(`Edit`->`Notebook Settings`), otherwise it will take hours for you to run the code.
* If you are using `Windows WSL` with `CUDA` and find it very slow, consider not using `WSL`, whose support for `CUDA` was limited(in addition, only `WSL2` support `CUDA`).
* You may find documentation for `transformers` particularly useful for this assignment.
* You may want to be careful on creating UW CSE course catalog datasets, because it is going to be used in the following assignments.


In [1]:
# Dependencies
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import accuracy_score
from torch.utils.data import DataLoader
# Import transformers library here
# TODO: If you would like to use models other than DistilBert, you can change the import here and names in later part
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments, AdamW, AutoTokenizer
from tqdm import tqdm
import torch
from evaluate import load

import tools

In [2]:
torch.cuda.is_available()

True

In [3]:
torch.cuda.current_device()

0

In [4]:
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")

## Part 1. Data preprocessing & Tokenization

In this section, you are going to

* Read in you train, validation, and test data.
* Convert the categories of data into ids.
* Tokenize your texts.
* Create a IntentDataset class for later use. 

In [5]:
train_texts, train_labels = tools.read_data("train")
val_texts, val_labels = tools.read_data("val")
test_texts, test_labels = tools.read_data("test")
train_texts = train_texts.tolist()
val_texts = val_texts.tolist()
test_texts = test_texts.tolist()

In [6]:
# Create integer class labels instead of strings
classes = tools.labels(train_labels).tolist()
train_labels = tools.relabel(train_labels, classes)
val_labels = tools.relabel(val_labels, classes)
test_labels = tools.relabel(test_labels, classes)

### IntentDataset

Making your data prepration easier and more extendable will save much effort. The Dataset class provided by `PyTorch` is one of the tools that make data loading simpler.

In this part, you are going to create a [PyTorch Dataset class](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html) for the data. Your `IntentDataset` should inherit `Dataset` and override below methods.

In [7]:
class IntentDataset(torch.utils.data.Dataset):
  def __init__(self, encodings, labels):
    self.encodings = encodings
    self.labels = labels


  def __getitem__(self, idx):
    """
    To support the indexing such that dataset[i] can be used to get the i-th sample
    """
#         item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    item = {key: val[idx].clone().detach() for key, val in self.encodings.items()}
    item['label'] = torch.tensor(self.labels[idx])
    return item


  def __len__(self):
    """
    Returns the size of the dataset.
    """
    return len(self.labels)

## Part 2. Initialize model and Train/Validate function

In this part, you are going to 
* initialize a classification model from `transformers`.
* Implement the train function.
* Implement the validate function.

In [8]:
def compute_metrics(eval_pred):
  accuracy = load("accuracy")
  precision = load("precision")
  f1 = load("f1")
  recall = load("recall")
  
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=1)
  
  accuracy.compute(predictions=predictions, references=labels)
  precision.compute(predictions=predictions, references=labels, average="micro")
  f1.compute(predictions=predictions, references=labels, average="micro")
  recall.compute(predictions=predictions, references=labels, average="micro")
  
  return {"accuracy": accuracy, "precision": precision, "f1": f1, "recall": recall}

In [9]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

train_encodings = tokenizer(train_texts, padding=True, truncation=True, return_tensors="pt")
val_encodings = tokenizer(val_texts, padding=True, truncation=True, return_tensors="pt")
test_encodings = tokenizer(test_texts, padding=True, truncation=True, return_tensors="pt")

# Turn the encodings and labels to a dataset object
train_dataset = IntentDataset(train_encodings, train_labels)
val_dataset = IntentDataset(val_encodings, val_labels)
test_dataset = IntentDataset(test_encodings, test_labels)

# model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=len(classes), problem_type="multi_label_classification").to('cuda') 
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=len(classes))

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [10]:
training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    optim="adamw_torch",
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    no_cuda=False,
    skip_memory_metrics=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

In [11]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,F1,Recall
1,4.4124,2.465194,"EvaluationModule(name: ""accuracy"", module_type: ""metric"", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """""" Args:  predictions (`list` of `int`): Predicted labels.  references (`list` of `int`): Ground truth labels.  normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.  sample_weight (`list` of `float`): Sample weights Defaults to None. Returns:  accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy. Examples:  Example 1-A simple example  >>> accuracy_metric = evaluate.load(""accuracy"")  >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])  >>> print(results)  {'accuracy': 0.5}  Example 2-The same as Example 1, except with `normalize` set to `False`.  >>> accuracy_metric = evaluate.load(""accuracy"")  >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], normalize=False)  >>> print(results)  {'accuracy': 3.0}  Example 3-The same as Example 1, except with `sample_weight` set.  >>> accuracy_metric = evaluate.load(""accuracy"")  >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], sample_weight=[0.5, 2, 0.7, 0.5, 9, 0.4])  >>> print(results)  {'accuracy': 0.8778625954198473} """""", stored examples: 0)","EvaluationModule(name: ""precision"", module_type: ""metric"", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """""" Args:  predictions (`list` of `int`): Predicted class labels.  references (`list` of `int`): Actual class labels.  labels (`list` of `int`): The set of labels to include when `average` is not set to `'binary'`. If `average` is `None`, it should be the label order. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class. Labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in `predictions` and `references` are used in sorted order. Defaults to None.  pos_label (`int`): The class to be considered the positive class, in the case where `average` is set to `binary`. Defaults to 1.  average (`string`): This parameter is required for multiclass/multilabel targets. If set to `None`, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data. Defaults to `'binary'`.  - 'binary': Only report results for the class specified by `pos_label`. This is applicable only if the classes found in `predictions` and `references` are binary.  - 'micro': Calculate metrics globally by counting the total true positives, false negatives and false positives.  - 'macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.  - 'weighted': Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters `'macro'` to account for label imbalance. This option can result in an F-score that is not between precision and recall.  - 'samples': Calculate metrics for each instance, and find their average (only meaningful for multilabel classification).  sample_weight (`list` of `float`): Sample weights Defaults to None.  zero_division (`int` or `string`): Sets the value to return when there is a zero division. Defaults to 'warn'.  - 0: Returns 0 when there is a zero division.  - 1: Returns 1 when there is a zero division.  - 'warn': Raises warnings and then returns 0 when there is a zero division. Returns:  precision (`float` or `array` of `float`): Precision score or list of precision scores, depending on the value passed to `average`. Minimum possible value is 0. Maximum possible value is 1. Higher values indicate that fewer negative examples were incorrectly labeled as positive, which means that, generally, higher scores are better. Examples:  Example 1-A simple binary example  >>> precision_metric = evaluate.load(""precision"")  >>> results = precision_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0])  >>> print(results)  {'precision': 0.5}  Example 2-The same simple binary example as in Example 1, but with `pos_label` set to `0`.  >>> precision_metric = evaluate.load(""precision"")  >>> results = precision_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], pos_label=0)  >>> print(round(results['precision'], 2))  0.67  Example 3-The same simple binary example as in Example 1, but with `sample_weight` included.  >>> precision_metric = evaluate.load(""precision"")  >>> results = precision_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], sample_weight=[0.9, 0.5, 3.9, 1.2, 0.3])  >>> print(results)  {'precision': 0.23529411764705882}  Example 4-A multiclass example, with different values for the `average` input.  >>> predictions = [0, 2, 1, 0, 0, 1]  >>> references = [0, 1, 2, 0, 1, 2]  >>> results = precision_metric.compute(predictions=predictions, references=references, average='macro')  >>> print(results)  {'precision': 0.2222222222222222}  >>> results = precision_metric.compute(predictions=predictions, references=references, average='micro')  >>> print(results)  {'precision': 0.3333333333333333}  >>> results = precision_metric.compute(predictions=predictions, references=references, average='weighted')  >>> print(results)  {'precision': 0.2222222222222222}  >>> results = precision_metric.compute(predictions=predictions, references=references, average=None)  >>> print([round(res, 2) for res in results['precision']])  [0.67, 0.0, 0.0] """""", stored examples: 0)","EvaluationModule(name: ""f1"", module_type: ""metric"", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """""" Args:  predictions (`list` of `int`): Predicted labels.  references (`list` of `int`): Ground truth labels.  labels (`list` of `int`): The set of labels to include when `average` is not set to `'binary'`, and the order of the labels if `average` is `None`. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class. Labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in `predictions` and `references` are used in sorted order. Defaults to None.  pos_label (`int`): The class to be considered the positive class, in the case where `average` is set to `binary`. Defaults to 1.  average (`string`): This parameter is required for multiclass/multilabel targets. If set to `None`, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data. Defaults to `'binary'`.  - 'binary': Only report results for the class specified by `pos_label`. This is applicable only if the classes found in `predictions` and `references` are binary.  - 'micro': Calculate metrics globally by counting the total true positives, false negatives and false positives.  - 'macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.  - 'weighted': Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters `'macro'` to account for label imbalance. This option can result in an F-score that is not between precision and recall.  - 'samples': Calculate metrics for each instance, and find their average (only meaningful for multilabel classification).  sample_weight (`list` of `float`): Sample weights Defaults to None. Returns:  f1 (`float` or `array` of `float`): F1 score or list of f1 scores, depending on the value passed to `average`. Minimum possible value is 0. Maximum possible value is 1. Higher f1 scores are better. Examples:  Example 1-A simple binary example  >>> f1_metric = evaluate.load(""f1"")  >>> results = f1_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0])  >>> print(results)  {'f1': 0.5}  Example 2-The same simple binary example as in Example 1, but with `pos_label` set to `0`.  >>> f1_metric = evaluate.load(""f1"")  >>> results = f1_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], pos_label=0)  >>> print(round(results['f1'], 2))  0.67  Example 3-The same simple binary example as in Example 1, but with `sample_weight` included.  >>> f1_metric = evaluate.load(""f1"")  >>> results = f1_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], sample_weight=[0.9, 0.5, 3.9, 1.2, 0.3])  >>> print(round(results['f1'], 2))  0.35  Example 4-A multiclass example, with different values for the `average` input.  >>> predictions = [0, 2, 1, 0, 0, 1]  >>> references = [0, 1, 2, 0, 1, 2]  >>> results = f1_metric.compute(predictions=predictions, references=references, average=""macro"")  >>> print(round(results['f1'], 2))  0.27  >>> results = f1_metric.compute(predictions=predictions, references=references, average=""micro"")  >>> print(round(results['f1'], 2))  0.33  >>> results = f1_metric.compute(predictions=predictions, references=references, average=""weighted"")  >>> print(round(results['f1'], 2))  0.27  >>> results = f1_metric.compute(predictions=predictions, references=references, average=None)  >>> print(results)  {'f1': array([0.8, 0. , 0. ])}  Example 5-A multi-label example  >>> f1_metric = evaluate.load(""f1"", ""multilabel"")  >>> results = f1_metric.compute(predictions=[[0, 1, 1], [1, 1, 0]], references=[[0, 1, 1], [0, 1, 0]], average=""macro"")  >>> print(round(results['f1'], 2))  0.67 """""", stored examples: 0)","EvaluationModule(name: ""recall"", module_type: ""metric"", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """""" Args: - **predictions** (`list` of `int`): The predicted labels. - **references** (`list` of `int`): The ground truth labels. - **labels** (`list` of `int`): The set of labels to include when `average` is not set to `binary`, and their order when average is `None`. Labels present in the data can be excluded in this input, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in y_true and y_pred are used in sorted order. Defaults to None. - **pos_label** (`int`): The class label to use as the 'positive class' when calculating the recall. Defaults to `1`. - **average** (`string`): This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data. Defaults to `'binary'`.  - `'binary'`: Only report results for the class specified by `pos_label`. This is applicable only if the target labels and predictions are binary.  - `'micro'`: Calculate metrics globally by counting the total true positives, false negatives, and false positives.  - `'macro'`: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.  - `'weighted'`: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters `'macro'` to account for label imbalance. Note that it can result in an F-score that is not between precision and recall.  - `'samples'`: Calculate metrics for each instance, and find their average (only meaningful for multilabel classification). - **sample_weight** (`list` of `float`): Sample weights Defaults to `None`. - **zero_division** (): Sets the value to return when there is a zero division. Defaults to .  - `'warn'`: If there is a zero division, the return value is `0`, but warnings are also raised.  - `0`: If there is a zero division, the return value is `0`.  - `1`: If there is a zero division, the return value is `1`. Returns: - **recall** (`float`, or `array` of `float`): Either the general recall score, or the recall scores for individual classes, depending on the values input to `labels` and `average`. Minimum possible value is 0. Maximum possible value is 1. A higher recall means that more of the positive examples have been labeled correctly. Therefore, a higher recall is generally considered better. Examples:  Example 1-A simple example with some errors  >>> recall_metric = evaluate.load('recall')  >>> results = recall_metric.compute(references=[0, 0, 1, 1, 1], predictions=[0, 1, 0, 1, 1])  >>> print(results)  {'recall': 0.6666666666666666}  Example 2-The same example as Example 1, but with `pos_label=0` instead of the default `pos_label=1`.  >>> recall_metric = evaluate.load('recall')  >>> results = recall_metric.compute(references=[0, 0, 1, 1, 1], predictions=[0, 1, 0, 1, 1], pos_label=0)  >>> print(results)  {'recall': 0.5}  Example 3-The same example as Example 1, but with `sample_weight` included.  >>> recall_metric = evaluate.load('recall')  >>> sample_weight = [0.9, 0.2, 0.9, 0.3, 0.8]  >>> results = recall_metric.compute(references=[0, 0, 1, 1, 1], predictions=[0, 1, 0, 1, 1], sample_weight=sample_weight)  >>> print(results)  {'recall': 0.55}  Example 4-A multiclass example, using different averages.  >>> recall_metric = evaluate.load('recall')  >>> predictions = [0, 2, 1, 0, 0, 1]  >>> references = [0, 1, 2, 0, 1, 2]  >>> results = recall_metric.compute(predictions=predictions, references=references, average='macro')  >>> print(results)  {'recall': 0.3333333333333333}  >>> results = recall_metric.compute(predictions=predictions, references=references, average='micro')  >>> print(results)  {'recall': 0.3333333333333333}  >>> results = recall_metric.compute(predictions=predictions, references=references, average='weighted')  >>> print(results)  {'recall': 0.3333333333333333}  >>> results = recall_metric.compute(predictions=predictions, references=references, average=None)  >>> print(results)  {'recall': array([1., 0., 0.])} """""", stored examples: 0)"


Trainer is attempting to log a value of "EvaluationModule(name: "accuracy", module_type: "metric", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
    predictions (`list` of `int`): Predicted labels.
    references (`list` of `int`): Ground truth labels.
    normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
    sample_weight (`list` of `float`): Sample weights Defaults to None.

Returns:
    accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.

Examples:

    Example 1-A simple example
        >>> accuracy_metric = evaluate.load("accuracy")
        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 

Trainer is attempting to log a value of "EvaluationModule(name: "recall", module_type: "metric", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
- **predictions** (`list` of `int`): The predicted labels.
- **references** (`list` of `int`): The ground truth labels.
- **labels** (`list` of `int`): The set of labels to include when `average` is not set to `binary`, and their order when average is `None`. Labels present in the data can be excluded in this input, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in y_true and y_pred are used in sorted order. Defaults to None.
- **pos_label** (`int`): The class label to use as the 'positive class' when calculating the recall. Defaults to `1`.
- **average** (`string`): This parameter 

TypeError: cannot pickle '_thread.lock' object

In [None]:
accuracies = []
test_results = trainer.evaluate(test_dataset)
accuracies.append(test_results["eval_accuracy"])

## Part 3. Fine-tune your model

Good work! You have now completed the core parts of fine-tuning model. 
In this section, will be implementing the final parts of the fine-tuning process. 
Train and validate your model recording the accuracy and loss along the way.
It may take some time to run this section, so please be patient and double check for typos before running.

## Part 4. Evaluation and Analysis
In this section, you are going to
* Plot the training and validation loss/accuracy with respect to epochs you ran.
* Compute the performance metrics (precision, F1, recall) of your _final model_.
* Compare these results with your model from Assignment 2.

No extra code is required in this section, but you should run it and observe the result.

In [None]:
# Plot the loss with respect to epoches
plt.plot(losses['train_loss'], 'r--', label='train loss')
plt.plot(losses['val_loss'], 'b', label='validation loss')
plt.title("Loss wrt Epoch")
plt.xlabel('Epoches')
plt.ylabel('Loss')
plt.legend(loc='upper right')
plt.xticks([0, 1, 2], [1, 2, 3])
plt.show()

# Plot the accuricies with respect to epoches
plt.plot(accuracies['train_acc'], 'r--', label='train accuracy')
plt.plot(accuracies['val_acc'], 'b', label='validation accuracy')
plt.title("Accuracy wrt Epoch")
plt.xlabel('Epoches')
plt.ylabel('Accuracy')
plt.legend()
plt.xticks([0, 1, 2], [1, 2, 3])
plt.show()

In [None]:
# Evaluate using the test set at the very end
# Note: You should only run this cell ONCE at the very end after you have completed any tuning
#       and training. 
#       It is inadvisible to develop your model against the test set as you can end up
#       inadvertantly overfitting to the test data. 
from sklearn.metrics import classification_report
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=True)
pred_labels, true_labels, _ = validate(test_loader, device)
# Compute evaluate report
report = classification_report(true_labels, pred_labels, labels=[i for i in range(len(classes))], target_names=classes)
print()
print(report)

## Part 5. Answer the Following Questions

For each the questions below, either write a short paragraph or report the metrics asked with clear annotation.

#### 1. What model and optimizer did you tried?

YOUR ANSWER HERE

#### 2. How long did it take for you to fine-tune your model? How does it compare to assignment 2?

YOUR ANSWER HERE

#### 3. Report your general accuracy for train, validation, and test set here.

YOUR ANSWRE HERE

#### 4. How was the performance compare to assignment 2? Why is it the case?

YOUR ANSWRE HERE

#### 5. Did you observe any trends from the plot of loss/accuracy with respect to epochs?

YOUR ANSWER HERE


## Part 6. Create your own dataset

Look at the page of [UW CSE course catalog](https://www.cs.washington.edu/education/courses/) and review the intent dataset you have worked with. 
You would like to support 3 new intents called "cse_course_content", "cse_course_prerequisites" and "cse_course_id" which models a student asking questions about various aspects (what content a certain course covers, what prequisites a certain course has, and what course ids cover a certain type of content) of the course offerings. 
Can you come up with some questions that a student may ask a virtual assistant for these intents? Valid questions should be answerable by a human with only the information on the course catalog page.

For example, a question about "cse_course_content" might be something like: 'What is CSE P 590B about?'

For this part, your goal is to create some training data for these new intents that you would like to support. Brainstorm at least 10 questions for _each_ intent, following the same format as the intent dataset you were working with. Create a new file `data/my_intents_train.json`.
Tips: Refer to the existing examples in the dataset provided for inspiration on how to come up with training examples. Remember, because the training data is concrete, even small variations like a different course id can constitute a separate question. For grading purposes we're OK with even small variations, though it's still a good idea to make some effort to come up with as many diverse phrasings as you can (like in the provided dataset) as this new data will be useful in future assignments.

## Submission

For submission, along with other files provided by this assignment you should include a report `hw3.pdf` which you can export from

the notebook. This report which should include a history of your fine-tuning process, classification report and any plots that were generated. 

You can also use other means to create the report if you're not using a notebook as long as it has all the information included.

You should also make sure to include the training data you created in `data/my_intents_train.json`.

## Extra Credit: Implement a Model with A Custom Architecture or Train With Your New Intents

In this assignment, you created a classifier with `transformers` provided by `Huggingface`. However, you can also build everything from scratch and define your own architecture. For an option of extra credit, you can implement a simple alternative neural model such as a Recurrent Neural Network (RNN), or try some different layer setup. Include the code for your model and report the accuracy (train, dev, test) for the model you trained. The procedure will be very similar to the previous parts of this assignment, except this time you will be designing your own forward pass.

If you decide to do this part, [this classification tutorial provided by Pytorch](https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html) may be helpful.

Another option of extra credit is to train with your new intents, you can either fine-tune some model with your new intent dataset or train some traditional models as you did in assignment 2. Include the code for your model and report the accuracy (train, dev, test) for the model you trained/fine-tuned.

In [None]:
# TODO: Your code here (Optional)

#### Report your test accuracy values here. (Optional)

YOUR ANSWER HERE