<a href="https://colab.research.google.com/github/abhikr487/Rproject/blob/main/Lab3-563.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COLX 563 Lab Assignment 3: Question-Answering with BERT
## Assignment Objectives

In this lab, you will implement and train a (distil)BERT model for Question and Answering on a subset of the [SQuAD v2.0](https://rajpurkar.github.io/SQuAD-explorer/) dataset. Lab objectives include:

1. Convert the data to tensors using the BERT tokenizer
2. Train a model for Question-Answering by tuning on top of a pre-trained BERT model
3. Optimize the choice of start and end indices

We use distBERT in this lab because it is significantly smaller and faster than BERT, but with very similar performance. Even though we are using distBERT, we will call it BERT throughout this lab

You will likely need a GPU for this lab, but the free version of Colab should be sufficient.  The code to mount your drive on Colab is provided.

In [1]:
import sys
!{sys.executable} -m pip install pulp
!{sys.executable} -m pip install transformers
from google.colab import drive
drive.mount('/content/gdrive')

Collecting pulp
  Downloading PuLP-2.8.0-py3-none-any.whl (17.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.7/17.7 MB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pulp
Successfully installed pulp-2.8.0
Mounted at /content/gdrive


## Getting Started

Run the code below to access relevant modules (you can add to this as needed).

In [3]:
#provided code
import numpy as np
import torch
import pulp
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import accuracy_score

For this lab, you'll be working with the SQuAD database. Download the SQuAD data from [the repo](https://github.ubc.ca/MDS-CL-2023-24/COLX_563_adv-semantics_students/tree/master/data/lab3), unzip it into a directory outside of your lab repo and change the path below. Later you will probably want to put the data on Google drive and change this path so it points to your mounted data.

The question, context (also called the passage), and answer for a given set of QA training data are stored in separate files with corresponding line numbers. You should open up the data files to make sure you understand what they each represent.

In [12]:
cd Lab3

/content/gdrive/MyDrive/Lab3


In [None]:
mv /content/gdrive/MyDrive/'Colab Notebooks'/Lab3.ipynb /content/gdrive/MyDrive/Lab3

In [4]:
#provided code
squad_path = '/content/gdrive/MyDrive/Lab3/'

In [23]:
from google.colab import files
uploaded = files.upload()

Saving test.question to test.question


In [14]:
device = torch.device("cuda")
torch.backends.cudnn.deterministic=True
print(device)

cuda


In [15]:
!{sys.executable} -m pip install torch

Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Collectin

## Tidy Submission
rubric={mechanics:1}

To get the marks for tidy submission:
- Submit the assignment by filling in this Jupyter notebook with your answers embedded
- Be sure to follow the instructions

## Exercise 1: Initial Data Processing

### Exercise 1.1
rubric={accuracy:2}

Your first task is to write a function, `convert_to_BERT_tensors`, which uses the build-in BERT tokenizer to create tensors for input to the BERT model. You should call the tokenizer directly, look at the tokenizer [docs](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.__call__). This function should involve only two lines of code, but you need to get the arguments right. Your tokenization process must
* return pytorch tensors corresponding to the input_ids and attention masks (which prevent BERT from attending to padding)
* combine questions and contexts into a single input with a separator character
* truncate when the question and context is too long to work with BERT (longer than 512)
* add padding when the question and context is too short

In [16]:

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')


def convert_to_BERT_tensors(questions, contexts):
    '''takes a parallel list of question strings and answer strings'''

    paell_lst = tokenizer(questions, contexts, return_tensors='pt', truncation=True, padding=True)
    return paell_lst['input_ids'], paell_lst['attention_mask']



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [17]:
test_questions = ["Why?", "How?"]
test_contexts = ["I think it is because we can bluminate", "It was done"" ".join(["very"]*1000) + " well"]

ids, mask = convert_to_BERT_tensors(test_questions,test_contexts)
assert ids.shape == (2,512) # 512 because that's the max allowed
assert ids[0][3] == 102 # fourth token is separator
assert list(ids[0][-100:]) == [0]*100 # first row is mostly padding
assert list(ids[1][-100:]) != [0]*100 # second row is not
assert list(mask[0][-100:]) == [0]*100 # first row padding is masked
assert list(mask[1][-100:]) != [0]*100 # second row is not padding, no mask
print("Success!")

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Success!


### Exercise 1.2
rubric={accuracy:3, efficiency:1}

Creating tensors for the answers is a bit trickier. As our target for training, we want tensors of indices which correspond to the beginning and end of the answer span. Since BERT uses a special tokenizer and puts the question and the context together into a single input sequence, the original answer spans provided by SQuAD (which correspond to regular tokens in the context) aren't useful.

Instead you will write a function called `get_answer_span_tensor` which takes strings corresponding to a question, context, and answer and identifies the location of the answer in the vector corresponding to the input (question + context, including the required special BERT tokens \[CLS\] and \[SEP\]). To accomplish this, you will want to use the [tokenize](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.tokenize) method of the tokenizer, which just does the tokenization, not the vectorization (i.e. creation of tensor representation, already accomplished in exercise 1). Apply `tokenize` to both answer string and the input string, and then match the answer within the input string to get the indicies. Note that the end index should be inclusive, i.e if your indices were 34 and 36, that would correspond to a 3 word answer, not a 2 word answer.

Note, if the answer does not appear in the input, you should set start and end indices both to zero. Remember that in 1.1. you are truncating the input to length 512 so if one of the indices ends outside of 512, you should treat it as a failed match (this is important, if you don't do this, your code will crash in Exercise 2!)  

In [18]:
def get_answer_span_tensor(question_text, context_text, answer_text):
    # Tokenize the question and context texts together with special tokens to separate them
    input_tokens = tokenizer.tokenize('[CLS] ' + question_text + ' [SEP] ' + context_text)

    # Tokenize the answer text separately
    answer_tokens = tokenizer.tokenize(answer_text)

    # Calculate the length of the answer tokens list
    span_length = len(answer_tokens)

    # Loop through the input tokens to find the matching sequence for the answer tokens
    for i in range(min(len(input_tokens) - span_length + 1, 512 - span_length - 1)):
        # Check if the current slice of input tokens matches the answer tokens
        if input_tokens[i:i + span_length] == answer_tokens:
            # If a match is found, create a tensor representing the start and end indices of the span
            span = torch.tensor([i, i + span_length - 1])
            break
    else:
        # If no match is found, return a tensor representing an invalid span
        span = torch.tensor([0, 0])

    # Return the span tensor
    return span


In [19]:
test_question = "Why?"
test_context = "I think it is because we can bluminate"
test_answer = "because we can bluminate"
bad_answer  = "because we can fumiage"
span = get_answer_span_tensor(test_question,test_context,test_answer)
assert span.shape == (2,)
assert list(span) == [8,12]
span = get_answer_span_tensor(test_question,test_context,bad_answer)
assert list(span) == [0,0]
print('Success!')

Success!


### Exercise 1.3
rubric={accuracy:2, quality:1}

Now write code that builds a `QAdataset` (defined below) and a corresponding dataloader for each of the train, dev, and test splits with the provided `batch_size`.

In [20]:
#provided code
batch_size = 16

class QAdataset(Dataset):
    '''A dataset for housing QA data, including input_data, output_data, and padding mask'''
    def __init__(self, input_data, output_data,mask):
        self.input_data = input_data
        self.output_data = output_data
        self.mask = mask

    def __len__(self):
        return len(self.input_data)

    def __getitem__(self, index):
        target = self.output_data[index]
        data_val = self.input_data[index]
        mask = self.mask[index]
        return data_val,target,mask

You will want to open the corresponding question, context, and answers files for each split and use the functions in 1.1 and 1.2 to create tensors to be passed to the QAdataset constructor. Note that you don't have answers for the test split but you can just default to a (0,0) target in that case. Don't be surprised if your dataloaders take in the order of 10 minutes to build (it is a large dataset!)

In [21]:
def prepare_QA_dataset(split):
    with open(squad_path + split + ".question", encoding="utf-8") as file_q:
        questions = file_q.readlines()
    with open(squad_path + split + ".context", encoding="utf-8") as file_c:
        contexts = file_c.readlines()
    input_tensors, masks = convert_to_BERT_tensors(questions, contexts)

    if "train" == split or "dev" == split:
        with open(squad_path + split + ".answer", encoding="utf-8") as file_a:
            answers = file_a.readlines()
            spans = []
            for i in range(len(questions)):
                spans.append(get_answer_span_tensor(questions[i], contexts[i], answers[i]))
    else:
        spans = [torch.tensor([0,0])]*len(questions)
    return QAdataset(input_tensors, spans, masks)

In [None]:
!pip install tqdm



In [None]:
from tqdm import tqdm

In [24]:
train_dataset = prepare_QA_dataset("train")
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False)
dev_dataset = prepare_QA_dataset('dev')
dev_dataloader = DataLoader(dev_dataset, batch_size=batch_size, shuffle=False)
test_dataset = prepare_QA_dataset('test')
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

In [25]:
pwd

'/content/gdrive/MyDrive/Lab3'

## Exercise 2: BERT Training


### Exercise 2.1
rubric={accuracy:2}

The Huggingface library has a [BERT QA model](https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforquestionanswering) that includes the main pre-trained BERT model as well as the QA heads, i.e. start and end embeddings, which are dot-producted with the embedding output of BERT for each token to get unnormalized probabilities indicating which token is most likely to be the starting and ending token. Note there is actually one available that has been pre-tuned on SQuaD, but we won't be using that here. Instead, we load the `DistilBertForQuestionAnswering` module with a regular DistilBert pre-trained model `distilbert-base-uncased` (which could be used for any task), and teach it to do QA (it will give you a warning about that this model of BERT has not been trained for QA, don't worry about this!). The initialization is provided for you below. You need to set up an appropriate loss function and optimizer; for the latter, use Adam with a learning rate of 0.00003. Then, iterate over the data using your dataloader, passing the inputs and masks to the forward function of the model in the usual way, and then calculate the loss for both start and end "logits" (the unnormalized probability that each token is the start/end of the answer) which form part of the output of the model. Print out the loss regularly, if everything is correct you should see it drop *very* fast. You only need to train your model for a single epoch here (you may want to increase this number later for the Kaggle competition).

In [27]:
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased')

loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.00003)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

epochs = 1

model.to(device)
for epoch in range(epochs):
    epoch_loss = 0
    batch_counter = 0
    for train_text_batch, train_span_batch, masks in train_dataloader:
        model.zero_grad()
        train_text_batch, train_span_batch, masks = train_text_batch.to(device), train_span_batch.to(device), masks.to(device)
        output = model(train_text_batch,attention_mask=masks)
        loss = loss_function(output.start_logits, train_span_batch[:,0])
        loss += loss_function(output.end_logits, train_span_batch[:,1])
        loss.backward()
        optimizer.step()
        batch_counter += 1
        if batch_counter % 10 == 0:
            print("Processed ", batch_counter*batch_size, "QA pairs of ", len(train_dataset))
            print("Last loss:", loss.item())
        epoch_loss += loss.item()
    print('After epoch:', epoch, 'Loss is:', epoch_loss)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Processed  160 QA pairs of  77558
Last loss: 11.438800811767578
Processed  320 QA pairs of  77558
Last loss: 9.11563491821289
Processed  480 QA pairs of  77558
Last loss: 9.385068893432617
Processed  640 QA pairs of  77558
Last loss: 7.9191107749938965
Processed  800 QA pairs of  77558
Last loss: 8.244309425354004
Processed  960 QA pairs of  77558
Last loss: 6.816885948181152
Processed  1120 QA pairs of  77558
Last loss: 6.925061225891113
Processed  1280 QA pairs of  77558
Last loss: 6.212676048278809
Processed  1440 QA pairs of  77558
Last loss: 6.955818176269531
Processed  1600 QA pairs of  77558
Last loss: 6.984383583068848
Processed  1760 QA pairs of  77558
Last loss: 6.755611419677734
Processed  1920 QA pairs of  77558
Last loss: 4.889864921569824
Processed  2080 QA pairs of  77558
Last loss: 5.028020858764648
Processed  2240 QA pairs of  77558
Last loss: 5.240192890167236
Processed  2400 QA pairs of  77558
Last loss: 5.477156639099121
Processed  2560 QA pairs of  77558
Last loss:

### Exercise 2.2
rubric={accuracy:2}

Now run the trained classifier over the dev set and calculate the accuracy for each of the start and end predictions (independently). You should get above 60% performance for both. Don't forget to put the model in `eval` mode, with no gradients!

In [28]:
# Initialize empty lists to hold predicted and true start and end indices
pred_starts = []
true_starts = []
pred_ends = []
true_ends = []

# Set the model to evaluation mode (disables dropout and batch normalization)
model.eval()

# Disable gradient calculations to save memory and improve computation speed
with torch.no_grad():
    # Iterate over batches in the dataloader
    for dev_text_batch, dev_span_batch, masks in dev_dataloader:
        # Move input and mask tensors to the device (GPU or CPU)
        dev_text_batch, masks = dev_text_batch.to(device), masks.to(device)

        # Compute model output, including start and end logits for span prediction
        output = model(dev_text_batch, attention_mask=masks)

        # Extract start and end scores from the model's output
        start_scores = output.start_logits
        end_scores = output.end_logits

        # Retrieve true start and end indices from the batch data
        targets = dev_span_batch

        # Predict indices by finding the position of the highest score and convert to numpy arrays
        pred_starts.extend(torch.argmax(start_scores, dim=1).cpu().numpy())
        pred_ends.extend(torch.argmax(end_scores, dim=1).cpu().numpy())

        # Store the true start and end indices as numpy arrays
        true_starts.extend(targets[:, 0].cpu().numpy())
        true_ends.extend(targets[:, 1].cpu().numpy())

# Output the accuracy of start and end index predictions
print("Starts accuracy")
print(accuracy_score(true_starts, pred_starts))
print("Ends accuracy")
print(accuracy_score(true_ends, pred_ends))


Starts accuracy
0.6235052955244278
Ends accuracy
0.6506662111376836


## Exercise 3: Discrete optimization of answer spans
rubric={accuracy:3, efficiency:2, quality:1}

The model from exercise 2 independently predicts both start and end indices for the answer span. However, this is a case where there is a dependency between predictions that needs to be considered. In particular, it doesn't make sense to have the end index appear before the start index, or too long after it. You want to pick the highest probability pair that satisfies those basic constraints.

You should write a function select_best_answer_span() which takes three arguments:

* start_probs, a tensor of size num_examples x 512 giving the start of span log-probabilities.
* end_probs, a tensor of size num_examples x 512 giving the end of span log-probabilities.
* distance, a maximum distance between the start and end of the returned span. for each example, you should compute the start and end index for which the sum of log-probabilities is maximal provided that the start index < the end index and the distance between the indices is <= distance.

Return a list of (start, end) pairs.  There are a few ways of doing this.  You can earn a bonus mark if you find a very efficient way.

In [29]:
def select_best_answer_span(start_probs, end_probs, distance):
    '''Given 2 matrices of probabilities associated with
    indices of a text being the start or end of an answer spans, respectively,
    solves the ILP with the objective function being the max probability,
    under the restriction that the end index must be no more
    than distance after the start. Returns a tuple (start index, end index)
    corresponding to the best solution'''
# Get the number of examples and the sequence length from the shape of start probabilities
    num_examples, seq_length = start_probs.shape

    best_spans = []

    for i in range(num_examples):
        best_start, best_end = 0, 0
        best_score = float('-inf')

        for start in range(seq_length):
            for end in range(start, min(start + distance + 1, seq_length)):
                span_score = start_probs[i, start] + end_probs[i, end]

                if span_score > best_score:
                    best_start, best_end = start, end
                    best_score = span_score

        best_spans.append((best_start, best_end))

    return best_spans


In [30]:
test_starts = np.array([[0.1,0.5,0.2,0.1,0.1], [0.3,0.2,0.2,0.1,0.1]])
test_ends = np.array([[0.4,0.1,0.3,0.1,0.1], [0.1,0.1,0.1,0.1,0.6]])
assert select_best_answer_span(test_starts,test_ends,2) == [(1,2),(2,4)]
print("Success!")

Success!


## Exercise 4: Kaggle Competition

### Exercise 4.1
rubric={accuracy:2}

Use your trained model from Exercise 2 to predict answers the test set. You will use your indices from `select_best_answer_span` to get a list of token ids, which you can use to slice the tensor and pass the tokens to the [decode](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.decode) method of your tokenizer, which will get you back to a string. Otherwise, you should be able to copy the dev prediction code from above. Create the usual output file (a file with header 'Id,Predicted', where 'Id' is a column of integers and 'Predicted' is the corresponding starting index (i.e. the "tags" of the past are the starting indices now), and submit to the Kaggle competition [here](https://www.kaggle.com/t/eb1a8df14a8c433a99200551852882f6). If you've done everything correctly you should be able to beat the baseline, which is required to get an A+ on this section.

In [32]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from evaluate)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [

In [34]:
!pip install --upgrade evaluate



In [35]:
from evaluate import select_best_answer_span_v2

ImportError: cannot import name 'select_best_answer_span_v2' from 'evaluate' (/usr/local/lib/python3.10/dist-packages/evaluate/__init__.py)

In [38]:
### Your code here
distance = 20
predicted_starts = []
# gold_starts = []
predicted_ends = []
# gold_ends = []
model.eval()

answers = []
with torch.no_grad():
    for test_text_batch, test_span_batch, masks in test_dataloader:
        test_text_batch, masks = test_text_batch.to(device), masks.to(device)
        output = model(test_text_batch, attention_mask=masks)

        # Copy from Exercise 2.2
        start_scores = output.start_logits
        end_scores = output.end_logits

        # Start_probs and end_probs by `F.log_softmax` of start_scores and end_scores
        start_probs = F.log_softmax(start_scores, dim=1).cpu().numpy()
        end_probs = F.log_softmax(end_scores, dim=1).cpu().numpy()

        # Find your spans using Ex3.1
        spans = select_best_answer_span(start_probs, end_probs, distance)

        # Append text to answers
        for i in range(len(spans)):
            # Using `tokenizer.decode` for `test_text_batch` to append to `answers`
            text_ids = test_text_batch[i][spans[i][0]:spans[i][1]+1].cpu().numpy()
            decoded_text = tokenizer.decode(text_ids)
            answers.append(decoded_text)

# Save to the file
with open("/content/gdrive/MyDrive/Lab3/test_answers.txt", "w", encoding="utf-8") as fout:
    fout.write('Id,Predicted\n')
    for idx, pred in enumerate(answers):
        fout.write(str(idx) + ',"' + str(pred).replace('"', '""') + '"\n')

### Exercise 4.2 (Optional)
rubric={raw:2}

Now compete to improve your QA system. The two easiest ways to improve your score are to tune the hyperparameters and to use a more powerful version of BERT. You can do some research to find out which flavor is currently the best on this task.

As usual, points in this exercise will be given based on ranking in the *Private* leaderboard at the deadline.

- 1st: 2 points
- 2nd: 1.8 point
- 3rd: 1.6 point
- 4th: 1.4 point
- 5th: 1.2 point
- 6th: 1 point
-
Again, make sure we can duplicate your result by running your code.