<a href="https://colab.research.google.com/github/aislinblack/CS6120-NLP-Project/blob/main/albert/Albert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Question and Answering with ALBERT

Based on the following jupyter notebook https://colab.research.google.com/github/spark-ming/albert-qa-demo/blob/master/Question_Answering_with_ALBERT.ipynb#scrollTo=1qfQAtRsMVl7

## Introduction to ALBERT





## 1.0 Setup

Let's check out what kind of GPU our friends at Google gave us. This notebook should be configured to give you a P100 😃 (saved in metadata)

In [1]:
!nvidia-smi

Sat Jul 30 21:25:33 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

First, we clone the Hugging Face transformer library from Github

In [2]:
!git clone https://github.com/huggingface/transformers \
&& cd transformers \
&& git checkout a3085020ed0d81d4903c50967687192e3101e770 

Cloning into 'transformers'...
remote: Enumerating objects: 102756, done.[K
remote: Counting objects: 100% (641/641), done.[K
remote: Compressing objects: 100% (245/245), done.[K
remote: Total 102756 (delta 393), reused 472 (delta 323), pack-reused 102115[K
Receiving objects: 100% (102756/102756), 96.06 MiB | 17.54 MiB/s, done.
Resolving deltas: 100% (75805/75805), done.
Note: checking out 'a3085020ed0d81d4903c50967687192e3101e770'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at a3085020e Added repetition penalty to PPLM example (#2436)


In [3]:
!pip install ./transformers
!pip install tensorboardX

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing ./transformers
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Collecting tokenizers==0.0.11
  Downloading tokenizers-0.0.11-cp37-cp37m-manylinux1_x86_64.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 17.9 MB/s 
[?25hCollecting boto3
  Downloading boto3-1.24.42-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 70.3 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |███████████████████████████████

## 2.0 Train Model

Now, we could definitely train our own model (and you can see how to do that in the other linked jupyter notebook), but it would take a really long time, and because of this hugging face lets us borrow a pretrained albert model which was already trained on the SQuAD dataset.

The tutorial lets us know that it takes about 1.5 hours per epoch to train ALBERT on SQuAD because the dataset is so large.



## 3.0 Setup prediction code

Now we can use the Hugging Face library to make predictions using our newly trained model. Note that a lot of the code is pulled from `run_squad.py` in the Hugging Face repository, with all the training parts removed. This modified code allows to run predictions we pass in directly as strings, rather .json format like the training/test set.

NOTE if you decided train your own mode, change the flag `use_own_model` to `True`


In [4]:
import os
import torch
import time
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

from transformers import (
    AlbertConfig,
    AlbertForQuestionAnswering,
    AlbertTokenizer,
    squad_convert_examples_to_features
)

from transformers.data.processors.squad import SquadResult, SquadV2Processor, SquadExample

from transformers.data.metrics.squad_metrics import compute_predictions_logits

# READER NOTE: Set this flag to use own model, or use pretrained model in the Hugging Face repository
use_own_model = False

if use_own_model:
  model_name_or_path = "/content/model_output"
else:
  model_name_or_path = "ktrapeznikov/albert-xlarge-v2-squad-v2"

output_dir = ""

# Config
n_best_size = 1
max_answer_length = 30
do_lower_case = True
null_score_diff_threshold = 0.0

def to_list(tensor):
    return tensor.detach().cpu().tolist()

# Setup model
config_class, model_class, tokenizer_class = (
    AlbertConfig, AlbertForQuestionAnswering, AlbertTokenizer)
config = config_class.from_pretrained(model_name_or_path)
tokenizer = tokenizer_class.from_pretrained(
    model_name_or_path, do_lower_case=True)
model = model_class.from_pretrained(model_name_or_path, config=config)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.to(device)

processor = SquadV2Processor()

def run_prediction(question_texts, context_text):
    """Setup function to compute predictions"""
    examples = []

    for i, question_text in enumerate(question_texts):
        example = SquadExample(
            qas_id=str(i),
            question_text=question_text,
            context_text=context_text,
            answer_text=None,
            start_position_character=None,
            title="Predict",
            is_impossible=False,
            answers=None,
        )

        examples.append(example)

    features, dataset = squad_convert_examples_to_features(
        examples=examples,
        tokenizer=tokenizer,
        max_seq_length=384,
        doc_stride=128,
        max_query_length=64,
        is_training=False,
        return_dataset="pt",
        threads=1,
    )

    eval_sampler = SequentialSampler(dataset)
    eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=10)

    all_results = []

    for batch in eval_dataloader:
        model.eval()
        batch = tuple(t.to(device) for t in batch)

        with torch.no_grad():
            inputs = {
                "input_ids": batch[0],
                "attention_mask": batch[1],
                "token_type_ids": batch[2],
            }

            example_indices = batch[3]

            outputs = model(**inputs)

            for i, example_index in enumerate(example_indices):
                eval_feature = features[example_index.item()]
                unique_id = int(eval_feature.unique_id)

                output = [to_list(output[i]) for output in outputs]

                start_logits, end_logits = output
                result = SquadResult(unique_id, start_logits, end_logits)
                all_results.append(result)

    output_prediction_file = "predictions.json"
    output_nbest_file = "nbest_predictions.json"
    output_null_log_odds_file = "null_predictions.json"

    predictions = compute_predictions_logits(
        examples,
        features,
        all_results,
        n_best_size,
        max_answer_length,
        do_lower_case,
        output_prediction_file,
        output_nbest_file,
        output_null_log_odds_file,
        False,  # verbose_logging
        True,  # version_2_with_negative
        null_score_diff_threshold,
        tokenizer,
    )

    return predictions

Downloading:   0%|          | 0.00/717 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/760k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/235M [00:00<?, ?B/s]

## 4.0 Run predictions on the Covid QA set

Now for the fun part... testing out your model on different inputs. Pretty rudimentary example here. But the possibilities are endless with this function.

In [7]:
import pandas as pd



def read_test_set():
    df = pd.read_json("Covid-QA-more-focused.json")
    return df['data']


data = read_test_set()

In [None]:
import time
import tensorflow as tf
print(tf.test.gpu_device_name())

num_right = 0 # giving credit for whenever it comes up with a subset of the string
total = 0

for item in data:
    start = time.time()

    paragraph = item["paragraphs"][0]
    
    questions_with_answers = paragraph["qas"]
    context = paragraph["context"]
    questions = []
    answers = []

    for qa in questions_with_answers:
        questions.append(qa["question"])
        answers.append(qa["answers"])

    predictions = run_prediction(questions, context)
    idx = 0
    for key in predictions.keys():
      pos_answers = answers[idx]
      correct = False
      for answer in pos_answers:
        answer_text = answer['text']
        correct = correct or (predictions[key] in answer_text)
      if correct:
        num_right += 1
      end = time.time()

      total += 1
      idx += 1
    end = time.time()
    print(end - start)

print("num-right:", num_right)
print("total:", total)

/device:GPU:0


convert squad examples to features: 100%|██████████| 10/10 [00:10<00:00,  1.01s/it]
add example index and unique id: 100%|██████████| 10/10 [00:00<00:00, 21709.65it/s]


112.06072545051575


convert squad examples to features: 100%|██████████| 2/2 [00:00<00:00,  2.65it/s]
add example index and unique id: 100%|██████████| 2/2 [00:00<00:00, 10058.28it/s]


26.2904851436615


convert squad examples to features: 100%|██████████| 1/1 [00:00<00:00,  4.68it/s]
add example index and unique id: 100%|██████████| 1/1 [00:00<00:00, 6096.37it/s]


8.694236040115356


convert squad examples to features: 100%|██████████| 10/10 [00:04<00:00,  2.21it/s]
add example index and unique id: 100%|██████████| 10/10 [00:00<00:00, 25282.12it/s]


111.30094766616821


convert squad examples to features: 100%|██████████| 8/8 [00:01<00:00,  4.36it/s]
add example index and unique id: 100%|██████████| 8/8 [00:00<00:00, 27962.03it/s]


70.97893834114075


convert squad examples to features: 100%|██████████| 5/5 [00:01<00:00,  3.15it/s]
add example index and unique id: 100%|██████████| 5/5 [00:00<00:00, 2209.39it/s]


57.73640322685242


convert squad examples to features: 100%|██████████| 4/4 [00:00<00:00, 23.06it/s]
add example index and unique id: 100%|██████████| 4/4 [00:00<00:00, 32201.95it/s]


8.230858564376831


convert squad examples to features: 100%|██████████| 11/11 [00:02<00:00,  5.46it/s]
add example index and unique id: 100%|██████████| 11/11 [00:00<00:00, 32286.45it/s]


84.41187691688538


convert squad examples to features: 100%|██████████| 8/8 [00:03<00:00,  2.52it/s]
add example index and unique id: 100%|██████████| 8/8 [00:00<00:00, 2257.58it/s]


111.15062499046326


convert squad examples to features: 100%|██████████| 5/5 [00:00<00:00,  5.80it/s]
add example index and unique id: 100%|██████████| 5/5 [00:00<00:00, 22995.09it/s]


38.42412066459656


convert squad examples to features: 100%|██████████| 29/29 [00:08<00:00,  3.40it/s]
add example index and unique id: 100%|██████████| 29/29 [00:00<00:00, 29674.27it/s]


317.0217218399048


convert squad examples to features: 100%|██████████| 4/4 [00:00<00:00, 18.29it/s]
add example index and unique id: 100%|██████████| 4/4 [00:00<00:00, 11915.64it/s]


10.752725601196289


convert squad examples to features: 100%|██████████| 1/1 [00:00<00:00, 11.04it/s]
add example index and unique id: 100%|██████████| 1/1 [00:00<00:00, 8371.86it/s]


4.3583807945251465


convert squad examples to features: 100%|██████████| 27/27 [00:02<00:00, 12.55it/s]
add example index and unique id: 100%|██████████| 27/27 [00:00<00:00, 62394.60it/s]


110.8208258152008


convert squad examples to features: 100%|██████████| 20/20 [00:04<00:00,  4.56it/s]
add example index and unique id: 100%|██████████| 20/20 [00:00<00:00, 9591.37it/s]


165.38205742835999


convert squad examples to features: 100%|██████████| 1/1 [00:00<00:00, 24.63it/s]
add example index and unique id: 100%|██████████| 1/1 [00:00<00:00, 525.01it/s]


1.8990294933319092


convert squad examples to features: 100%|██████████| 30/30 [00:05<00:00,  5.16it/s]
add example index and unique id: 100%|██████████| 30/30 [00:00<00:00, 32388.45it/s]


238.93355107307434


convert squad examples to features: 100%|██████████| 59/59 [00:37<00:00,  1.56it/s]
add example index and unique id: 100%|██████████| 59/59 [00:00<00:00, 19037.15it/s]


1121.712605714798


convert squad examples to features: 100%|██████████| 15/15 [00:07<00:00,  2.10it/s]
add example index and unique id: 100%|██████████| 15/15 [00:00<00:00, 19430.07it/s]


244.37357091903687


convert squad examples to features: 100%|██████████| 13/13 [00:01<00:00,  6.73it/s]
add example index and unique id: 100%|██████████| 13/13 [00:00<00:00, 34863.14it/s]


80.88931679725647


convert squad examples to features: 100%|██████████| 5/5 [00:01<00:00,  4.53it/s]
add example index and unique id: 100%|██████████| 5/5 [00:00<00:00, 6256.42it/s]


44.185272455215454


convert squad examples to features: 100%|██████████| 119/119 [03:54<00:00,  1.97s/it]
add example index and unique id: 100%|██████████| 119/119 [00:00<00:00, 7781.15it/s]


4635.500182151794


convert squad examples to features: 100%|██████████| 23/23 [00:10<00:00,  2.22it/s]
add example index and unique id: 100%|██████████| 23/23 [00:00<00:00, 21086.12it/s]


340.0011866092682


convert squad examples to features: 100%|██████████| 1/1 [00:00<00:00, 17.20it/s]
add example index and unique id: 100%|██████████| 1/1 [00:00<00:00, 8774.69it/s]


2.738431215286255


convert squad examples to features: 100%|██████████| 10/10 [00:01<00:00,  6.08it/s]
add example index and unique id: 100%|██████████| 10/10 [00:00<00:00, 25435.44it/s]


67.85349440574646


convert squad examples to features: 100%|██████████| 1/1 [00:01<00:00,  1.25s/it]
add example index and unique id: 100%|██████████| 1/1 [00:00<00:00, 3927.25it/s]


29.10935640335083


convert squad examples to features: 100%|██████████| 24/24 [00:06<00:00,  3.76it/s]
add example index and unique id: 100%|██████████| 24/24 [00:00<00:00, 30320.27it/s]


240.44512939453125


convert squad examples to features: 100%|██████████| 2/2 [00:01<00:00,  1.04it/s]
add example index and unique id: 100%|██████████| 2/2 [00:00<00:00, 6700.17it/s]


48.34503436088562


convert squad examples to features: 100%|██████████| 9/9 [00:00<00:00, 12.84it/s]
add example index and unique id: 100%|██████████| 9/9 [00:00<00:00, 36542.82it/s]


31.68479633331299


convert squad examples to features: 100%|██████████| 3/3 [00:00<00:00, 13.04it/s]
add example index and unique id: 100%|██████████| 3/3 [00:00<00:00, 14891.02it/s]


10.732195138931274


convert squad examples to features: 100%|██████████| 21/21 [00:03<00:00,  6.69it/s]
add example index and unique id: 100%|██████████| 21/21 [00:00<00:00, 39060.04it/s]


129.97586822509766


convert squad examples to features: 100%|██████████| 125/125 [01:24<00:00,  1.48it/s]
add example index and unique id: 100%|██████████| 125/125 [00:00<00:00, 18132.67it/s]


2482.013066291809


convert squad examples to features:   2%|▏         | 1/56 [00:59<54:40, 59.65s/it]

## So How does it Perform for questions using the CDCs covid advice website?

In [26]:
cdc_guidance = ""

SyntaxError: ignored