In [1]:
import time
from typing import List, Tuple, Union

import kscope
import torch
import torch.nn as nn
from datasets import load_dataset
from torch import cuda
from torch.utils.data import DataLoader
from transformers import (
    AutoTokenizer,
    PreTrainedTokenizer,
    PreTrainedTokenizerFast,
    T5ForConditionalGeneration,
    T5Tokenizer,
)
from transformers.modeling_outputs import Seq2SeqLMOutput
from transformers.tokenization_utils_base import BatchEncoding

# Setting seed to ensure reproducibility
torch.manual_seed(1776)

<torch._C.Generator at 0x7f91f0505770>

In this notebook, we'll perform a simple experiment. The question is:

_Do the discrete prompts learned to optimize T5s ability to perform the SST2 sentiment analysis task also improve performance for LLaMA-2-7B?_

__NOTE__: We performed discrete prompt optimization using an in-house implementation of AutoPrompt. While the implementation is not included in this iteration of the lab, if you are interested feel free to ask your facilitator about it and we can provide access to a repository containing the implementation.

In the optimization, the original seed prompt was 

    "Generate the sentiment of the next sentence. ". 

For T5, this prompt induced about 68% accuracy on the whole SST2 dataset, of which we consider a sample to perform the transferrability measurements conducted here. After gradient based search optimization, we produced the optimized prompt 

    "tumour negative .05. Positive respins the Contains sentence. " 
    
which yielded an accuracy of about 81%. If the optimization was run for additional iterations, we got a final prompt of

    "childcare negative .05. Positive respins wSt Thank sentence." 
    
with an accuracy of 83%. Both of these prompts are difficult to read and it's not clear why these induce improved performance for T5, beyond the fact that they do incorporate the label space (Negative, Positive).

Let's determine if either of these odd but apparently performant prompts will improve also results for LLaMA-2-7B over the original as well.

First, let's setup the prompts.

In [2]:
initial_prompt = "Generate the sentiment of the next sentence. "
optimized_prompt_1 = "tumour negative .05. Positive respins the Contains sentence. "
optimized_prompt_2 = "childcare negative .05. Positive respins wSt Thank sentence. "

Next, we'll setup the tokenizers associated with the target language models.

For activation retrieval, we need to instantiate a tokenizer to obtain appropriate token indices for our labels.

__NOTE__: All LLaMA-2 models, regardless of size, used the same tokenizer. However, if you want to use a different type of model, a different tokenizer may be needed.

If you are on the cluster, the tokenizer may be loaded from `/model-weights/Llama-2-7b-hf`. Otherwise, you'll need to download the `config.json`, `tokenizer.json`, `tokenizer.model`, and `tokenizer_config.json` from there to your local machine.

In [3]:
llama_tokenizer = AutoTokenizer.from_pretrained("/Users/david/Desktop/LLaMA2_Tokenizer")
t5_tokenizer = T5Tokenizer.from_pretrained("google/t5-large-lm-adapt")
# Let's test out how the tokenizer works on an example sentence. Note that the token with ID = 1 is the
# Beginning of sentence token ("<s>")
llama_encoded_tokens = llama_tokenizer.encode("Hello this is a test")
print(f"LLaMA-2 Encoded Tokens: {llama_encoded_tokens}")

t5_encoded_tokens = t5_tokenizer.encode("Hello this is a test")
print(f"T5 Encoded Tokens: {t5_encoded_tokens}")

# If you ever need to move back from token ids, you can use tokenizer.decode or tokenizer.batch_decode
llama_decoded_tokens = llama_tokenizer.decode(llama_encoded_tokens)
print(f"LLaMA Decoded Tokens: {llama_decoded_tokens}")

t5_decoded_tokens = t5_tokenizer.decode(t5_encoded_tokens)
print(f"T5 Decoded Tokens: {t5_decoded_tokens}")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


LLaMA-2 Encoded Tokens: [1, 15043, 445, 338, 263, 1243]
T5 Encoded Tokens: [8774, 48, 19, 3, 9, 794, 1]
LLaMA Decoded Tokens: <s> Hello this is a test
T5 Decoded Tokens: Hello this is a test</s>


Setup some of the necessary variables for the inference below.

In [4]:
label_words = ["negative", "positive"]
# How big should inference batches be
batch_size = 10
# How many batches in total to process from the dataloader (batch_size*batches_to_sample = datapoints to process)
batches_to_sample = 20

# Determine whether cuda and a GPU are available to speed up processing
device = "cuda" if cuda.is_available() else "cpu"
print(f"Detected Device {device}")
softmax = nn.Softmax(dim=0)

Detected Device cpu


In [5]:
dataset = load_dataset("sst2", split="validation")
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
dataset[0]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


{'idx': 0,
 'sentence': "it 's a charming and often affecting journey . ",
 'label': 1}

In order to facilitate answer extraction, we need to know the tokenization ids for each of the labels we expect the model to response with. In this case, we optimized the prompts for T5 to produce either "negative" or "positive." So we use the tokenizers to extract the IDs associated with these labels.

In [6]:
dummy_sentence = "I love this movie!"
print(llama_tokenizer(f"{initial_prompt}{dummy_sentence} negative")["input_ids"])
print(llama_tokenizer(f"{initial_prompt}{dummy_sentence} positive")["input_ids"])
print(t5_tokenizer(f"{initial_prompt}{dummy_sentence} negative </s>", add_special_tokens=False)["input_ids"])
print(t5_tokenizer(f"{initial_prompt}{dummy_sentence} positive </s>", add_special_tokens=False)["input_ids"])

[1, 3251, 403, 278, 19688, 310, 278, 2446, 10541, 29889, 306, 5360, 445, 14064, 29991, 8178]
[1, 3251, 403, 278, 19688, 310, 278, 2446, 10541, 29889, 306, 5360, 445, 14064, 29991, 6374]
[6939, 2206, 8, 6493, 13, 8, 416, 7142, 5, 27, 333, 48, 1974, 55, 2841, 1]
[6939, 2206, 8, 6493, 13, 8, 416, 7142, 5, 27, 333, 48, 1974, 55, 1465, 1]


Note that the LLaMA-2 and T5 Tokenizers output only differs in the last token ID (ignoring the end of sentence token ID of 1). So we infer that this must be the token associated with negative and positive labels. Let's confirm this below.

In [7]:
print(f"LLaMA Decode: {llama_tokenizer.decode([8178, 6374])}")
print(f"T5 Decode: {t5_tokenizer.decode([2841, 1465])}")

LLaMA Decode: negative positive
T5 Decode: negative positive


In [8]:
llama_label_tokens = {"negative": [8178], "positive": [6374]}
t5_label_tokens = {"negative": [2841], "positive": [1465]}

We'll start by measuring the performance of LLaMA-2 7B with the initial prompt

In [9]:
client = kscope.Client(gateway_host="llm.cluster.local", gateway_port=3001)
client.model_instances

ConnectionError: HTTPConnectionPool(host='llm.cluster.local', port=3001): Max retries exceeded with url: /models/instances (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa918c35070>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))

The model requires an generation configuration to run, but we're going to be extracting the label likelihoods rather than generating. So the only setting we really care about is `max_tokens = 1`. For a discussion of the configuration parameters see: `src/reference_implementations/prompting_vector_llms/CONFIG_README.md`

In [None]:
model = client.load_model("llama-7b")
# If this model is not actively running, it will get launched in the background.
# In this case, wait until it moves into an "ACTIVE" state before proceeding.
while model.state != "ACTIVE":
    time.sleep(1)

generation_config = {"max_tokens": 1}

In [None]:
# We're interested in the activations from the last layer of the model, because this will allow us to calculate the
# likelihoods
last_layer_name = model.module_names[-1]
last_layer_name

'decoder.output_projection'

In [9]:
def create_prompt_str(instruction: str, sentences: List[str]) -> List[str]:
    return [f"{instruction}{sentence}" for sentence in sentences]

Let's take a look at the prompts we're creating

In [10]:
example_batch = next(iter(dataloader))
example_prompts_initial = create_prompt_str(initial_prompt, example_batch["sentence"])
example_prompts_optimized_1 = create_prompt_str(optimized_prompt_1, example_batch["sentence"])
example_prompts_optimized_2 = create_prompt_str(optimized_prompt_2, example_batch["sentence"])
print(example_prompts_initial[0])
print(example_prompts_optimized_1[0])
print(example_prompts_optimized_2[0])

Generate the sentiment of the next sentence. the son 's room is a triumph of gentility that earns its moments of pathos . 
tumour negative .05. Positive respins the Contains sentence. the son 's room is a triumph of gentility that earns its moments of pathos . 
childcare negative .05. Positive respins wSt Thank sentence. the son 's room is a triumph of gentility that earns its moments of pathos . 


In [12]:
def select_label_from_activations_llama(label_token_ids: torch.Tensor, layer_matrix: torch.Tensor) -> int:
    # The activations we care about are the last token (corresponding to our label token) and the values for our label
    # vocabulary
    label_activations = layer_matrix[-1][label_token_ids].float()
    # Softmax is not strictly necessary, but it helps to contextualize the "probability" the model associates with each
    # label relative to the others
    label_distributions = softmax(label_activations)
    # We select the label index with the largest value
    max_label_index = torch.argmax(label_distributions)
    return max_label_index

In [14]:
report: List[Tuple[str, float]] = []

In [25]:
correct = 0
total = 0
label_token_ids = torch.Tensor([llama_label_tokens["negative"], llama_label_tokens["positive"]]).long()
for batch_num, batch in enumerate(dataloader):
    prompts = create_prompt_str(initial_prompt, batch["sentence"])
    labels = batch["label"]
    assert len(prompts) == len(labels)
    # Because activation retrieval is a somewhat heavy operation, we just do batches of 1
    for prompt, label in zip(prompts, labels):
        activations = model.get_activations([prompt], [last_layer_name], generation_config)
        last_layer_matrix = activations.activations[0][last_layer_name]
        predicted_label = select_label_from_activations_llama(label_token_ids, last_layer_matrix)
        if predicted_label == int(label.item()):
            correct += 1
        total += 1
    print(f"Batch number {batch_num+1} Complete")
    if batch_num + 1 == batches_to_sample:
        break
accuracy = correct / total
report.append((initial_prompt, accuracy))

Batch number 1 Complete
Batch number 2 Complete
Batch number 3 Complete
Batch number 4 Complete
Batch number 5 Complete
Batch number 6 Complete
Batch number 7 Complete
Batch number 8 Complete
Batch number 9 Complete
Batch number 10 Complete
Batch number 11 Complete
Batch number 12 Complete
Batch number 13 Complete
Batch number 14 Complete
Batch number 15 Complete
Batch number 16 Complete
Batch number 17 Complete
Batch number 18 Complete
Batch number 19 Complete
Batch number 20 Complete


In [26]:
print(f"Accuracy: {accuracy}")

Accuracy: 0.6


Now let's try both of our "optimized prompts"

In [27]:
correct = 0
total = 0
label_token_ids = torch.Tensor([llama_label_tokens["negative"], llama_label_tokens["positive"]]).long()
for batch_num, batch in enumerate(dataloader):
    prompts = create_prompt_str(optimized_prompt_1, batch["sentence"])
    labels = batch["label"]
    assert len(prompts) == len(labels)
    # Because activation retrieval is a somewhat heavy operation, we just do batches of 1
    for prompt, label in zip(prompts, labels):
        activations = model.get_activations([prompt], [last_layer_name], generation_config)
        last_layer_matrix = activations.activations[0][last_layer_name]
        predicted_label = select_label_from_activations_llama(label_token_ids, last_layer_matrix)
        if predicted_label == int(label.item()):
            correct += 1
        total += 1
    print(f"Batch number {batch_num+1} Complete")
    if batch_num + 1 == batches_to_sample:
        break
accuracy = correct / total
report.append((optimized_prompt_1, accuracy))

Batch number 1 Complete
Batch number 2 Complete
Batch number 3 Complete
Batch number 4 Complete
Batch number 5 Complete
Batch number 6 Complete
Batch number 7 Complete
Batch number 8 Complete
Batch number 9 Complete
Batch number 10 Complete
Batch number 11 Complete
Batch number 12 Complete
Batch number 13 Complete
Batch number 14 Complete
Batch number 15 Complete
Batch number 16 Complete
Batch number 17 Complete
Batch number 18 Complete
Batch number 19 Complete
Batch number 20 Complete


In [28]:
print(f"Accuracy: {accuracy}")

Accuracy: 0.555


In [29]:
correct = 0
total = 0
label_token_ids = torch.Tensor([llama_label_tokens["negative"], llama_label_tokens["positive"]]).long()
for batch_num, batch in enumerate(dataloader):
    prompts = create_prompt_str(optimized_prompt_2, batch["sentence"])
    labels = batch["label"]
    assert len(prompts) == len(labels)
    # Because activation retrieval is a somewhat heavy operation, we just do batches of 1
    for prompt, label in zip(prompts, labels):
        activations = model.get_activations([prompt], [last_layer_name], generation_config)
        last_layer_matrix = activations.activations[0][last_layer_name]
        predicted_label = select_label_from_activations_llama(label_token_ids, last_layer_matrix)
        if predicted_label == int(label.item()):
            correct += 1
        total += 1
    print(f"Batch number {batch_num+1} Complete")
    if batch_num + 1 == batches_to_sample:
        break
accuracy = correct / total
report.append((optimized_prompt_2, accuracy))

Batch number 1 Complete
Batch number 2 Complete
Batch number 3 Complete
Batch number 4 Complete
Batch number 5 Complete
Batch number 6 Complete
Batch number 7 Complete
Batch number 8 Complete
Batch number 9 Complete
Batch number 10 Complete
Batch number 11 Complete
Batch number 12 Complete
Batch number 13 Complete
Batch number 14 Complete
Batch number 15 Complete
Batch number 16 Complete
Batch number 17 Complete
Batch number 18 Complete
Batch number 19 Complete
Batch number 20 Complete


In [30]:
print(f"Accuracy: {accuracy}")

Accuracy: 0.615


### HugggingFace T5

Let's try these prompts in the context of the original T5 model from HuggingFace to verify that they indeed do "optimize" the prompt performance.

In [11]:
# Instantiate the model and set it to eval mode
t5_model = T5ForConditionalGeneration.from_pretrained("google/t5-large-lm-adapt").to(device).eval()

In [12]:
def create_encoder_decoder_inputs(
    prompts: List[str], t5_tokenizer: Union[PreTrainedTokenizer, PreTrainedTokenizerFast], device: str
) -> Tuple[BatchEncoding, BatchEncoding]:
    # Repeat each prompt twice (once for each label)
    repeated_prompts = [prompt for prompt in prompts for i in range(2)]
    # repeat label words for each repeated prompt
    decoder_labels = [f"{label_word} </s>" for label_word in label_words] * len(prompts)
    encoder_inputs = t5_tokenizer(
        repeated_prompts,
        truncation=True,
        padding="max_length",
        max_length=64,
        add_special_tokens=False,
        return_tensors="pt",
    ).to(device)
    decoder_inputs = t5_tokenizer(
        decoder_labels,
        truncation=True,
        padding="max_length",
        max_length=16,
        add_special_tokens=False,
        return_tensors="pt",
    ).to(device)
    return encoder_inputs, decoder_inputs

In [17]:
def get_likelihoods_from_t5_output(
    output: Seq2SeqLMOutput, loss_func: torch.nn.CrossEntropyLoss, decoder_ids: torch.Tensor
) -> torch.Tensor:
    loss_func = torch.nn.CrossEntropyLoss(ignore_index=-100, reduction="none")
    # Negative of the loss to get back to raw log-probabilities
    log_likelihoods = -loss_func(output.logits.view(-1, output.logits.size(-1)), decoder_ids.view(-1))
    batch_size, sequence_length, _ = output.logits.size()
    # compute per-token log probability in a sequence.
    # log_p has log probabilities for the following target output: [pos, it, ive]
    log_likelihoods = log_likelihoods.view(batch_size, sequence_length)
    # pad tokens have index -100 in HuggingFace.
    good_log_p = log_likelihoods.masked_fill_(decoder_ids == -100, 0.0)
    # good_log_p now has the log probability of the output sequence tokens.
    # sum over the sequence length to compute the log probability for a full sequence.
    return torch.sum(good_log_p, dim=1).squeeze()

In [14]:
def run_t5_model_on_encodings(
    encoder_encodings: BatchEncoding, decoder_encodings: BatchEncoding, t5_model: T5ForConditionalGeneration
) -> Tuple[Seq2SeqLMOutput, torch.Tensor]:
    decoder_ids = decoder_encodings.input_ids
    # we have to make sure that the PAD token is ignored.
    # HuggingFace ignores a pad token if the token is -100!
    decoder_ids = decoder_ids.masked_fill(decoder_ids == t5_tokenizer.pad_token_id, -100)
    # Disable gradient tracking for faster inference
    with torch.no_grad():
        model_output = t5_model(
            input_ids=encoder_encodings.input_ids,
            attention_mask=encoder_encodings.attention_mask,
            decoder_attention_mask=decoder_encodings.attention_mask,
            decoder_input_ids=t5_model._shift_right(decoder_ids),
            labels=None,
        )
    return model_output, decoder_ids

In [15]:
def extract_label_from_likelihoods(softmax_func: nn.Softmax, likelihoods: torch.Tensor) -> torch.tensor:
    # Pair the likelihoods associated with negative and positive labels for each prompt
    likelihoods = likelihoods.reshape(-1, 2)
    likelihoods = softmax_func(likelihoods)
    return torch.argmax(likelihoods, dim=1)

First, we run the original prompt through the T5 Model to establish the "baseline" accuracy that we're hoping the optimized prompts improve upon.

In [18]:
correct = 0
total = 0
# We're going to use a loss function to extra the log probabilities of the labels.
loss_func = nn.CrossEntropyLoss(ignore_index=-100, reduction="none")
softmax = nn.Softmax(dim=1)
for batch_num, batch in enumerate(dataloader):
    prompts = [f"{prompt} </s>" for prompt in create_prompt_str(initial_prompt, batch["sentence"])]
    labels = batch["label"].to(device)
    encoder_encodings, decoder_encodings = create_encoder_decoder_inputs(prompts, t5_tokenizer, device)

    model_output, decoder_ids = run_t5_model_on_encodings(encoder_encodings, decoder_encodings, t5_model)
    likelihoods = get_likelihoods_from_t5_output(model_output, loss_func, decoder_ids)
    predicted_labels = extract_label_from_likelihoods(softmax, likelihoods)
    match_tensor = (predicted_labels == labels).long()
    correct += torch.sum(match_tensor)
    total += len(match_tensor)
    print(f"Batch number {batch_num+1} Complete")
    if batch_num + 1 == batches_to_sample:
        break
accuracy = correct / total

report.append((initial_prompt, accuracy))

Batch number 1 Complete
Batch number 2 Complete
Batch number 3 Complete
Batch number 4 Complete
Batch number 5 Complete
Batch number 6 Complete
Batch number 7 Complete
Batch number 8 Complete
Batch number 9 Complete
Batch number 10 Complete
Batch number 11 Complete
Batch number 12 Complete
Batch number 13 Complete
Batch number 14 Complete
Batch number 15 Complete
Batch number 16 Complete
Batch number 17 Complete
Batch number 18 Complete
Batch number 19 Complete
Batch number 20 Complete


NameError: name 'report' is not defined

In [19]:
print(f"Accuracy: {accuracy}")

Accuracy: 0.6800000071525574


Next, we run the first optimized prompt through T5 to see if there is a better accuracy outcome.

In [20]:
correct = 0
total = 0
# We're going to use a loss function to extra the log probabilities of the labels.
loss_func = nn.CrossEntropyLoss(ignore_index=-100, reduction="none")
softmax = nn.Softmax(dim=1)
for batch_num, batch in enumerate(dataloader):
    prompts = [f"{prompt} </s>" for prompt in create_prompt_str(optimized_prompt_1, batch["sentence"])]
    labels = batch["label"].to(device)
    encoder_encodings, decoder_encodings = create_encoder_decoder_inputs(prompts, t5_tokenizer, device)
    model_output, decoder_ids = run_t5_model_on_encodings(encoder_encodings, decoder_encodings, t5_model)
    likelihoods = get_likelihoods_from_t5_output(model_output, loss_func, decoder_ids)
    predicted_labels = extract_label_from_likelihoods(softmax, likelihoods)
    match_tensor = (predicted_labels == labels).long()
    correct += torch.sum(match_tensor)
    total += len(match_tensor)
    print(f"Batch number {batch_num+1} Complete")
    if batch_num + 1 == batches_to_sample:
        break
accuracy = correct / total
report.append((optimized_prompt_1, accuracy))

Batch number 1 Complete
Batch number 2 Complete
Batch number 3 Complete
Batch number 4 Complete
Batch number 5 Complete
Batch number 6 Complete
Batch number 7 Complete
Batch number 8 Complete
Batch number 9 Complete
Batch number 10 Complete
Batch number 11 Complete
Batch number 12 Complete
Batch number 13 Complete
Batch number 14 Complete
Batch number 15 Complete
Batch number 16 Complete
Batch number 17 Complete
Batch number 18 Complete
Batch number 19 Complete
Batch number 20 Complete


NameError: name 'report' is not defined

In [21]:
print(f"Accuracy: {accuracy}")

Accuracy: 0.8199999928474426


Finally, we run the second optimized prompt to establish its performance on a small subset of the SST2 task.

In [22]:
correct = 0
total = 0
# We're going to use a loss function to extra the log probabilties of the labels.
loss_func = nn.CrossEntropyLoss(ignore_index=-100, reduction="none")
softmax = nn.Softmax(dim=1)
for batch_num, batch in enumerate(dataloader):
    prompts = [f"{prompt} </s>" for prompt in create_prompt_str(optimized_prompt_2, batch["sentence"])]
    labels = batch["label"].to(device)
    encoder_encodings, decoder_encodings = create_encoder_decoder_inputs(prompts, t5_tokenizer, device)
    model_output, decoder_ids = run_t5_model_on_encodings(encoder_encodings, decoder_encodings, t5_model)
    likelihoods = get_likelihoods_from_t5_output(model_output, loss_func, decoder_ids)
    predicted_labels = extract_label_from_likelihoods(softmax, likelihoods)
    match_tensor = (predicted_labels == labels).long()
    correct += torch.sum(match_tensor)
    total += len(match_tensor)
    print(f"Batch number {batch_num+1} Complete")
    if batch_num + 1 == batches_to_sample:
        break
accuracy = correct / total
report.append((optimized_prompt_2, accuracy))

Batch number 1 Complete
Batch number 2 Complete
Batch number 3 Complete
Batch number 4 Complete
Batch number 5 Complete
Batch number 6 Complete
Batch number 7 Complete
Batch number 8 Complete
Batch number 9 Complete
Batch number 10 Complete
Batch number 11 Complete
Batch number 12 Complete
Batch number 13 Complete
Batch number 14 Complete
Batch number 15 Complete
Batch number 16 Complete
Batch number 17 Complete
Batch number 18 Complete
Batch number 19 Complete
Batch number 20 Complete


NameError: name 'report' is not defined

In [23]:
print(f"Accuracy: {accuracy}")

Accuracy: 0.7599999904632568


As a final step, let's summarize our findings and analyze the results that we've collected in the report dictionary.

In [42]:
print("Summary")
print("LLaMA-2 7B Performance:")
for prompt, acc in report[0:3]:
    print(f"Prompt: {prompt}, Accuracy: {acc}")
print("T5 Performance:")
for prompt, acc in report[3:6]:
    print(f"Prompt: {prompt}, Accuracy: {acc}")

Summary
OPT Performance:
Prompt: Generate the sentiment of the next sentence. , Accuracy: 0.6
Prompt: tumour negative .05. Positive respins the Contains sentence. , Accuracy: 0.555
Prompt: childcare negative .05. Positive respins wSt Thank sentence. , Accuracy: 0.615
T5 Performance:
Prompt: Generate the sentiment of the next sentence. , Accuracy: 0.6549999713897705
Prompt: tumour negative .05. Positive respins the Contains sentence. , Accuracy: 0.7900000214576721
Prompt: childcare negative .05. Positive respins wSt Thank sentence. , Accuracy: 0.7549999952316284


It's fairly clear that OPT does not do well with this type of prompt, whereas T5 does a pretty good job with this instruction prompt.

The amazing part is that these weird prompts seem to improve the performance of T5, but also possibly the performance of OPT a little bit!