<a href="https://colab.research.google.com/github/cicl-iscl/SemEval23-Task5/blob/main/WebSemble_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installation and Imports

In [49]:
!git clone https://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@github.com/cicl-iscl/SemEval23-Task5.git

Cloning into 'SemEval23-Task5'...
remote: Enumerating objects: 112, done.[K
remote: Counting objects: 100% (33/33), done.[K
remote: Compressing objects: 100% (33/33), done.[K
remote: Total 112 (delta 12), reused 0 (delta 0), pack-reused 79[K
Receiving objects: 100% (112/112), 26.69 MiB | 19.72 MiB/s, done.
Resolving deltas: 100% (38/38), done.


In [50]:
%cd SemEval23-Task5

/content/SemEval23-Task5


In [None]:
!pip install transformers
!pip install datasets

In [4]:
import torch
import gc
import os

In [29]:
if torch.cuda.is_available():
    DEVICE = 'cuda'
    torch.cuda.set_per_process_memory_fraction(0.8, 0) # so that torch doesn't allocate the whole RAM
    torch.cuda.empty_cache()
    gc.collect()
else: DEVICE = 'cpu'

# Preprocessing

In [None]:
import transformers
from datasets import Dataset

from utils.preprocess import squad_format, from_squad

`squad_format` reads input .json files into SQuAD format `dict`s, then `from_squad` transorms them into `Dataset` instances adding / computing other fields we'll use later in training / predicting like start and end indexes, input ids, labels etc. 

In [7]:
DIR = 'webis22_original'

test_path = os.path.join(DIR, 'validation.jsonl')
X_test = Dataset.from_pandas(
    squad_format(test_path)
)
X_test = X_test.map(from_squad, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

For sake of demonstration we'll select small subsets of the datasets.

In [8]:
X_test = X_test.select(range(100))

In [9]:
X_test.features

{'id': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None),
 'context': Value(dtype='string', id=None),
 'answers': [{'answer_start': Value(dtype='int64', id=None),
   'text': Value(dtype='string', id=None)}],
 'label': Value(dtype='string', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'offset_mapping': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None),
 'start_positions': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'end_positions': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}

In [10]:
X_test[0]['id'], X_test[0]['title'], X_test[0]['context']

('6dc7ddef-4e8e-4a6b-9296-526377518071',
 'Five Nights at Freddy’s Sequel Delayed for Weird Reason',
 'Five Nights at Freddy’s Sequel Delayed for Weird Reason - Five Nights at Freddy’s creator Scott Cawthon takes to Steam to tease a possible delay for Five Nights at Freddy’s: Sister Location, the fifth game in the series. For the past couple of years, horror gaming fans have been able to look forward to one new entry in the Five Nights at Freddy’s series after another, with four core games, one RPG spinoff, and a novel released so far. The next game in the franchise, Five Nights at Freddy’s: Sister Location, was scheduled to release this coming Friday, October 7th, but if developer Scott Cawthon is to be believed, the project has been delayed by a few months. According to a post by Cawthon on the Five Nights at Freddy’s: Sister Location Steam page, the game is being delayed because it’s too dark. Cawthon said that some of the plot elements are so disturbing that they are making him fee

# QA

We use an ansemble of models, which means we'll gather logits from each of the models (see `QA/Predict`), compute the mean values and retrieve the top-k answers from the posts while postprocessing (see `QA/Postprocessing`)

## Predict

All the information about the models we use and the preliminaries we should undertake on this models is stored within .json files called here *instructions*. Instructions have several essential fields:

1. `name`: just for reference
2. `use`: whether the model will be used
3. `input_model_path`: where to pull the pre-trained model from in case it's to be fine-tuned (assuming it's not ready to be used yet)
4. `fine-tune`: whether it's to be fine-tuned
5. `output_model_path`: where to pull the fine-tuned model from (assuming it's ready to be used from here)

Also instructions can contain additional info like e.g. arguments for training or notes.

In [36]:
from run import read_instructions

As we don't have any models in the GitHub repo, we can't use some models that are supposed to be stored locally. For simplification we'll just ignore them.

In [66]:
%rm instructions/QA/bart-base-webis22.json

In [67]:
qa_instructions_dir = 'instructions/QA'
qa_instructions = read_instructions(qa_instructions_dir)

In [68]:
qa_instructions[0]

{'name': 'roberta-base-squad2',
 'use': True,
 'input_model_path': 'deepset/roberta-base-squad2',
 'fine-tune': False,
 'training_kwargs': {},
 'trainer_kwargs': {},
 'output_model_path': 'deepset/roberta-base-squad2',
 'notes': ['Fine-tuned on SQuAD v2.0 out of the box.']}

In [69]:
# leave only 'question', 'context' for qa to accept it
to_remove = X_test.column_names
to_remove.remove('question'); to_remove.remove('context')
X_test_ = X_test.remove_columns(to_remove)

In [70]:
X_test_.features

{'question': Value(dtype='string', id=None),
 'context': Value(dtype='string', id=None)}

Here we gather logits from all models

In [71]:
from downstream.answer import build_model as build_qa, answer

WebSemble distinguishes 2 modes of usage: `train` and `test`. The only difference is that fine-tuning takes place only when the mode is `train`. For sake of demonstration we won't fine-tune anything and use models that have already been.

In [72]:
mode = 'test'

In [74]:
batch_size = 8

qa_start_logits, qa_end_logits = [], []
for instruction in qa_instructions:
    if not instruction['use']: continue
    print(instruction['name'])
    qa = build_qa(instruction, X_train=None, X_dev=None, mode=mode)
    if X_test_ is not None:
        X_test_ = X_test_.map(
            answer, batched=True, batch_size=batch_size,
            fn_kwargs={'qa': qa}
        )
        qa_start_logits.append(X_test_['start_logits'])
        qa_end_logits.append(X_test_['end_logits'])

roberta-base-squad2


  0%|          | 0/13 [00:00<?, ?ba/s]

bert-large-uncased-whole-word-masking-finetuned-squad


  0%|          | 0/13 [00:00<?, ?ba/s]

distilbert-base-cased-distilled-squad


  0%|          | 0/13 [00:00<?, ?ba/s]

Now to obtain the decision of the ansemble we'll compute the mean logit values. Before that we need to bring them to the same shape as different models can give tensors of differen lengths as the output.

In [75]:
from utils.postprocess import truncate_logits

In [76]:
# now we have our logits in lists because we retrieve them from a datasets.Dataset instance
minlen = min([len(logits) for start_logits in qa_start_logits for logits in start_logits])
qa_start_logits = truncate_logits(qa_start_logits, minlen)
qa_end_logits = truncate_logits(qa_end_logits, minlen)

In [77]:
for logits in qa_start_logits: print(logits.shape) # --> len(X_test), minlen

torch.Size([100, 512])
torch.Size([100, 512])
torch.Size([100, 512])


In [78]:
start_logits_ensemble = torch.mean(qa_start_logits, axis=0).cpu()
end_logits_ensemble = torch.mean(qa_end_logits, axis=0).cpu()

In [79]:
start_logits_ensemble.shape

torch.Size([100, 512])

Logits don't tell us "human-readably" what the outcome is, so we have to postprocess them to extract the spoilers. 
`postprocess_qa` will return best `top_k` spoilers, where `top_k` is passed in `postprocess_qargs`, or the default value is used otherwise.

In [80]:
from utils.postprocess import postprocess_qa

In [81]:
postprocess_qargs = {
    'top_k': 5
}

In [82]:
output = postprocess_qa(X_test, (start_logits_ensemble, end_logits_ensemble), **postprocess_qargs)
pred_spoilers = output.predictions

In [83]:
pred_spoilers[33:36]

[{'id': '947a87bc-7e74-42a8-b20a-acf3d88983f1',
  'prediction_texts': ['Jennifer Lawrence has chopped off her hair into a pixie',
   'chopped off her hair into a pixie',
   'Jennifer Lawrence has chopped off her hair into a pi',
   'Jennifer Lawrence',
   'Jennifer Lawrence has chopped off her hair into a pixie cut. Or at least it']},
 {'id': 'ccb005f3-30d7-413f-bf64-7675c96f016a',
  'prediction_texts': ['The oven cleaning hacks that will have your appliance sparkling in MINUTES without AN',
   'scrubbing (from removing',
   'scrubbing (from removing grease with oil to banishing bad smells with water) - Scrubbing',
   'The oven cleaning hacks that will have your appliance',
   'The oven cleaning hacks that will have your appliance sparkling in MINUTES without ANY']},
 {'id': 'c212469f-5111-4fde-aad1-ed8ebf7e563b',
  'prediction_texts': ['Lawmakers could do this to fight corruption. But they won’t',
   'they won’t',
   'won’t',
   'Lawmakers could do this to fight corruption',
   'But t

# Summarization

Optionally for classification we can use summarization so that we predict the type of the spoiler not by the whole text but rather by the most relevant essence of it.

Unlike QA and classification below, we don't use an ansemble, but have a pre-chosen model for that.

In [None]:
!pip install sentencepiece

In [12]:
from transformers import PegasusForConditionalGeneration, AutoTokenizer

from downstream.summarize import summarize

In [None]:
DEFAULT_S2S_MODEL_NAME = 'google/pegasus-xsum'
model = PegasusForConditionalGeneration.from_pretrained(DEFAULT_S2S_MODEL_NAME).to(DEVICE)
tokenizer = AutoTokenizer.from_pretrained(DEFAULT_S2S_MODEL_NAME)

In [30]:
X_test = X_test.map(
    summarize, input_columns='context', batched=True, batch_size=batch_size,
    fn_kwargs={'summarizer': model, 'tokenizer': tokenizer}
)

  0%|          | 0/13 [00:00<?, ?ba/s]

In [31]:
X_test.features # 'summarized' added

{'id': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None),
 'context': Value(dtype='string', id=None),
 'answers': [{'answer_start': Value(dtype='int64', id=None),
   'text': Value(dtype='string', id=None)}],
 'label': Value(dtype='string', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'offset_mapping': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None),
 'start_positions': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'end_positions': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'summarized': Value(dtype='string', id=None)}

In [34]:
X_test['context'][0], X_test['summarized'][0]

('Five Nights at Freddy’s Sequel Delayed for Weird Reason - Five Nights at Freddy’s creator Scott Cawthon takes to Steam to tease a possible delay for Five Nights at Freddy’s: Sister Location, the fifth game in the series. For the past couple of years, horror gaming fans have been able to look forward to one new entry in the Five Nights at Freddy’s series after another, with four core games, one RPG spinoff, and a novel released so far. The next game in the franchise, Five Nights at Freddy’s: Sister Location, was scheduled to release this coming Friday, October 7th, but if developer Scott Cawthon is to be believed, the project has been delayed by a few months. According to a post by Cawthon on the Five Nights at Freddy’s: Sister Location Steam page, the game is being delayed because it’s too dark. Cawthon said that some of the plot elements are so disturbing that they are making him feel sick, and so he is thinking about delaying the game so that he can rework it entirely "into somethi

# Classification

Here we can use an ansemble of models as well; see how we gather logits in `Classification/Predict` then how we compute the mean values and retrieve the most probable label at `QA/Postprocessing`)

## Predict

Besides the info in QA, instructions for classification contain field `label_mapping` indicating matching between labels and their ids.

In [37]:
classification_instructions_dir = 'instructions/TextClassification'
classification_instructions = read_instructions(classification_instructions_dir)

In [38]:
classification_instructions[0]

{'name': 'distilbert-base-uncased-webis22',
 'use': True,
 'input_model_path': 'models/distilbert-base-uncased-webis22',
 'fine-tune': False,
 'training_kwargs': {'output_dir': 'flow/checkpoints/distilbert-base-uncased-webis22',
  'logging_dir': 'flow/log/distilbert-base-uncased-webis22',
  'num_train_epochs': 0.5},
 'trainer_kwargs': {},
 'summarize': True,
 'label_mapping': {'phrase': 0, 'passage': 1, 'multi': 2},
 'output_model_path': 'models/distilbert-base-uncased-webis22',
 'notes': ['Fine-tuned on Webis22.']}

In [39]:
# leave only 'context' / 'summarized' --> 'text' for classifier to accept it
text_column = 'summarized' if 'summarized' in X_test.column_names else 'title' # 'context'
to_remove = X_test.column_names
to_remove.remove(text_column)
X_test_ = X_test.remove_columns(to_remove).rename_column(text_column, 'text')

In [40]:
X_test_.features

{'text': Value(dtype='string', id=None)}

In [51]:
from downstream.classify import build_model as build_classifier, classify

Actually, in the original script we used a fine-tuned model that was stored locally, but, once again, as we can't put those on GitHub, we can't transfer them here. So for classification in this demo we'll pull a pre-trained model that is not fine-tuned on our data.

In [52]:
classification_instructions = [{
    'name': 'distilbert-base-uncased',
    'use': True,
    'input_model_path': 'distilbert-base-uncased',
    'fine-tune': False,
    'training_kwargs': {
        'output_dir': 'flow/checkpoints/distilbert-base-uncased-webis22',
        'logging_dir': 'flow/log/distilbert-base-uncased-webis22',
        'num_train_epochs': 0.5
        },
    'trainer_kwargs': {},
    'summarize': True,
    'label_mapping': {'phrase': 0, 'passage': 1, 'multi': 2},
    'output_model_path': 'distilbert-base-uncased',
    'notes': ['Just for here.']
}]

In [55]:
cl_logits = []
for instruction in classification_instructions:
    if not instruction['use']: continue
    print(instruction['name'])
    classifier = build_classifier(instruction, X_train=None, X_dev=None, mode=mode)
    if X_test_ is not None:
        X_test_ = X_test_.map(
            classify, batched=True, batch_size=batch_size,
            fn_kwargs={'classifier': classifier}
        )
        cl_logits.append(X_test_['logits'])

distilbert-base-uncased


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classi

  0%|          | 0/13 [00:00<?, ?ba/s]

Now these logits will always be the same length so we don't need to truncate or pad them.

In [56]:
cl_logits = torch.tensor(cl_logits, requires_grad=False)
logits_ensemble = torch.mean(cl_logits, axis=0)

In [57]:
logits_ensemble.shape # len(X_test_), len(label_mapping)

torch.Size([100, 3])

We should apply `argmax` to retrieve label ids.

In [58]:
from utils.postprocess import postprocess_classify

In [59]:
output = postprocess_classify(X_test, logits_ensemble)
pred_label_ids = output.predictions

In [60]:
pred_label_ids[2:7]

array([2, 2, 2, 2, 2])

And we are ready to make the labels back human-readable.

In [61]:
id2label = lambda label_id: classifier.model.config.id2label[label_id] # use latest
pred_labels = [
    {'id': id, 'label': id2label(label_id)}
    for id, label_id in zip(X_test['id'], pred_label_ids)
]

In [64]:
pred_labels[27:31]

[{'id': '55a310e9-49c5-4d04-9427-1dd32bdf30a5', 'label': 'multi'},
 {'id': '226c19e3-5040-4cff-896d-1d7ab0aad6fa', 'label': 'multi'},
 {'id': '30634377-bf2f-48a1-a6ae-aa984621ec94', 'label': 'multi'},
 {'id': 'd8186b19-22ad-4a05-89f9-7c199410a433', 'label': 'multi'}]

# Finalization

In this block we'll just put our predictions together and (optionally) apply a joint postprocessing.

In [84]:
pred_spoilers_ = {entry['id']: entry['prediction_texts'] for entry in pred_spoilers}
pred_labels_ = {entry['id']: entry['label'] for entry in pred_labels}

In [85]:
def find_by_id(preds, id):
    id_idx = list(preds.keys()).index(id)
    return list(preds.values())[id_idx]

In [86]:
answers = []

for id in X_test['id']:

    top_k = find_by_id(pred_spoilers_, id)
    label = find_by_id(pred_labels_, id)
    # final_spoiler = postprocess_top_k(top_k, label)
    final_spoiler = top_k[0]

    answer = {
        'uuid': id,
        'spoilerType': label,
        'spoiler': final_spoiler
    }
    answers.append(answer)

In [87]:
answers[19:24]

[{'uuid': '07fa586f-3911-4192-99e6-f8e22f12e172',
  'spoilerType': 'multi',
  'spoiler': 'Parents Discover'},
 {'uuid': '0a5b3d8b-af6e-4e13-be29-e8d5883728bc',
  'spoilerType': 'multi',
  'spoiler': 'The One Morning'},
 {'uuid': '080bd61d-86f7-41f8-801e-3efc956b42aa',
  'spoilerType': 'multi',
  'spoiler': 'Is It Safe'},
 {'uuid': '6e384564-5281-48d3-aa5d-a99ed108666b',
  'spoilerType': 'multi',
  'spoiler': 'Capeside gang will likely never get back together'},
 {'uuid': '55405e61-e388-4f7b-8fbb-3549c4f94e2a',
  'spoilerType': 'multi',
  'spoiler': 'twins would survive'}]