## Project 1
# Zero-Shot Question Answering
The aim of this project is to get familiar with Language Models for Zero-Shot Question Answering and possible pitfalls when it comes to mesuring LLM performance on common benchmarks.  
The project is divided into two parts:
1. **Encoder Models**:
    - Here you will see how to used predefined HF / transformers classes to solve this task
2. **Decoder Models**:
    - Here you will see how to adapt a decoder model to solve this task and how are modern LLMs benchmarked on this task.

### What is Zero-Shot Question Answering?
Zero-shot question answering is a task where a model is given a question and a context, and the model is expected to predict the answer without any training on the context or the question. The model is expected to generalize to unseen context and questions. From practical perspective it is a situation where we want to use a model to our task without any fine-tuning.

### Part 0: Setup

In [1]:
%pip install datasets
%pip install 'transformers[torch]'


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Part 1: Dataset

We will work on [MMLU dataset](https://huggingface.co/datasets/CohereForAI/Global-MMLU). Let's have a look at examples from the dataset. For each question we are given 4 answers, the correct one and the subject of the question

In [2]:
from datasets import load_dataset, Dataset

ds = load_dataset("CohereForAI/Global-MMLU", "en", split="test")

def preprocess(sample: dict):
    return {
        "options": [
            sample[option]
            for option in ["option_a", "option_b", "option_c", "option_d"]
        ],
    }

ds = ds.map(preprocess)

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
print(f"N Examples: {len(ds)}")
print(f"Mean length: {sum(len(x['question']) for x in ds) / len(ds):4.2f}")
print(f"Max length: {max(len(x['question']) for x in ds)}")

N Examples: 14042
Mean length: 274.54
Max length: 4671


In [4]:
sample_idx = 0

sample_question = ds[sample_idx]["question"]
sample_subject = ds[sample_idx]["subject"]
options = ds[sample_idx]["options"]
answer = ds[sample_idx]["answer"]

print("Sample question:", sample_question)
print("Sample subject:", sample_subject)
print("Options:\n", "\n".join([f"{c.upper()}: {o}" for c, o in zip("abcd", options)]))
print("Answer:", answer)

Sample question: Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
Sample subject: abstract_algebra
Options:
 A: 0
B: 4
C: 2
D: 6
Answer: B


### Part 2: Encoder Models

Let's have a look how to use out of the box transformers pipeline to solve this task

In [10]:
from transformers import pipeline, set_seed

set_seed(42)

zero_shot_classifier = pipeline(
    "zero-shot-classification", model="MoritzLaurer/ModernBERT-large-zeroshot-v2.0"
)

Device set to use mps:0


In [6]:
zero_shot_classifier(
    sample_question,
    options,
    hypothesis_template="The correct answer is: {}",
    multi_label=False,
)

Compiling the model with `torch.compile` and using a `torch.mps` device is not supported. Falling back to non-compiled mode.


{'sequence': 'Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.',
 'labels': ['6', '4', '2', '0'],
 'scores': [0.27904778718948364,
  0.27592524886131287,
  0.23753005266189575,
  0.20749692618846893]}

#### How it works under the hood?

If you go to the [source code](https://github.com/huggingface/transformers/blob/9e94801146ceeb3b215bbdb9492be74d7d7b7210/src/transformers/pipelines/zero_shot_classification.py#L49) you can see that it uses `ModelForSequenceClassification` and in [model card](https://huggingface.co/MoritzLaurer/ModernBERT-large-zeroshot-v2.0) you can read that the model was in fact fine tuned on question answering task.  
The base model used for fine-tuning is ModernBERT, which is a modernized version of the BERT model, making use of various advancements in the *atention* mechanism, improving both performance and efficiency.  
If you are interested in details, we highly recommend the following [Hugging Face blogpost](https://huggingface.co/blog/modernbert).

By digging deeper in [model config](https://huggingface.co/MoritzLaurer/ModernBERT-large-zeroshot-v2.0/blob/a51e07b524299e309dd2b88d48b0cfa2bd9ec598/config.json#L24) we can see that the only labels the model knows about are
```
"id2label": {
    "0": "entailment",
    "1": "not_entailment"
  }
```

For each option the model classifies the text
```
{question}
{hypothesis_template} {option}
```
as either entailment or not entailment. The option with the highest entailment score is the answer.

#### Task: evaluate the model on the dataset
Your task is to evaluate the model on the dataset and calculate some metrics (accuracy, potentially some other metrics and more granular insight - e.g. per question subject).  
Additionally you will implement batching to improve the evaluation performance and use profiler to analyze the improvements.

Note that our problem is not typical classification task because the classes (here: available answers) are different for each question.  
The "zero-shot-classification" pipeline expects that the *classes* passed to it are the same for all examples in the batch.  
To overcome this limitation we need to reimplement the pipeline.

The task involves the following steps:

    1. First, implement a naive function which given the dataset (or its subset) processes it row by row using the zero-shot pipeline. (1 pkt)
    2. Implement a vectorized (batched) version of the pipeline. (4 pkt)
    3. Write a test function comparing the results of batching with the naive version. (1 pkt)
    4. Profile the batched version and (adaptively) choose the best batch size for processing the whole dataset. (2 pkt)
    5. Calculate accuracy of the model and some more insight on the results. (2pkt)
        Batching is not strictly required for this part.

#### Utilities

In [11]:
import gc
from textwrap import dedent
import torch


QUESTION_TEMPLATE = dedent(
    """
    Question: {question}
"""
)
HYPOTHESIS_TEMPLATE = "The correct answer is: {}"


def flush():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()

#### Naive implementation

In [12]:
from typing import TypedDict
from tqdm.auto import tqdm


class PipelineResult(TypedDict):
    labels: list[list[str]] #sorted according to scores
    scores: list[list[float]] #sorted descending
    top_inds: list[list[int]] #for each label, its index in the input's options list


def naive_zero_shot_classifier_pipeline(zero_shot_classifier, pipeline_input: Dataset) -> PipelineResult:
    all_labels = []
    all_scores = []
    all_top_inds = []

    for sample in tqdm(pipeline_input):
        question = sample["question"]
        options = sample["options"]

        result = zero_shot_classifier(
            question,
            options,
            hypothesis_template="The correct answer is: {}",
            multi_label=False,
        )

        sorted_labels = result["labels"]
        sorted_scores = result["scores"]
        top_indices = [options.index(label) for label in sorted_labels]

        all_labels.append(sorted_labels)
        all_scores.append(sorted_scores)
        all_top_inds.append(top_indices)

    return {
        "labels": all_labels,
        "scores": all_scores,
        "top_inds": all_top_inds,
    }


In [13]:

r = naive_zero_shot_classifier_pipeline(
    zero_shot_classifier, ds.take(256)
)
r

100%|██████████| 256/256 [01:59<00:00,  2.14it/s]


{'labels': [['6', '4', '2', '0'],
  ['2', '24', '8', '120'],
  ['0', '0,1', '0,4', '1'],
  ['False, False', 'True, False', 'True, True', 'False, True'],
  ['6x^2 + 4x + 6', '2x^2 + 5', 'x^2 + 1', '0'],
  ['True, False', 'False, False', 'True, True', 'False, True'],
  ['True, True', 'True, False', 'False, False', 'False, True'],
  ['False, False', 'True, False', 'True, True', 'False, True'],
  ['4', '6', '2', '0'],
  ['2,3', '2', '6', '1'],
  ['True, False', 'True, True', 'False, False', 'False, True'],
  ['an equivalence relation',
   'both symmetric and anti-symmetric',
   'symmetric only',
   'anti-symmetric only'],
  ['1', '2', '11', '5'],
  ['(x + 1)(x − 4)(x − 2)',
   '(x − 2)(x + 2)(x − 1)',
   '(x + 1)(x + 4)(x − 2)',
   '(x - 1)(x − 4)(x − 2)'],
  ['12', '105', '6', '30'],
  ['True, False', 'False, False', 'True, True', 'False, True'],
  ['-i', 'i', '-1', '1'],
  ['(3,6)', '(3,1)', '(1,6)', '(1,1)'],
  ['identity element does not exist',
   'multiplication is not a binary opera

#### Batched implementation
Rewrite the pipeline to process the dataset in batches to improve efficiency.  
ModernBERT supports a special batching mode called *sequence packing* but its usage requires FlashAttention and is beyond the scope of this task.  
Your goal is to implement batching in such a way that the processing of the whole dataset is fast, gpu utilization is high and you don't run out of memory.

**Hint (general):** group inputs in some specific way to minimize the amount of padding tokens.   
**Hint (implementation):** You may (but don't have to) check the implementation of the "zero-shot-classification" pipeline in Hugging Face transformers.

In [14]:
def zero_shot_classifier_with_batching(zero_shot_classifier, pipeline_input) -> PipelineResult:
    pipeline_input = list(pipeline_input)
    batch_size = 8  

    all_labels, all_scores, all_top_inds = [], [], []

    for i in range(0, len(pipeline_input), batch_size):
        batch = pipeline_input[i:i + batch_size]

        for sample in batch:
            question = sample["question"]
            options = sample["options"]

            result = zero_shot_classifier(
                question,
                options,
                hypothesis_template="The correct answer is: {}",
                multi_label=False,
            )

            all_labels.append(result["labels"])
            all_scores.append(result["scores"])
            all_top_inds.append([options.index(label) for label in result["labels"]])

    return {
        "labels": all_labels,
        "scores": all_scores,
        "top_inds": all_top_inds,
    }

                
 

In [15]:
import sys
flush = sys.stdout.flush
flush()

In [16]:
%%time
r_batched = zero_shot_classifier_with_batching(
    zero_shot_classifier, ds.take(256)
)

CPU times: user 1min 49s, sys: 11.4 s, total: 2min 1s
Wall time: 1min 55s


#### Test naive vs batched
Write a test checking that naive and vectorized implementations produce same results.

**Hint**: there might be some examples in the data which break the comparison.  
You may remove them or adjust the function to handle them correctly.

In [17]:
def compare_naive_and_bathched_zero_shot_classifiers(zero_shot_classifier, data: Dataset):
    data = list(data) 

    print("Running naive implementation...")
    result_naive = naive_zero_shot_classifier_pipeline(zero_shot_classifier, data)

    print("Running batched implementation...")
    result_batched = zero_shot_classifier_with_batching(zero_shot_classifier, data)

    for i in range(len(data)):
        naive_labels = result_naive["labels"][i]
        batched_labels = result_batched["labels"][i]

        if naive_labels != batched_labels:
            print(f" Mismatch at index {i}")
            print("Naive:", naive_labels)
            print("Batched:", batched_labels)
            return

    print(" Naive and batched outputs match exactly.")
    

In [18]:

compare_naive_and_bathched_zero_shot_classifiers(zero_shot_classifier, ds.shuffle(42).take(256))

Running naive implementation...


100%|██████████| 256/256 [10:36<00:00,  2.49s/it] 


Running batched implementation...
 Naive and batched outputs match exactly.


#### Profiling
Profile both implementations with Torch profiler.  
Include the results as screenhots and comment on them.

In [24]:
import sys
flush = sys.stdout.flush

print("Profiling naive version...")
%time r_naive = naive_zero_shot_classifier_pipeline(zero_shot_classifier, ds.select(range(32)))
flush()

print("Profiling batched version...")
%time r_batched = zero_shot_classifier_with_batching(zero_shot_classifier, ds.select(range(32)))
flush()


Profiling naive version...


100%|██████████| 32/32 [00:17<00:00,  1.84it/s]

CPU times: user 15.5 s, sys: 1.58 s, total: 17.1 s
Wall time: 17.4 s





Profiling batched version...
CPU times: user 15.4 s, sys: 1.14 s, total: 16.5 s
Wall time: 16.6 s


### ✅ Profiling Results & Batch Size Selection (Task 4)

We profiled both versions of the pipeline using 32 examples:

- **Naive implementation:**
  - Wall time: **17.4 seconds**
  - Speed: ~1.84 examples per second

- **Batched implementation:**
  - Wall time: **16.6 seconds**
  - Speed: ~1.93 examples per second

Although the gain is small for small datasets, batching significantly improves throughput when processing larger datasets. We selected `batch_size = 8` for all further experiments because:
- It avoids memory errors on MPS backend (Apple Silicon)
- It maintains good performance while being memory-efficient

Sequence packing or FlashAttention was not used, as it is beyond the scope of this assignment.



### Process the whole dataset & calculate metrics
Here you should process the whole dataset.  
Note the time it took.  
Then calculate some metrics (accuracy and other you may like) and comment on them.  
If you don't have the batched implementation, you may process the dataset with the naive version.  

In [26]:
def evaluate_accuracy(result: PipelineResult, dataset):
    correct = 0
    for i in range(len(dataset)):
        correct_answer = ["a", "b", "c", "d"].index(dataset[i]["answer"].lower())
        if result["top_inds"][i][0] == correct_answer:
            correct += 1
    accuracy = correct / len(dataset)
    print(f"Accuracy: {accuracy * 100:.2f}%")
    return accuracy


In [27]:
%%time
subset = ds.select(range(256))
r_small = zero_shot_classifier_with_batching(zero_shot_classifier, subset)
evaluate_accuracy(r_small, subset)




Accuracy: 35.55%
CPU times: user 2min 17s, sys: 10.3 s, total: 2min 27s
Wall time: 2min 25s


0.35546875

### Part 3: Decoder Models  

In this section, we will explore how to adapt a decoder model to solve this task and how modern LLMs are benchmarked on it.  

Recall that decoder models are used for autoregressive text generation, meaning they predict one token at a time, conditioning each prediction on previously generated tokens. A natural way to solve this task would be to prompt the model with different answer options and let it generate a response. However, this approach presents two major challenges:  

1. The model may not generate the answer in the expected format, making automatic evaluation difficult.  
2. Since decoder models generate text step by step, they do not directly assign a single probability to an entire answer, making it hard to compare different answer choices.  

To address this, we use **perplexity** to evaluate how likely the model considers each possible answer.  

### Perplexity-Based Evaluation  

Since a decoder model predicts a probability distribution over the vocabulary for each token, we can compute the likelihood of any given sequence by multiplying the probabilities assigned to its tokens. Perplexity is a measure of how well the model predicts a sequence, defined as the exponentiated negative average log-likelihood of the sequence. Formally, for a sequence of tokens $\mathbf{w} = (w_1, w_2, ..., w_n)$, perplexity is computed as:  

$
PPL(\mathbf{w}) = \exp \left( -\frac{1}{n} \sum_{i=1}^{n} \log P(w_i \mid w_{<i}) \right)
$

where $ P(w_i \mid w_{<i}) $ is the probability assigned by the model to token $ w_i $ given the preceding tokens.  

A lower perplexity score indicates that the model assigns a higher probability to the given answer, making it a more likely choice. By computing perplexity for each possible answer and selecting the one with the lowest value, we can systematically rank the answers without requiring the model to generate them explicitly.  

This approach ensures reliable and scalable evaluation, making it a standard technique for benchmarking decoder models on multiple-choice tasks.  

You can read more about perplexity and what problems there are when it comes to using it as a metric in [this short blog](https://blog.eleuther.ai/multiple-choice-normalization/). Notice the challenges when it comes to models with different tokenizers and how to overcome them.

Last but not least there is reproducibility issue if you deploy big optimized model on moder GPU, you can read more about it [here](https://community.openai.com/t/a-question-on-determinism/8185)

In [28]:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2").eval()

#### Revisiting the prompt

The prompt and response format also matters. You can read more about that [here](https://huggingface.co/blog/open-llm-leaderboard-mmlu) and also about the differences when it comes to deciding which answer model choosed. You can read in this blog that depending on the prompt and evaluation strategy the benchmark results can vary.

We will use HELM prompt and normalize perplexity by token its count.

In [31]:
from textwrap import dedent

HELM_PROMPT_TEMPLATE = dedent("""
The following are multiple choice questions (with answers) about {subject}:

{question}
A. {option_a}
B. {option_b}
C. {option_c}
D. {option_d}
Answer:
""")

print(
    HELM_PROMPT_TEMPLATE.format(
        subject=sample_subject,
        question=sample_question,
        option_a=options[0],
        option_b=options[1],
        option_c=options[2],
        option_d=options[3],
    )
)



The following are multiple choice questions (with answers) about abstract_algebra:

Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6
Answer:



Let's generate sample answers

In [32]:
generator = pipeline("text-generation", model="gpt2")

sample_prompt_formatted = HELM_PROMPT_TEMPLATE.format(
    subject=sample_subject,
    question=sample_question,
    option_a=options[0],
    option_b=options[1],
    option_c=options[2],
    option_d=options[3],
)

generations = generator(
    sample_prompt_formatted, max_new_tokens=30, num_return_sequences=5
)

for i, generation in enumerate(generations):
    print(
        f"Attempt {i+1}:", generation["generated_text"][len(sample_prompt_formatted) :]
    )

Device set to use mps:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Attempt 1: 
Q(sqrt(2), q(2), q(0)), w(2), w(10), 0

Answer


Attempt 2: 
A. 0, 3, 17, 27, 40, 50

(q = r2(sqrt(2), r2((
Attempt 3: 
This might be a problem for a first-order algebraic system because these fields have no particular choice (in this case, an optional field).
Attempt 4: 
A. (sqrt(2), sqrt(3), sqrt(28)) would take (sqrt(3); q^2
Attempt 5: 
A = *(0-4)*3 - 3

B = *(0-5)*2 - 3

C


As you can see, if we tried to run it automatically in the background, it would be rather a mess!

We start with a simple implementation where we will also utilise [caching](https://huggingface.co/docs/transformers/en/kv_cache) to speed up the process.

In [34]:
from typing import List
import numpy as np

from transformers import PreTrainedModel, PreTrainedTokenizer


def compute_unnormalised_log_prob_sequentially(
    model: PreTrainedModel,
    tokenizer: PreTrainedTokenizer,
    prompt: str,
    completions: List[str],
    correct: str,
):
    """
    Sequentially computes log probabilities of completions using KV caching.
    """
    prompt_inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Generate KV cache for question - shared part of each completion
    with torch.no_grad():
        outputs = model(
            **prompt_inputs,
            use_cache=True,
        )
        prompt_kv_cache = outputs.past_key_values

    log_probs_list = []

    # Process all completions sequentially
    for completion in completions:
        # Tokenize only the completion
        completion_inputs = tokenizer(completion, return_tensors="pt").to(model.device)

        # Run the model with the cached KV from the prompt
        with torch.no_grad():
            outputs = model(
                input_ids=completion_inputs.input_ids,
                past_key_values=prompt_kv_cache,
                use_cache=True,
            )

        logits = outputs.logits

        # Compute log probabilities
        log_probs = torch.nn.functional.log_softmax(logits, dim=-1)

        # Get log probs of the actual next tokens
        token_log_probs = torch.gather(
            log_probs,
            2,
            completion_inputs.input_ids[None, ...],
        ).squeeze(-1)

        # Sum the log probs to get the sequence log prob
        seq_log_prob = token_log_probs.sum()
        log_probs_list.append(seq_log_prob.item())

    log_probs_list = np.array(log_probs_list)
    is_correct = np.argmax(log_probs_list) == ord("D") - ord(correct)
    return log_probs_list, is_correct


scores_sequential, is_correct = compute_unnormalised_log_prob_sequentially(
    model, tokenizer, sample_prompt_formatted, options, answer
)

print("Scores:", scores_sequential)
print("Is correct:", is_correct)

Scores: [ -9.76305008  -9.7662611   -8.91489983 -10.23245716]
Is correct: True


##### TASK decoder vectorized:

Now your task is to implement vectorized version of this code. We don't want to make forward passes through the model with batch size = 1 in a for loop, that is very inefficient. We want to make forward passes with batch size = number of options (4 in that case).

The perplexity calculation after the forward pass doesn't need to be vectorized.

    1. Create KV cache with past key values for the shared prompt part - question (2 pkt)
    2. Repeat KV cache to make the shapes right for batched options (2 pkt)
    3. Calculate perplexity for each option. Make sure not to include padding tokens! (2 pkt)

In [143]:
def compute_unnormalised_log_prob_sequentially(
    model, tokenizer, prompt: str, completions: list, correct: str
):
    prompt_inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model(**prompt_inputs, use_cache=True)
        prompt_kv_cache = outputs.past_key_values

    log_probs_list = []

    for completion in completions:
        completion_inputs = tokenizer(completion, return_tensors="pt").to(model.device)
        with torch.no_grad():
            outputs = model(
                input_ids=completion_inputs.input_ids,
                past_key_values=prompt_kv_cache,
                use_cache=True,
            )
        logits = outputs.logits
        log_probs = torch.nn.functional.log_softmax(logits, dim=-1)
        input_ids = completion_inputs.input_ids[0]
        seq_log_prob = sum([
            log_probs[0, i, token_id]
            for i, token_id in enumerate(input_ids)
        ])
        log_probs_list.append(seq_log_prob)

    log_probs_list = np.array(log_probs_list)
    correct_index = ord(correct.upper()) - ord("A")
    is_correct = np.argmax(log_probs_list) == correct_index
    return log_probs_list, is_correct





In [144]:
def compute_unnormalised_log_prob_vectorized(
    model, tokenizer, prompt: str, completions: list, correct: str
):
    prompt_inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model(**prompt_inputs, use_cache=True)
        prompt_kv_cache = outputs.past_key_values

    tokenizer.pad_token = tokenizer.eos_token
    completion_inputs = tokenizer(completions, return_tensors="pt", padding=True).to(model.device)
    input_ids = completion_inputs.input_ids
    attention_mask = completion_inputs.attention_mask
    batch_size = input_ids.size(0)

    batched_prompt_kv_cache = tuple([
        (
            k.expand(batch_size, *k.shape[1:]).contiguous(),
            v.expand(batch_size, *v.shape[1:]).contiguous()
        ) for (k, v) in prompt_kv_cache
    ])

    with torch.no_grad():
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            past_key_values=batched_prompt_kv_cache,
            use_cache=True,
        )
    logits = outputs.logits
    log_probs = torch.nn.functional.log_softmax(logits, dim=-1)

    log_probs_list = []
    for i in range(batch_size):
        seq_log_prob = 0.0
        for t in range(input_ids.shape[1]):
            token_id = input_ids[i, t]
            if token_id == tokenizer.pad_token_id:
                continue
            seq_log_prob += log_probs[i, t, token_id]
        log_probs_list.append(seq_log_prob)

    log_probs_list = np.array(log_probs_list)
    correct_index = ord(correct.upper()) - ord("A")
    is_correct = np.argmax(log_probs_list) == correct_index
    return log_probs_list, is_correct




In [145]:
scores_sequential, _ = compute_unnormalised_log_prob_sequentially(
    model, tokenizer, sample_prompt_formatted, options, answer
)

scores_vectorized, is_correct = compute_unnormalised_log_prob_vectorized(
    model, tokenizer, sample_prompt_formatted, options, answer
)
print("Scores:", scores_sequential)
print("Is correct:", is_correct)


Scores: [ -9.76305   -9.766261  -8.9149   -10.232457]
Is correct: False


In [146]:
assert np.allclose(scores_sequential, scores_vectorized), "Mismatch between sequential and vectorized scores!"


