## Intro

1. MultiNLI (English)
   -> Train (392k), Test (20k), Validation (20k)

2. XNLI (MultiNLI for 14 languages)
   -> Train (Same), Test (5k), Validation (2.5k)

3. AfriXNLI (MultiNLI for 16 African languages)
   -> Train (Same), Test (600), Validation (450)

In [None]:
!pip install -U "transformers>=4.46.0" datasets

Collecting datasets
  Downloading datasets-4.2.0-py3-none-any.whl.metadata (18 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-21.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Downloading datasets-4.2.0-py3-none-any.whl (506 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m506.3/506.3 kB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-21.0.0-cp312-cp312-manylinux_2_28_x86_64.whl (42.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyarrow, datasets
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 18.1.0
    Uninstalling pyarrow-18.1.0:
      Successfully uninstalled pyarrow-18.1.0
  Attempting uninstall: datasets
    Found existing installation: datasets 4.0.0
    Uninstalling datasets-4.0.0:
      Successfully uninstalled datasets-4.0.0
[31mERROR: pip's dependency resolver doe

In [None]:
import tensorflow_datasets as tfds
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from tqdm import tqdm
import json
import warnings

warnings.filterwarnings("ignore")

## MultiNLI Dataset

In [None]:
mnli = load_dataset("nyu-mll/multi_nli")
print("MultiNLI:", mnli)
print("\nSample MultiNLI:", mnli["train"][0])

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/214M [00:00<?, ?B/s]

data/validation_matched-00000-of-00001.p(…):   0%|          | 0.00/4.94M [00:00<?, ?B/s]

data/validation_mismatched-00000-of-0000(…):   0%|          | 0.00/5.10M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/392702 [00:00<?, ? examples/s]

Generating validation_matched split:   0%|          | 0/9815 [00:00<?, ? examples/s]

Generating validation_mismatched split:   0%|          | 0/9832 [00:00<?, ? examples/s]

MultiNLI: DatasetDict({
    train: Dataset({
        features: ['promptID', 'pairID', 'premise', 'premise_binary_parse', 'premise_parse', 'hypothesis', 'hypothesis_binary_parse', 'hypothesis_parse', 'genre', 'label'],
        num_rows: 392702
    })
    validation_matched: Dataset({
        features: ['promptID', 'pairID', 'premise', 'premise_binary_parse', 'premise_parse', 'hypothesis', 'hypothesis_binary_parse', 'hypothesis_parse', 'genre', 'label'],
        num_rows: 9815
    })
    validation_mismatched: Dataset({
        features: ['promptID', 'pairID', 'premise', 'premise_binary_parse', 'premise_parse', 'hypothesis', 'hypothesis_binary_parse', 'hypothesis_parse', 'genre', 'label'],
        num_rows: 9832
    })
})

Sample MultiNLI: {'promptID': 31193, 'pairID': '31193n', 'premise': 'Conceptually cream skimming has two basic dimensions - product and geography.', 'premise_binary_parse': '( ( Conceptually ( cream skimming ) ) ( ( has ( ( ( two ( basic dimensions ) ) - ) ( ( product 

## XNLI Dataset

In [None]:
xnli = load_dataset("xnli", "all_languages")
print("XNLI:", xnli)
print("Sample XNLI:", xnli["test"][0])

README.md: 0.00B [00:00, ?B/s]

all_languages/train-00000-of-00004.parqu(…):   0%|          | 0.00/238M [00:00<?, ?B/s]

all_languages/train-00001-of-00004.parqu(…):   0%|          | 0.00/239M [00:00<?, ?B/s]

all_languages/train-00002-of-00004.parqu(…):   0%|          | 0.00/238M [00:00<?, ?B/s]

all_languages/train-00003-of-00004.parqu(…):   0%|          | 0.00/239M [00:00<?, ?B/s]

all_languages/test-00000-of-00001.parque(…):   0%|          | 0.00/6.77M [00:00<?, ?B/s]

all_languages/validation-00000-of-00001.(…):   0%|          | 0.00/3.39M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/392702 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5010 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2490 [00:00<?, ? examples/s]

XNLI: DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 392702
    })
    test: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 5010
    })
    validation: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 2490
    })
})
Sample XNLI: {'premise': {'ar': 'حسنا ، لم أكن أفكر حتى حول ذلك ، لكن كنت محبطاً تماما ،وأنهيت الحديث معه مرة ثانية .', 'bg': 'Е, аз дори не мислех за това, но бях толкова разочарована, а в крайна сметка отново разговарях с него.', 'de': 'Nun, daran dachte ich nicht einmal, aber ich war so frustriert, dass ich am Ende doch mit ihm redete.', 'el': 'Λοιπόν, δεν το σκέφτηκα καν, αλλά ήμουν τόσο απογοητευμένος, και κατέληξα να του μιλάω και πάλι.', 'en': "Well, I wasn't even thinking about that, but I was so frustrated, and, I ended up talking to him again.", 'es': 'Bien, ni estaba pensando en eso, pero estaba tan frustrada y empecé a hablar con él de 

## AfriXNLI Dataset

In [None]:
afri = load_dataset("masakhane/afrixnli", "swa")
afri_test = afri["test"]
print("AfriXNLI:", afri)
print("Sample AfriXNLI:", afri["test"][0])

README.md: 0.00B [00:00, ?B/s]

dev.tsv: 0.00B [00:00, ?B/s]

test.tsv: 0.00B [00:00, ?B/s]

Generating validation split:   0%|          | 0/450 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/600 [00:00<?, ? examples/s]

AfriXNLI: DatasetDict({
    validation: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 450
    })
    test: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 600
    })
})
Sample AfriXNLI: {'premise': 'Naam, sikukuwa nafikiri juu ya hilo, lakini nilichanganyikiwa sana, na, hatimaye nikaendelea kuzungumza naye tena.', 'hypothesis': 'Sijaongea na yeye tena.', 'label': 2}


## Demo of AfriXNLI for 'Swahili' language

In [None]:
model_name = "joeddav/xlm-roberta-large-xnli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

label_map = {
    "entailment": 0,
    "neutral": 1,
    "contradiction": 2
}

id2label = {v: k for k, v in label_map.items()}
correct = 0
total = len(afri_test)

for item in tqdm(afri_test, desc="Evaluating"):
    premise = item["premise"]
    hypothesis = item["hypothesis"]
    label = item["label"]
    inputs = tokenizer(premise, hypothesis, return_tensors="pt", truncation=True, padding=True).to(model.device)

    with torch.no_grad():
        outputs = model(**inputs)
        pred = torch.argmax(outputs.logits, dim=-1).item()

    if pred == label:
        correct += 1

accuracy = correct / total
print(f"\nAccuracy on Swahili test set: {accuracy*100:.2f}%")

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/734 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of the model checkpoint at joeddav/xlm-roberta-large-xnli were not used when initializing XLMRobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Evaluating: 100%|██████████| 600/600 [07:58<00:00,  1.25it/s]


Accuracy on Swahili test set: 33.00%





## Demo of AfriXNLI for 'Yoruba' language

In [None]:
afri = load_dataset("masakhane/afrixnli", "yor")
afri_test = afri["test"]
print("AfriXNLI:", afri)
print("Sample AfriXNLI:", afri["test"][0])

model_name = "joeddav/xlm-roberta-large-xnli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

label_map = {
    "entailment": 0,
    "neutral": 1,
    "contradiction": 2
}

id2label = {v: k for k, v in label_map.items()}
correct = 0
total = len(afri_test)

for item in tqdm(afri_test, desc="Evaluating"):
    premise = item["premise"]
    hypothesis = item["hypothesis"]
    label = item["label"]
    inputs = tokenizer(premise, hypothesis, return_tensors="pt", truncation=True, padding=True).to(model.device)

    with torch.no_grad():
        outputs = model(**inputs)
        pred = torch.argmax(outputs.logits, dim=-1).item()

    if pred == label:
        correct += 1

accuracy = correct / total
print(f"\nAccuracy on Yoruba test set: {accuracy*100:.2f}%")

dev.tsv: 0.00B [00:00, ?B/s]

test.tsv: 0.00B [00:00, ?B/s]

Generating validation split:   0%|          | 0/450 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/600 [00:00<?, ? examples/s]

AfriXNLI: DatasetDict({
    validation: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 450
    })
    test: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 600
    })
})
Sample AfriXNLI: {'premise': 'Ó dáa, mi ò tiẹ̀ ronú nípa ìyẹn, ṣùgbọ́n inúù mi bàjẹ́ , àtiwípé, mo tún pada tún bá a sọ̀rọ̀.', 'hypothesis': 'Mi ò tún tíì bá a sọ̀rọ̀.', 'label': 2}


Some weights of the model checkpoint at joeddav/xlm-roberta-large-xnli were not used when initializing XLMRobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Evaluating: 100%|██████████| 600/600 [15:24<00:00,  1.54s/it]


Accuracy on Yoruba test set: 28.17%





In [None]:
def evaluate_strategies(language_code="swa"):
    """
    Evaluate XLM-RoBERTa on AfriXNLI using zero-shot and few-shot strategies
    """
    # Load dataset
    afri = load_dataset("masakhane/afrixnli", language_code)
    afri_test = afri["test"]

    # Load model
    model_name = "joeddav/xlm-roberta-large-xnli"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)

    label_map = {"entailment": 0, "neutral": 1, "contradiction": 2}
    id2label = {v: k for k, v in label_map.items()}

    # Few-shot examples (from MultiNLI training data)
    few_shot_examples = [
        {
            "premise": "The cat sat on the mat.",
            "hypothesis": "The cat is on the mat.",
            "label": 0,  # entailment
            "label_name": "entailment"
        },
        {
            "premise": "The cat sat on the mat.",
            "hypothesis": "The dog is on the mat.",
            "label": 2,  # contradiction
            "label_name": "contradiction"
        },
        {
            "premise": "The cat sat on the mat.",
            "hypothesis": "The animal is resting.",
            "label": 1,  # neutral
            "label_name": "neutral"
        }
    ]

    strategies = {
        "zero_shot": {
            "correct": 0,
            "total": 0
        },
        "few_shot": {
            "correct": 0,
            "total": 0
        }
    }

    print(f"Evaluating on {language_code.upper()} test set ({len(afri_test)} examples)...")
    print("-" * 60)

    for item in tqdm(afri_test, desc="Evaluating strategies"):
        premise = item["premise"]
        hypothesis = item["hypothesis"]
        true_label = item["label"]

        # Zero-shot evaluation
        inputs_zs = tokenizer(premise, hypothesis, return_tensors="pt", truncation=True, padding=True)
        with torch.no_grad():
            outputs_zs = model(**inputs_zs)
            pred_zs = torch.argmax(outputs_zs.logits, dim=-1).item()

        strategies["zero_shot"]["total"] += 1
        if pred_zs == true_label:
            strategies["zero_shot"]["correct"] += 1

        # Few-shot evaluation
        few_shot_prompt = ""
        for example in few_shot_examples:
            few_shot_prompt += f"Premise: {example['premise']}\n"
            few_shot_prompt += f"Hypothesis: {example['hypothesis']}\n"
            few_shot_prompt += f"Relationship: {example['label_name']}\n\n"

        few_shot_prompt += f"Premise: {premise}\n"
        few_shot_prompt += f"Hypothesis: {hypothesis}\n"
        few_shot_prompt += "Relationship:"

        # Use the same model but with the few-shot context
        inputs_fs = tokenizer(few_shot_prompt, return_tensors="pt", truncation=True, padding=True, max_length=512)
        with torch.no_grad():
            outputs_fs = model(**inputs_fs)
            pred_fs = torch.argmax(outputs_fs.logits, dim=-1).item()

        strategies["few_shot"]["total"] += 1
        if pred_fs == true_label:
            strategies["few_shot"]["correct"] += 1

    # Calculate and display results
    print("\n" + "=" * 60)
    print("STRATEGY PERFORMANCE RESULTS")
    print("=" * 60)

    for strategy_name, results in strategies.items():
        accuracy = results["correct"] / results["total"] * 100
        print(f"{strategy_name.replace('_', ' ').title():<15}: {accuracy:.2f}% ({results['correct']}/{results['total']})")

    return strategies

# Run evaluation for a language
results = evaluate_strategies("swa")

Some weights of the model checkpoint at joeddav/xlm-roberta-large-xnli were not used when initializing XLMRobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Evaluating on SWA test set (600 examples)...
------------------------------------------------------------


Evaluating strategies: 100%|██████████| 600/600 [27:54<00:00,  2.79s/it]



STRATEGY PERFORMANCE RESULTS
Zero Shot      : 33.00% (198/600)
Few Shot       : 33.50% (201/600)
