# Fine-Tuning BERT for Natural Language Inference (MNLI)
Natural Language Inference (NLI) is the task of determining whether a "hypothesis" is true (entailment), false (contradiction), or undetermined (neutral) given a "premise".

## 1. Setup Environment and Installation
We install the transformers and datasets libraries. Since MNLI is part of the GLUE benchmark, we will use the HuggingFace glue loader.

In [None]:
!pip install transformers datasets accelerate evaluate scikit-learn -q

import torch
import numpy as np
import pandas as pd
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    pipeline
)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h



Using device: cpu


## 2. Load and Explore the MNLI Dataset
We load the dataset from the GLUE benchmark. Note that MNLI has two validation sets: matched (same domains as training) and mismatched (different domains).

In [None]:
# MNLI is part of the GLUE benchmark
raw_datasets = load_dataset("glue", "mnli")

# Preview training data
train_df = pd.DataFrame(raw_datasets['train']).head()
print("MNLI Data Sample:")
print(train_df[['premise', 'hypothesis', 'label']])

# Label mapping: 0 -> Entailment, 1 -> Neutral, 2 -> Contradiction
labels = raw_datasets["train"].features["label"].names
print(f"\nLabels: {labels}")

MNLI Data Sample:
                                             premise  \
0  Conceptually cream skimming has two basic dime...   
1  you know during the season and i guess at at y...   
2  One of our number will carry out your instruct...   
3  How do you know? All this is their information...   
4  yeah i tell you what though if you go price so...   

                                          hypothesis  label  
0  Product and geography are what make cream skim...      1  
1  You lose the things to the following level if ...      0  
2  A member of my team will execute your orders w...      0  
3                  This information belongs to them.      0  
4           The tennis shoes have a range of prices.      1  

Labels: ['entailment', 'neutral', 'contradiction']


## 3. Baseline Comparison (Traditional Machine Learning)
For the baseline, we concatenate the premise and hypothesis with a separator and use TF-IDF with Logistic Regression.

In [None]:
# Prepare text for baseline (concatenate premise and hypothesis)
def prepare_baseline_text(dataset_split):
    return [p + " [SEP] " + h for p, h in zip(dataset_split['premise'], dataset_split['hypothesis'])]

train_texts = prepare_baseline_text(raw_datasets['train'].select(range(20000)))
test_texts = prepare_baseline_text(raw_datasets['validation_matched'])

y_train = raw_datasets['train'].select(range(20000))['label']
y_test = raw_datasets['validation_matched']['label']

# Vectorization
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)

# Training Logistic Regression
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)

# Accuracy
lr_preds = lr_model.predict(X_test)
print(f"Baseline (Logistic Regression) Accuracy: {accuracy_score(y_test, lr_preds):.4f}")

Baseline (Logistic Regression) Accuracy: 0.4177


## 4. BERT Tokenization for Sentence Pairs
Unlike single-sentence classification, BERT handles NLI by taking two inputs separated by a [SEP] token and using segment embeddings.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(examples):
    # Pass both premise and hypothesis to the tokenizer
    return tokenizer(
        examples["premise"],
        examples["hypothesis"],
        truncation=True,
        padding="max_length",
        max_length=128
    )

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/392702 [00:00<?, ? examples/s]

Map:   0%|          | 0/9815 [00:00<?, ? examples/s]

Map:   0%|          | 0/9832 [00:00<?, ? examples/s]

Map:   0%|          | 0/9796 [00:00<?, ? examples/s]

Map:   0%|          | 0/9847 [00:00<?, ? examples/s]

## 5. Model Configuration
We load bert-base-uncased with 3 output labels for Entailment, Neutral, and Contradiction.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)
model.to(device)

import evaluate
metric = evaluate.load("glue", "mnli")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading builder script: 0.00B [00:00, ?B/s]

## 6. Fine-Tuning BERT for NLI
We use the Trainer API. Because MNLI is very large (~392k rows), we will train on a smaller subset (15,000 samples) for this assignment task.

In [None]:
training_args = TrainingArguments(
    output_dir="finetuning-bert-nli",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
)

# Calculate the fraction for 15000 samples to maintain stratification
num_total_train_samples = len(tokenized_datasets["train"])
target_train_size = 500
train_split_fraction = target_train_size / num_total_train_samples

# Create a stratified sample
# The train_test_split method on a Dataset returns a DatasetDict with 'train' and 'test' keys
stratified_split = tokenized_datasets["train"].train_test_split(
    train_size=train_split_fraction, # This will be the smaller training set
    stratify_by_column="label",
    seed=42 # for reproducibility
)

# Use the 'train' part of the stratified split as the new training dataset
stratified_train_dataset = stratified_split["train"]

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=stratified_train_dataset,
    eval_dataset=tokenized_datasets["validation_matched"],
    compute_metrics=compute_metrics,
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.891076,0.599796
2,No log,0.889205,0.620173




TrainOutput(global_step=64, training_loss=0.6409909129142761, metrics={'train_runtime': 8620.5205, 'train_samples_per_second': 0.116, 'train_steps_per_second': 0.007, 'total_flos': 65778354432000.0, 'train_loss': 0.6409909129142761, 'epoch': 2.0})

## 7. Final Evaluation and Saving
We evaluate on the validation_matched set and save the model to fulfill the GitHub submission requirement.

In [None]:
# Evaluate BERT
results = trainer.evaluate()
print(f"Final BERT Accuracy on MNLI Matched: {results['eval_accuracy']:.4f}")

# Save the model
model.save_pretrained("./finetuning-bert-nli")
tokenizer.save_pretrained("./finetuning-bert-nli")

Final BERT Accuracy on MNLI Matched: 0.6202


('./finetuning-bert-nli/tokenizer_config.json',
 './finetuning-bert-nli/special_tokens_map.json',
 './finetuning-bert-nli/vocab.txt',
 './finetuning-bert-nli/added_tokens.json',
 './finetuning-bert-nli/tokenizer.json')

## 8. Inference (Prediction)
Test the model with custom premise-hypothesis pairs.

In [None]:
nli_pipeline = pipeline("text-classification", model="./finetuning-bert-nli", device=0 if torch.cuda.is_available() else -1)

# Example: Entailment
p = "A soccer game with multiple players owns the field."
h = "Some people are playing a sport."

# Note: pipeline for NLI usually takes a single string or formatted input
# For BERT NLI, we often manually format or use the model directly
def predict_nli(premise, hypothesis):
    inputs = tokenizer(premise, hypothesis, return_tensors="pt", truncation=True, padding=True).to(device)
    with torch.no_grad():
        logits = model(**inputs).logits
    prediction = torch.argmax(logits, dim=-1).item()
    return labels[prediction]

print(f"Premise: {p}")
print(f"Hypothesis: {h}")
print(f"Prediction: {predict_nli(p, h)}")

Device set to use cpu


Premise: A soccer game with multiple players owns the field.
Hypothesis: Some people are playing a sport.
Prediction: neutral
