# **Contradictory, My Dear Watson: Multi-Model Classification Write-Up**

---

## 1. Introduction

Textual Entailment (also referred to as Natural Language Inference, or NLI) is a fundamental task in Natural Language Processing (NLP). The goal is to determine whether a *premise* and a *hypothesis* are in a relationship of **entailment**, **contradiction**, or **neutrality**. In the [“Contradictory, My Dear Watson”](https://www.kaggle.com/competitions/contradictory-my-dear-watson) Kaggle competition, we aim to classify pairs of sentences in multiple languages into one of these three categories.

Our approach utilizes **transformer-based** models from the [Hugging Face](https://huggingface.co/) library. Specifically, we experiment with several popular multilingual pretrained models (e.g., **XLM-RoBERTa**, **Multilingual BERT**, and **mDeBERTa**). We then fine-tune them for the classification task, compare their performance, and output predictions for the test data.

Below, each section corresponds to a notebook cell or set of cells:
1. **Environment Setup**
2. **Loading the Dataset**
3. **Defining Helper Functions**
4. **Preparing Train/Validation Split**
5. **Training Multiple Models**
6. **Comparing & Saving Predictions**
7. **Compare Early Predictions**
8. **Conclusion & References**

We'll also see the core math behind cross-entropy loss, which is minimized during fine-tuning.

## 2. Problem Description

Each sample in the **Contradictory, My Dear Watson** dataset provides:
- A **premise** (a sentence).
- A **hypothesis** (another sentence).
- A **label** indicating whether the hypothesis is **contradiction**, **entailment**, or **neutral** relative to the premise.

These examples span **15 different languages**, making it a challenging **multilingual** problem. By leveraging **multilingual pretrained** transformer models, we can capture cross-lingual embeddings to handle these varied inputs.

Hence, we treat this as a **3-class** classification task:
1. **Contradiction**
2. **Neutral**
3. **Entailment**

## 3. Method / Code

#### Cell 1: Environment Setup
We'll install required libraries (`transformers`, etc.) and import the necessary modules.


In [None]:
!pip install transformers --quiet

import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import os
import shutil  # For copying best model's CSV
from transformers import AutoTokenizer, TFAutoModel
from tensorflow import keras

# For reproducibility
SEED = 42
np.random.seed(SEED)
tf.random.set_seed(SEED)

print('Setup complete!')

#### Cell 2: Loading the Dataset
We read the **train** and **test** CSVs. Adjust file paths as necessary for your environment.

For Kaggle, you might load them from `../input/contradictory-my-dear-watson` or any local path. Here, we just demonstrate reading from the Kaggle or GitHub link.

In [None]:
train_data = pd.read_csv(
    "https://raw.githubusercontent.com/caseym7875/DSCI-478-Kaggle/refs/heads/main/train.csv"
)
test_data = pd.read_csv(
    "https://raw.githubusercontent.com/caseym7875/DSCI-478-Kaggle/refs/heads/main/test.csv"
)

print('Train shape:', train_data.shape)
print('Test shape:', test_data.shape)
train_data.head()

#### Cell 3: Defining Helper Functions
We define:
1. **`tokenize_data`**: Loads a tokenizer for a given model and converts premise/hypothesis text to numerical IDs.
2. **`build_model`**: Creates a Keras model with the pretrained transformer base plus a simple classification head.
3. **`plot_history`**: Plots training and validation accuracy/loss.

When we tokenize, we have inputs truncated or padded to a max length $L$:
$$
\text{encodings} = \mathrm{Tokenizer}(\text{premise}, \text{hypothesis}, \max\_length=L).
$$

In each training step, we **minimize cross-entropy** for 3 classes:
$$
\mathcal{L} = - \sum_{i=1}^{3} y_i \log(\hat{y}_i),
$$
where $y$ is the true one-hot label, and $\hat{y}$ is the predicted probability distribution.

In [None]:
from transformers import AutoTokenizer, TFAutoModel
from tensorflow import keras
import matplotlib.pyplot as plt
from IPython.display import clear_output, display

def tokenize_data(model_name, dataframe, max_len=128):
    """
    Tokenize premise/hypothesis using the specified Hugging Face model.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    encodings = tokenizer(
        list(dataframe['premise'].values),
        list(dataframe['hypothesis'].values),
        truncation=True,
        padding='max_length',
        max_length=max_len,
        return_tensors='tf'
    )
    return encodings

def build_model(model_name, max_len=128, num_labels=3, lr=2e-5, dropout_rate=0.3):
    """
    Create a Keras model using TFAutoModel + a simple classification head.
    """
    base_model = TFAutoModel.from_pretrained(model_name)
    input_ids      = keras.Input(shape=(max_len,), dtype=tf.int32, name="input_ids")
    attention_mask = keras.Input(shape=(max_len,), dtype=tf.int32, name="attention_mask")
    
    outputs = base_model({'input_ids': input_ids, 'attention_mask': attention_mask})
    cls_token = outputs.last_hidden_state[:, 0, :]  # [CLS]
    
    x = keras.layers.Dropout(dropout_rate)(cls_token)
    x = keras.layers.Dense(num_labels, activation='softmax')(x)
    
    model = keras.Model(inputs=[input_ids, attention_mask], outputs=x)
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=lr),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    return model

def plot_history(history, title_prefix=""):
    hist = history.history
    epochs = range(1, len(hist['loss']) + 1)
    
    plt.figure(figsize=(10,4))
    # Plot Loss
    plt.subplot(1,2,1)
    plt.plot(epochs, hist['loss'], 'bo-', label='Train Loss')
    if 'val_loss' in hist:
        plt.plot(epochs, hist['val_loss'], 'ro-', label='Val Loss')
    plt.title(f'{title_prefix} Loss')
    plt.xlabel('Sub-Epoch')
    plt.ylabel('Loss')
    plt.legend()
    
    # Plot Accuracy
    plt.subplot(1,2,2)
    plt.plot(epochs, hist['accuracy'], 'bo-', label='Train Acc')
    if 'val_accuracy' in hist:
        plt.plot(epochs, hist['val_accuracy'], 'ro-', label='Val Acc')
    plt.title(f'{title_prefix} Accuracy')
    plt.xlabel('Sub-Epoch')
    plt.ylabel('Accuracy')
    plt.legend()
    
    plt.tight_layout()
    plt.show()

print("Helper functions defined.")

#### Cell 4: Preparing Train/Validation Split
We shuffle the training data and split off 10% as a validation set to monitor performance.

In [None]:
VALIDATION_SPLIT = 0.1
val_size = int(len(train_data) * VALIDATION_SPLIT)

# Shuffle
train_data_shuffled = train_data.sample(frac=1, random_state=SEED).reset_index(drop=True)

val_data = train_data_shuffled.iloc[:val_size]
train_data_ = train_data_shuffled.iloc[val_size:]

y_val = val_data['label'].values
y_train = train_data_['label'].values

print(f"Training set size: {len(train_data_)}")
print(f"Validation set size: {len(val_data)}")

#### Cell 5: Training Multiple Models
We define a list of model candidates:
1. **XLM-RoBERTa**
2. **Multilingual BERT**
3. **mDeBERTa**

For each:
1. We **tokenize** train and validation data.
2. **Build** the Keras model.
3. **Train** for a specified number of epochs, using cross-entropy.
4. **Evaluate** on the validation set.
5. **Predict** on test data.
6. **Save** each model’s predictions to a CSV.

The cross-entropy loss is:
$$
\mathcal{L} = -\sum_{i=1}^3 y_i\,\log(\hat{y}_i),
$$
where $y_i$ is the one-hot label, and $\hat{y}_i$ is the predicted probability.

In [None]:
model_candidates = [
    "joeddav/xlm-roberta-large-xnli",
    "bert-base-multilingual-cased",
    "microsoft/mdeberta-v3-base"
]

BATCH_SIZE = 8   # Adjust to fit GPU memory
EPOCHS = 2       # Full epochs (will be split into sub-epochs)
MAX_LEN = 64     # Increase if memory allows (e.g., 128/256)

results = {}                  # model_name -> validation accuracy
predictions_dict = {}         # model_name -> predicted labels on test
predictions_csv_dict = {}     # model_name -> CSV path for predictions
saved_model_paths = []        # track where each model is saved

for model_name in model_candidates:
    print(f"\n=== Training model: {model_name} ===")
    # Tokenize train & val
    train_encodings = tokenize_data(model_name, train_data_, max_len=MAX_LEN)
    val_encodings   = tokenize_data(model_name, val_data,   max_len=MAX_LEN)
    
    # Build TF Datasets
    train_dataset = (
        tf.data.Dataset
        .from_tensor_slices((dict(train_encodings), y_train))
        .shuffle(buffer_size=len(train_data_))
        .batch(BATCH_SIZE)
    )
    
    val_dataset = (
        tf.data.Dataset
        .from_tensor_slices((dict(val_encodings), y_val))
        .batch(BATCH_SIZE)
    )
    
    # Calculate original steps and then split epochs into sub-epochs
    original_steps = len(list(train_dataset))
    subepochs_per_epoch = 10
    steps_per_subepoch = original_steps // subepochs_per_epoch
    total_subepochs = EPOCHS * subepochs_per_epoch  # 2 * 10 = 20
    
    print(f"Original steps per epoch: {original_steps}")
    print(f"Sub-epochs per epoch: {subepochs_per_epoch}")
    print(f"Steps per sub-epoch: {steps_per_subepoch}")
    print(f"Total sub-epochs: {total_subepochs}")
    
    # Build model
    model = build_model(model_name, max_len=MAX_LEN)
    
    # Train: Use total_subepochs as the number of epochs and steps_per_subepoch as steps per epoch.
    history = model.fit(
        train_dataset,
        validation_data=val_dataset,
        epochs=total_subepochs,
        steps_per_epoch=steps_per_subepoch,
        verbose=1
    )
    
    # Plot training history (20 slices)
    plot_history(history, title_prefix=model_name)
    
    # Evaluate on validation set
    val_loss, val_acc = model.evaluate(val_dataset, verbose=0)
    results[model_name] = val_acc
    print(f"Validation Accuracy for {model_name}: {val_acc:.4f}")
    
    # Predict on the test set
    test_encodings = tokenize_data(model_name, test_data, max_len=MAX_LEN)
    test_dataset = tf.data.Dataset.from_tensor_slices(dict(test_encodings)).batch(BATCH_SIZE)
    test_preds = model.predict(test_dataset)
    test_labels = np.argmax(test_preds, axis=1)
    predictions_dict[model_name] = test_labels
    
    # Save model
    save_path = f"./{model_name.replace('/', '_')}_model"
    model.save(save_path)
    saved_model_paths.append(save_path)
    print(f"Model saved to {save_path}")
    
    # Create and save CSV for predictions
    pred_csv_path = f"{model_name.replace('/', '_')}_predictions.csv"
    submission_df = pd.DataFrame({
        'id': test_data['id'],
        'prediction': test_labels
    })
    submission_df.to_csv(pred_csv_path, index=False)
    predictions_csv_dict[model_name] = pred_csv_path
    print(f"Prediction CSV saved as {pred_csv_path}\n")

#### Cell 6: Comparing & Saving Predictions
Here we:
1. Summarize each model’s validation accuracy.
2. Pick the **best** model.
3. Copy that model’s prediction CSV to **`submission.csv`** for Kaggle.


In [None]:
# Create summary DataFrame
model_performance_df = pd.DataFrame({
    'Model': list(results.keys()),
    'Val Accuracy': list(results.values()),
    'Prediction CSV': [predictions_csv_dict[m] for m in results.keys()]
}).sort_values(by='Val Accuracy', ascending=False)

print("=== Model Validation Performance ===")
display(model_performance_df)

# Select best model
best_model_name = model_performance_df.iloc[0]['Model']
best_model_csv = model_performance_df.iloc[0]['Prediction CSV']
print(f"\nBest Model: {best_model_name}")
print(f"Corresponding CSV: {best_model_csv}")

# Copy best model's CSV to submission.csv
shutil.copyfile(best_model_csv, "submission.csv")
print("\nFinal submission.csv created from the best model!")

#### Cell 7: Compare Early Predictions
We can look at the first 10 predictions from each model side by side to see if they disagree on specific samples.


In [None]:
comparison_df = pd.DataFrame()
comparison_df['test_id'] = test_data['id'].head(10)

for model_name in model_candidates:
    comparison_df[model_name] = predictions_dict[model_name][:10]

comparison_df

## 4. Conclusion

We have demonstrated a **multilingual classification** approach using pretrained transformers:
- **XLM-RoBERTa**
- **Multilingual BERT**
- **mDeBERTa**

Each was fine-tuned on the **Contradictory, My Dear Watson** dataset for 2 epochs, though **increasing** epochs usually improves final accuracy. We generated test predictions, saved them to separate CSVs, and chose the best model based on validation accuracy.

### Potential Improvements
- **Longer Training**: More epochs.
- **Hyperparameter Tuning**: Adjust learning rate, batch size.
- **Layer Unfreezing**: Fine-tune more layers for potential gains.
- **Ensembling**: Combine predictions from multiple models.


## 5. References
- [Vaswani et al., 2017] *Attention Is All You Need.* NIPS.
- [Devlin et al., 2019] *BERT: Pre-training of Deep Bidirectional Transformers.* NAACL.
- [Conneau et al., 2020] *Unsupervised Cross-lingual Representation Learning at Scale.* ACL. (XLM-RoBERTa)
- [He et al., 2021] *DeBERTa: Decoding-Enhanced BERT with Disentangled Attention.* ICLR.
- [Hugging Face Transformers](https://huggingface.co/docs/transformers)