# 🔬 DiffLlama vs Llama: Google Colab Experiment

This Notebook is designed to run a comparative experiment on Google Colab environment to evaluate the noise robustness of DiffLlama and Llama on mathematical reasoning tasks.

## 📋 Experiment Overview
- **Objective**: Compare DiffLlama-375M and Llama-375M performance on noisy math problems
- **Dataset**: GSM8K math reasoning dataset and its noisy variants
- **Evaluation**: Zero-shot performance + attention mechanism analysis
- **Environment**: Google Colab (GPU recommended)

---

## 🚀 Step 1: Environment Setup

First, check the runtime environment and configure necessary settings.

In [1]:
# Check GPU availability
import torch
print(f"🖥️  CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"🔧 GPU: {torch.cuda.get_device_name(0)}")
    print(f"💾 GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("⚠️  No GPU detected. Experiment will be slow on CPU.")

🖥️  CUDA available: True
🔧 GPU: Tesla T4
💾 GPU Memory: 15.8 GB


In [2]:
# Clone from Git repository if project files are not in current directory
# Replace with your actual repository URL
import os
if not os.path.exists('colab/experiment.py'):
    print("📥 Cloning repository...")
    !git clone https://github.com/github-bowen/DiffLlama-Math-Robustness.git
    print("📥 Copying files...")
    !cp -r DiffLlama-Math-Robustness/* .
    print("📥 Removing repository...")
    !rm -rf DiffLlama-Math-Robustness
    print("📥 Done")
else:
    print("✅ Project files found")

📥 Cloning repository...
Cloning into 'DiffLlama-Math-Robustness'...
remote: Enumerating objects: 215, done.[K
remote: Counting objects: 100% (215/215), done.[K
remote: Compressing objects: 100% (149/149), done.[K
remote: Total 215 (delta 120), reused 153 (delta 61), pack-reused 0 (from 0)[K
Receiving objects: 100% (215/215), 133.91 KiB | 6.70 MiB/s, done.
Resolving deltas: 100% (120/120), done.
📥 Copying files...
📥 Removing repository...
📥 Done


## 📁 Step 2: Upload Project Files

If you didn't clone using Git, manually upload the following files to Colab:

**Required Files**:
- `colab_experiment.py` (main Colab script)
- `pre_download_models.py` (model download script)
- All Python files in the `src/` directory
- `requirements.txt`

Use Colab's file upload feature or copy files from Google Drive.

## 📖 Step 3: View Usage Instructions

Run the command below to view detailed usage instructions and options.

In [3]:
# Display usage instructions
!python -m colab.experiment --instructions


🎯 GOOGLE COLAB USAGE INSTRUCTIONS

1. 📱 Basic Setup (Run once):
   !python -m colab.experiment --setup

2. 🚀 Quick Test (Recommended first run):
   !python -m colab.experiment --mode quick

3. 📊 Medium Experiment:
   !python -m colab.experiment --mode medium

4. 🔬 Full Experiment:
   !python -m colab.experiment --mode full --max-samples 500

5. 🎯 Experiment with Fine-tuning:
   !python -m colab.experiment --mode medium --enable-sft --sft-samples 200

6. 🔄 Skip Zero-shot (only SFT and attention):
   !python -m colab.experiment --mode medium --skip-zero-shot --enable-sft

7. 📈 Only Fine-tuning workflow:
   !python -m colab.experiment --mode medium --skip-zero-shot --enable-sft --skip-attention

🔧 Options:
   --mode: quick/medium/full (experiment scope)
   --max-samples: Limit number of evaluation samples
   --enable-sft: Enable supervised fine-tuning (disabled by default)
   --sft-samples: Number of samples for fine-tuning (default: varies by mode)
   --sft-epochs: Number of epochs for 

## 🔧 Step 4: Initial Setup

Run initial setup to install dependencies and configure the environment.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# Run initial setup (includes Google Drive mounting)
!python -m colab.experiment --setup

📦 Installing dependencies...
Collecting evaluate>=0.4.0 (from -r requirements.txt (line 5))
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting hf_xet (from -r requirements.txt (line 13))
  Downloading hf_xet-1.1.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (879 bytes)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->-r requirements.txt (line 1))
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->-r requirements.txt (line 1))
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->-r requirements.txt (line 1))
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->-r requirements.txt (line 1))
  Downloadi

In [6]:
!ls -al

total 80
drwxr-xr-x 1 root root  4096 Jun  2 23:08 .
drwxr-xr-x 1 root root  4096 Jun  2 23:05 ..
lrwxrwxrwx 1 root root    50 Jun  2 23:08 cache -> /content/drive/MyDrive/DiffLlama_Experiment/models
drwxr-xr-x 3 root root  4096 Jun  2 23:06 colab
drwxr-xr-x 4 root root  4096 May 29 14:01 .config
lrwxrwxrwx 1 root root    48 Jun  2 23:08 data -> /content/drive/MyDrive/DiffLlama_Experiment/data
drwx------ 6 root root  4096 Jun  2 23:06 drive
-rw-r--r-- 1 root root  1074 Jun  2 23:06 LICENSE
-rw-r--r-- 1 root root 18135 Jun  2 23:06 main.py
lrwxrwxrwx 1 root root    60 Jun  2 23:08 models_finetuned -> /content/drive/MyDrive/DiffLlama_Experiment/models_finetuned
-rw-r--r-- 1 root root  6409 Jun  2 23:06 README.md
-rw-r--r-- 1 root root   226 Jun  2 23:06 requirements.txt
lrwxrwxrwx 1 root root    51 Jun  2 23:08 results -> /content/drive/MyDrive/DiffLlama_Experiment/results
drwxr-xr-x 1 root root  4096 May 29 14:01 sample_data
drwxr-xr-x 2 root root  4096 Jun  2 23:06 scripts
drwxr-xr-x 2

In [None]:
!python -m scripts.download_models

## 🚀 Step 5: Run Experiments

Choose an appropriate experiment mode based on your needs:

```bash
options:
  -h, --help            show this help message and exit
  --mode {quick,medium,full}
                        Experiment mode (default: quick)
  --max-samples MAX_SAMPLES
                        Maximum samples for evaluation
  --enable-sft          Enable supervised fine-tuning (disabled by default)
  --sft-samples SFT_SAMPLES
                        Number of samples for fine-tuning
  --sft-epochs SFT_EPOCHS
                        Number of epochs for fine-tuning
  --skip-attention      Skip attention analysis
  --skip-zero-shot      Skip zero-shot evaluation
  --setup               Only run setup (dependencies and environment)
  --instructions        Display usage instructions
```

### 🏃 Quick Test (Recommended for First Run)
Validate the experiment workflow using a small number of samples, takes about 30-60 minutes.

In [None]:
# Quick test mode
!python -m colab.experiment --mode quick

### 📊 Medium-Scale Experiment
Use a moderate number of samples, balancing time and result quality.

In [None]:
# Medium-scale experiment (make sure quick test runs successfully first)
!python -m colab.experiment --mode medium

### 🔬 Full Experiment
Use the complete dataset for the experiment, may take several hours.

In [None]:
# Full experiment (run only when you have enough time)
## Evaluation only
!python -m colab.experiment --mode full --skip-attention

## SFT only
!python -m colab.experiment --mode full --skip-zero-shot --enable-sft --skip-attention

## Evaluation + SFT
!python -m colab.experiment --mode full --enable-sft --skip-attention

## All steps: Evaluation + SFT + Attention Analysis
!python -m colab.experiment --mode full --enable-sft

In [24]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


In [26]:
# SFT on all samples:
## SFT + Attention Analysis
!python -m main --max-samples 50 --sft-samples 7473 --sft-epochs 1 --skip-zero-shot

2025-06-02 23:20:24.555666: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748906424.589585    4380 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748906424.610868    4380 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-06-02 23:20:24.668618: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
🔬 Running FULL EXPERIMENT
DIFFLAMA VS LLAMA: NOISE ROBUSTNESS EXPERIMENT
Start time: 2025-06-02 23:20:30
✓ All requir

### 🛠 Custom Experiment
Adjust experiment parameters as needed.

In [None]:
# Custom experiment example
# Only run evaluation, skip attention analysis to save time
!python -m colab.experiment --mode medium --skip-attention --max-samples 100

## 📊 Step 6: View Experiment Results

After completing the experiment, review the generated result files.

In [17]:
# List generated result files
!ls -la results/

total 4
drwx------ 5 root root 4096 Jun  2 23:17 attention_maps


In [18]:
# View the latest experiment summary
import json
import glob

# Find the latest summary file
summary_files = glob.glob('results/colab_summary_*.json')
if summary_files:
    latest_summary = max(summary_files)
    print(f"📋 Latest experiment summary: {latest_summary}")

    with open(latest_summary, 'r') as f:
        summary = json.load(f)

    print("\n📊 Experiment Summary:")
    for key, value in summary.items():
        print(f"  {key}: {value}")
else:
    print("No experiment summary found. Please run an experiment first.")

No experiment summary found. Please run an experiment first.


In [19]:
# Display main results
import pandas as pd

# Find the latest results file
result_files = glob.glob('results/colab_results_*.csv')
if result_files:
    latest_results = max(result_files)
    print(f"📈 Latest results: {latest_results}")

    df = pd.read_csv(latest_results)
    print("\n📊 Performance Comparison:")
    print(df.pivot(index='model', columns='dataset', values='accuracy'))

    # Calculate performance differences
    pivot_df = df.pivot(index='model', columns='dataset', values='accuracy')
    if 'llama' in pivot_df.index and 'diffllama' in pivot_df.index:
        print("\n🔍 Performance Difference (DiffLlama - Llama):")
        diff = pivot_df.loc['diffllama'] - pivot_df.loc['llama']
        print(diff)
else:
    print("No results found. Please run an experiment first.")

No results found. Please run an experiment first.


## 📈 Step 7: Results Visualization

If your experiment included attention analysis, you can view the generated attention heatmaps.

In [20]:
# Display attention heatmaps
import matplotlib.pyplot as plt
from IPython.display import Image, display
import os

attention_dir = 'results/attention_maps'
if os.path.exists(attention_dir):
    print("🧠 Attention Visualization Files:")

    # List all attention map files
    for root, dirs, files in os.walk(attention_dir):
        for file in files:
            if file.endswith('.png'):
                file_path = os.path.join(root, file)
                print(f"  📊 {file_path}")

                # Display images (optional, uncomment to show)
                # display(Image(file_path))
else:
    print("No attention maps found. Run experiment with attention analysis enabled.")

🧠 Attention Visualization Files:
  📊 results/attention_maps/clean_q1/llama_attn_layer-1_head0_sample.png
  📊 results/attention_maps/clean_q1/diffllama_attn_layer-1_head0_sample.png
  📊 results/attention_maps/noisy_q1/llama_attn_layer-1_head0_sample.png
  📊 results/attention_maps/noisy_q1/diffllama_attn_layer-1_head0_sample.png
  📊 results/attention_maps/clean_q2/llama_attn_layer-1_head0_sample.png


In [21]:
# Display attention analysis results
attention_files = glob.glob('results/colab_attention_*.json')
if attention_files:
    latest_attention = max(attention_files)
    print(f"🧠 Latest attention analysis: {latest_attention}")

    with open(latest_attention, 'r') as f:
        attention_data = json.load(f)

    print("\n📊 Attention Allocation Analysis:")
    for model, data in attention_data.items():
        print(f"\n{model.upper()} Model:")
        for condition, stats in data.items():
            print(f"  {condition.capitalize()}:")
            print(f"    KMI (Key Math Info): {stats['kmi_mean']:.3f} ± {stats['kmi_std']:.3f}")
            print(f"    NI (Noise Info): {stats['ni_mean']:.3f} ± {stats['ni_std']:.3f}")
            print(f"    OC (Other Context): {stats['oc_mean']:.3f} ± {stats['oc_std']:.3f}")
else:
    print("No attention analysis found. Run experiment with attention analysis enabled.")

No attention analysis found. Run experiment with attention analysis enabled.


## 💾 Step 8: Download Results

Download experiment results locally or ensure they are saved in Google Drive.

In [22]:
# Compress result files for download
import zipfile
from datetime import datetime

timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
zip_filename = f'experiment_results_{timestamp}.zip'

with zipfile.ZipFile(zip_filename, 'w') as zipf:
    # Add all files from results directory
    for root, dirs, files in os.walk('results'):
        for file in files:
            file_path = os.path.join(root, file)
            zipf.write(file_path)

print(f"📦 Results packaged in: {zip_filename}")
print("You can download this file from Colab's Files panel.")

# Reminder if Google Drive was used
if os.path.exists('/content/drive/MyDrive/DiffLlama_Experiment'):
    print("\n💾 Results are also saved in Google Drive:")
    print("  /content/drive/MyDrive/DiffLlama_Experiment/")

📦 Results packaged in: experiment_results_20250602_231718.zip
You can download this file from Colab's Files panel.

💾 Results are also saved in Google Drive:
  /content/drive/MyDrive/DiffLlama_Experiment/


## 🛠 Troubleshooting

If you encounter issues, try the following solutions:

In [None]:
# Clear GPU memory cache
import torch
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print("✅ GPU cache cleared")

# Check available memory
import psutil
memory = psutil.virtual_memory()
print(f"💾 RAM: {memory.available / 1e9:.1f}GB available / {memory.total / 1e9:.1f}GB total")

In [None]:
# If memory is insufficient, you can restart the runtime (use with caution)
# import os
# os.kill(os.getpid(), 9)

## 🎯 Experiment Conclusions

Based on the experiment results, you can analyze the following key questions:

1. **Noise Robustness**: Does DiffLlama perform better on noisy data?
2. **Attention Mechanism**: Is differential attention more effective at focusing on key information?
3. **Performance Degradation**: How do both models' performances change across different noise types?

---

**Thank you for using this experiment framework!** 🎉

If you have issues, please check:
- If GPU memory is sufficient
- If all required files are uploaded
- If network connection is stable

**Tip**: It's recommended to run the quick test mode first to validate the environment before running the full experiment.