# Medical Cross-Task Knowledge Transfer - Kaggle Setup

**Project**: Medical NLP with Small Language Models  
**Goal**: Study cross-task knowledge transfer in medical NLP tasks  
**GPU**: T4 (16GB VRAM)  

---

## Setup Checklist

Before running this notebook:
1. ‚úÖ Enable **GPU T4 x2** in Settings ‚Üí Accelerator
2. ‚úÖ Enable **Internet** in Settings ‚Üí Internet
3. ‚úÖ Set **Persistence** to "Files only" in Settings

---

## 1Ô∏è‚É£ Clone Repository

In [None]:
# Clone your GitHub repository
!git clone https://github.com/bharathbolla/Crosstalk_Medical_LLM.git
%cd Crosstalk_Medical_LLM

# Verify structure
print("\nüìÅ Repository structure:")
!ls -la

## 2Ô∏è‚É£ Install Dependencies

In [None]:
# Install required packages
!pip install -q transformers datasets evaluate wandb accelerate scikit-learn pyyaml

print("‚úÖ Dependencies installed!")

## 3Ô∏è‚É£ Verify GPU

In [None]:
import torch

print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print(f"CUDA Version: {torch.version.cuda}")
else:
    print("‚ö†Ô∏è GPU not available! Check Settings ‚Üí Accelerator ‚Üí GPU T4 x2")

## 4Ô∏è‚É£ Download Datasets (15 minutes)

Downloads all 8 medical NLP datasets from HuggingFace using Parquet format.

In [None]:
from datasets import load_dataset
from pathlib import Path

# Create data directory
data_path = Path("data/raw")
data_path.mkdir(parents=True, exist_ok=True)

# Dataset configurations
datasets_config = {
    "bc2gm": {
        "url": "https://huggingface.co/datasets/bigbio/blurb/resolve/refs%2Fconvert%2Fparquet/bc2gm",
        "splits": ["train", "validation", "test"]
    },
    "jnlpba": {
        "url": "https://huggingface.co/datasets/bigbio/blurb/resolve/refs%2Fconvert%2Fparquet/jnlpba",
        "splits": ["train", "validation", "test"]
    },
    "chemprot": {
        "url": "https://huggingface.co/datasets/bigbio/blurb/resolve/refs%2Fconvert%2Fparquet/chemprot",
        "splits": ["train", "validation", "test"]
    },
    "ddi": {
        "url": "https://huggingface.co/datasets/bigbio/blurb/resolve/refs%2Fconvert%2Fparquet/ddi_corpus",
        "splits": ["train", "test"]
    },
    "gad": {
        "url": "https://huggingface.co/datasets/bigbio/blurb/resolve/refs%2Fconvert%2Fparquet/gad",
        "splits": ["train", "test"]
    },
    "hoc": {
        "url": "https://huggingface.co/datasets/bigbio/blurb/resolve/refs%2Fconvert%2Fparquet/hallmarks_of_cancer",
        "splits": ["train", "validation", "test"]
    },
    "pubmedqa": {
        "url": "https://huggingface.co/datasets/bigbio/blurb/resolve/refs%2Fconvert%2Fparquet/pubmed_qa",
        "splits": ["train", "validation", "test"]
    },
    "biosses": {
        "url": "https://huggingface.co/datasets/bigbio/blurb/resolve/refs%2Fconvert%2Fparquet/biosses",
        "splits": ["train", "validation", "test"]
    }
}

print("üì• Downloading 8 medical NLP datasets...\n")

total_samples = 0
for name, config in datasets_config.items():
    print(f"Downloading {name}...")
    base_url = config["url"]
    
    # Build data_files dict
    data_files = {}
    for split in config["splits"]:
        data_files[split] = f"{base_url}/{split}/0000.parquet"
    
    # Load and save
    dataset = load_dataset("parquet", data_files=data_files)
    dataset.save_to_disk(str(data_path / name))
    
    # Show stats
    train_size = len(dataset["train"])
    total_samples += train_size
    print(f"  ‚úì {name}: {train_size:,} training samples\n")

print(f"‚úÖ All 8 datasets downloaded!")
print(f"üìä Total training samples: {total_samples:,}")

## 5Ô∏è‚É£ Test Parsers

In [None]:
# Test that parsers work
import sys
sys.path.insert(0, "src")

from data import TaskRegistry, BC2GMDataset
from pathlib import Path

# Check registered tasks
print(f"Registered tasks: {TaskRegistry.list_tasks()}")

# Load one dataset
dataset = BC2GMDataset(
    data_path=Path("data/raw"),
    split="train"
)
print(f"\nLoaded {len(dataset)} BC2GM samples")
print(f"First sample:\n  {dataset[0].input_text[:150]}...")

# Check label schema
schema = dataset.get_label_schema()
print(f"\nLabel schema ({len(schema)} labels): {list(schema.keys())}")

print("\n‚úÖ Everything works! Ready to train!")

## 6Ô∏è‚É£ Smoke Test - Quick Training Test (10 minutes)

Train BERT on 100 samples for 50 steps to verify the pipeline works.

In [None]:
from transformers import (
    AutoTokenizer, 
    AutoModelForTokenClassification, 
    TrainingArguments, 
    Trainer
)
from src.data import BC2GMDataset
from src.data.collators import NERCollator
from pathlib import Path

print("üöÄ Starting smoke test...\n")

# 1. Load tiny subset (100 samples only)
dataset = BC2GMDataset(data_path=Path("data/raw"), split="train")
small_dataset = [dataset[i] for i in range(100)]
print(f"‚úì Loaded {len(small_dataset)} samples")

# 2. Load BERT model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
print(f"‚úì Loaded tokenizer: {model_name}")

label_schema = dataset.get_label_schema()
num_labels = len(label_schema)

model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=num_labels
).to("cuda")
print(f"‚úì Loaded model: {model_name} ({num_labels} labels)")

# 3. Setup training (just 50 steps!)
training_args = TrainingArguments(
    output_dir="./smoke_test_output",
    max_steps=50,
    per_device_train_batch_size=8,
    logging_steps=10,
    save_steps=25,
    fp16=True,  # Use mixed precision for speed
    report_to="none",  # Don't log to wandb yet
)
print("‚úì Training config ready")

# 4. Create collator
collator = NERCollator(tokenizer=tokenizer, label_schema=label_schema)
print("‚úì Collator ready")

# 5. Train!
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_dataset,
    data_collator=collator
)

print("\n" + "="*60)
print("Training 50 steps on 100 samples...")
print("="*60 + "\n")

trainer.train()

print("\n" + "="*60)
print("‚úÖ Smoke test complete! Your pipeline works on Kaggle!")
print("="*60)

## üéâ Success!

If you got here without errors, you're ready for real experiments!

---

## Next Steps

### Option 1: Run Contamination Check (2 hours)

Before training, check if test data leaked into pre-training:

```python
!python scripts/run_contamination_check.py \
    --data_path data/raw \
    --output_dir contamination_results \
    --device cuda
```

### Option 2: Run First Baseline (1 hour)

BERT baseline on BC2GM:

```python
!python scripts/run_baseline.py \
    --model bert-base-uncased \
    --task bc2gm \
    --epochs 3 \
    --batch_size 16
```

### Option 3: Run Full Experiment (4-6 hours)

Single-task training on all tasks:

```python
!python scripts/run_experiment.py strategy=s1_single task=all
```

---

## üìä Monitor GPU Usage

Run this in a separate cell:

```python
!watch -n 5 nvidia-smi
```

---

## üîß Troubleshooting

**"CUDA out of memory"**:
- Reduce `per_device_train_batch_size` to 4 or 2
- Add `gradient_accumulation_steps=4` to simulate larger batch

**"ModuleNotFoundError"**:
- Make sure `sys.path.insert(0, "src")` is in the cell
- Re-run the imports cell

**Session disconnected**:
- Your checkpoints are saved every 200 steps
- Resume training from last checkpoint

---

**Good luck with your experiments!** üöÄ