# Medical Cross-Task Transfer - FINAL WORKING NOTEBOOK

**Status**: âœ… This works on Kaggle!

**Setup**: GPU T4 x2 + Internet ON

**Data**: Uses pickle files (NO library dependencies!)

---

## Cell 1: Clone Repository & Verify Data

In [None]:
import sys
import os

# Clone repo
print("ðŸ“¥ Cloning repository...")
os.chdir('/kaggle/working')
!rm -rf Crosstalk_Medical_LLM
!git clone https://github.com/bharathbolla/Crosstalk_Medical_LLM.git
os.chdir('Crosstalk_Medical_LLM')

print("\nâœ… Repository cloned!")
print(f"Current directory: {os.getcwd()}")

# Verify pickle files exist
print("\nðŸ“¦ Checking datasets...")
!python test_pickle_load.py

## Cell 2: Install Only Training Libraries

We don't need datasets/pyarrow for data loading (using pickle!)

Just install transformers, torch, etc.

In [None]:
# Install training libraries (NOT datasets/pyarrow!)
!pip install -q transformers torch accelerate scikit-learn wandb

# Verify GPU
import torch
print(f"\nâœ… PyTorch: {torch.__version__}")
print(f"âœ… CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"âœ… GPU: {torch.cuda.get_device_name(0)}")
    print(f"âœ… VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## Cell 3: Load Dataset from Pickle

Example: Load BC2GM (NER) dataset

In [None]:
import pickle
from pathlib import Path

# Load BC2GM dataset from pickle
dataset_name = "bc2gm"
pickle_file = Path(f"data/pickle/{dataset_name}.pkl")

print(f"ðŸ“¦ Loading {dataset_name} from pickle...")
with open(pickle_file, 'rb') as f:
    data = pickle.load(f)

# Show statistics
train_data = data['train']
val_data = data.get('validation', [])
test_data = data.get('test', [])

print(f"\nâœ… Loaded {dataset_name}!")
print(f"   Train: {len(train_data):,} samples")
print(f"   Validation: {len(val_data):,} samples")
print(f"   Test: {len(test_data):,} samples")

# Show first sample
sample = train_data[0]
print(f"\nðŸ“‹ First sample:")
print(f"   ID: {sample.get('id', 'N/A')}")
print(f"   Tokens: {sample['tokens'][:10]}...")
print(f"   NER tags: {sample['ner_tags'][:10]}...")
print(f"   All fields: {list(sample.keys())}")

## Cell 4: Prepare Data for Training

Convert pickle data to HuggingFace format for training

In [None]:
from transformers import AutoTokenizer
from torch.utils.data import Dataset

class NERDataset(Dataset):
    """Simple NER dataset from pickle data."""

    def __init__(self, data, tokenizer, max_length=512):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]

        # Tokenize
        tokens = item['tokens']
        labels = item['ner_tags']

        # Convert tokens to string and tokenize
        text = ' '.join(tokens)
        encoding = self.tokenizer(
            text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        # Align labels with tokenized input
        # Simple approach: use first subword label
        aligned_labels = [-100] * self.max_length
        for i in range(min(len(labels), self.max_length)):
            aligned_labels[i] = labels[i]

        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'labels': torch.tensor(aligned_labels)
        }

# Load tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create datasets
train_dataset = NERDataset(train_data[:100], tokenizer)  # Use 100 samples for quick test
val_dataset = NERDataset(val_data[:20] if val_data else train_data[:20], tokenizer)

print(f"âœ… Created training dataset: {len(train_dataset)} samples")
print(f"âœ… Created validation dataset: {len(val_dataset)} samples")

## Cell 5: Quick Training Test (5 minutes)

Train BERT for a few steps to verify everything works

In [None]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

# Determine number of labels
max_label = max([max(item['ner_tags']) for item in train_data])
num_labels = max_label + 1

print(f"ðŸ“Š Number of labels: {num_labels}")

# Load model
print(f"\nðŸ¤– Loading {model_name}...")
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=num_labels
)

# Training arguments
training_args = TrainingArguments(
    output_dir="./test_trainer",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=10,
    max_steps=50,  # Quick test
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=25,
    save_steps=50,
    fp16=True,  # Use mixed precision
    report_to="none",
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

print("\nðŸš€ Starting training...")
print("=" * 60)
trainer.train()
print("=" * 60)
print("\nâœ… Training complete!")

## Cell 6: Load Different Dataset

Example: Load all 8 datasets and show statistics

In [None]:
import pickle
from pathlib import Path

# Load all datasets
datasets = ["bc2gm", "jnlpba", "chemprot", "ddi", "gad", "hoc", "pubmedqa", "biosses"]

print("ðŸ“Š DATASET STATISTICS")
print("=" * 60)

all_data = {}
total_samples = 0

for dataset_name in datasets:
    pickle_file = Path(f"data/pickle/{dataset_name}.pkl")

    with open(pickle_file, 'rb') as f:
        data = pickle.load(f)

    all_data[dataset_name] = data
    train_size = len(data['train'])
    total_samples += train_size

    splits_info = ", ".join([f"{k}: {len(v)}" for k, v in data.items()])
    print(f"\n{dataset_name.upper()}:")
    print(f"  Splits: {splits_info}")

print("\n" + "=" * 60)
print(f"âœ… Total training samples: {total_samples:,}")
print(f"âœ… All {len(datasets)} datasets loaded!")

## Cell 7: Save Your Work

Save trained model or any results

In [None]:
# Save model
output_dir = "./my_trained_model"
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"âœ… Model saved to {output_dir}")

# List saved files
!ls -lh {output_dir}

## Success! ðŸŽ‰

You now have:
- âœ… Data loading working (pickle format)
- âœ… Training pipeline working
- âœ… All 8 datasets available
- âœ… GPU acceleration
- âœ… Model saving

---

## Next Steps:

1. **Extend training**: Increase `max_steps` or `num_train_epochs`
2. **Try other datasets**: Change `dataset_name` in Cell 3
3. **Try other models**: Change `model_name` to:
   - `"dmis-lab/biobert-v1.1"`
   - `"allenai/scibert_scivocab_uncased"`
   - `"microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract"`
4. **Multi-task learning**: Load multiple datasets and combine them
5. **Hyperparameter tuning**: Adjust batch size, learning rate, etc.

---

## ðŸ“Š Resources:

- Repository: https://github.com/bharathbolla/Crosstalk_Medical_LLM
- All datasets: `data/pickle/` directory
- Documentation: See README files in repo

**Happy experimenting!** ðŸš€