# Medical Cross-Task Transfer - Kaggle (Virtual Environment)

**Solution**: Uses isolated virtual environment to avoid dependency conflicts

**Setup**: GPU T4 x2 + Internet ON

---

## Cell 1: Clone Repository

In [None]:
!git clone https://github.com/bharathbolla/Crosstalk_Medical_LLM.git
%cd Crosstalk_Medical_LLM

## Cell 2: Create Virtual Environment

This creates an isolated Python environment to avoid conflicts with Kaggle's pre-installed packages.

In [None]:
# Create virtual environment
!python3 -m venv venv

print("âœ… Virtual environment created!")
print("\nNext: Run Cell 3 to install packages")

## Cell 3: Install Compatible Packages

Installs compatible versions in the isolated environment.

In [None]:
# Install packages in venv (compatible versions)
!venv/bin/pip install -q --upgrade pip
!venv/bin/pip install -q pyarrow==14.0.0 datasets==2.20.0
!venv/bin/pip install -q transformers==4.40.0 evaluate==0.4.2
!venv/bin/pip install -q torch accelerate==0.30.0 scikit-learn pyyaml

# Verify versions
!venv/bin/python -c "import datasets; import pyarrow; print(f'datasets: {datasets.__version__}'); print(f'pyarrow: {pyarrow.__version__}')"

print("\nâœ… All packages installed in virtual environment!")

## Cell 4: Verify Datasets Exist

In [None]:
from pathlib import Path

data_path = Path("data/raw")
datasets = ["bc2gm", "jnlpba", "chemprot", "ddi", "gad", "hoc", "pubmedqa", "biosses"]

print("Checking datasets...\n")
for name in datasets:
    status = "âœ“" if (data_path / name).exists() else "âœ—"
    print(f"{status} {name}")

print("\nâœ… All datasets are included in the repository!")

## Cell 5: Test Parsers (Using venv)

In [None]:
# Run test_parsers.py using the virtual environment Python
!venv/bin/python test_parsers.py

## Cell 6: Quick Smoke Test (Optional)

Test training pipeline on 100 samples for 50 steps.

In [None]:
%%bash
source venv/bin/activate

python -c "
import sys
sys.path.insert(0, 'src')

from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
from data import BC2GMDataset
from data.collators import NERCollator
from pathlib import Path

print('Loading dataset...')
dataset = BC2GMDataset(data_path=Path('data/raw'), split='train')
small_dataset = [dataset[i] for i in range(100)]
print(f'Loaded {len(small_dataset)} samples')

print('Loading BERT model...')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
label_schema = dataset.get_label_schema()
model = AutoModelForTokenClassification.from_pretrained('bert-base-uncased', num_labels=len(label_schema))

print('Setting up training...')
training_args = TrainingArguments(
    output_dir='./smoke_test',
    max_steps=50,
    per_device_train_batch_size=8,
    logging_steps=10,
    fp16=True,
    report_to='none'
)

collator = NERCollator(tokenizer=tokenizer, label_schema=label_schema)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_dataset,
    data_collator=collator
)

print('Training...')
trainer.train()
print('âœ… Smoke test complete!')
"

## Cell 7: Run Baseline Experiment

In [None]:
# Run baseline experiment using venv Python
!venv/bin/python scripts/run_baseline.py \
    --model bert-base-uncased \
    --task bc2gm \
    --epochs 3 \
    --batch_size 16

## Success! ðŸŽ‰

If you got here without errors:
- âœ… Virtual environment created
- âœ… Compatible packages installed  
- âœ… All 8 parsers working
- âœ… Ready for experiments!

---

### Key Points:

1. **Always use venv Python**: `venv/bin/python` instead of system Python
2. **All datasets included**: No downloads needed
3. **No version conflicts**: Isolated environment

### Next Steps:

Run full experiments:
```bash
!venv/bin/python scripts/run_experiment.py strategy=s1_single task=all
```