# Colab Starter: Universal Source Modeling Challenge

This notebook is a practical starting point for students to train at home and rehearse live-day evaluation.

Key reminders:
- Prediction is **sequential/online** (no lookahead).
- The evaluator enforces a **context limit** and **runtime limit**.
- Your competition-day submission is a Python file that defines `build_predictor(...)`.

See also:
- `COMPETITION_RULES.md`
- `submissions/README.md`


In [None]:
from __future__ import annotations

import platform
import sys
from pathlib import Path

import numpy as np

repo_markers = [Path('competition'), Path('src'), Path('submissions')]
in_repo_root = all(p.exists() for p in repo_markers)
print('cwd:', Path.cwd())
print('repo_root_detected:', in_repo_root)
if not in_repo_root:
    print('If this is Colab, clone/mount the repo and set the notebook cwd to the repository root.')
    print('Expected markers:', [str(p) for p in repo_markers])

# Optional (uncomment if needed):
# import subprocess
# subprocess.run([sys.executable, '-m', 'pip', 'install', '-r', 'requirements.txt'], check=True)

print('python_version:', sys.version.split()[0])
print('platform:', platform.platform())
print('numpy_version:', np.__version__)


## Data Availability

Expected local paths for baseline practice:
- Training data: `data/generator/train.npy`
- Practice test data: `data/generator/test.npy`

If these files are missing, ask the instructor for the practice split or generate one with the provided synthetic source tools (instructor-side workflow).


In [None]:
from pathlib import Path
import numpy as np

train_path = Path('data/generator/train.npy')
test_path = Path('data/generator/test.npy')

missing = [str(p) for p in (train_path, test_path) if not p.exists()]
if missing:
    print('Missing required files for practice:', missing)
    print('Place practice data under data/generator/ or ask the instructor for a practice split.')
else:
    train = np.load(train_path)
    test = np.load(test_path)
    print('train shape:', train.shape, 'dtype:', train.dtype)
    print('test shape: ', test.shape, 'dtype:', test.dtype)
    if train.ndim == 1 and train.size > 0:
        print('train min/max symbol:', int(train.min()), int(train.max()))
    if test.ndim == 1 and test.size > 0:
        print('test min/max symbol: ', int(test.min()), int(test.max()))


## Predictor Template

Your submission file must define:

```python
def build_predictor(alphabet_size: int, max_context_length: int) -> Predictor:
    ...
```

Start by copying or modifying `submissions/template_predictor.py`. The live evaluator imports this function dynamically.


## Run Smoke Test (5000 tokens)

This quickly verifies that your predictor imports correctly and produces the canonical `FINAL_SCORE` line. Expected `evaluated_tokens=5000`.


In [None]:
import subprocess
import sys

cmd = [
    sys.executable,
    '-m', 'competition.run_live_eval',
    '--test-path', 'data/generator/test.npy',
    '--predictor-path', 'submissions/template_predictor.py',
    '--smoke-test',
]
print('Running:', ' '.join(cmd))
result = subprocess.run(cmd, capture_output=True, text=True)
print('returncode:', result.returncode)
if result.stdout:
    print(result.stdout.strip())
if result.stderr:
    print('STDERR:
' + result.stderr.strip())


## Run Full Practice Evaluation (default 200000-token prefix)

This mirrors the competition default behavior. It should print one canonical `FINAL_SCORE` line.


In [None]:
import subprocess
import sys

cmd = [
    sys.executable,
    '-m', 'competition.run_live_eval',
    '--test-path', 'data/generator/test.npy',
    '--predictor-path', 'submissions/template_predictor.py',
]
print('Running:', ' '.join(cmd))
result = subprocess.run(cmd, capture_output=True, text=True)
print('returncode:', result.returncode)
if result.stdout:
    print(result.stdout.strip())
if result.stderr:
    print('STDERR:
' + result.stderr.strip())


## Tips / Extensions

Common optimization knobs:
- Keep `--validate-probabilities` off during normal runs (it is slower).
- Avoid slow Python work inside `predict_next` (especially repeated allocations).
- Cache expensive preprocessing/model state where possible.
- Keep memory usage below the Colab Pro+ budget target (assume ~16GB GPU limit for competition design).

Workflow reminders:
- Training at home is allowed. Live day is inference/evaluation under time constraints.
- Test both `--smoke-test` and a full practice run before competition day.
- Do not modify the evaluator script or rely on lookahead.
