# Lab 5: LLM Evaluation and Prompting Strategies
## Stage 1: Experiment Execution

This notebook automates the process of querying different Large Language Models (LLMs) using various prompting strategies:
- **Zero-shot**: Providing only the instruction and input.
- **Few-shot**: Providing a few examples before the target input.
- **Chain-of-Thought (CoT)**: Instructing the model to "think step-by-step".

The goal is to gather raw responses for a set of diverse NLP tasks which will later be evaluated manually.

### 1. Initialization
Load dependencies, set seed for reproducibility, and configure project paths.

In [1]:
import json
from pathlib import Path
import numpy as np
from ollama import Client
from eval_utils import run_evaluation

In [2]:
np.random.seed(42)

In [3]:
# Set up paths
ROOT = Path(".").resolve()
LAB_DIR = ROOT
OUTPUT_DIR = LAB_DIR / "outputs"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
TASKS_DIR = LAB_DIR / 'tasks'
TASKS_DIR.mkdir(parents=True, exist_ok=True)

In [4]:
# Create a client
client = Client()
models_resp = client.list()


### 2. Task Definition
Collect task metadata and examples from the `tasks/` directory. Each JSON file defines a specific evaluation objective (e.g., Logical Reasoning, Code Generation).

In [5]:
# Load per-task JSON files from `tasks/`
task_files = sorted(TASKS_DIR.glob('*.json'))

TASKS = []
DEV_EXAMPLES = {}
EVAL_EXAMPLES = {}

for f in task_files:
    d = json.loads(f.read_text(encoding='utf-8'))
    TASKS.append({'id': d['id'], 'name': d['name'], 'description': d.get('description',''), 'eval_criteria': d.get('eval_criteria','')})
    DEV_EXAMPLES[d['id']] = d.get('dev_examples', [])
    EVAL_EXAMPLES[d['id']] = d.get('eval_example', {})

### 3. Execution Loop
Configure the experiment parameters and run the querying process. Results are saved as a timestamped CSV file in `outputs/`.

In [6]:
# Configuration: Selecting models and prompting strategies
selected_models = [
    # Small model
    'smollm:1.7b',
    # Large reasoning model
    'magistral:24b'
]
strategies = ['zero', 'few', 'cot']

# Main execution: Queries each model for every task and strategy
results = run_evaluation(
    models=selected_models,
    strategies=strategies,
    tasks=TASKS,
    examples=EVAL_EXAMPLES,
    client=client,
    output_dir=OUTPUT_DIR,
    save_prefix='lab5_experiment',
    dev_examples=DEV_EXAMPLES,
    tasks_dir=TASKS_DIR
)

Running 50 experiment runs


Running experiments: 60it [58:20, 58.35s/it]                           

Saved 60 results to /Users/wojciechbartoszek/Documents/studia/lingwistyka_obliczeniowa/Lingwistyka-Obliczeniowa/lab5/outputs/lab5_experiment_20260111_115522.jsonl and /Users/wojciechbartoszek/Documents/studia/lingwistyka_obliczeniowa/Lingwistyka-Obliczeniowa/lab5/outputs/lab5_experiment_20260111_115522.csv



