Collection of evals on telecommunications tasks.
Read more: GSMA Benchmarks Blog Post
Before using this repository, you must request permission for the benchmark datasets on HuggingFace:
HuggingFace Configuration:
- Get your access token from your HuggingFace account
- Add the above repositories to "Repositories permissions"
- Click "read access to contents of selected repos"
Docker or OrbStack (required for sandbox execution)
- Docker: https://www.docker.com/get-started
- OrbStack (Mac): https://orbstack.dev
uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh- Install dependencies:
uv sync- Configure environment variables:
Create a .env file in the root folder with your API credentials:
# Required: HuggingFace token for dataset access
HF_TOKEN=your_huggingface_token_here
# Add API keys for the models you want to use
OPENAI_API_KEY=your_openai_key_here
ANTHROPIC_API_KEY=your_anthropic_key_hereList of all available models: https://inspect.aisi.org.uk/models.html
Run evals from the command line:
# TeleQnA
uv run inspect eval src/open_telco/teleqna/teleqna.py
#TeleMath
uv run inspect eval src/open_telco/telemath/telemath.py
#TeleLogs
uv run inspect eval src/open_telco/telelogs/telelogs.py
#TeleYaml
uv run inspect eval src/open_telco/teleyaml/teleyaml.pyWith options:
# Specific model
uv run inspect eval src/open_telco/telemath/telemath.py --model openai/gpt-4o
# Limit samples
uv run inspect eval src/open_telco/telemath/telemath.py --limit 10Alternative: Use the Inspect VS Code Extension or run the Web UI with python ui/app.py
-
TeleQnA: Benchmark Dataset to Assess Large Language Models for Telecommunications A benchmark dataset of 10,000 question-answer pairs sourced from telecommunications standards and research articles. Evaluates LLMs' knowledge across general telecom inquiries and complex standards-related questions. Paper | Dataset
uv run inspect eval src/open_telco/teleqna/teleqna.py
-
TeleMath: Evaluating Mathematical Reasoning in Telecom Domain 500 mathematically intensive problems covering signal processing, network optimization, and performance analysis. Implemented as a ReAct agent using bash and python tools to solve domain-specific mathematical computations. Paper | Dataset
uv run inspect eval src/open_telco/telemath/telemath.pyMetrics: pass@1, const@16 (majority voting over 16 answers)
-
TeleLogs: Root Cause Analysis in 5G Networks A synthetic dataset for root cause analysis (RCA) in 5G networks. Given network configuration parameters and user-plane data (throughput, RSRP, SINR), models must identify which of 8 predefined root causes explain throughput degradation below 600 Mbps. Use
-T <N>to specify epochs for pass@1 and maj@4 metrics. Paper | Datasetuv run inspect eval src/open_telco/telelogs/telelogs.py -T 4Metrics: pass@1 (averaged over N epochs), maj@4 (majority voting)
-
TeleYaml: 5G Network Configuration Generation Evaluates the capability of LLMs to generate standard-compliant YAML configurations for 5G core network tasks, specifically AMF Configuration, Network Slicing, and UE Provisioning. Dataset
uv run inspect eval src/open_telco/teleyaml/teleyaml.pyMetrics: model-graded accuracy
-
3GPP TSG: Technical Specification Group Classification Classifies 3GPP technical documents according to their working group. Models act as a distinguished expert in the telecommunication domain to identify the correct group for a given text. Dataset
uv run inspect eval src/open_telco/three_gpp/three_gpp.pyMetrics: accuracy, stderr