Synthetically generates question-passage pairs for training information retrieval models in the electrical and electronics engineering (EE) domain. The output format matches tomaarsen/miriad-4.4M-split — the medical IR dataset used to train models like embeddinggemma-300m-medical — but targets EE literature instead.
Large-scale question-passage datasets exist for the biomedical domain (MIRIAD: 4.4M pairs), but nothing comparable exists for electrical and electronics engineering. This project fills that gap by using OpenAI models to generate high-quality, diverse synthetic pairs across 40 EE subdomains.
electrical-engineering-ir-dataset/
├── generate_ee_dataset.py # CLI entry point
├── requirements.txt # Python dependencies
├── .env # OpenAI API key (not committed)
├── README.md
│
├── ee_dataset/ # Domain data & export
│ ├── __init__.py
│ ├── diversity.py # 40 subdomains, ~450 topics, styles, personas
│ └── exporter.py # Parquet/CSV export + diversity statistics
│
├── prompts/ # Prompt engineering
│ ├── __init__.py
│ ├── templates.py # System & user prompt templates
│ └── builder.py # Randomized prompt construction
│
├── models/ # Generation engine
│ ├── __init__.py
│ ├── config.py # GenerationConfig dataclass
│ └── generator.py # Async DatasetGenerator (OpenAI calls)
│
├── utils/ # Shared utilities
│ ├── __init__.py
│ ├── checkpoint.py # JSONL checkpoint read/write
│ ├── parsing.py # LLM response parsing & validation
│ ├── rate_limiter.py # Token-bucket rate limiter (RPM)
│ └── logging_config.py # Centralised logging setup
│
└── output/ # Generated dataset (created at runtime)
├── ee_dataset.parquet
├── ee_dataset.csv
├── ee_dataset_metadata.parquet
└── checkpoint.jsonl
| Module | What it owns |
|---|---|
| ee_dataset/ | All domain knowledge — the 40 subdomains, ~450 specific topics, question styles, passage styles, author personas, complexity levels, seed phrases. Also the export pipeline and diversity stats. |
| prompts/ | Prompt templates (easy to edit without touching code) and the builder that randomly samples all diversity axes to construct a unique prompt per sample. |
| models/ | The GenerationConfig dataclass (all tuneable knobs) and the DatasetGenerator class that orchestrates async OpenAI calls with batching, retries, and rate-limit backoff. |
| utils/ | Cross-cutting concerns — JSONL checkpoint I/O, LLM response parsing/validation, token-bucket rate limiter, and logging setup. |
Each generated sample contains two fields, identical to the MIRIAD schema:
| Column | Type | Description |
|---|---|---|
question |
string | A specific technical question (20–200 words) |
passage_text |
string | An academic-style passage answering the question (150–600 words) |
Output files are saved to output/ after every batch, so you always have usable files even if the run is interrupted.
LLMs tend to produce repetitive text when generating at scale. This project mitigates that by randomizing six independent dimensions for every single sample:
| Dimension | Pool Size | Examples |
|---|---|---|
| Subdomains | 40 | Power Electronics, VLSI Design, Control Systems, EMC/EMI, ... |
| Specific topics | ~450 | "buck converter inductor design", "FPGA timing closure", ... |
| Question styles | 15 | Design-oriented, failure modes, comparisons, trade-offs, ... |
| Passage styles | 8 | Journal paper, textbook, conference paper, dissertation, ... |
| Author personas | 8 | Professor, industry engineer, PhD researcher, standards writer, ... |
| Complexity levels | 4 | Undergraduate, advanced undergraduate, graduate, research |
Additional techniques:
- Randomized temperature per request (default range: 0.85–1.15, disabled for reasoning models)
- Randomized passage length target (150–600 words)
- Seed phrases that push the model toward specific angles (numerical examples, industry standards, application domains, trade-offs, historical evolution)
- Explicit anti-repetition instructions in both system and user prompts
- JSON mode (
response_format: json_object) for guaranteed valid JSON output
The 40 subdomains span the full breadth of EE:
- Analog Circuit Design
- Digital Circuit Design
- Power Electronics
- Power Systems and Smart Grids
- Control Systems and Automation
- Signal Processing (DSP)
- Electromagnetic Theory and Wave Propagation
- Antenna Design and RF Engineering
- Microelectronics and VLSI Design
- Semiconductor Physics and Devices
- Communication Systems (Analog and Digital)
- Wireless and Mobile Communications
- Optical Fiber Communications
- Embedded Systems and IoT
- Microprocessor and Microcontroller Architecture
- Electric Machines and Drives
- Renewable Energy Systems (Solar, Wind)
- High Voltage Engineering and Insulation
- Instrumentation and Measurement
- Biomedical Electronics
- Robotics and Mechatronics
- Radar and Satellite Systems
- Digital Image Processing
- Machine Learning for EE Applications
- Computer Architecture and Hardware Design
- Integrated Circuit Fabrication
- Programmable Logic (FPGA/CPLD)
- Electric Vehicle Technology
- Power Quality and Harmonics
- Network Theory and Circuit Analysis
- Electromechanical Energy Conversion
- Switchgear and Protection Systems
- PCB Design and Signal Integrity
- Photovoltaic Systems
- Battery Technology and Energy Storage
- Superconducting Electronics
- MEMS and Nanotechnology
- Audio and Acoustic Engineering
- Telecommunications and 5G/6G
- Electromagnetic Compatibility (EMC/EMI)
- Python 3.10+
- An OpenAI API key
pip install -r requirements.txtCreate a .env file in the project root:
OPENAI_API_KEY=sk-your-key-here
python generate_ee_dataset.py --num-samples 50 --batch-size 5python generate_ee_dataset.py --num-samples 10000 --batch-size 10 --max-concurrent 10python generate_ee_dataset.py --num-samples 100000 --batch-size 10 --max-concurrent 10 --rpm 4000python generate_ee_dataset.py --num-samples 20000 --resume# Most powerful non-reasoning model (recommended)
python generate_ee_dataset.py --num-samples 10000 --model gpt-4.1
# Cheapest option
python generate_ee_dataset.py --num-samples 10000 --model gpt-4.1-mini
# Reasoning model (slower, higher cost due to thinking tokens)
python generate_ee_dataset.py --num-samples 10000 --model gpt-5-mini# LM Studio
python generate_ee_dataset.py --num-samples 1000 --base-url http://localhost:1234/v1 --model local-model --batch-size 3 --max-concurrent 3
# Ollama
python generate_ee_dataset.py --num-samples 1000 --base-url http://localhost:11434/v1 --model llama3--num-samples Total pairs to generate (default: 10000)
--batch-size Concurrent requests per batch (default: 5)
--max-concurrent Max concurrent API calls (default: 10)
--model OpenAI model (default: gpt-4.1-mini)
--output-dir Output directory (default: output)
--checkpoint-every Checkpoint interval in samples (default: 100)
--resume Resume from existing checkpoint
--temp-min Minimum sampling temperature (default: 0.85)
--temp-max Maximum sampling temperature (default: 1.15)
--base-url Custom API base URL for local servers
--rpm Max requests per minute (default: 500)
| Model | Type | Speed | Cost (10k samples) | Best for |
|---|---|---|---|---|
| gpt-4.1 | Non-reasoning | Medium | ~$55 | Highest quality passages |
| gpt-4.1-mini | Non-reasoning | Fast | ~$12 | Best cost/quality balance |
| gpt-5-mini | Reasoning | Slow (~30s/sample) | ~$45 (incl. thinking tokens) | Not recommended for this task |
| gpt-5-nano | Reasoning | Fast | ~$3 | Budget runs (lower quality) |
Recommendation: Use gpt-4.1 for quality or gpt-4.1-mini for volume. Reasoning models (gpt-5-*) spend ~65% of output tokens on invisible thinking, making them slower and more expensive without meaningful quality gains for structured generation tasks.
Reasoning model support: The generator auto-detects reasoning models (gpt-5-, o1-, o3-, o4-) and adapts — disables temperature, merges system/user messages, and increases token limits to account for thinking tokens.
Mixing models: You can switch models between runs with --resume. The checkpoint is model-agnostic.
Approximate costs per model (non-reasoning models use ~500 input + ~800 output tokens per sample; reasoning models add ~1,500 thinking tokens):
| Model | Input $/1M | Output $/1M | 10k samples | 100k samples |
|---|---|---|---|---|
| gpt-4.1-mini | $0.40 | $1.60 | ~$12 | ~$120 |
| gpt-4.1 | $2.00 | $8.00 | ~$55 | ~$550 |
| gpt-5-nano | $0.05 | $0.40 | ~$3 | ~$26 |
| gpt-5-mini | $0.25 | $2.00 | ~$45* | ~$450* |
*Reasoning models include ~1,500 invisible thinking tokens per sample billed as output.
- Checkpoint/resume: Progress is saved to
checkpoint.jsonlafter every batch. Use--resumeto continue from where you left off. - Incremental export: Parquet, CSV, and metadata files are written after every batch — you always have usable output files.
- JSON mode: Uses
response_format: json_objectto guarantee valid JSON from the API. - Retry logic: Each sample retries up to 5 times on parse errors or API failures.
- Rate limiter: Token-bucket rate limiter (
--rpm) spaces out requests to prevent 429 errors instead of relying on retry backoff. - Jittered backoff: On 429 errors, retries use exponential backoff with random jitter to avoid thundering herd.
- Reasoning model auto-detection: Automatically adapts API parameters for gpt-5-*/o-series models (no temperature, merged messages, higher token limits).
- Robust parsing: Handles markdown-fenced JSON, preamble text before JSON, and empty/None responses.
- Validation: Rejects samples with missing keys or content that is too short.
- Logging: Full logs to both stdout and
generation.log.
import pandas as pd
df = pd.read_parquet("output/ee_dataset.parquet")
print(df.head())
print(f"Questions: {len(df)}")from datasets import load_dataset
ds = load_dataset("parquet", data_files="output/ee_dataset.parquet")
print(ds["train"][0])from datasets import load_dataset
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments
dataset = load_dataset("parquet", data_files="output/ee_dataset.parquet")
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
args = SentenceTransformerTrainingArguments(
output_dir="ee-ir-model",
num_train_epochs=3,
per_device_train_batch_size=32,
learning_rate=2e-5,
)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=dataset["train"],
)
trainer.train()This project generates synthetic data using OpenAI models. The generated dataset is subject to OpenAI's usage policies. The generation code itself is released under the MIT License.