Skip to content

di37/electrical-electronics-ir-dataset-generator

Repository files navigation

Electrical & Electronics Engineering Information Retrieval Dataset Generator

Synthetically generates question-passage pairs for training information retrieval models in the electrical and electronics engineering (EE) domain. The output format matches tomaarsen/miriad-4.4M-split — the medical IR dataset used to train models like embeddinggemma-300m-medical — but targets EE literature instead.

Motivation

Large-scale question-passage datasets exist for the biomedical domain (MIRIAD: 4.4M pairs), but nothing comparable exists for electrical and electronics engineering. This project fills that gap by using OpenAI models to generate high-quality, diverse synthetic pairs across 40 EE subdomains.

Project Structure

electrical-engineering-ir-dataset/
├── generate_ee_dataset.py          # CLI entry point
├── requirements.txt                # Python dependencies
├── .env                            # OpenAI API key (not committed)
├── README.md
│
├── ee_dataset/                     # Domain data & export
│   ├── __init__.py
│   ├── diversity.py                # 40 subdomains, ~450 topics, styles, personas
│   └── exporter.py                 # Parquet/CSV export + diversity statistics
│
├── prompts/                        # Prompt engineering
│   ├── __init__.py
│   ├── templates.py                # System & user prompt templates
│   └── builder.py                  # Randomized prompt construction
│
├── models/                         # Generation engine
│   ├── __init__.py
│   ├── config.py                   # GenerationConfig dataclass
│   └── generator.py                # Async DatasetGenerator (OpenAI calls)
│
├── utils/                          # Shared utilities
│   ├── __init__.py
│   ├── checkpoint.py               # JSONL checkpoint read/write
│   ├── parsing.py                  # LLM response parsing & validation
│   ├── rate_limiter.py             # Token-bucket rate limiter (RPM)
│   └── logging_config.py           # Centralised logging setup
│
└── output/                         # Generated dataset (created at runtime)
    ├── ee_dataset.parquet
    ├── ee_dataset.csv
    ├── ee_dataset_metadata.parquet
    └── checkpoint.jsonl

Module Responsibilities

Module What it owns
ee_dataset/ All domain knowledge — the 40 subdomains, ~450 specific topics, question styles, passage styles, author personas, complexity levels, seed phrases. Also the export pipeline and diversity stats.
prompts/ Prompt templates (easy to edit without touching code) and the builder that randomly samples all diversity axes to construct a unique prompt per sample.
models/ The GenerationConfig dataclass (all tuneable knobs) and the DatasetGenerator class that orchestrates async OpenAI calls with batching, retries, and rate-limit backoff.
utils/ Cross-cutting concerns — JSONL checkpoint I/O, LLM response parsing/validation, token-bucket rate limiter, and logging setup.

Output Format

Each generated sample contains two fields, identical to the MIRIAD schema:

Column Type Description
question string A specific technical question (20–200 words)
passage_text string An academic-style passage answering the question (150–600 words)

Output files are saved to output/ after every batch, so you always have usable files even if the run is interrupted.

Diversity Strategy

LLMs tend to produce repetitive text when generating at scale. This project mitigates that by randomizing six independent dimensions for every single sample:

Dimension Pool Size Examples
Subdomains 40 Power Electronics, VLSI Design, Control Systems, EMC/EMI, ...
Specific topics ~450 "buck converter inductor design", "FPGA timing closure", ...
Question styles 15 Design-oriented, failure modes, comparisons, trade-offs, ...
Passage styles 8 Journal paper, textbook, conference paper, dissertation, ...
Author personas 8 Professor, industry engineer, PhD researcher, standards writer, ...
Complexity levels 4 Undergraduate, advanced undergraduate, graduate, research

Additional techniques:

  • Randomized temperature per request (default range: 0.85–1.15, disabled for reasoning models)
  • Randomized passage length target (150–600 words)
  • Seed phrases that push the model toward specific angles (numerical examples, industry standards, application domains, trade-offs, historical evolution)
  • Explicit anti-repetition instructions in both system and user prompts
  • JSON mode (response_format: json_object) for guaranteed valid JSON output

Subdomains Covered

The 40 subdomains span the full breadth of EE:

  • Analog Circuit Design
  • Digital Circuit Design
  • Power Electronics
  • Power Systems and Smart Grids
  • Control Systems and Automation
  • Signal Processing (DSP)
  • Electromagnetic Theory and Wave Propagation
  • Antenna Design and RF Engineering
  • Microelectronics and VLSI Design
  • Semiconductor Physics and Devices
  • Communication Systems (Analog and Digital)
  • Wireless and Mobile Communications
  • Optical Fiber Communications
  • Embedded Systems and IoT
  • Microprocessor and Microcontroller Architecture
  • Electric Machines and Drives
  • Renewable Energy Systems (Solar, Wind)
  • High Voltage Engineering and Insulation
  • Instrumentation and Measurement
  • Biomedical Electronics
  • Robotics and Mechatronics
  • Radar and Satellite Systems
  • Digital Image Processing
  • Machine Learning for EE Applications
  • Computer Architecture and Hardware Design
  • Integrated Circuit Fabrication
  • Programmable Logic (FPGA/CPLD)
  • Electric Vehicle Technology
  • Power Quality and Harmonics
  • Network Theory and Circuit Analysis
  • Electromechanical Energy Conversion
  • Switchgear and Protection Systems
  • PCB Design and Signal Integrity
  • Photovoltaic Systems
  • Battery Technology and Energy Storage
  • Superconducting Electronics
  • MEMS and Nanotechnology
  • Audio and Acoustic Engineering
  • Telecommunications and 5G/6G
  • Electromagnetic Compatibility (EMC/EMI)

Setup

Prerequisites

  • Python 3.10+
  • An OpenAI API key

Installation

pip install -r requirements.txt

Configuration

Create a .env file in the project root:

OPENAI_API_KEY=sk-your-key-here

Usage

Quick test (50 samples)

python generate_ee_dataset.py --num-samples 50 --batch-size 5

Standard run (10k samples)

python generate_ee_dataset.py --num-samples 10000 --batch-size 10 --max-concurrent 10

Large-scale generation (100k samples)

python generate_ee_dataset.py --num-samples 100000 --batch-size 10 --max-concurrent 10 --rpm 4000

Resume after interruption

python generate_ee_dataset.py --num-samples 20000 --resume

Use a different model

# Most powerful non-reasoning model (recommended)
python generate_ee_dataset.py --num-samples 10000 --model gpt-4.1

# Cheapest option
python generate_ee_dataset.py --num-samples 10000 --model gpt-4.1-mini

# Reasoning model (slower, higher cost due to thinking tokens)
python generate_ee_dataset.py --num-samples 10000 --model gpt-5-mini

Use a local model (LM Studio, Ollama, vLLM)

# LM Studio
python generate_ee_dataset.py --num-samples 1000 --base-url http://localhost:1234/v1 --model local-model --batch-size 3 --max-concurrent 3

# Ollama
python generate_ee_dataset.py --num-samples 1000 --base-url http://localhost:11434/v1 --model llama3

All options

--num-samples       Total pairs to generate (default: 10000)
--batch-size        Concurrent requests per batch (default: 5)
--max-concurrent    Max concurrent API calls (default: 10)
--model             OpenAI model (default: gpt-4.1-mini)
--output-dir        Output directory (default: output)
--checkpoint-every  Checkpoint interval in samples (default: 100)
--resume            Resume from existing checkpoint
--temp-min          Minimum sampling temperature (default: 0.85)
--temp-max          Maximum sampling temperature (default: 1.15)
--base-url          Custom API base URL for local servers
--rpm               Max requests per minute (default: 500)

Model Selection Guide

Model Type Speed Cost (10k samples) Best for
gpt-4.1 Non-reasoning Medium ~$55 Highest quality passages
gpt-4.1-mini Non-reasoning Fast ~$12 Best cost/quality balance
gpt-5-mini Reasoning Slow (~30s/sample) ~$45 (incl. thinking tokens) Not recommended for this task
gpt-5-nano Reasoning Fast ~$3 Budget runs (lower quality)

Recommendation: Use gpt-4.1 for quality or gpt-4.1-mini for volume. Reasoning models (gpt-5-*) spend ~65% of output tokens on invisible thinking, making them slower and more expensive without meaningful quality gains for structured generation tasks.

Reasoning model support: The generator auto-detects reasoning models (gpt-5-, o1-, o3-, o4-) and adapts — disables temperature, merges system/user messages, and increases token limits to account for thinking tokens.

Mixing models: You can switch models between runs with --resume. The checkpoint is model-agnostic.

Cost Estimation

Approximate costs per model (non-reasoning models use ~500 input + ~800 output tokens per sample; reasoning models add ~1,500 thinking tokens):

Model Input $/1M Output $/1M 10k samples 100k samples
gpt-4.1-mini $0.40 $1.60 ~$12 ~$120
gpt-4.1 $2.00 $8.00 ~$55 ~$550
gpt-5-nano $0.05 $0.40 ~$3 ~$26
gpt-5-mini $0.25 $2.00 ~$45* ~$450*

*Reasoning models include ~1,500 invisible thinking tokens per sample billed as output.

Reliability Features

  • Checkpoint/resume: Progress is saved to checkpoint.jsonl after every batch. Use --resume to continue from where you left off.
  • Incremental export: Parquet, CSV, and metadata files are written after every batch — you always have usable output files.
  • JSON mode: Uses response_format: json_object to guarantee valid JSON from the API.
  • Retry logic: Each sample retries up to 5 times on parse errors or API failures.
  • Rate limiter: Token-bucket rate limiter (--rpm) spaces out requests to prevent 429 errors instead of relying on retry backoff.
  • Jittered backoff: On 429 errors, retries use exponential backoff with random jitter to avoid thundering herd.
  • Reasoning model auto-detection: Automatically adapts API parameters for gpt-5-*/o-series models (no temperature, merged messages, higher token limits).
  • Robust parsing: Handles markdown-fenced JSON, preamble text before JSON, and empty/None responses.
  • Validation: Rejects samples with missing keys or content that is too short.
  • Logging: Full logs to both stdout and generation.log.

Using the Dataset

Load with pandas

import pandas as pd

df = pd.read_parquet("output/ee_dataset.parquet")
print(df.head())
print(f"Questions: {len(df)}")

Load with Hugging Face datasets

from datasets import load_dataset

ds = load_dataset("parquet", data_files="output/ee_dataset.parquet")
print(ds["train"][0])

Train a sentence-transformer

from datasets import load_dataset
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments

dataset = load_dataset("parquet", data_files="output/ee_dataset.parquet")

model = SentenceTransformer("BAAI/bge-base-en-v1.5")

args = SentenceTransformerTrainingArguments(
    output_dir="ee-ir-model",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    learning_rate=2e-5,
)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=dataset["train"],
)
trainer.train()

License

This project generates synthetic data using OpenAI models. The generated dataset is subject to OpenAI's usage policies. The generation code itself is released under the MIT License.