# Synthetic Academic Query Generation with OpenAlex

This notebook demonstrates how to generate diverse academic queries using the OpenAlex API. The system creates realistic queries that can be used for testing lead generation systems.

## Features

- **Diverse Query Generation**: Uses OpenAlex concept hierarchy and institution data
- **Configurable Parameters**: Target number, batch size, concept levels, etc.
- **Checkpointing**: Saves progress and can resume from interruptions
- **Batching**: Processes queries in batches for efficiency
- **Results with Names and Institutions**: Generates structured lead data

## Quick Start


In [1]:
import os
import sys
import asyncio
from pathlib import Path

# Add the src directory to the path
sys.path.append("../src")

from src.evals.generate_synthetic_questions import (
    GenerationConfig,
    SyntheticQueryGenerator,
)
from rich import print as rprint
import json
import pandas as pd

## Optional: Clean Up Old Checkpoints

If you're getting validation errors from old checkpoints, you can clean them up:


In [2]:
# Optional: Clean up old checkpoints if you're getting validation errors
import shutil
from pathlib import Path

checkpoint_path = Path("checkpoints/demo_queries")
if checkpoint_path.exists():
    shutil.rmtree(checkpoint_path)
    print("Cleaned up old checkpoints")
else:
    print("No old checkpoints to clean")

Cleaned up old checkpoints


## Configuration

First, let's set up the configuration for our query generation. You can adjust these parameters based on your needs.


In [3]:
# Configure email for OpenAlex API (recommended)
# You can set this as an environment variable: OPENALEX_EMAIL
# os.environ['OPENALEX_EMAIL'] = 'your.email@example.com'

# Create configuration
config = GenerationConfig(
    target_queries=10,  # Start small for testing
    batch_size=10,
    max_results_per_query=5,
    checkpoint_dir="checkpoints/demo_queries",
    output_file="demo_synthetic_queries.json",
)

print("Configuration:")
print(f"Target queries: {config.target_queries}")
print(f"Batch size: {config.batch_size}")
print(f"Max results per query: {config.max_results_per_query}")
print(f"Output file: {config.output_file}")

Configuration:
Target queries: 10
Batch size: 10
Max results per query: 5
Output file: demo_synthetic_queries.json


## Generate Synthetic Queries

Now let's run the query generation. This will:

1. Fetch concepts and institutions from OpenAlex
2. Generate diverse query variations
3. Execute queries against OpenAlex
4. Save results with checkpointing

**Note**: This may take several minutes depending on your target number of queries.


In [4]:
# Create the generator
generator = SyntheticQueryGenerator(config)

# Run the generation
await generator.initialize_data()

print(generator.topics_per_institution)

{'city': 'Washington', 'geonames_city_id': '4140963', 'region': None, 'country_code': 'US', 'country': 'United States', 'latitude': 38.89511, 'longitude': -77.03637}
{'city': 'Vancouver', 'geonames_city_id': '6173331', 'region': None, 'country_code': 'CA', 'country': 'Canada', 'latitude': 49.24966, 'longitude': -123.11934}
{'city': 'Atlanta', 'geonames_city_id': '4180439', 'region': None, 'country_code': 'US', 'country': 'United States', 'latitude': 33.749, 'longitude': -84.38798}
{'city': 'Atlanta', 'geonames_city_id': '4180439', 'region': None, 'country_code': 'US', 'country': 'United States', 'latitude': 33.749, 'longitude': -84.38798}
{'city': 'Atlanta', 'geonames_city_id': '4180439', 'region': None, 'country_code': 'US', 'country': 'United States', 'latitude': 33.749, 'longitude': -84.38798}


defaultdict(<class 'set'>, {'https://openalex.org/I1294578330': {'https://openalex.org/T10145'}, 'https://openalex.org/I141945490': {'https://openalex.org/T10145'}, 'https://openalex.org/I1288198617': {'https://openalex.org/T10556'}})


In [5]:
rprint(generator.institution_based_searches)

In [None]:
import pyalex
from itertools import chain
from rich import print as rprint

works = (
    pyalex.Works()
    .filter(publication_year=">2023")
    .filter(authorships={"institutions.country_code": "US|GB|CA|AU|BR"})
    .filter(authorships={"institutions.id": "https://openalex.org/I1288198617"})
    .filter(topics={"id": "https://openalex.org/T10556"})
    .get()
)

rprint(works)

In [None]:
import pyalex
from itertools import chain
from rich import print as rprint

rprint(pyalex.Authors()["A5108562369"])