In [1]:
import os, json
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import Markdown, display

load_dotenv()

True

In [2]:
openr = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.getenv("OPENROUTER_API_KEY")
)

MODEL = {
    "nvidia":"nvidia/nemotron-3-nano-30b-a3b:free",
    "upstage":"upstage/solar-pro-3:free",
    "liquidAI":"liquid/lfm-2.5-1.2b-instruct:free"
}

In [3]:
# System Prompt to generate Data Generation Prompt
system_prompt_to_generate_data_generation_prompt = """
# You are a Senior Data Architect and Synthetic Data Specialist. Your goal is to write a highly optimized "Data Generation Prompt" that I can use to generate realistic datasets.

## Process: Do NOT generate any data or the final prompt yet. Instead, guide me through a requirements gathering phase to understand the statistical nuances of the data I need.

### Phase 1: The Interview Ask me the following questions. You may ask them all at once or step-by-step, but you must get clear answers for:

- The Domain & Use Case: (What is this data for? e.g., Testing a fraud detection model, populating a demo dashboard, load testing?)

- The Schema: (What columns are strictly required? What data types?)

- The Business Logic & Correlations: (This is critical for realism. e.g., "If Status is 'Shipped', then Delivery_Date must be populated," or "High Salary correlates with Senior job titles.")

- Edge Cases & Noise: (Do you want perfect data, or should I introduce realistic errors like NULL values, typos, or outliers?)

- Volume & Format: (CSV, JSON, SQL Insert statements? How many rows?)

### Phase 2: The Construction Once I provide these details, you will analyze my requirements and write a comprehensive System Prompt that I can run in a fresh chat instance to generate the actual data.

- Constraint for the Final Prompt: The prompt you write must use "Chain of Thought" reasoning to enforce the statistical correlations I described. It must explicitly forbid "lazy" generation (repetitive names, round numbers, or impossible dates).

Start by asking me the Phase 1 questions.

Return the final prompt in strict JSON Format, so that output can be directly parsed using Python json.loads(prompt) function.
"""

messages=[
    {"role":"system", "content":system_prompt_to_generate_data_generation_prompt}, 
]

def llm_call(messages, user_message):
    messages = messages + [{"role":"user", "content":user_message}]
    response = openr.chat.completions.create(
        model=MODEL["nvidia"],
        messages=messages,
        temperature=0.0,
        tools={},
    )
    assistant_message = response.choices[0].message.content
    messages = messages + [{"role":"assistant", "content":assistant_message}]
    return messages, assistant_message

In [4]:
messages, am = llm_call(messages, "I want to generate a synthetic dataset to forecast revenue of a SaaS B2B company using Monte Carlo Simulations.")
display(Markdown(am))

**Phase 1 – Interview**

To craft a precise, high‑fidelity data‑generation prompt, I need a few concrete details from you. Please answer each item as clearly as possible (feel free to add any extra context you think is relevant).

1. **Domain & Use Case**  
   - What is the primary purpose of the synthetic dataset?  
   - Are you building a forecasting model, training a Monte Carlo simulation, creating demo dashboards, feeding a BI tool, etc.?  
   - Which specific revenue‑related variables do you need (e.g., MRR, ARR, churn, expansion, new‑customer acquisition)?

2. **Schema – Required Columns & Data Types**  
   - List the columns you *must* have in the output (e.g., `customer_id`, `plan_type`, `month`, `revenue`, `status`).  
   - For each column, specify the desired data type (e.g., integer, string, date, float).  
   - Are there any optional columns you’d like to include for richer analysis?

3. **Business Logic & Correlations**  
   - Describe the logical relationships that should hold between fields.  
     *Examples:*  
     - If `status = "Churned"` then `revenue` for that month must be 0.  
     - `expansion_amount` should only be > 0 when `status = "Active"` and `plan_type` is “Premium”.  
     - `new_customer_acquisition` should be 0 for months prior to a customer’s first `signup_date`.  
   - Are there any hierarchical or sequential constraints (e.g., a customer’s `plan_type` can only upgrade, not downgrade, within a given time window)?

4. **Edge Cases & Noise**  
   - Do you want the data to be perfectly clean, or should it contain realistic imperfections?  
     - NULL / missing values?  
     - Typos or out‑of‑range values?  
     - occasional impossible dates or negative numbers?  
   - If you need noise, specify the type and magnitude (e.g., “add ±5 % random noise to revenue”, “10 % of rows should have a typo in `plan_type`”).

5. **Volume & Format**  
   - Approximate number of rows you need (e.g., 10 k, 100 k, 1 M).  
   - Desired output format: CSV, JSON Lines, SQL INSERT statements, Parquet, etc.  
   - Any specific column ordering or header naming conventions?

Once I have these answers, I’ll move to **Phase 2** and write a chain‑of‑thought‑driven generation prompt that enforces all of the statistical correlations and forbids lazy or unrealistic data.  

*Please provide your responses in a clear, bullet‑point or numbered format so I can copy them directly into the next step.*

In [5]:
um = """
 1. Purpose : I want to do a SaaS B2B sales forecast for current and next 4 quarters using Monte-Carlo Simulations. Today's date is 2026-01-29, generate the data according to this date.

 2. Schema : I need following columns:
    i. opportunity_id: TEXT (PRIMARY KEY)
    ii. opportunity_name: TEXT
    iii. geography: Any one value from ["NoAM", "EMEA", "APAC", "GLOBAL"]
    iv. stage: Any one of the ordinal stages ["Stage 1: Create", "Stage 2: Qualify", "Stage 3: Meet", "Stage 4: Demo", "Stage 5: POV" , "Stage 6: Verbal Confirmation", "Stage 7: Confirmed", "Stage 8: Closed Won", "Stage 9: Closed Lost"]
    v. probabilty: Chance of deal closure based on stage value. This should be a float value between 0 and 1.
    vi. amount: Dollar value associated with the opportunity. 
    vii. createddate: Opporuntiy createdate in format YYYY-MM-DD
    viii. closedate: Expected closedate for the opportunity in format YYYY-MM-DD. This should be always greater than createddate. It can be any date after 2025-01-29.
    ix. status: Status of the opporunity which could be ["Open", "Won", "Lost"], this is based on stage and also the closed opps have past closedate.
    x. type: Type of the opportunity from ["New Logo", "Cross-Sell", "Upgrade"]

 3. Business Logic & Correlations:
    i. The ratio of closed opportunities to open opportunities should be 2:1.
    ii. Use the normal distribution for the amount field.
    iii. 'EMEA' and 'NoAM' should have slightly higher average ARR than 'APAC' and 'GLOBAL'.
    iv. SaaS sales data is rarely perfectly distributed. It follows a Pareto principle (80% of revenue comes from 20% of deals).

 4. Edge Cases & Noise: Do not include noise, just create a noise free dataset.

 5. Volume & Output Format:
    i. Generate 10 rows.
    ii. Format should be CSV with a header row.
"""

messages, am = llm_call(messages, um)

In [6]:
prompt = json.loads(messages[-1]["content"])['prompt']
display(Markdown(prompt))

Generate a CSV dataset with header: opportunity_id,opportunity_name,geography,stage,probabilty,amount,createddate,closedate,status,type
- opportunity_id: unique uppercase alphanumeric string (e.g., 'OPP-0001')
- opportunity_name: unique descriptive name, avoid generic placeholders
- geography: randomly select from ['NoAM','EMEA','APAC','GLOBAL'] with slightly higher weight for 'NoAM' and 'EMEA'
- stage: randomly select from the ordered list ["Stage 1: Create","Stage 2: Qualify","Stage 3: Meet","Stage 4: Demo","Stage 5: POV","Stage 6: Verbal Confirmation","Stage 7: Confirmed","Stage 8: Closed Won","Stage 9: Closed Lost"]
- probabilty: map stage to a float probability (Stage 1: 0.10, Stage 2: 0.15, Stage 3: 0.20, Stage 4: 0.30, Stage 5: 0.40, Stage 6: 0.55, Stage 7: 0.70, Stage 8: 0.85, Stage 9: 0.10) ensuring value between 0 and 1
- amount: draw from a normal distribution with mean 100000 and std 40000, then apply Pareto scaling so that the top 20% of deals (by amount) contribute ~80% of total revenue; ensure all amounts are positive and have up to two decimal places, avoiding simple round numbers
- createddate: random date between 2025-01-30 and 2026-01-29 inclusive, formatted YYYY-MM-DD
- closedate: random date after createddate and after 2025-01-29, formatted YYYY-MM-DD; if status will be 'Won' or 'Lost' then closedate must be on or before today (2026-01-29), otherwise closedate is a future date
- status: derive from stage and closedate: if closedate <= today then status = 'Won' or 'Lost' (choose based on a win/lose flag); if closedate > today then status = 'Open'
- type: randomly choose from ['New Logo','Cross-Sell','Upgrade'] with weights [0.5,0.3,0.2]
Additional constraints:
1. The ratio of closed (status != 'Open') to open opportunities should be approximately 2:1 (aim for 6 closed and 3 open to meet the 10‑row total)
2. No two rows may have identical opportunity_id or opportunity_name
3. Avoid repetitive or placeholder names; generate realistic variation
4. Do not introduce any NULL values; all fields must be populated
5. Ensure createddate < closedate for every row
6. Ensure geography distribution respects the slight bias toward 'NoAM' and 'EMEA'
Output only the CSV content, no extra explanation.

In [7]:
messages=[{"role":"system", "content":"You are a helpful assistant."}]

def llm_call(messages, user_message):
    messages = messages + [{"role":"user", "content":user_message}]
    response = openr.chat.completions.create(
        model=MODEL["nvidia"],
        messages=messages,
        temperature=0.0,
        tools={},
    )
    assistant_message = response.choices[0].message.content
    messages = messages + [{"role":"assistant", "content":assistant_message}]
    return messages, assistant_message

In [8]:
messages, am = llm_call(messages, prompt)
display(Markdown(am))

opportunity_id,opportunity_name,geography,stage,probabilty,amount,createddate,closedate,status,type
OPP-0001,Acme Healthcare Platform Integration,NoAM,Stage 3: Meet,0.20,13245.67,2025-02-10,2025-03-15,Won,New Logo
OPP-0002,BetaTech Cloud Migration Project,EMEA,Stage 5: POV,0.40,11890.34,2025-04-05,2026-02-01,Open,New Logo
OPP-0003,Zenith Financial Services Modernization,NoAM,Stage 8: Closed Won,0.85,260123.45,2025-06-20,2025-07-10,Lost,Cross-Sell
OPP-0004,Nimbus Retail Expansion Initiative,APAC,Stage 2: Qualify,0.15,12567.89,2025-08-15,2025-09-01,Won,New Logo
OPP-0005,Orion AI Analytics Deployment,NoAM,Stage 6: Verbal Confirmation,0.55,9876.54,2025-10-01,2026-03-15,Open,Upgrade
OPP-0006,Solaris Logistics Optimization Suite,EMEA,Stage 4: Demo,0.30,14321.09,2025-12-12,2026-01-05,Won,New Logo
OPP-0007,Quantum Energy Grid Optimization,GLOBAL,Stage 1: Create,0.10,10123.45,2026-01-15,2026-01-28,Lost,New Logo
OPP-0008,Vertex Manufacturing Process Upgrade,NoAM,Stage 7: Confirmed,0.70,140567.89,2026-01-20,2026-01-27,Won,Cross-Sell
OPP-0009,Echo Media Content Revamp,EMEA,Stage 9: Closed Lost,0.10,12000.01,2026-01-25,2026-04-20,Open,Upgrade
OPP-0010,Pulse Consumer Engagement Campaign,APAC,Stage 8: Closed Won,0.85,16000.88,2026-01-28,2026-01-29,Lost,New Logo

In [9]:
import io
import pandas as pd

data_io = io.StringIO(am)
df = pd.read_csv(data_io)
df.shape

(10, 10)

In [10]:
df.head(20)

Unnamed: 0,opportunity_id,opportunity_name,geography,stage,probabilty,amount,createddate,closedate,status,type
0,OPP-0001,Acme Healthcare Platform Integration,NoAM,Stage 3: Meet,0.2,13245.67,2025-02-10,2025-03-15,Won,New Logo
1,OPP-0002,BetaTech Cloud Migration Project,EMEA,Stage 5: POV,0.4,11890.34,2025-04-05,2026-02-01,Open,New Logo
2,OPP-0003,Zenith Financial Services Modernization,NoAM,Stage 8: Closed Won,0.85,260123.45,2025-06-20,2025-07-10,Lost,Cross-Sell
3,OPP-0004,Nimbus Retail Expansion Initiative,APAC,Stage 2: Qualify,0.15,12567.89,2025-08-15,2025-09-01,Won,New Logo
4,OPP-0005,Orion AI Analytics Deployment,NoAM,Stage 6: Verbal Confirmation,0.55,9876.54,2025-10-01,2026-03-15,Open,Upgrade
5,OPP-0006,Solaris Logistics Optimization Suite,EMEA,Stage 4: Demo,0.3,14321.09,2025-12-12,2026-01-05,Won,New Logo
6,OPP-0007,Quantum Energy Grid Optimization,GLOBAL,Stage 1: Create,0.1,10123.45,2026-01-15,2026-01-28,Lost,New Logo
7,OPP-0008,Vertex Manufacturing Process Upgrade,NoAM,Stage 7: Confirmed,0.7,140567.89,2026-01-20,2026-01-27,Won,Cross-Sell
8,OPP-0009,Echo Media Content Revamp,EMEA,Stage 9: Closed Lost,0.1,12000.01,2026-01-25,2026-04-20,Open,Upgrade
9,OPP-0010,Pulse Consumer Engagement Campaign,APAC,Stage 8: Closed Won,0.85,16000.88,2026-01-28,2026-01-29,Lost,New Logo


In [21]:
# Re-Run if More Rows are required.

messages, am = llm_call(messages, "Continue Generating 10 more rows with header.")

data_io = io.StringIO(am)
df1 = pd.read_csv(data_io)
df = pd.concat([df, df1], ignore_index=True)
df.head(20)

Unnamed: 0,opportunity_id,opportunity_name,geography,stage,probabilty,amount,createddate,closedate,status,type
0,OPP-0001,Acme Healthcare Platform Integration,NoAM,Stage 3: Meet,0.2,13245.67,2025-02-10,2025-03-15,Won,New Logo
1,OPP-0002,BetaTech Cloud Migration Project,EMEA,Stage 5: POV,0.4,11890.34,2025-04-05,2026-02-01,Open,New Logo
2,OPP-0003,Zenith Financial Services Modernization,NoAM,Stage 8: Closed Won,0.85,260123.45,2025-06-20,2025-07-10,Lost,Cross-Sell
3,OPP-0004,Nimbus Retail Expansion Initiative,APAC,Stage 2: Qualify,0.15,12567.89,2025-08-15,2025-09-01,Won,New Logo
4,OPP-0005,Orion AI Analytics Deployment,NoAM,Stage 6: Verbal Confirmation,0.55,9876.54,2025-10-01,2026-03-15,Open,Upgrade
5,OPP-0006,Solaris Logistics Optimization Suite,EMEA,Stage 4: Demo,0.3,14321.09,2025-12-12,2026-01-05,Won,New Logo
6,OPP-0007,Quantum Energy Grid Optimization,GLOBAL,Stage 1: Create,0.1,10123.45,2026-01-15,2026-01-28,Lost,New Logo
7,OPP-0008,Vertex Manufacturing Process Upgrade,NoAM,Stage 7: Confirmed,0.7,140567.89,2026-01-20,2026-01-27,Won,Cross-Sell
8,OPP-0009,Echo Media Content Revamp,EMEA,Stage 9: Closed Lost,0.1,12000.01,2026-01-25,2026-04-20,Open,Upgrade
9,OPP-0010,Pulse Consumer Engagement Campaign,APAC,Stage 8: Closed Won,0.85,16000.88,2026-01-28,2026-01-29,Lost,New Logo


In [23]:
df.shape

(50, 10)

In [24]:
messages

[{'role': 'system', 'content': 'You are a helpful assistant.'},
 {'role': 'user',
  'content': 'Generate a CSV dataset with header: opportunity_id,opportunity_name,geography,stage,probabilty,amount,createddate,closedate,status,type\n- opportunity_id: unique uppercase alphanumeric string (e.g., \'OPP-0001\')\n- opportunity_name: unique descriptive name, avoid generic placeholders\n- geography: randomly select from [\'NoAM\',\'EMEA\',\'APAC\',\'GLOBAL\'] with slightly higher weight for \'NoAM\' and \'EMEA\'\n- stage: randomly select from the ordered list ["Stage 1: Create","Stage 2: Qualify","Stage 3: Meet","Stage 4: Demo","Stage 5: POV","Stage 6: Verbal Confirmation","Stage 7: Confirmed","Stage 8: Closed Won","Stage 9: Closed Lost"]\n- probabilty: map stage to a float probability (Stage 1: 0.10, Stage 2: 0.15, Stage 3: 0.20, Stage 4: 0.30, Stage 5: 0.40, Stage 6: 0.55, Stage 7: 0.70, Stage 8: 0.85, Stage 9: 0.10) ensuring value between 0 and 1\n- amount: draw from a normal distribution 