# Notebook 1: Data Generation with Tinker

**Goal:** Use `tinker` to generate a `jsonl` dataset for fine-tuning. This file will be our "teacher training" data.

## 1. Setup and Imports

In [None]:
import os
import json
import pandas as pd
import tinker
from dotenv import load_dotenv

# Load API keys from .env file
load_dotenv()

# Set your OpenAI API key for tinker to use
# Assumes tinker uses the OPENAI_API_KEY env variable
tinker.api_key = os.getenv("OPENAI_API_KEY")

# Define output path
OUTPUT_DIR = "outputs/datasets/"
OUTPUT_FILE = os.path.join(OUTPUT_DIR, "training_data.jsonl")

# Ensure directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"Tinker version: {tinker.__version__}")
print(f"Output file will be saved to: {OUTPUT_FILE}")

## 2. Define Data Generation Task

Define the `tinker.Task` for the model. This includes the system prompt, user prompt, and example output.

In [None]:
# This is a hypothetical task based on the ADAM project's goals
data_gen_task = tinker.Task(
    name="Credit Risk Assessor",
    description="Generates a structured credit risk assessment from a company summary.",
    system_prompt="""
    You are a senior credit analyst. Your task is to analyze the provided company summary and produce a JSON object outlining the key credit risks and a final rating. The rating should be one of: 'Low', 'Medium', 'High'.
    """,
    user_prompt_template="""
    Analyze the following company summary:
    
    {company_summary}
    """,
    response_format="json",
    # Provide a few-shot example
    examples=[
        {
            "messages": [
                {"role": "user", "content": "Analyze the following company summary:\n\nTechCorp is a SaaS company with $50M ARR, but high churn (30%) and new competition. They are burning $2M/month."},
                {"role": "assistant", "content": "{\"key_risks\": [\"High customer churn (30%)\", \"Negative cash flow ($2M/month burn)\", \"Increasing competitive pressure\"], \"final_rating\": \"High\"}"}
            ]
        }
    ]
)

print("Tinker Task defined successfully.")

## 3. Generate Training Dataset

Use `tinker.generate_dataset` to create the training examples. We'll start with a small batch for testing.

In [None]:
# Define some seed inputs to generate variations from
seed_inputs = [
    {"company_summary": "StableCo is a utilities provider with 10-year government contracts, low debt, and 5% annual profit growth."},
    {"company_summary": "Growthly is a pre-profit tech startup with a new patent but no revenue and only 6 months of runway left."},
    {"company_summary": "RetailGiant is a 50-year-old retailer facing declining foot traffic due to e-commerce, but has significant real estate assets."}
]

print(f"Generating 10 examples based on {len(seed_inputs)} seed inputs...")

# This call will use the Tinker API and your OpenAI key to generate data
try:
    generated_dataset = tinker.generate_dataset(
        task=data_gen_task,
        inputs=seed_inputs,
        num_examples=10 # Generate 10 high-quality training examples
    )
    print(f"Successfully generated {len(generated_dataset)} examples.")
    print("\n--- Example 0 --- ")
    print(json.dumps(generated_dataset[0], indent=2))
except Exception as e:
    print(f"An error occurred during data generation: {e}")

## 4. Save Data to JSONL

Convert the generated data into the `jsonl` format required by the OpenAI fine-tuning API.

In [None]:
if 'generated_dataset' in locals():
    count = 0
    with open(OUTPUT_FILE, 'w') as f:
        for example in generated_dataset:
            # The 'messages' key is exactly what tinker provides
            # and what the OpenAI API expects.
            if "messages" in example:
                json_line = json.dumps({"messages": example["messages"]})
                f.write(json_line + "\n")
                count += 1
            
    print(f"Successfully saved {count} examples to {OUTPUT_FILE}")
else:
    print("Skipping save, no data was generated.")