
# Day 12 · Notebook 5 — Active‑Learning & Prompt Retraining  (Concepts #106 & #151)

We’ll capture real‑time **thumbs‑up / thumbs‑down** feedback, store it in SQLite, and periodically fine‑tune or choose better prompt variants.

*DeepSeek R1* shows how a modest *Process‑Reward Model* (PRM) can steer generation quality without full RLHF.


In [None]:

import sqlite3, datetime, json, openai, os, random

conn = sqlite3.connect("feedback.db")
conn.execute("""CREATE TABLE IF NOT EXISTS feedback (
    id INTEGER PRIMARY KEY,
    prompt TEXT,
    response TEXT,
    rating INTEGER,
    ts DATETIME DEFAULT CURRENT_TIMESTAMP
)""")
conn.commit()

PROMPT = "Explain photosynthesis to a 10‑year‑old in exactly three sentences."

def get_response(prompt=PROMPT):
    res = openai.ChatCompletion.create(
        model="gpt-4o-mini",
        messages=[{"role":"user","content": prompt}],
        temperature=0.7,
        max_tokens=100,
    ).choices[0].message.content.strip()
    return res

# Simulated user rating (1👍, 0👎)
def simulate_rating() -> int:
    return random.choice([0,1])

resp = get_response()
rating = simulate_rating()
conn.execute("INSERT INTO feedback(prompt,response,rating) VALUES(?,?,?)",
             (PROMPT, resp, rating))
conn.commit()
print("Stored feedback →", rating)



### Next Steps
1. Aggregate feedback into a “replay buffer.”  
2. Fine‑tune a tiny reward model or simple heuristic scorer.  
3. Use the scorer inside a *process‑supervised* loop (à la DeepSeek R1) to choose between multiple candidate generations.
