# RAG
This notebook is to setup the logic to set up the prompt based on a random input QnA, and use RAG to provide context.

## Context setup

In [121]:
import os
import json
import random
from openai import OpenAI
from dotenv import load_dotenv
from qdrant_client import QdrantClient

In [122]:
# Load environment variables from .env file
load_dotenv()

# Initialize clients with API keys from .env
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
qdrant_client = QdrantClient(
    url=os.getenv("QDRANT_URL"),
    api_key=os.getenv("QDRANT_API_KEY")
)

In [123]:
# set up a question
question = 'who is the president of the united states?'

In [124]:
def get_context(question:str ,embedding_model:str ="text-embedding-3-small",limit:int=2) -> str:
    # embed the question
    embedding_response = openai_client.embeddings.create(
        model=embedding_model,
        input=question
    )
    question_embedding = embedding_response.data[0].embedding

    # Search Qdrant for relevant pages
    search_results = qdrant_client.query_points(
        collection_name="usa_civics_guide",
        query=question_embedding,
        limit=limit
    )

    # Build context from search results
    context = "\n\n---\n\n".join([
        f"Page {result.payload['page_number']}:\n{result.payload['text']}"
        for result in search_results.points
    ])

    return context

In [125]:
context = get_context(question)
print(context)

Page 13:
THE COMMANDER IN CHIEF 
The President of the United States is the 
Commander in Chief of the military. This 
means that the President is in charge of everyone 
serving in U.S. military. Another term for the U.S. 
military is the U.S. Armed Forces.  
In the United States, serving in the U.S. Armed 
Forces is voluntary. Today there are over 2 million people serving in the U.S. Armed Forces.  For more 
information about the U.S. Armed Forces, please visit: 
U.S. Department of Defense at www.defense.gov .There are six branches of the U.S. Armed Forces:
VOTING FOR PRESIDENT
U.S. citizens vote for President in November 27, 
and the President is elected for four years. 
A person can only be elected to be President two  
times. This means a person can get elected for four 
years, and then get reelected for four more years. 
The Constitution says that in order to get elected 
President, a person must be 35 years old or older and was a U.S. citizen at birth. 
When a person runs for Pres

## LLM setup

In [126]:
def llm(system_prompt: str, user_prompt: str, model: str = "gpt-4o-mini") -> dict:
    """
    Get an answer from the LLM using custom system and user prompts.
    
    Args:
        system_prompt: The system message that defines the assistant's behavior and role
        user_prompt: The user message containing the question and any context
        model: OpenAI model to use (default: gpt-4o-mini)
    
    Returns:
        The LLM's parsed JSON output as a dict. If parsing fails, returns {"error": "..."}.
    """
    try:
        completion = openai_client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt},
            ],
            response_format={"type": "json_object"},
            temperature=0.3,
        )

        response_text = completion.choices[0].message.content

        try:
            data = json.loads(response_text)
        except json.JSONDecodeError as e:
            raise ValueError(f"Failed to parse JSON response: {e}\nRaw output: {response_text}")

        return data

    except Exception as e:
        return {"error": str(e)}


### Prompts

In [127]:
system_prompt = """
You are a friendly USCIS officer helping the user practice for the U.S. naturalization civics test.

You will receive:
- a *question* (one of the official USCIS civics test questions),
- a list of acceptable *answers*,
- the *user_state* (the U.S. state where the user lives),
- the *user_answer* (the user’s attempt),
- and some *context* (background facts retrieved from a knowledge base).

Your task:
1. Decide if the user's answer is correct based on the *answers* list.  
   - For most questions, match the *user_answer* against the list of acceptable *answers*.  
   - Accept small typos, alternate spellings, or equivalent forms (e.g., “Vance” = “JD Vance”).  
   - Only use *context* to provide background_info, **not** to judge correctness unless it directly matches *answers*.  

2. **If the question depends on the user's state** (for example, questions like “Name one of your state’s U.S. Senators” or “Who is your state’s governor?”):
   - Look for answers that begin with the user’s state abbreviation (e.g., “AZ: …” for Arizona).  
   - Only consider those entries as correct for that user.  
   - When explaining correctness, remove the state code (e.g., return “Juanita” instead of “AZ: Juanita”).

3. **If the question asks about “your U.S. Representative” or “name your Representative”:**
   - Assume any of the listed answers is acceptable.  
   - Always append this sentence to the *reason* field:  
     "Note: Your actual U.S. Representative depends on where you live. You can find yours at https://www.house.gov/representatives/find-your-representative"

4. If the user is correct, respond positively and encouragingly.  
5. If the user is incorrect, gently explain why and provide the correct answer(s) for their state or the general list as appropriate.  
6. Include one concise, interesting fact or detail drawn from the *context* that relates to the question or its answer.

### Output format
You must reply **only** in valid JSON — no commentary or extra text.

output:
{
  "success": true | false,
  "reason": "Short, friendly message congratulating or explaining the correct answer.",
  "background_info": "Concise, interesting fact or context related to the question."
}


### Style rules
- Keep responses short and conversational, as if speaking during an interview.  
- Ensure JSON is 100% valid (no trailing commas, no quotes around booleans).  
- Do **not** repeat the question or the user’s answer unless helpful to the explanation.  

### Examples

**Example 1 (state-specific question, correct answer)**  
**Input:**
- Question: "Name one of your state’s U.S. Senators."  
- Answers: ["AZ: Pepito García", "AZ: Juanita Cruz", "CA: Alex Padilla", "CA: Laphonza Butler"]  
- User State: "AZ"  
- User Answer: "Juanita"  
- Context: "Each state elects two senators to represent it in the U.S. Senate for six-year terms."  

**Output:**
{
  "success": true,
  "reason": "Correct! Juanita Cruz is one of the U.S. Senators from Arizona.",
  "background_info": "Each state elects two senators to represent it in the U.S. Senate for six-year terms."
}

---

**Example 2 (state-specific question, incorrect answer)**  
**Input:**
- Question: "Name one of your state’s U.S. Senators."  
- Answers: ["AZ: Pepito García", "AZ: Juanita Cruz", "CA: Alex Padilla", "CA: Laphonza Butler"]  
- User State: "AZ"  
- User Answer: "Alex Padilla"  
- Context: "California and Arizona each have two senators in the U.S. Senate."  

**Output:**
{
  "success": false,
  "reason": "Not quite. Alex Padilla represents California. For Arizona, acceptable answers include Pepito García or Juanita Cruz.",
  "background_info": "Each U.S. state has two senators who represent that state in the Senate."
}

---

**Example 3 (state-independent question)**  
**Input:**
- Question: "How many amendments does the U.S. Constitution have?"  
- Answers: ["27", "twenty-seven"]  
- User State: "FL"  
- User Answer: "27"  
- Context: "The first 10 amendments are known as the Bill of Rights."  

**Output:**
{
  "success": true,
  "reason": "Correct! The Constitution has 27 amendments in total.",
  "background_info": "The first 10 amendments are known as the Bill of Rights."
}

---

**Example 4 (representative question)**  
**Input:**
- Question: "Name your U.S. Representative."  
- Answers: ["CA: Nancy Pelosi", "CA: Jim Jordan", "CA: Alexandria Ocasio-Cortez", "CA: Kevin McCarthy", "CA: Elise Stefanik"]  
- User State: "CA"  
- User Answer: "Nancy Pelosi"  
- Context: "Members of the U.S. House of Representatives serve two-year terms and represent specific districts within each state."  

**Output:**
{
  "success": true,
  "reason": "Correct! Nancy Pelosi is one of the current U.S. Representatives in your state. Note: Your actual U.S. Representative depends on where you live. You can find yours at https://www.house.gov/representatives/find-your-representative",
  "background_info": "Members of the House of Representatives serve two-year terms and represent districts within each state."
}
"""

In [128]:
user_prompt = """
<question>
{question}
</question>

<answers>
{answers}
</answers>

<user_state>
{user_state}
</user_state>

<user_answer>
{user_answer}
</user_answer>

<context>
{context}
</context>
"""

## QNA retrieval setup

In [129]:
# establish user selections
test_year = 2008
user_state = 'AZ'

In [130]:
# open the QnA
filename = f"../documents/{test_year}_civics_test_qa_pairs.json"

with open(filename, "r") as f:
    data = json.load(f)

In [131]:
# select a random entry
random_entry = random.choice(data)

# get the question and answer
question = random_entry['question']
answers = random_entry['answers']

In [132]:
# pretend the user answered this thing
ans_system_prompt =f"Pretend you are a US immigrant living in {user_state}. Answer the *question* accurately and concisely. Output must be valid JSON only: {{\"answer\": \"<your answer>\"}}."
ans_user_prompt =f"<question> {question} </question>"

output = llm(ans_system_prompt,ans_user_prompt)
user_answer = output['answer']

In [133]:
print(f'question: {question}')
print(f'possible answer(s): {answers}')
print(f'user answer: {user_answer}')

question: Name one branch or part of the government.
possible answer(s): ['Congress', 'legislative', 'President', 'executive', 'the courts', 'judicial']
user answer: Legislative Branch


## RAG

In [134]:
def rag(question:str, answers: str, user_state: str, user_answer: str,model: str="gpt-4o-mini"):
    # get context
    context = get_context(question)
    # add to prompt alongside other information
    qna_user_prompt = user_prompt.format(question=question,answers=answers,user_state=user_state,user_answer=user_answer,context=context)
    # send all to the llm (note that llm already parses the output into json)
    return llm(system_prompt,qna_user_prompt,model)

In [135]:
response = rag(question,answers,user_state,user_answer)

In [136]:
response

{'success': True,
 'reason': 'Correct! The Legislative Branch is one of the three branches of the U.S. government.',
 'background_info': 'The three branches of government are the Legislative Branch (Congress), the Executive Branch (President), and the Judicial Branch (Supreme Court).'}