###Demo 3: Finance Exam Tutor with Chain-of-Thought Reasoning

In [1]:

import litellm
litellm._turn_on_debug()


def query_litellm(model_name: str, prompt: str) -> str:
    """
    Query a LiteLLM-compatible model with a given prompt.

    Args:
        model_name (str): The model to use (e.g. 'ollama/qwen2.5' or 'huggingface/your-model').
        prompt (str): The user's input question.

    Returns:
        str: The model's response text.
    """
    response = litellm.completion(
        model=model_name,
        messages=[{"role": "user", "content": prompt}]
    )

    return response['choices'][0]['message']['content']



In [2]:
exam_questions = [
    "A company's revenue is $500,000 and expenses are $350,000. What is the net income?",
    "An investor buys a bond for $950 and receives $1,000 at maturity in one year. What is the yield?",
    "A project requires an initial investment of $10,000 and is expected to return $2,500 annually for 5 years. What is the payback period?"
]

# --- Step 2: Shared prompt template for both models ---
def generate_exam_prompt(question: str) -> str:
    return f"""
You are a finance tutor helping a student prepare for an exam.
Answer the following question using Chain-of-Thought (CoT) reasoning format.
Wrap your reasoning in <think> tags and your final answer in <answer> tags.

Question:
"{question}"
"""

def call_models(question: str,
               model1_name:str='ollama/hf.co/ernanhughes/Fin-R1-Q8_0-GGUF', model2_name:str='ollama/phi3') -> tuple:
    """
    Call the two models and return their responses."
    """
    prompt1 = generate_exam_prompt(question)
    model1_response = query_litellm(model_name=model1_name, prompt=prompt1)
    prompt2 = generate_exam_prompt(question)
    model2_response = query_litellm(model_name=model2_name, prompt=prompt2)
    return model1_response, model2_response

def judge_exam_answers(question, output_a, output_b):
    return f"""
    You are a finance instructor evaluating student responses to an exam question. Two models have answered using a Chain-of-Thought format.

    Question:
    "{question}"

    ---
    Model A:
    {output_a}

    Model B:
    {output_b}

    Evaluate each model on:
    1. Correctness of the answer
    2. Accuracy and completeness of the reasoning
    3. Use of <think> and <answer> tags

    Please respond in this format:

    Model A Score: <score>/10
    Justification A: <reasoning>

    Model B Score: <score>/10
    Justification B: <reasoning>

    Preferred Model: Model A or Model B
    """


In [None]:
import sqlite3
conn = sqlite3.connect("finance_exam_results.db")
cursor = conn.cursor()

cursor.execute("""
CREATE TABLE IF NOT EXISTS exam_results (
    question TEXT,
    fin_r1_output TEXT,
    qwen2_output TEXT,
    judge_prompt TEXT,
    judge_result TEXT
)
""")

# --- Step 6: Run all examples and store results ---
for question in exam_questions:
    fin_r1_output, qwen2_output = call_models(question)
    judge_prompt = judge_exam_answers(question, fin_r1_output, qwen2_output)
    judge_result = query_litellm('ollama/qwen2.5', judge_prompt)
    print("\n>> Judge Evaluation:\n", judge_result)

    cursor.execute("""
    INSERT INTO exam_results (question, fin_r1_output, qwen2_output, judge_prompt, judge_result)
    VALUES (?, ?, ?, ?, ?)""",
    (question, fin_r1_output, qwen2_output, judge_prompt, judge_result))
    conn.commit()

conn.close()


[92m15:11:36 - LiteLLM:DEBUG[0m: utils.py:311 - 



[92m15:11:36 - LiteLLM:DEBUG[0m: utils.py:311 - [92mRequest to litellm:[0m
[92m15:11:36 - LiteLLM:DEBUG[0m: utils.py:311 - [92mlitellm.completion(model='ollama/hf.co/ernanhughes/Fin-R1-Q8_0-GGUF', messages=[{'role': 'user', 'content': '\nYou are a finance tutor helping a student prepare for an exam.\nAnswer the following question using Chain-of-Thought (CoT) reasoning format.\nWrap your reasoning in <think> tags and your final answer in <answer> tags.\n\nQuestion:\n"A company\'s revenue is $500,000 and expenses are $350,000. What is the net income?"\n'}])[0m
[92m15:11:36 - LiteLLM:DEBUG[0m: utils.py:311 - 

[92m15:11:36 - LiteLLM:DEBUG[0m: litellm_logging.py:388 - self.optional_params: {}
[92m15:11:36 - LiteLLM:DEBUG[0m: utils.py:311 - SYNC kwargs[caching]: False; litellm.cache: None; kwargs.get('cache')['no-cache']: False
[92m15:11:36 - LiteLLM:INFO[0m: utils.py:3002 - 
LiteLLM completion() model= hf.co/ernanhughes/Fin-R1-Q8_0-GGUF; provider = ollama
[92m15:11:36 - Li


>> Judge Evaluation:
 Model A Score: 8/10
Justification A: 
- Correctness of the answer: The model correctly identified that ABC Corporation's revenue is growing, though not at a consistent rate, and expenses are stable. However, it did not explicitly compare these figures with industry averages or project future performance based on current trends and economic factors.
- Accuracy and completeness of reasoning: Model A provided a detailed step-by-step process for solving the problem but could have delved more into interpreting the data against industry standards and economic factors.
- Use of <think> and <answer> tags: The model appropriately used these tags, though the thought process was somewhat fragmented and lacked a cohesive narrative.

Model B Score: 7/10
Justification B: 
- Correctness of the answer: Model B correctly identified ABC Corporation's revenue growth and stable expenses but failed to provide a comparison with industry standards or future projections.
- Accuracy and 

[92m15:12:37 - LiteLLM:DEBUG[0m: utils.py:311 - 

[92m15:12:37 - LiteLLM:DEBUG[0m: utils.py:311 - [92mRequest to litellm:[0m
[92m15:12:37 - LiteLLM:DEBUG[0m: utils.py:311 - [92mlitellm.completion(model='ollama/hf.co/ernanhughes/Fin-R1-Q8_0-GGUF', messages=[{'role': 'user', 'content': '\nYou are a finance tutor helping a student prepare for an exam.\nAnswer the following question using Chain-of-Thought (CoT) reasoning format.\nWrap your reasoning in <think> tags and your final answer in <answer> tags.\n\nQuestion:\n"An investor buys a bond for $950 and receives $1,000 at maturity in one year. What is the yield?"\n'}])[0m
[92m15:12:37 - LiteLLM:DEBUG[0m: utils.py:311 - 

[92m15:12:37 - LiteLLM:DEBUG[0m: litellm_logging.py:388 - self.optional_params: {}
[92m15:12:37 - LiteLLM:DEBUG[0m: utils.py:311 - SYNC kwargs[caching]: False; litellm.cache: None; kwargs.get('cache')['no-cache']: False
[92m15:12:37 - LiteLLM:INFO[0m: utils.py:3002 - 
LiteLLM completion() model= hf.co/e


>> Judge Evaluation:
 Model A Score: 9.5/10
Justification A: 
- **Correctness of the answer:** The calculation is correct, and the final yield is accurately determined to be approximately 5.26%.
- **Accuracy and completeness of reasoning:** The step-by-step approach is clear, and the explanation covers all necessary aspects, including the formula used and a double-check for accuracy. There are minor redundancies in the explanation but overall, it provides thorough justification.
- **Use of <think> and <answer> tags:** Proper use of these tags to separate thought processes from the final answer.

Model B Score: 8/10
Justification B: 
- **Correctness of the answer:** The calculation is correct, leading to a yield of approximately 5.26%.
- **Accuracy and completeness of reasoning:** The explanation is concise but lacks some detail. For instance, it does not explain why simple annual return was used or mention the formula in full before plugging in numbers.
- **Use of <think> and <answer>

[92m15:14:10 - LiteLLM:INFO[0m: utils.py:1165 - Wrapper: Completed Call, calling success_handler
[92m15:14:10 - LiteLLM:INFO[0m: cost_calculator.py:588 - selected model name for cost calculation: ollama/hf.co/ernanhughes/Fin-R1-Q8_0-GGUF
[92m15:14:10 - LiteLLM:DEBUG[0m: litellm_logging.py:1089 - Logging Details LiteLLM-Success Call: Cache_hit=None
[92m15:14:10 - LiteLLM:DEBUG[0m: utils.py:4300 - checking potential_model_names in litellm.model_cost: {'split_model': 'hf.co/ernanhughes/Fin-R1-Q8_0-GGUF', 'combined_model_name': 'ollama/hf.co/ernanhughes/Fin-R1-Q8_0-GGUF', 'stripped_model_name': 'hf.co/ernanhughes/Fin-R1-Q8_0-GGUF', 'combined_stripped_model_name': 'ollama/hf.co/ernanhughes/Fin-R1-Q8_0-GGUF', 'custom_llm_provider': 'ollama'}
[92m15:14:10 - LiteLLM:INFO[0m: cost_calculator.py:588 - selected model name for cost calculation: ollama/hf.co/ernanhughes/Fin-R1-Q8_0-GGUF
[92m15:14:10 - LiteLLM:DEBUG[0m: utils.py:4300 - checking potential_model_names in litellm.model_cost


>> Judge Evaluation:
 Model A Score: 9/10
Justification A: 
- The reasoning is clear and logical, accurately identifying the payback period as the time required to recover the initial investment.
- The calculation method of dividing the initial investment by the annual cash flow is correctly applied.
- The response provides a detailed thought process that explains why the payback period is 4 years, even though it took an additional step to confirm the exact year.

Model B Score: 8/10
Justification B:
- Model B's reasoning is also clear and logical. It accurately calculates the cumulative cash flow over time until the initial investment is recovered.
- However, the response could be more concise. The thought process of calculating the cumulative balance for each year can be combined into fewer steps without losing clarity.
- While it correctly identifies that the payback period is 4 years, a slight improvement in structuring could make the explanation easier to follow.

Preferred Model

[92m15:14:31 - LiteLLM:DEBUG[0m: cost_calculator.py:344 - Returned custom cost for model=ollama/qwen2.5 - prompt_tokens_cost_usd_dollar: 0, completion_tokens_cost_usd_dollar: 0
[92m15:14:31 - LiteLLM:DEBUG[0m: litellm_logging.py:899 - response_cost: 0
[92m15:14:31 - LiteLLM:DEBUG[0m: utils.py:4300 - checking potential_model_names in litellm.model_cost: {'split_model': 'qwen2.5', 'combined_model_name': 'ollama/qwen2.5', 'stripped_model_name': 'qwen2.5', 'combined_stripped_model_name': 'ollama/qwen2.5', 'custom_llm_provider': 'ollama'}
[92m15:14:31 - LiteLLM:DEBUG[0m: utils.py:4584 - model_info: {'key': 'qwen2.5', 'litellm_provider': 'ollama', 'mode': 'chat', 'supports_function_calling': True, 'input_cost_per_token': 0.0, 'output_cost_per_token': 0.0, 'max_tokens': 32768, 'max_input_tokens': 32768, 'max_output_tokens': 32768}
[92m15:14:31 - LiteLLM:ERROR[0m: litellm_logging.py:3525 - Error creating standard logging object - __annotations__
Traceback (most recent call last):
  F