# Prompting

Welcome! In this notebook, you’ll explore how large language models (LLMs) can be guided to solve **mathematical reasoning problems** using the ⚡ **Groq API**.  
We’ll work with the 🧮 **GSM8k dataset**, a benchmark designed to test step-by-step math reasoning.  

### 🔍 What you’ll do
- 🔑 **Set up Groq** → get your API key and connect to the Groq service.  
- 📂 **Load the GSM8k dataset** → explore the structure of questions, answers, and reasoning.  
- ✍️ **Experiment with prompting** → try different styles of instructions and demonstrations.  
- 🛠️ **Build and test solvers** → send structured prompts to the model and parse responses.  
- 🎛️ **Play and improve** → adjust few-shot examples, prompts, models, and generation parameters to push performance further.  

✨ By the end, you’ll see how **prompt design + reasoning demonstrations + model settings** can dramatically change performance on challenging reasoning tasks. Have fun experimenting!


In [None]:
import os
import ast
import json
import regex

from time import sleep
import pandas as pd

from datasets import load_dataset

from groq import Groq

### Groq
- Groq builds specialized processors (LPUs: Language Processing Units) designed to run large AI models extremely fast and efficiently compared to traditional GPUs/CPUs.
- Their chips and software stack focus on minimizing inference latency, making them well-suited for real-time AI applications like chatbots, speech, and recommendation systems.
- Groq provides APIs that let you run models (like LLMs) on their hardware in the cloud, so you can use their speed without owning the hardware yourself.

Check out Groq's website: [link](https://groq.com/)!


**`TODO:`** 

1. Create a personal account on **Groq**.  
   - For this exercise, you can remain on the **Free tier**—no payment details or credit card are required.  

2. Generate an **API key** and store it in your environment variables under the name `GROQ_API_KEY`.  

3. Verify your setup by running the following code snippet.  
   - Once confirmed, feel free to experiment with different prompts and models.  


In [None]:
client = Groq(
    api_key=os.environ.get("GROQ_API_KEY"),
)

model_name = "llama-3.3-70b-versatile"

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Explain the importance of fast language models",
        }
    ],
    model=model_name
)

print(chat_completion.choices[0].message.content)

### GSM8k
- GSM8K (Grade School Math 8K) is a benchmark dataset designed to evaluate the mathematical reasoning ability of language models.
- It contains around 8,500 high-quality, linguistically diverse word problems that require multi-step reasoning at the grade school (middle-school level) math level.
- Widely used in research for training and evaluating large language models, especially in testing their ability to perform step-by-step reasoning rather than just recall facts.

You can have a look at the Dataset's card [here](https://huggingface.co/datasets/openai/gsm8k).

**`TODO:`**  

- Load the **train** and **test** splits of the `gsm8k` dataset using Hugging Face.  
- This dataset has two configurations: `main` and `socratic`.  
  - Be sure to set `name="main"` when loading to select the main configuration.  
- 💡 Hint: If you need a refresher on loading a `Dataset` from Hugging Face, revisit the **Lab Session 7** exercises on fine-tuning.  

In [None]:
dataset_train = load_dataset("gsm8k", "main", split="train")
dataset_test = load_dataset("gsm8k", "main", split="test")

**`TODO:`** Check out the features and the number of rows in the train set. For the first two sample, print all of its features to get an understanding of the dataset.

In [None]:
display(dataset_train)

sample = dataset_train[0]
print(f"Question:\n{sample['question']}\n")
print(f"Answer:\n{sample['answer']}")

### Introduciton to Reasoning

As you might have noticed the answers in the dataset can be divided in two parts (1) the actual, final response and (2) the reasoning needed to get there. “Reasoning” is the step-by-step explanation from the dataset that shows how the final answer was derived before the `####` marker. These steps are often called thoughts as well. Thoughts or steps are the individual actions, each one an independent move in the solver’s attempt to work through the problem and reach the final answer. Reasoning can be segmented in different ways depending on the dataset or the solver’s style, but for our exercise we’ll keep it simple and treat each sentence in the reasoning as a single thought.

**`TODO:`** Create a few simple answer demonstrations.  

Complete the code below so that it generates `qa_obj` dictionaries containing **only**:
- the `question`
- the *final response*

Make sure the reasoning steps are excluded.


In [None]:
answer_demonstrations = []
for i, row in enumerate(dataset_train.shuffle(seed=42)):
    qa_obj = {}
    # TODO: only include the question and the final answer (no reasoning steps)
    qa_obj['Question'] = row['question']
    qa_obj['Response'] = row['answer'].split('####')[-1].strip()
    answer_demonstrations.append(qa_obj)
    if i==20:
        break
answer_demonstrations[0:2]

**`TODO:`** Now let's create some demonstrations that include reasoning as well.
- For each sample, create a new field called `"Response"`.  
  - `"Response"` should be a dictionary with the following keys:  
    - `"Thoughts"` → a list capturing the reasoning steps or intermediate thoughts leading to the solution.  
    - `"Answer"` → the final numeric answer to the problem.  


In [None]:
reason_demonstrations = []
for i, row in enumerate(dataset_train.shuffle(seed=42)):
    qa_obj = {}
    qa_obj['Question'] = row['question']
    qa_obj['Response'] = {}
    # TODO: Split the answer into thoughts and final answer
    qa_obj['Response']['Thoughts'] = row['answer'].split('####')[0].strip().split('\n')
    qa_obj['Response']['Answer'] = row['answer'].split('####')[-1].strip()
    reason_demonstrations.append(qa_obj)
    if i==20:
        break
reason_demonstrations[0:2]

**`TODO:`**  

- From the **test split** of the dataset, select **100 random samples**.
- Preprocess each sample in the same way that you preprocessed the reasoning demonstrations
- For each sample, create a new field called `"Response"`.  
  - `"Response"` should be a dictionary with the following keys:  
    - `"Thoughts"` → a list capturing the reasoning steps or intermediate thoughts leading to the solution.  
    - `"Answer"` → the final numeric answer to the problem.  


In [None]:
test_questions = []
for i, row in enumerate(dataset_test.shuffle(seed=42)):
    qa_obj = {}
    qa_obj['Question'] = row['question']
    qa_obj['Response'] = {}
    # TODO: Split the answer into thoughts and final answer
    qa_obj['Response']['Thoughts'] = row['answer'].split('####')[0].strip().split('\n')
    qa_obj['Response']['Answer'] = row['answer'].split('####')[-1].strip()
    test_questions.append(qa_obj)
    if i==100:
        break
test_questions[0:2]

### Generation parameters
- Generation parameters control how the model decodes its probability distribution into actual text.
- We're gonna look into them extensively during the next lecture and lab session when we break down the text generation task.
- For the moment, here's a quick sneak-peak.
    - `Max tokens`: Control the maximum length of the model's response.
    - `Temperature`: Controls how random or creative the output is (higher = more varied).
    - `Top-p`: Limits choices to the most likely words until their probabilities add up to p. Also called Nucleus sampling.

In the code below, we define three different configurations of generation parameters. These configurations vary exclusively in temperature, ranging from high (more randomness and creativity) to low (more deterministic and focused).


In [None]:
payload = {
    "conf1": {
      "max_tokens": 512,
      "temperature": 1,
      "top_p": 1,
    },
    "conf2": {
      "max_tokens": 512,
      "temperature": 0.5,
      "top_p": 1,
    },
    "conf3": {
      "max_tokens": 512,
      "temperature": 0.1,
      "top_p": 1,
    }
}

### 📘 GSM8K Math Solver (How It Works)

This function builds a structured **prompting pipeline** for solving grade-school math problems (from GSM8K).  
Here’s the idea:

1. **System role** → Tells the model it is a *mathematical reasoning expert*.  
2. **Task instructions** → Explain that input comes in JSON (`{"Question": "..."}`) and output must also be JSON (`{"Response": "..."}`).  
3. **Few-shot examples** → Provide sample Q&A pairs so the model learns the pattern.  
4. **Actual question** → Append the new math problem to solve.  
5. **Call the model** → Send everything to the AI and get an answer in the required format.  
6. **Return results** → Output includes the model’s response (a number in JSON) and how many tokens were used.

This design makes the solver’s answers **consistent, machine-readable, and reliable** for classroom demonstrations.


In [None]:
def math_question_solver(question, qa_demonstrations, conf='conf1', fewshot=5, n=1):
    # Current system prompt
    demonstrations = [{"role":"system", "content": 
                 f'''You are mathemtical reasoning expert whose job is to provide answers to mathematical reasoning questions.'''}]

    # Current initial prompt for describing the task
    demonstrations.append({"role": "user", "content": '''##Goal \nProvide an answer to a given mathematical reasoning question.\n
    ##Input \nYou will be given input in the JSON format as described below\n
    Input format:
    {
        "Question": "<question>"
    }

    ##Output \nYou should only respond in the JSON format as described below\n
    Output format:
    ```json
    {
        "Response": "<numerical answer>"
    }
    ```
    Ensure the response can be parsed by Python json.loads
    '''})

    # Include 5 few-shot demonstrations for ensuring that the model understands the input and output structure
    if fewshot:
        # always the same demonstrations
        for demo in qa_demonstrations[0:fewshot]:
            demonstrations.append({"role": "user", "content": f'''{{"Question": "{demo['Question']}"}}'''})
            demonstrations.append({"role": "assistant", "content": f'''{{"Response": "{demo['Response']}"}}'''})
    
    messages = []
    messages.extend(demonstrations)
    messages.append({"role": "user", "content": f'''{{"Question": "{question}"}}'''})

    response = client.chat.completions.create(model= model_name, **payload[conf], messages=messages, n=n)

    # # parsing the guesses
    response_text = [choice.message.content for choice in response.choices]
    return response_text, response.usage.total_tokens

**`TODO:`** Now that you’ve seen the simple math solver, build a **reasoning solver**.  
- First, define the output format you want (just like the math solver did).  
- Then, instead of simple Q&A examples, provide **few-shot examples that include reasoning demonstrations** along with the final answer.  


In [None]:
def math_question_solver_with_reasoning(question, qa_demonstrations, conf='conf1', fewshot=5, n=1):
    # Current system prompt
    demonstrations = [{"role":"system", "content": 
                 f'''You are mathemtical reasoning expert whose job is to provide answers to mathematical reasoning questions.'''}]

    # Current initial prompt for describing the task
    demonstrations.append({"role": "user", "content": '''##Goal \nGiven a mathematical reasoning question, you should first think about it primarily by (1) breaking down the problem into steps, (2) reasoning about individual steps, and then (3) combining individual thoughts to come up with a final answer to the question.\n
    ##Input \nYou will be given input in the JSON format as described below\n
    Input format:
    {
        "Question": "<question>"
    }

    ##Output \nYou should only respond in the JSON format as described below\n
    Output format:
    ```json
    {
        "Thoughts": "<- short bulleted- list explaining the step-by-step strategy- used to obtain the final answer>",
        "Answer": "<numerical answer>"
    }
    ```
    Ensure the response can be parsed by Python json.loads
    '''})

    # TODO: Include few-shot resoning demonstrations for ensuring that the model understands the input and output structure.
    if fewshot:
        # always the same demonstrations
        for demo in qa_demonstrations[0:fewshot]:
            demonstrations.append({"role": "user", "content": f'''{{"Question": "{demo['Question']}"}}'''})
            demonstrations.append({"role": "assistant", "content": f'''{{"Thoughts": "{demo['Response']['Thoughts']}", "Answer": "{demo['Response']['Answer']}"}}'''})
    
    # TODO: Create the messages object that includes all demonstrations as well as the current question
    messages = []
    messages.extend(demonstrations)
    messages.append({"role": "user", "content": f'''{{"Question": "{question}"}}'''})

    # TODO: Generate the response from the model
    response = client.chat.completions.create(model= model_name, **payload[conf], messages=messages, n=n)

    # TODO: Parse the responses to extract the text content
    response_text = [choice.message.content for choice in response.choices]
    return response_text, response.usage.total_tokens

In the code below we provide two helper functions to facilitate your process:

- `extract_json_from_response(response)`<br>
  This function scans a text response and extracts any valid JSON objects embedded within it.  
  It uses a recursive regex pattern to detect balanced curly braces `{...}` and returns a list of JSON strings found.

- `is_number(s)`<br>
  This function checks if a given string can be interpreted as a number (e.g., integer, float, complex).  
  It attempts to safely parse the string using `ast.literal_eval` and returns `True` if successful, otherwise `False`.


In [None]:
def extract_json_from_response(response):
    '''
    test_str = ("This is a funny text about stuff,\n"
    "look at this product {\"action\":\"product\",\"options\":{\"action\":\"product\", \"action\":\"product\"}}.\n"
    "More Text is to come and another JSON string\n"
    "{\"action\":\"review\",\"options\":{...}}")
    matches = regex.finditer(pattern, test_str, regex.VERBOSE)
    for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
    '''
    pattern = r"""\{(?:[^{}]|(?R))*\}"""
    matches = regex.finditer(pattern, response, regex.VERBOSE)

    groups = [match.group() for matchNum, match in enumerate(matches, start=1)]

    return groups

# Function to check if a string is a number
def is_number(s):
    try:
        ast.literal_eval(s) # for int, long, float, complex, etc.
    except (ValueError, SyntaxError):
        return False
    return True

### 🔄 The `run_solver` Function

This function is a **wrapper** around the math solver that makes it more reliable.  
- It retries up to 3 times if something goes wrong.  
- It extracts the model’s JSON output and checks if it contains a valid number.  
- If the output is missing or invalid, it returns `"NA"`.  
- It also tracks the total number of tokens used.  

Final return values:  
1. **`response`** → raw model reply  
2. **`answer`** → cleaned numeric answer (or `"NA"`)  
3. **`total_tokens_used`** → tokens spent across tries


In [None]:
def run_solver(question, qa_demonstrations, conf, fewshot=5):
    total_tokens_used = 0
    num_tries = 0
    
    while (num_tries < 3):
        try:
            response, token_used = math_question_solver(question, qa_demonstrations, conf, fewshot)
            response_json_obj = extract_json_from_response(response[0])
            
            total_tokens_used += token_used
            num_tries += 1
            
            if response_json_obj:
                response_obj = json.loads(response_json_obj[0])
                if "Response" in response_obj:
                    answer = response_obj["Response"]
                    
                    if not is_number(answer):
                        answer = "NA"
                    else:
                        pass
                
                else:
                    answer = "NA"
                    print(response_obj)
            else:
                answer = "NA"
                print(response_json_obj)
            
            if answer != "NA":
                break
        except Exception as e:
            print(f"Encountered an exception {e}, sleeping")
            print(num_tries)
            sleep(10)
            answer = "NA"
    
    return response, answer, total_tokens_used



**`TODO:`** Finish the `run_solver_with_reason` function.  
- Use the same retry and validation logic as in `run_solver`.  
- But this time, extract both `"Thoughts"` and `"Answer"` from the model’s JSON output.  
- Return the raw response, the thoughts, the final answer, and the total tokens used.

In [None]:
def run_solver_with_reason(question, qa_demonstrations, conf, fewshot=5):
    total_tokens_used =0;
    num_tries = 0
    
    while(num_tries<3):
        try:
            num_tries += 1
            response, tokens_used = math_question_solver_with_reasoning(question, qa_demonstrations, conf, fewshot)
            response_json_obj = extract_json_from_response(response[0])
            
            total_tokens_used += tokens_used
             

            if response_json_obj:
                response_obj = json.loads(response_json_obj[0])
                thoughts = response_obj['Thoughts']
                answer = response_obj['Answer']
                
                if not is_number(answer):
                    answer='NA'
                else:
                    pass

            else:
                thoughts='NA'; answer='NA'
                print(response_json_obj)

            if answer!='NA':
                break

        except Exception as e:
            print(f"Encountered an exception {e}, sleeping")
            print(num_tries)
            sleep(10)
            thoughts='NA'; answer='NA'

    return response, thoughts, answer, total_tokens_used

### 📊 Few-Shot Experiment

This loop tests how the solver performs with different numbers of few-shot demonstrations (`0, 1, 5, 10, 20`).  
- For each setting, it runs on two test questions.  
- It compares the model’s answer to the human gold-standard answer.  
- Results are marked as **Successful** (correct) or **Unsuccessful** (wrong/NA).  
- Each run is tested across three configurations (`conf1`, `conf2`, `conf3`).  
- All outcomes (question, reasoning, answers, correctness, tokens, etc.) are saved to a CSV file for later analysis.


In [None]:
os.makedirs("generated_answers_without_reasoning_demonstrations", exist_ok=True)

for fewshot in [0, 1, 5, 10, 20]:
    print(f"Few-shot demonstrations: {fewshot}")
    generated_answers = []

    for index, row in enumerate(test_questions[:1]):
        human_answer = float(row['Response']['Answer'].replace(",",""))
        
        print(f"Q{index}: {row['Question']}")
        print(f"Human answer: {human_answer}")
        
        for conf in ['conf1', 'conf2', 'conf3']:
            response, answer, num_tokens = run_solver(row['Question'], answer_demonstrations, conf, fewshot)

            # Answer not found
            if answer == "NA":
                print(answer, human_answer, 'Unsuccessful', num_tokens, payload[conf]['temperature'], fewshot)
                generated_answers.append([row['Question'], row['Response']['Thoughts'], human_answer, response, answer, 0, num_tokens, payload[conf]['temperature'], fewshot])

            # Correct answer
            elif float(answer) == human_answer:
                print(float(answer), human_answer, 'Successful', num_tokens, payload[conf]['temperature'], fewshot)
                generated_answers.append([row['Question'], row['Response']['Thoughts'], human_answer, response, float(answer), 1, num_tokens, payload[conf]['temperature'], fewshot])

            # Incorrect answer
            else:
                print(float(answer), human_answer, 'Unsuccessful', num_tokens, payload[conf]['temperature'], fewshot)
                generated_answers.append([row['Question'], row['Response']['Thoughts'], human_answer, response, float(answer), 0, num_tokens, payload[conf]['temperature'], fewshot])

    generated_answers_df = pd.DataFrame(generated_answers, columns=['Question', 'AnswerReasoning_Human', 'Answer_Human', 'Response_LLM', 'Answer_LLM', 'is_correct', 'num_tokens', 'temperature', 'num_demonstrations'])
    generated_answers_df.to_csv(f'generated_answers_without_reasoning_demonstrations/{fewshot}.csv', index=False)
        

In [None]:
results = {}

for fewshot in [0, 1, 5, 10, 20]:
    df = pd.read_csv(f"generated_answers_without_reasoning_demonstrations/{fewshot}.csv")
    accuracy = df['is_correct'].mean()
    avg_tokens = df['num_tokens'].mean()
    results[fewshot] = {"accuracy": accuracy, "avg_tokens": avg_tokens}

results_df = pd.DataFrame(results).T
print(results_df)


**`TODO:`** Write the code to run the same few-shot experiment, but this time using your **reasoning solver** (`run_solver_with_reason`).  
- Loop over different numbers of few-shot demonstrations (`0, 1, 5, 10, 20`).  
- For each test question and configuration (`conf1`, `conf2`, `conf3`), call the reasoning solver.  
- Collect not only the final answer but also the model’s reasoning steps.  
- Compare with the human gold-standard answer and mark results as **Successful** or **Unsuccessful**.  
- Save all results (question, human reasoning, human answer, model response, model reasoning, model answer, correctness, tokens, etc.) into a CSV file.


In [None]:
os.makedirs("generated_answers_with_reasoning_demonstrations", exist_ok=True)

for fewshot in [0, 1, 5, 10, 20]:
    print(f'Few-shot demonstrations={fewshot}')
    generated_answers = []

    for index, row in enumerate(test_questions[:2]):
        human_answer = float(row['Response']['Answer'].replace(",",""))
        
        print(f"Q{index}: {row['Question']}")
        print(f"Human answer: {human_answer}")

        for conf in ['conf1', 'conf2', 'conf3']:
            response, thoughts, answer, num_tokens = run_solver_with_reason(row['Question'], reason_demonstrations, conf, fewshot)

            # Answer not found
            if answer == "NA":
                print(answer, human_answer, 'Unsuccessful', num_tokens, payload[conf]['temperature'], fewshot)
                generated_answers.append([row['Question'], row['Response']['Thoughts'], human_answer, response, thoughts, answer, 0, num_tokens, payload[conf]['temperature'], fewshot])

            # Correct answer
            elif float(answer) == human_answer:
                print(float(answer), human_answer, 'Successful', num_tokens, payload[conf]['temperature'], fewshot)
                generated_answers.append([row['Question'], row['Response']['Thoughts'], human_answer, response, thoughts, float(answer), 1, num_tokens, payload[conf]['temperature'], fewshot])

            # Incorrect answer
            else:
                print(float(answer), human_answer, 'Unsuccessful', num_tokens, payload[conf]['temperature'], fewshot)
                generated_answers.append([row['Question'], row['Response']['Thoughts'], human_answer, response, thoughts, float(answer), 0, num_tokens, payload[conf]['temperature'], fewshot])

    generated_answers_df = pd.DataFrame(generated_answers, columns=['Question', 'AnswerReasoning_Human', 'Answer_Human', 'Response_LLM', 'AnswerReasoning_LLM', 'Answer_LLM', 'is_correct', 'num_tokens', 'temperature', 'num_demonstrations'])
    generated_answers_df.to_csv(f'generated_answers_with_reasoning_demonstrations/{fewshot}.csv', index=False)

In [None]:
results = {}

for fewshot in [0, 1, 5, 10, 20]:
    df = pd.read_csv(f"generated_answers_with_reasoning_demonstrations/{fewshot}.csv")
    accuracy = df['is_correct'].mean()
    avg_tokens = df['num_tokens'].mean()
    results[fewshot] = {"accuracy": accuracy, "avg_tokens": avg_tokens}

results_df = pd.DataFrame(results).T
print(results_df)


**`Practise (Optional):`** Cool, now that you’ve experimented with these solvers using different types of few-shot demonstrations and reasoning strategies, you can take things further on your own!  

- **Tweak the prompts** → try rephrasing the instructions or adding different examples.  
- **Swap models** → see how different models handle the same task.  
- **Experiment freely** → change generation parameters, add constraints, or design creative demonstrations.  

The goal is to explore how these choices impact performance and discover what works best for your problem!