<a href="https://colab.research.google.com/github/antahiap/debugging-dl-models/blob/master/notebooks/3_debugging_with_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Please, make a copy of the notebook.
!pip install datasets
!pip install langchain[llms]
!pip install openai
!pip install python-dotenv

Collecting datasets
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0.0,>=0.14.0 (from datasets)
  Downloading huggingface_hub-0.16.4-py3-none-a

In [None]:
import dotenv
import gdown
import json
import langchain

from collections import defaultdict
from datasets import load_dataset

from langchain.callbacks import get_openai_callback
from langchain.chains import LLMChain
from langchain.chat_models import ChatOpenAI

from langchain.prompts.few_shot import FewShotChatMessagePromptTemplate
from langchain.prompts import PromptTemplate, ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate, AIMessagePromptTemplate

#### Setup an OpenAI API key.

In [None]:
def download_dotenv_from_drive(file_id, destination=".env"):
    url = f'https://drive.google.com/uc?id={file_id}'
    gdown.download(url, destination, quiet=True)

def setup_openai_api_key(from_dotenv=False, key=None):
  if from_dotenv:
    # Load a .env file from google drive and load the API key from the file.
    file_id = '1Ak2Yk1EA5GgioenYBNT8WvlD1QeHesnO'
    download_dotenv_from_drive(file_id)
    dotenv.load_dotenv(".env")
  else:
    assert key is not None, "Please, provide an OpenAI API key"
    !export OPENAI_API_KEY=key

setup_openai_api_key(from_dotenv=True)

# Rubber duck debugging with LLMs

Useful resources:

  - [Teaching Large Language Models to Self-Debug](https://arxiv.org/abs/2304.05128)
  - [Program Synthesis with Large Language Models](https://arxiv.org/abs/2108.07732)
  - [Langchain Docs](https://python.langchain.com/docs/get_started/introduction.html)
  - [Example prompts in python format](https://colab.research.google.com/drive/12pDif3-hFrs1Rjujd_X8kmon-5z61GAd#scrollTo=Vkt2XPgEUlVV)

<div align="center">
    <img src="https://drive.google.com/uc?export=view&id=19kuukNxxfOB2z_KrC6dBqzqF73RYZ4F6" alt="It's quack-tually a simple bug!" width="500" height="500"/>
    <br>
    <i>It's quack-tually a very simple bug!</i>
</div>

Rubber Duck Debugging is a method of debugging code where a programmer explains their code, line by line, to a rubber duck (or any inanimate object) with the aim of finding errors or logical issues.

### **Main idea: an LLM would explain the code to itself line by line and debug in an iterative manner.**

<div align="center">
    <img src="https://drive.google.com/uc?export=view&id=1iniWVZu7RoxFT0RvKMruJvOnPDxaLy8u" alt="It's quack-tually a simple bug!" width="550" height="300"/>
    <br>
    <i>Self-debugging schema.</i>
</div>


#### Download the few-shot example prompts.
- Examples with unit tests.
- Examples with explanations + unit tests.
- Examples only with explanations.

In [None]:
def download_few_shot_prompts(file_id, destination):
    url = f'https://drive.google.com/uc?id={file_id}'
    gdown.download(url, destination, quiet=True)

examples_ut_id = "1_vX7ETmxJZr6hzyl2nnexPPJjHtrkRan"
examples_ut_exp_id = "15MbPBlK42dOcRE2Qs6p6QLIcxzu0wMW_"
examples_exp_id = "1qCG3ry5VXoFAgMjIammUdYt0D2cVKsSr"

download_few_shot_prompts(examples_ut_id, "examples_ut.json")
download_few_shot_prompts(examples_ut_exp_id, "examples_ut_exp.json")
download_few_shot_prompts(examples_exp_id, "examples_exp.json")


with open("examples_ut.json", "r") as file:
    examples_ut = json.load(file)

with open("examples_ut_exp.json", "r") as file:
    examples_ut_exp = json.load(file)

with open("examples_exp.json", "r") as file:
    examples_exp = json.load(file)

In [None]:
# Explore the example prompts
print(f"The number of examples with unit tests is: {len(examples_ut)}")
print(f"The number of examples with unit tests and explanations is: {len(examples_ut_exp)}")
print(f"The number of examples only with explanations: {len(examples_exp)} \n")
print("One of the prompt examples: \n")
examples_ut[0]

The number of examples with unit tests is: 8
The number of examples with unit tests and explanations is: 5
The number of examples only with explanations: 5 

One of the prompt examples: 



{'assertion': '\n      assert count_ways(2) == 3\n      ',
 'task': '\n      Write a python function to find the number of ways to fill it with 2 x 1 dominoes for the given 3 x n board.\n      ',
 'original_code': '\n      None\n      ',
 'answer': '\n      ```python\n      def count_ways(n):\n          if n == 0:\n              return 1\n          if n == 1:\n              return 1\n          if n == 2:\n              return 3\n          return count_ways(n-1) + count_ways(n-2)\n      ```\n      Feedback: With the above function, count_ways(2) == 3. The assertion is "assert count_ways(2) == 3". \n      So the code passes the assertion. The code above is wrong. Please fix it.\n\n      ```python\n      def count_ways(n):\n          A = [0] * (n + 1)\n          B = [0] * (n + 1)\n          A[0] = 1\n          A[1] = 0\n          B[0] = 0\n          B[1] = 1\n          for i in range(2, n+1):\n              A[i] = A[i - 2] + 2 * B[i - 1]\n              B[i] = A[i - 1] + B[i - 2]\n        

## Natural language to Python translation with LLMs

First of all, let's download the MBPP(**M**ostly **B**asic **P**ython **P**roblems) dataset from HuggingFace and have a look at it. It consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers. Each problem consists of a task description, code solution and 3 automated test cases.

In [None]:
# Load dataset from HuggingFace datasets
mbpp = load_dataset("mbpp", "sanitized")
mbpp

Let's look at an example of a data point:

In [None]:
mbpp["test"][0]

In [None]:
print(f"Prompt: {mbpp['test'][0]['prompt']} \n")
print(f"Assertions: {mbpp['test'][0]['test_list']} \n")
print(f"Code: \n{mbpp['test'][0]['code']}")

### Self-debugging with LangChain

LangChain provides:

- Easy reusable chat prompt templates.
- Few-shot chat templates.
- Useful interface to chain the templates and an input prompt for a smooth run of an LLM.

We will be using a chat model, since the model used in the paper is worse, deprecated and more expensive than GPT3.5.

In [None]:
# Create a template for a human question. The input variables are given in curly brackets.
human_q_template = """
assertion:
These are the assertions for your function:
{assertion}

task:
{task}

AI:
answer from the previous iteration:
{original_code}
"""
human_q_prompt = HumanMessagePromptTemplate.from_template(human_q_template)

In [None]:
# Create a template for an AI answer.
ai_a_template = """
{answer}
"""
ai_a_prompt = AIMessagePromptTemplate.from_template(ai_a_template)

In [None]:
# Combine the human prompt template and the AI answer template into a chat template.
example_ut_prompt = ChatPromptTemplate.from_messages(
    [
        human_q_prompt,
        ai_a_prompt
    ]
)

In [None]:
# Let's check what we've got with the examples loaded from the json file.
print(example_ut_prompt.format(**examples_ut[1]))

In [None]:
# Create a few-shot examples prompt template for a chat.
few_shot_prompt = FewShotChatMessagePromptTemplate(
    example_prompt=example_ut_prompt, # converts each example into 1 or more messages through its format_messages method.
    examples=examples_ut # examples to be included in the final prompt.
)
print(few_shot_prompt.format()[0:2000])

In [None]:
# Add system templates to give initial instructions to the model.
system_template = """
You are a helpful assistant who generates python functions and feedback based on the provided assertion tests.
A user will pass in an assertion test and a description of what a function needs to do.
You should generate a function and a feedback message following the format of the provided examples above.
ONLY return a function and a feedback message, and nothing else.
If the "answer from previous iteration" is "None", it means the current iteration is 0.
Do NOT ask to provide the answer from iteration 0 or complain that it is missing.
"""

In [None]:
# Construct the final prompt.
final_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_template),
        few_shot_prompt,
        human_q_prompt,
        ("system", "You are executing iteration {iteration}.")
    ]
)

In [None]:
# Print the last 1017 characters of an example of the final prompt.
print(f"[my comment that the chain wouldn't see]: "
      f"Here comes all the few-shot examples given above \n\n... \n"
      )
print(final_prompt.format(
    assertion=mbpp['test'][0]['test_list'][0],
    task=mbpp['test'][0]['prompt'],
    original_code="None",
    iteration="0"
)[-1017:])

### Exercise 3.1

The task is to just play around with the chain:
  - Modify this code to run the chain:
  ```python
  # Create a chain and feed it the prompt template
  chain = LLMChain(
      llm=ChatOpenAI(model="gpt-3.5-turbo", max_tokens=1000, temperature=0.7),
      prompt=final_prompt
  )

  answer = chain.run(
      assertion="...",
      task="...",
      original_code="None",
      iteration="0"
  )
  print(answer)
  ```
  - Use different examples from the `mbpp["test"]` dataset and run the `chain` with the following input variables:
    - `assertion=mpbb["test"][idx]["test_list"][0]`, where idx is an example index chosen by you. Some more difficult case indices: `[7, 22, 34, 77, 211]`. Can you can find more? 😸
    - `task=mbpp['test'][idx]["prompt"]`
    - `original_code=None`
    - `iteration=0`
    - hint: it's helpful to print a case before running it.
  - Run the assertions manualy, copying the code from the `answer`and executing it with any test from `mpbb["test"][idx]["test_list"]`.
  - If an assertion fails, you can run the `chain` again, providing the `answer` as the `original_code` argument to the `chain`:
    - `original_code=answer`
  - Use a bigger model and try different temperature values:
    - `"gpt-3.5-turbo-16k"` with `max_tokens` bigger than 1000 then, e.g. `max_tokens=4000`
    - `temperature=0.2`, `temperature=0.4`, `temperature=0.6`, `temperature=1`
    - Can `"gpt-4"` solve these cases `mbpp['test'][7, 22, 34, 77, 211]`?
  
What are your observations?
  

### Self-debugging with unit tests
Let's create an iterative process that would debug, run the assertions and in case they fail pass the previous code as an input argument to the chain and run it again. Firstly, some helpfer functions, as always.


In [None]:
import re
import textwrap

def dedent_func_block(code: str, debug: bool=False) -> str:
    """This function dedents a code block that starts with the 'def' statement
    code: A string with some function definition
    debug: Run in a debug mode.
    """
    lines = code.split("\n")
    pattern = r'^\s*def\s+[a-zA-Z_][a-zA-Z0-9_]*\s*\('

    # Iterate over the lines to find a match
    def_line = None
    for i, line in enumerate(lines):
        if re.match(pattern, line):
            def_line = i

    # If not function definition found, return None
    if def_line is None:
      if debug:
        print(f"[dedent_func_block]: No function block was found")
      return None

    code = textwrap.dedent("\n".join(lines[def_line:]))
    import_lines = "\n".join(lines[:def_line])
    if debug:
      print(code)
    return import_lines + "\n" + code

def is_answer_correct(answer: str, assertion: str, verbose=True, debug=False) -> bool:
    """Check if an LLM's answer is correct running the cextracted code with an assertion test.
    answer: An answer from an LLM.
    assertion: An assertion test string.
    verbose: Print the run info.
    debug: Run in a debug mode.
    """
    # Extract Python code from the provided string using regex
    # We look for code starting and ending with triple backticks, and having "python" after the opening backticks
    code_matches = re.findall(r'```python\s+(.*?)```', answer, re.DOTALL)
    if not code_matches:
        if verbose:
          print(f"[is_answer_correct]: The answer is: {answer} \n")
          print("[is_answer_correct]: No valid Python code found in the provided string.")
        return False
    if len(code_matches) >= 2:
      code_match = code_matches[-1]
    else:
      code_match = code_matches[0]

    # Extract the actual code
    code = dedent_func_block(code_matches[-1], debug=debug)

    # Use exec() to run the code
    try:
        # Execute the code
        exec(code, globals())
    except Exception as e:
        # If there's any error, return False indicating the code is incorrect
        if verbose:
          print(f"[is_answer_correct]: Error: {e}")
          print(f"[is_answer_correct]: The code is not executable. \n")
        return False

    # If the code runs, try executing the assertion
    try:
      exec(assertion, globals())
    except Exception as e:
      if verbose:
        print(f"[is_answer_correct]: Assertion error: {e} \n")
      return False
    return True

We will sample 20 cases from the test set and run several iterations of debugging in a loop. The authors say that usually 3 iterations are enough.

Let's create a function that iterates over a dataset in the MBPP format, generates python functions and performs self-debugging.


In [None]:
def run_self_debug(chain, dataset, n_iter, verbose=True, debug=False, use_assertions=True):
  # Run with an openai callback to track the number of tokens
  with get_openai_callback() as cb:
      correct_tasks = defaultdict()
      # Iterate over the cases in the dataset
      for num, task_assertions in enumerate(zip(dataset['prompt'], dataset['test_list'])):
          task, assertions = task_assertions
          original_code = "None"
          correct_tasks[task] = False
          assertion = assertions[0]
          if use_assertions:
            print(f"\n[run]: Running task number {num}. \nThe task is: \n{task} \n"
                  f"The assertion for the task is: \n{assertion}\n"
                  )
          else:
            print(f"\n[run]: Running task number {num}. \nThe task is: \n{task} \n")

          # Do n_iter of code generation or break when all assertions are passed
          for iter in range(n_iter):

            # Generate code + feedback
            answer = chain.run(
                assertion=assertions[0],
                task=task,
                original_code=original_code,
                iteration=str(iter)
                )
            print(f"\n[run]: Executing iteration {iter}.\n \n {answer} \n")

            # Run assertions
            if sum([is_answer_correct(answer, test_assertion, verbose, debug) for test_assertion in assertions]) == len(assertions):
              correct_tasks[task] = True
              print(f"\n[run]: The task number {num} was succesfully accomplished. All assertions were passed.\n")
              break
            # Take the answer and use it as the previous answer in the prompt if any assertion fails
            else:
              system_feedback_message = "Actually, the last sentences are false. The code does not pass the assertion. The code above is wrong. Please fix it."
              if not use_assertions:
                system_feedback_message = "Actually, the last sentences are false. The code above is wrong. Please fix it."
              original_code = f"{answer}. \n{system_feedback_message}"
              if iter < n_iter - 1:
                print(f"\n[run]: Previous answer: \n{original_code} \n")
              else:
                print(f"\n[run]: The task number {num} was failed.")

  # Print the results and the cases that failed
  total_tokens = cb.total_tokens
  print(
      f"These are the tasks that failed:\n",
      "\n".join([task for task, result in correct_tasks.items() if result == False])
      )
  print(f"\nTotal tokens spent for the request: {total_tokens}")

  acc = sum(correct_tasks.values()) / len(correct_tasks)
  print(f"Accuracy is {acc}.")
  return acc, correct_tasks


Let's sample 20 examples from the test set and remove duplicates.

In [None]:
from datasets.io.abc import Dataset
import numpy as np

def replace_duplicates_with_random(lst, low, high):
    while len(lst) != len(set(lst)):  # while there are duplicates in the list
        seen = set()
        for i, item in enumerate(lst):
            if item in seen:
                while True:
                    random_num = int(np.random.randint(low, high))
                    if random_num not in lst:
                        lst[i] = random_num
                        break
            seen.add(item)
    return lst

# Sample indices and create a subset of the test set.
def create_subdataset(dataset: Dataset, given_idx: list, size: int):
  idx = list(np.random.randint(0, len(dataset), size - len(given_idx))) + given_idx
  idx = replace_duplicates_with_random(idx, 0, len(dataset))
  return dataset[idx]

# Create a subsampled dataset
np.random.seed(262342)
# Add some difficult examples to make it harder for an LLM
test_ds = create_subdataset(mbpp["test"], [7, 22, 34, 77, 211], 20)

All is ready for generation and self-debugging. Let's do it!

In [None]:
# Run the tasks with the self-debugging loop
chain = LLMChain(
  llm=ChatOpenAI(model="gpt-3.5-turbo-16k", max_tokens=4000, temperature=0.7),
  prompt=final_prompt
  )
acc, correct_task = run_self_debug(chain, test_ds, 3, verbose=True, debug=False, use_assertions=True)


[run]: Running task number 0. 
The task is: 
Write a function to find the nth octagonal number. 
The assertion for the task is: 
assert is_octagonal(5) == 65


[run]: Executing iteration 0.
 
 ```python
def is_octagonal(n):
    return n * (3 * n - 2)
```

Feedback: With the above function, is_octagonal(5) returns 35. The assertion is "assert is_octagonal(5) == 65". So the code does not pass the assertion. Please fix it.

```python
def is_octagonal(n):
    return n * (3 * n - 1) // 2
```

Feedback: With the above function, is_octagonal(5) returns 65. The assertion is "assert is_octagonal(5) == 65". So the code passes the assertion. The code above is correct. 

[is_answer_correct]: Assertion error:  

[is_answer_correct]: Assertion error:  

[is_answer_correct]: Assertion error:  


[run]: Previous answer: 
```python
def is_octagonal(n):
    return n * (3 * n - 2)
```

Feedback: With the above function, is_octagonal(5) returns 35. The assertion is "assert is_octagonal(5) == 65". So the 

In [None]:
# Run the tasks with the self-debugging loop
chain = LLMChain(
  llm=ChatOpenAI(model="gpt-4", max_tokens=4000, temperature=0.7),
  prompt=final_prompt
  )
acc, correct_task = run_self_debug(chain, test_ds, 3, verbose=True, debug=False, use_assertions=True)


[run]: Running task number 0. 
The task is: 
Write a function to find the nth octagonal number. 
The assertion for the task is: 
assert is_octagonal(5) == 65


[run]: Executing iteration 0.
 
 
      ```python
      def is_octagonal(n):
          return n * (3*n - 2)
      ```

      Feedback: With the above function, is_octagonal(5) == 65. The assertion is "assert is_octagonal(5) == 65". 
      So the code passes the assertion. The code above is correct.
 


[run]: The task number 0 was succesfully accomplished. All assertions were passed.


[run]: Running task number 1. 
The task is: 
Write a python function to check whether the given array is monotonic or not. 
The assertion for the task is: 
assert is_Monotonic([6, 5, 4, 4]) == True


[run]: Executing iteration 0.
 
 

      ```python
      def is_Monotonic(A):
          return (all(A[i] <= A[i + 1] for i in range(len(A) - 1)) or
                  all(A[i] >= A[i + 1] for i in range(len(A) - 1)))
      ```

      Feedback: With th

### Rubber Duck Debugging with LLMs (UT + explanations)

We will include line-by-line explanations in the prompts and ask an LLM to explain its code before writing a feedback.

In [None]:
def create_few_shot_prompt(examples, human_q_template, ai_a_template, system_template):
    # Human prompt
    human_q_template = human_q_template
    human_q_prompt = HumanMessagePromptTemplate.from_template(human_q_template)

    # AI prompt
    ai_a_template = ai_a_template
    ai_a_prompt = AIMessagePromptTemplate.from_template(ai_a_template)

    # System prompt
    system_prompt = SystemMessagePromptTemplate.from_template(system_template)

    # Chat prompt.
    example_prompt = ChatPromptTemplate.from_messages(
        [
            human_q_prompt,
            ai_a_prompt
        ]
    )

    # Few-shot prompt
    few_shot_prompt = FewShotChatMessagePromptTemplate(
        example_prompt=example_prompt,
        examples=examples
    )

    # Final prompt
    final_prompt = ChatPromptTemplate.from_messages(
    [
        system_prompt,
        few_shot_prompt,
        human_q_prompt,
        ("system", "You are executing iteration {iteration}.")
    ]
)
    return final_prompt

In [None]:
human_q_template_ut_exp="""
    assertion:
    These are the assertions for your function:
    {assertion}

    task:
    {task}

    AI:
    answer from the previous iteration:
    {original_code}
    """
ai_a_template_ut_exp="""
    {answer}
    """
system_template_ut_exp="""
    You are a helpful assistant who generates python functions, line-by-line explanations for them and feedback based on the provided assertion tests.
    A user will pass in an assertion test and a description of what a function needs to do.
    You should generate a function, a line-by-line explanation for the function and a feedback message following the format of the provided examples above.
    ONLY return a function, line-by-line explanation and a feedback message, and nothing else.
    If the "answer from previous iteration" is "None", it means the current iteration is 0.
    Do NOT ask to provide the answer from iteration 0 or complain that it is missing.
    """

In [None]:
final_prompt_ut_exp = create_few_shot_prompt(examples_ut_exp, human_q_template_ut_exp, ai_a_template_ut_exp, system_template_ut_exp)

# Print some last characters of an example of the final prompt.
print(f"[my comment that the chain wouldn't see]: "
      f"Here comes all the few-shot examples given above \n\n... \n"
      )
print(final_prompt_ut_exp.format(
    assertion=mbpp['test'][0]['test_list'][0],
    task=mbpp['test'][0]['prompt'],
    original_code="None",
    iteration="0"
)[-5300:])

[my comment that the chain wouldn't see]: Here comes all the few-shot examples given above 

... 


    
Human: 
    assertion:
    These are the assertions for your function:
    
      assert find_Rotations("aaaa") == 1
      

    task:
    
      Write a python function to find the minimum number of rotations required to get the same string.
      

    AI:
    answer from the previous iteration:
    
      None
      
    
AI: 
    
      ```python
      def find_Rotations(s):
          if len(s) == 1:
              return 1
          elif len(s) == 0:
              return 0
          else:
              l = len(s)
              min = len(s)
              for i, c in enumerate(s):
                  if c == s[0]:
                      temp = i
                      if temp < min:
                          min = temp
              return min
      ```

      Here is a line-by-line explanation of the code:
      ‘def find_Rotations(s):‘: This line defines a function named ‘find_Rotat

In [None]:
# Run Rubber Duck Debugging with unit tests and explanations
chain = LLMChain(
  llm=ChatOpenAI(model="gpt-3.5-turbo-16k", max_tokens=8000, temperature=0.7),
  prompt=final_prompt_ut_exp
  )
acc, correct_task = run_self_debug(chain, test_ds, 3, verbose=True, debug=False, use_assertions=True)


[run]: Running task number 0. 
The task is: 
Write a function to find the nth octagonal number. 
The assertion for the task is: 
assert is_octagonal(5) == 65


[run]: Executing iteration 0.
 
 ```python
def is_octagonal(n):
    return n * (3 * n - 2)
```

Here is a line-by-line explanation of the code:

`def is_octagonal(n):`: This line defines a function named `is_octagonal` that takes a single argument, `n`. `n` represents the position of the octagonal number to be calculated.

`return n * (3 * n - 2)`: This line calculates the nth octagonal number using the formula `n * (3 * n - 2)`. The formula represents the pattern of adding successive multiples of 8 starting from 1. The nth octagonal number is equal to `n` multiplied by `(3 * n - 2)`.

Feedback: With the above function, `is_octagonal(5) == 65`. The assertion is "assert is_octagonal(5) == 65". So the code passes the assertion. The code above is correct. 


[run]: The task number 0 was succesfully accomplished. All assertions wer

Unfortunately there's no access to GPT4-32K, so we can't test it at the moment.

What it we havbe no unit tests? Explanations should help!

### Rubber Duck Debugging with LLMs (only explanations)

In [None]:
human_q_template_exp="""
    task:
    {task}

    AI:
    answer from the previous iteration:
    {original_code}
    """
ai_a_template_exp="""
    {answer}
    """
system_template_exp="""
    You are a helpful assistant who generates python functions, line-by-line explanations for them and feedback based on the explanations.
    A user will pass in a description of what a function needs to do.
    You should generate a function, a line-by-line explanation for the function and a feedback message following the format of the provided examples above.
    ONLY return a function, line-by-line explanation and a feedback message, and nothing else.
    If the "answer from previous iteration" is "None", it means the current iteration is 0.
    Do NOT ask to provide the answer from iteration 0 or complain that it is missing.
    """

In [None]:
final_prompt_exp = create_few_shot_prompt(examples_exp, human_q_template_exp, ai_a_template_exp, system_template_exp)

# Print some last characters of an example of the final prompt.
print(f"[my comment that the chain wouldn't see]: "
      f"Here comes all the few-shot examples given above \n\n... \n"
      )
print(final_prompt_exp.format(
    assertion=mbpp['test'][0]['test_list'][0],
    task=mbpp['test'][0]['prompt'],
    original_code="None",
    iteration="0"
)[-4770:])

[my comment that the chain wouldn't see]: Here comes all the few-shot examples given above 

... 

   
Human: 
    task:
    
      Write a python function to find the minimum number of rotations required to get the same string.
      

    AI:
    answer from the previous iteration:
    
      None
      
    
AI: 
    
      ```python
      def find_Rotations(s):
          if len(s) == 1:
              return 1
          elif len(s) == 0:
              return 0
          else:
              l = len(s)
              min = len(s)
              for i, c in enumerate(s):
                  if c == s[0]:
                      temp = i
                      if temp < min:
                          min = temp
              return min
      ```

      Here is a line-by-line explanation of the code:
      ‘def find_Rotations(s):‘: This line defines a function named ‘find_Rotations ‘ that takes a single argument, ‘s‘.
      ‘s‘ represents the string whose minimum number of rotations is to be ca

In [None]:
# Run Rubber Duck Debugging with explanations only
chain = LLMChain(
  llm=ChatOpenAI(model="gpt-3.5-turbo-16k", max_tokens=8000, temperature=0.7),
  prompt=final_prompt_exp
  )
acc, correct_task = run_self_debug(chain, test_ds, 3, verbose=True, debug=False, use_assertions=False)


[run]: Running task number 0. 
The task is: 
Write a function to find the nth octagonal number. 


[run]: Executing iteration 0.
 
 ```python
def find_octagonal_number(n):
    return n * (3 * n - 2)
```

Here is a line-by-line explanation of the code:
- `def find_octagonal_number(n):` This line defines a function named `find_octagonal_number` that takes a single argument, `n`. `n` represents the position of the octagonal number to be calculated.
- `return n * (3 * n - 2)` This line calculates the nth octagonal number using the formula `n * (3 * n - 2)`. The formula represents the number of dots required to form an octagon.

Feedback: The code above is correct. 


[run]: The task number 0 was succesfully accomplished. All assertions were passed.


[run]: Running task number 1. 
The task is: 
Write a python function to check whether the given array is monotonic or not. 


[run]: Executing iteration 0.
 
 ```python
def is_monotonic(arr):
    increasing = True
    decreasing = True
    fo

### Results based on the small experiments above:

In [None]:
from IPython.core.display import display, HTML

table_html = """
<style>
    /* Adjusts the table's font size */
    table.big-table {
        font-size: 1.5em;
        width: 100%;
    }
    /* Adjusts the table cell padding */
    table.big-table th, table.big-table td {
        padding: 10px 20px;
    }
</style>

<table class="big-table">
    <thead>
        <tr>
            <th>Model and method</th>
            <th>Tokens spent</th>
            <th>Accuracy</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>GPT3.5-16K + UT</td>
            <td>97222</td>
            <td>0.75</td>
        </tr>
        <tr>
            <td>GPT3.5-16K + UT + Expl.</td>
            <td>194988</td>
            <td>0.8</td>
        </tr>
        <tr>
            <td>GPT3.5-16K + Expl</td>
            <td>132589</td>
            <td><b>0.85</b></td>
        </tr>
        <tr>
            <td>GPT4-8K + UT</td>
            <td>90154</td>
            <td>0.8</td>
        </tr>
    </tbody>
</table>
"""

display(HTML(table_html))

Model and method,Tokens spent,Accuracy
GPT3.5-16K + UT,97222,0.75
GPT3.5-16K + UT + Expl.,194988,0.8
GPT3.5-16K + Expl,132589,0.85
GPT4-8K + UT,90154,0.8


Probably the results are such due to a small test set size.