# OpenAICookbookPromptEngineering
## Overview
This notebook 
1. Scrapes java code from the fineract directory
2. Cleans it up & tokenizes it
3. Chunks the codebase into a max token size (defaults to 1600)
4. Embeds the tokenized code chunks via openAI's `text-embedding-ada-002` model
5. Stores the embedded text as a dataframe
6. Uses the dataframe as extra context to provide to chatGPT based on the question asked of ChatGPT

## Operations
Assuming you're starting from scratch and opening this notebook for the first time, its cells offer the following functionalities (the numbers do not necessarily correspond to individual cells):

1. The first cell intends to `pip install` all required python modules/libraries to make things slightly easier -- run a single cell and know that all dependencies are ensured. It doesn't really hurt anything to including another `import <package>` line in following cells though, so not a hard/fast rule.

2. The second cell sets up your openAI API key against a local `openai-key.txt` file. Again, .gitignore includes that txt file, so as long as you name that file exactly when making your API key we should be alright. For this reason though, be sure to ALWAYS run `git status` before running any `git commit -am ""` command to ensure you do not accidentally include sensitive information.

> **Warning**
> Once any sensitive information is baked into ANY git commit & pushed to the remote repo, it exists within the git log. It is *possible* to remove it without destroying all code history, but it is very painstaking and sometimes difficult to detect after the fact. Always ensure that your commit history does not include any sensitive information prior to pushing to a remote repository.

3. Git clones the fineract web app from the open-source repo on the 1.8.4 git branch. By default this bit of code is commented-out as it takes a while and the way it's written it will always attempt the git clone, so it's recommended to re-comment it after you've downloaded it locally. This repo's .gitignore knows to ignore this repository, so don't worry about nested git repositories.

4. Sets up java cleanup function(s) to remove erraneous information that isn't important (comments, empty return character lines, & unit test code blocks). The removal of unit test code blocks is essential because including the word `test` in your prompt/question to ChatGPT may skew the context if there are too many text vectors that include the word `test` in them (which unit test code blocks typically do)

5. Tokenizes the files & chunks them based on the chunksize provided (defaults to 1600). It uses the SentencePiece Byte-Pair Encoding Tokenizer from the `tokenizers` python module due to more correct formatting of the results. Other options are available (character, byte-level, or Bert Word) but the results weren't as great; these options are included within the import cell if experimentation is desired. The tokenized chunks are optionally written to a local flat directory `java_files_chunks_sentence` (two lines are marked as "uncomment" if you want to see the results written locally).

6. The tokenized chunks are stored as a dataframe (and then ultimately a CSV file in the `data` directory) with vectorized strings. We define functions that provide that dataframe based on a query, with the query representing the question we'll use later on to ask ChatGPT a question about the fineract codebase (in our case as an attempt to write unit tests).

7. We define our prompt & test the vector serach to make sure the files we're finding match the question we're asking

8. We define the functions for how to ask ChatGPT a question & then use all the previous functions to search for extra context & provide that all over an API call


## Sources
This notebook stems from two OpenAI Coookbook templates:

* [Embedding Wikipedia Tokens (Prerequisite for Next Example)](https://github.com/openai/openai-cookbook/blob/2a2753e8d0566fbf21a8270ce6afaf761d7cdee5/examples/Embedding_Wikipedia_articles_for_search.ipynb)
* [Question Answering Using Embeddings](https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb)

The OpenAICookbookPromptEngineering Notebook uses the apache fineract java web app [(git branch 1.8.4)](https://github.com/apache/fineract/tree/1.8.4) as its application code to consume from.

In [None]:
#Setup installation of packages and what to import/consume later:
%pip install scipy
%pip install tokenizers
%pip install openai

import regex
import IPython.display
import os
import shutil
import subprocess
import openai
import re
from tokenizers import Tokenizer
from tokenizers import CharBPETokenizer as CBPET
from tokenizers import ByteLevelBPETokenizer as BBPET
from tokenizers import SentencePieceBPETokenizer as SPBPET
from tokenizers import BertWordPieceTokenizer as BWPT
from tokenizers.trainers import BpeTrainer
import ast  # for converting embeddings saved as strings back to arrays
import pandas as pd  # for storing text and embeddings data
import tiktoken  # for counting tokens
from scipy import spatial  # for calculating vector similarities for search

In [None]:
#Set the opeanAI API key:
openai.api_key_path = 'openai-key.txt'

In [None]:
""" Unit Test Scraper Functions
Below Unit Test Finder functions find SOME unit tests in *.java files. Makes the following assumptions:
    - Unit Test starts with "@Test <return character><tab character> public void"
    - Unit Test does NOT throw any exception, and thus just has a return character after its declaration

Also cleans up
    - single-line comments
    - multi-line comments
    - empty new lines

With the above assumptions it finds 465 of the 588 unit test cases littered throughout the fineract application
"""

# Regex Pattern to Scrap Unit Tests:
singleline_pattern = r"(?!\/\/localhost\b)\/\/.*"
multiline_pattern = r"/\*(.|[\r\n])*?\*/"
whiteline_pattern = r"\n\s*\n"
test_pattern = r"@Test\s*public\s*void\s*\w+\s*\(\s*\)\s*\{[^{}]*+(?:(?:\{[^{}]*+\})*+[^{}]*+)*+\}"

def remove_comments(java_code):
    # Remove single-line comments
    java_code = re.sub(singleline_pattern, "", java_code, flags=re.MULTILINE)

    # Remove multi-line comments
    java_code = re.sub(multiline_pattern, "", java_code, re.MULTILINE)

    # Remove Unit Tests:
    java_code = re.sub(test_pattern, "", java_code, flags=re.MULTILINE)

    # Remove empty white lines:
    java_code = re.sub(whiteline_pattern, "", java_code)

    return java_code

In [None]:
"""Set up the codebase chunks and embed them
    - Clone the repo down (commented out right now so I don't have to wait for a clone each time it fails)
    - Extract & Clean (using above remove_comments() function) the java files
    - Chunk the java files into 1600 token chunks
    - Store the chunks in memory
    - Send the chunks off to openAI's embedding API endpoint (using text-embedding-ada-002)
    - Store the embedded text as a Pandas Dataframe
    - Print the dataframe (just to be sure)
"""

repo_dir = "fineract"  # Use the cleaned repo
# Clone the GitHub repository -- should really only need to do this once
# repo_url = "https://github.com/apache/fineract"
# subprocess.run(["git", "clone", "-b", "1.8.4", repo_url, repo_dir]) # Grab the 1.8.4 fineract git branch just because it's stable

# Set up the tokenizer
# tokenizer = BBPET()  # Byte
tokenizer = SPBPET()   # Sentence


java_files = []
for root, dirs, files in os.walk(repo_dir):
    if dirs == "test":
        os.rmdir(dirs)
    for file in files:
        if file.endswith(".java"):
            java_files.append(os.path.join(root, file))

# Process the Java files and break them into 1600 token chunks
new_dir = "fineract-java"
output_dir = "java_files_chunks_sentence"
os.makedirs(output_dir, exist_ok=True)
os.makedirs(new_dir, exist_ok=True)

for file in java_files:
    with open(file, "r") as f:
        java_code = f.read()

    # Use the remove_comments function to redefine java_code as sans-comments java code:
    java_code = remove_comments(java_code)

    new_filename = f"cleaned-{os.path.basename(file)}"
    new_filepath = os.path.join(new_dir, new_filename)

    with open(new_filepath, "w") as f:
        f.write(java_code)

new_java_files = []
for root, dirs, files in os.walk(new_dir):
    if dirs == "test":
        os.rmdir(dirs)
    for file in files:
        if file.endswith(".java"):
            new_java_files.append(os.path.join(root, file))

# Train the tokenizer on the Java files:
# tokenizer_trainer = BpeTrainer(vocab_size=1600, min_frequency=2)
tokenizer.train(new_java_files)

output_content = []
for file in new_java_files:
    with open(file, "r") as f:
        new_java_code = f.read()

    encoding = tokenizer.encode(new_java_code)
    tokens = encoding.tokens
    ids = encoding.ids
    chunk_size = 1600
    num_chunks = (len(tokens) + chunk_size - 1) // chunk_size

    for i in range(num_chunks):
        start = i * chunk_size
        end = (i + 1) * chunk_size
        chunk_tokens = tokens[start:end]
        chunk_ids = ids[start:end]
        chunk_code = tokenizer.decode(chunk_ids)

        chunk_filename = os.path.basename(file) + f".chunk{i+1}.java"
        chunk_filepath = os.path.join(output_dir, chunk_filename)

        output_content.append(chunk_code)
        # If you want to visualize the tokenized java chunks, uncomment the following two lines:
        # with open(chunk_filepath, "w") as f:
        #     f.write(chunk_code)

EMBEDDING_MODEL = "text-embedding-ada-002"

# Number of token chunks to send at a time
# OpenAI's example specifies 1000 but I've had greater success with 100
BATCH_SIZE = 100

embeddings = []
for batch_start in range(0, len(output_content), BATCH_SIZE):
    batch_end = batch_start + BATCH_SIZE
    batch = output_content[batch_start:batch_end]
    print(f"Batch {batch_start} to {batch_end-1}")

    response = openai.Embedding.create(model=EMBEDDING_MODEL, input=batch)
    for i, be in enumerate(response["data"]):
        assert i == be["index"]  # double check embeddings are in same order as input
    batch_embeddings = [e["embedding"] for e in response["data"]]
    embeddings.extend(batch_embeddings)

df = pd.DataFrame({"text": output_content, "embedding": embeddings})

# Print out the Dataframe, just to be sure:
df

In [None]:
# Create the directory so that py is happy
os.makedirs("data", exist_ok=True)

# save document chunks and embeddings
SAVE_PATH = "data/fineract.csv"

df.to_csv(SAVE_PATH, index=False)

In [None]:
embeddings_path = SAVE_PATH

df = pd.read_csv(embeddings_path)

In [None]:
df['embedding'] = df['embedding'].apply(ast.literal_eval)

In [None]:
"""Setup the vector search function that scrapes through your dataframe and/or CSV file to find related code based on your prompt
"""

# search function
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100
) -> tuple[list[str], list[float]]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = openai.Embedding.create(
        model=EMBEDDING_MODEL,
        input=query,
    )
    query_embedding = query_embedding_response["data"][0]["embedding"]
    strings_and_relatednesses = [
        (row["text"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]

In [None]:
"""[Optional] Test the vector scraping from the embedded text
    Don't actually need to run this, but it's good to understand
    what it's doing under the hood
"""

prompt = "Write a new unit test for CodesApiResource {"

strings, relatednesses = strings_ranked_by_relatedness(prompt, df, top_n=3)
for string, relatedness in zip(strings, relatednesses):
    print(f"{relatedness=:.3f}")
    print(string)

In [None]:
"""Define the functions for how to reach out to ChatGPT
    use by calling ask(<prompt>)
    Can optionally get the hidden context printed by typing

        ask(<prompt>, print_message=True)
"""

GPT_MODEL = "gpt-3.5-turbo"

def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    strings, relatednesses = strings_ranked_by_relatedness(query, df)
    introduction = 'Use the below codeset from the fineract java web application to answer the subsequent question. If the answer cannot be found from the code sample, write "I could not find an answer."'
    question = f"\n\nQuestion: {query}"
    message = introduction
    for string in strings:
        next_article = f'\n\nFineract Application codebase selection:\n"""\n{string}\n"""'
        if (
            num_tokens(message + next_article + question, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question


def ask(
    query: str,
    df: pd.DataFrame = df,
    model: str = GPT_MODEL,
    token_budget: int = 4096 - 500,
    print_message: bool = False,
) -> str:
    """Answers a query using GPT and a dataframe of relevant texts and embeddings."""
    message = query_message(query, df, model=model, token_budget=token_budget)
    if print_message:
        print(message)
    messages = [
        {"role": "system", "content": "You answer questions about the Fineract Web Application."},
        {"role": "user", "content": message},
    ]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0
    )
    response_message = response["choices"][0]["message"]["content"]
    return response_message

In [None]:
"""Call out to ChatGPT

Syntax:
    
    ask("<prompt>", print_message=<True/False>)

print_message defaults to false; if set to True it displays the hidden context along with the answer provided by ChatGPT
"""

# Ask a question while hiding the context:
ask(prompt,print_message=False)