<a href="https://colab.research.google.com/github/ghopper3/ChatGPT-Project/blob/main/Copy_of_Accounting_Chatbot_Using_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a Fine-Tuned RAG Chatbot with LangChain
# (Part 2: Retrieval-Augmented Generation)


# Introduction
In [Part 1](https://colab.research.google.com/drive/1nvAbE1bWQX_zHNCVEcyn2JSd-w1hYWc8?usp=sharing) of this project, we fine-tuned ChatGPT to be a finance and accounting chatbot. As a reminder, fine-tuning involves re-training a pre-trained LLM on a new dataset of labeled examples (in our case we used sample questions and ansswers from CPA and CFA exams). Fine-tuning is meant to improve the LLM's performance on specific tasks, such as text classification, language translation, or question answering.

In this next step, we're going to use Retrieval Augemented Generation to focus our model on our desired goals even more.

RAG involves augmenting a pre-trained LLM with a retrieval component. This allows the LLM to access and retrieve information from external knowledge bases, which can help to improve its accuracy and informativeness on knowledge-intensive tasks.

So what is RAG?
Retrieval-Augmented Generation (RAG) is a technique that improves the accuracy of large language models (LLMs) by allowing them to access and retrieve information from external knowledge bases. RAG models work by first retrieving a set of relevant documents from the knowledge base for the given query, and then generating the final output text by conditioning on the query and the retrieved documents. RAG models have the potential to revolutionize the way that we use LLMs for a variety of tasks, such as question answering, summarization, and translation.

**Use fine-tuning:**
*   When you have a labeled dataset of examples for the specific task that you want the LLM to perform.
*   When you need the LLM to perform well on a specific task, even if it means sacrificing some performance on other tasks.


**Use RAG:**
*   When you need the LLM to be accurate and informative on a wide range of knowledge-intensive tasks.
*   When you don't have a labeled dataset for the specific task that you want the LLM to perform.
*   When you need the LLM to be able to access and retrieve information from external knowledge bases.

In our case, we're using both in an effort to achieve even better results.

This portion of the project is based on the work of James Briggs (https://aurelio.ai/) and his project: "Building RAG Chatbots with LangChain."

I am including James' instructions throughout, and adding my own as they relate specifically for our use case of building a Finance & Accounting Chatbot using a fine-tuned model with accounting skills.

In this example, we'll work on building an AI chatbot from start-to-finish. We will be using LangChain, OpenAI, and Pinecone vector DB, to build a chatbot capable of learning from the external world using Retrieval Augmented Generation (RAG).

Our dataset will include accounting texts, including "Internal Control Strategies for Compliance with the Sarbanes-Oxley Act of 2002" and the "Auditing Standards of the Public Company Accounting Oversight Board" to help our chatbot answer questions about accounting specific topics.

By the end of the example we'll have a functioning chatbot and RAG pipeline that can hold a conversation and provide informative responses based on a knowledge base.

### Before you begin

You'll need to get an [OpenAI API key](https://platform.openai.com/account/api-keys) and [Pinecone API key](https://app.pinecone.io).

### Prerequisites

Before we start building our chatbot, we need to install some Python libraries. Here's a brief overview of what each library does:

- **langchain**: This is a library for GenAI. We'll use it to chain together different language models and components for our chatbot.
- **openai**: This is the official OpenAI Python client. We'll use it to interact with the OpenAI API and generate responses for our chatbot.
- **datasets**: This library provides a vast array of datasets for machine learning. We'll use it to load our knowledge base for the chatbot.
- **pinecone-client**: This is the official Pinecone Python client. We'll use it to interact with the Pinecone API and store our chatbot's knowledge base in a vector database.

You can install these libraries using pip like this:

(Note: You may have to run this twice if you get an error message about dependencies.)

In [None]:
!pip install -qU \
    langchain==0.0.292 \
    openai==0.28.0 \
    datasets==2.10.1 \
    pinecone-client==2.2.4 \
    tiktoken==0.5.1

Let's make sure we have the latest version of pandas

In [None]:
!pip install --upgrade pandas


### Building a Chatbot (no RAG)

We will be relying heavily on the LangChain library to bring together the different components needed for our chatbot. To begin, we'll create a simple chatbot without any retrieval augmentation. We do this by initializing a `ChatOpenAI` object. For this we do need an [OpenAI API key](https://platform.openai.com/account/api-keys).

Note for the model here, we are using our fine-tuned model that we used in Part 1 of this project.

In [None]:
import os
from langchain.chat_models import ChatOpenAI

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or "YOUR_OPENAI_API_KEY"

chat = ChatOpenAI(
    openai_api_key=os.environ["OPENAI_API_KEY"],
    model='LINK_TO_YOUR_FINE_TUNED_MODEL_FROM_PART_1' # Alternatively, you can use GPT 3.5 or GPT 4 OTS
)

Chats with OpenAI's `gpt-3.5-turbo` and `gpt-4` chat models are typically structured (in plain text) like this:

```
System: You are a helpful accounting assistant.

User: Hi AI, how are you today?

Assistant: I'm great thank you. How can I help you?

User: I'd like to understand FASB Update ASU 2016-13.

Assistant:
```

The final `"Assistant:"` without a response is what would prompt the model to continue the conversation. In the official OpenAI `ChatCompletion` endpoint these would be passed to the model in a format like:

```python
[
    {"role": "system", "content": "You are a helpful accounting assistant."},
    {"role": "user", "content": "Hi AI, how are you today?"},
    {"role": "assistant", "content": "I'm great thank you. How can I help you?"}
    {"role": "user", "content": "I'd like to understand FASB Update ASU 2016-13."}
]
```

In LangChain there is a slightly different format. We use three _message_ objects like so:

In [None]:
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

messages = [
    SystemMessage(content="You are a helpful accounting assistant."),
    HumanMessage(content="Hi AI, how are you today?"),
    AIMessage(content="I'm great thank you. How can I help you?"),
    HumanMessage(content="I'd like to understand FASB Update ASU 2016-13.")
]

The format is very similar, we're just swapped the role of `"user"` for `HumanMessage`, and the role of `"assistant"` for `AIMessage`.

We generate the next response from the AI by passing these messages to the `ChatOpenAI` object.

In [None]:
res = chat(messages)
res

In response we get another AI message object. We can print it more clearly like so:

In [None]:
print(res.content)

Because `res` is just another `AIMessage` object, we can append it to `messages`, add another `HumanMessage`, and generate the next response in the conversation.

In [None]:
# add latest AI response to messages
messages.append(res)

# now create a new user prompt
prompt = HumanMessage(
    content="How might this impact my business?"
)
# add to messages
messages.append(prompt)

# send to chat-gpt
res = chat(messages)

print(res.content)

### Dealing with Hallucinations

We have our chatbot, but as mentioned — the knowledge of LLMs can be limited. The reason for this is that LLMs learn all they know during training. An LLM essentially compresses the "world" as seen in the training data into the internal parameters of the model. We call this knowledge the _parametric knowledge_ of the model.

By default, LLMs have no access to the external world.

The result of this is very clear when we ask LLMs about more recent information, like ASU 2023-02, which was issued in March of 2023, which addresses the accounting for investments in certain tax credit structures.

In [None]:
# add latest AI response to messages
messages.append(res)

# now create a new user prompt
prompt = HumanMessage(
    content="What can you tell me about ASU 2023-05?"
)
# add to messages
messages.append(prompt)

# send to OpenAI
res = chat(messages)

In [None]:
print(res.content)

Our chatbot can no longer help us, it doesn't contain the information we need to answer the question. It was very clear from this answer that the LLM doesn't know the informaiton, but sometimes an LLM may respond like it _does_ know the answer — and this can be very hard to detect.

OpenAI have since adjusted the behavior for this particular example as we can see below:

In [None]:
# add latest AI response to messages
messages.append(res)

# now create a new user prompt
prompt = HumanMessage(
    content="Can you tell me what ASU 2023-05 says about the initial measurement of a joint venture’s total net assets?"
)
# add to messages
messages.append(prompt)

# send to OpenAI
res = chat(messages)

In [None]:
print(res.content)

There is another way of feeding knowledge into LLMs. It is called _source knowledge_ and it refers to any information fed into the LLM via the prompt. We can try that with the LLMChain question. We can take a description of this object from the LangChain documentation.

In [None]:
llmchain_information = [
    "ASU 2023-05 defines a joint venture as A joint venture is the formation of a new entity without an accounting acquirer. The formation of a joint venture is the creation of a new reporting entity, and none of the assets and/or businesses contributed to the joint venture are viewed as having survived the combination as an independent entity—that is, an accounting acquirer will not be identified.",
    "Under ASU 2023-05, A joint venture measures its identifiable net assets and goodwill, if any, at the formation date. The joint venture formation date is the date on which an entity initially meets the definition of a joint venture.",
    "Initial measurement of a joint venture’s total net assets is equal to the fair value of 100 percent of the joint venture’s equity. The amendments require that a joint venture measure its total net assets upon formation as the fair value of the joint venture as a whole. The fair value of the joint venture as a whole equals the fair value of 100 percent of a joint venture’s equity immediately following formation (including any noncontrolling interest in the net assets recognized by the joint venture)."
]

source_knowledge = "\n".join(llmchain_information)

We can feed this additional knowledge into our prompt with some instructions telling the LLM how we'd like it to use this information alongside our original query.

In [None]:
query = "Can you tell me what ASU 2023-05 says about the initial measurement of a joint venture’s total net assets?"

augmented_prompt = f"""Using the contexts below, answer the query.

Contexts:
{source_knowledge}

Query: {query}"""

Now we feed this into our chatbot as we were before.

In [None]:
# create a new user prompt
prompt = HumanMessage(
    content=augmented_prompt
)
# add to messages
messages.append(prompt)

# send to OpenAI
res = chat(messages)

In [None]:
print(res.content)

The quality of this answer is much better. This is made possible thanks to the idea of augmented our query with external knowledge (source knowledge). There's just one problem — how do we get this information in the first place?

This is where we will use Pinecone and vector databases, but first, we'll need a dataset.

### Importing the Data

In this task, we will be importing our data. We will be using a cuople of accounting specific PDFs for this test. Specifically, we are using "Internal Control Strategies for Compliance with the Sarbanes-Oxley Act of 2002" and the "Auditing Standards of the Public Company Accounting Oversight Board."
However, once the model works, we could use a long list of accounting references with documents like:

**Regulatory and Standards Manuals**
1.	FASB Accounting Standards Codification (ASC)
2.	International Financial Reporting Standards (IFRS) Handbook
3.	PCAOB Auditing Standards Manual
4.	Sarbanes-Oxley Act (SOX) Compliance Guide

Before we can import our PDFs, we have to install some additional software and mount our Google Drive to access the documents. We are also going to try to do some data cleanup by converting the files to text documents.

The code to do this is below.

In [None]:

!pip install PyPDF2
!pip install transformers


In [None]:
import os
import PyPDF2
import pandas as pd
from tqdm.auto import tqdm
from datasets import Dataset
from transformers import AutoModel, AutoTokenizer
import torch
from pinecone import Index
import requests

# Define the API key
API_KEY = "YOUR_API_KEY"

# Create a Pinecone index
index = Index("my-index")

# Define a function to extract text from a PDF file
def extract_text_from_pdf(file_path):
    with open(file_path, "rb") as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        text = ""
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text

# Define a function to split text into chunks
def split_text_into_chunks(text, chunk_size=1000):
    chunks = []
    for i in range(0, len(text), chunk_size):
        chunks.append(text[i : i + chunk_size])
    return chunks

# Function to embed documents using BERT
def embed_documents(texts, model):
    embeddings = []
    for text in texts:
        inputs = tokenizer(text, padding=True, truncation=True, max_length=512, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state.mean(dim=1).squeeze().numpy())
    return embeddings

# Initialize tokenizer and model for BERT
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# If you're using google drive to store your documents, you need to mount it and insert the path to where your files are stored
from google.colab import drive
drive.mount('/content/drive')

# Define the path to your PDF files in Google Drive
pdf_dir = "INSERT_PATH_TO_YOUR_DIRECTORY"

# Create a list of dictionaries containing the text and filename for each PDF file
pdf_data = []
for filename in os.listdir(pdf_dir):
    if filename.endswith(".pdf"):
        file_path = os.path.join(pdf_dir, filename)
        text = extract_text_from_pdf(file_path)
        chunks = split_text_into_chunks(text)
        for chunk in chunks:
            pdf_data.append({"text": chunk, "filename": filename})

# Convert the list of dictionaries to a dictionary with column names as keys and lists of values as values
pdf_data_dict = {}
for key in pdf_data[0].keys():
    pdf_data_dict[key] = [d[key] for d in pdf_data]

# Convert the dictionary to a dataset object
dataset = Dataset.from_dict(pdf_data_dict)

# Add the doi column to the dataset
dataset = dataset.map(lambda x: {"doi": x["filename"].split(".")[0], **x})

# Convert the dataset to a pandas DataFrame object
data = dataset.to_pandas()

# Remove the filename column from the DataFrame
data = data.drop(columns=["filename"])

# Set the batch size and loop over the data with a progress bar
batch_size = 10
for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i + batch_size)
    batch = data.iloc[i : i_end]
    ids = [f"{x['doi']}-{i}-{j}" for i, x in batch.iterrows() for j in range(len(batch))]
    texts = [x["text"] for _, x in batch.iterrows()]

# Embed text
embeds = embed_documents(texts, model)

# Split metadata into smaller batches and store each batch separately
metadata = [{"text": x["text"]} for _, x in batch.iterrows()]
metadata_batches = [metadata[i : i + 10] for i in range(0, len(metadata), 10)]

# Loop through metadata batches
for j, metadata_batch in enumerate(metadata_batches):
    # Generate unique IDs for each metadata batch
    metadata_ids = [f"{x['doi']}-{i}-{j}" for i, x in batch.iterrows()]

    # Add to Pinecone


In [None]:
dataset[0]

#### Dataset Overview

The dataset we are using comes from the two PDFs we loaded, but with a larger database and more documentation we could increase our model's knowledge base. The PDFs were each several hundred pages, so we had to break them into smaller "chunks" so that our model could read them. Each entry in the dataset represents a "chunk" of text from these PDFs.

Because most **L**arge **L**anguage **M**odels (LLMs) only contain knowledge of the world as it was during training, they cannot answer questions about information that wasn't available until after they were trained. This method allows us to add new information as it's available and to focus the model for our specific needs -- in this case, building a finance and accounting expert.

### Task 4: Building the Knowledge Base

We now have a dataset that can serve as our chatbot knowledge base. Our next task is to transform that dataset into the knowledge base that our chatbot can use. To do this we must use an embedding model and vector database.

We begin by initializing our connection to Pinecone, this requires a [free API key](https://app.pinecone.io).

In [None]:
import pinecone

# get API key from app.pinecone.io and environment from console
pinecone.init(
    api_key=os.environ.get('PINECONE_API_KEY') or 'YOUR_PINECONE_API_KEY',
    environment=os.environ.get('PINECONE_ENVIRONMENT') or 'gcp-starter'
)

Then we initialize the index. We will be using OpenAI's `text-embedding-ada-002` model for creating the embeddings, so we set the `dimension` to `1536`.

In [None]:
import time

index_name = 'acctg-chatbot'

if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=1536,
        metric='cosine'
    )
    # wait for index to finish initialization
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

index = pinecone.Index(index_name)

Then we connect to the index:

In [None]:
index.describe_index_stats()

Our index is now ready but it's empty. It is a vector index, so it needs vectors. As mentioned, to create these vector embeddings we will OpenAI's `text-embedding-ada-002` model — we can access it via LangChain like so:

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings

embed_model = OpenAIEmbeddings(model="text-embedding-ada-002")

Using this model we can create embeddings like so:

In [None]:
# Define a list of dictionaries containing the text and metadata for each chunk
chunks = [
    {'text': 'This is the text of the first chunk.', 'source': 'source1', 'title': 'title1', 'doi': 'doi1', 'chunk-id': 'chunk1'},
    {'text': 'This is the text of the second chunk.', 'source': 'source2', 'title': 'title2', 'doi': 'doi2', 'chunk-id': 'chunk2'},
    {'text': 'This is the text of the third chunk.', 'source': 'source3', 'title': 'title3', 'doi': 'doi3', 'chunk-id': 'chunk3'},
    # Add more chunks here
]

# Split the chunks into batches of 100
batch_size = 100
chunks_batches = [chunks[i:i+batch_size] for i in range(0, len(chunks), batch_size)]
metadata_batches = []

# Loop over the chunks in batches
for chunks_batch in tqdm(chunks_batches):
    # Get the metadata for each chunk in the batch
    metadata_batch = []
    for chunk in chunks_batch:
        metadata = {
            'text': chunk['text'],
            'source': chunk['source'],
            'title': chunk['title']
        }
        metadata_batch.append(metadata)
    # Store the metadata for the batch in a separate vector
    metadata_batches.append(metadata_batch)

# Embed the chunks and store the embeddings and metadata in Pinecone
for i, chunks_batch in enumerate(tqdm(chunks_batches)):
    # Get the embeddings for the chunks in the batch
    embeddings_batch = embed_model.embed_documents([chunk['text'] for chunk in chunks_batch])
    # Get the metadata for the batch
    metadata_batch = metadata_batches[i]
    # Generate unique ids for each chunk in the batch
    ids_batch = [f"{chunk['doi']}-{chunk['chunk-id']}" for chunk in chunks_batch]
    # Split the metadata into smaller batches
    metadata_batches_split = [metadata_batch[j:j+batch_size] for j in range(0, len(metadata_batch), batch_size)]
    # Store the embeddings and metadata for the batch in Pinecone
    for j, metadata_batch_split in enumerate(metadata_batches_split):
        ids_batch_split = ids_batch[j*batch_size:(j+1)*batch_size]
        embeddings_batch_split = embeddings_batch[j*batch_size:(j+1)*batch_size]
        index.upsert(vectors=zip(ids_batch_split, embeddings_batch_split, metadata_batch_split))


From this we get two (aligning to our 1,970 chunks of text) 1536-dimensional embeddings.

We're now ready to embed and index all our our data! We do this by looping through our dataset and embedding and inserting everything in batches.

In [None]:
import os
import PyPDF2
import pandas as pd
from tqdm.auto import tqdm
from datasets import Dataset

# Define a function to extract text from a PDF file
def extract_text_from_pdf(file_path):
    with open(file_path, 'rb') as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        text = ''
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text

# Define the path to your PDF files
pdf_dir = 'INSERT_PATH_TO_YOUR_FILES'

# Create a list of dictionaries containing the text and filename for each PDF file
pdf_data = []
for filename in os.listdir(pdf_dir):
    if filename.endswith('.pdf'):
        file_path = os.path.join(pdf_dir, filename)
        text = extract_text_from_pdf(file_path)
        pdf_data.append({'text': text, 'filename': filename})

# Convert the list of dictionaries to a dictionary with column names as keys and lists of values as values
pdf_data_dict = {}
for key in pdf_data[0].keys():
    pdf_data_dict[key] = [d[key] for d in pdf_data]

# Convert the dictionary to a dataset object
dataset = Dataset.from_dict(pdf_data_dict)

# Add the doi column to the dataset
dataset = dataset.map(lambda x: {'doi': x['filename'].split('.')[0], **x})

# Convert the dataset to a pandas DataFrame object
data = dataset.to_pandas()

# Remove the filename column from the DataFrame
data = data.drop(columns=['filename'])

# Set the batch size and loop over the data with a progress bar
batch_size = 100
for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i+batch_size)
    # get batch of data
    batch = data.iloc[i:i_end]
    # generate unique ids for each chunk
    ids = [f"{x['doi']}-{i}" for i, x in batch.iterrows()]
    # get text to embed
    texts = [x['text'][:1000] for _, x in batch.iterrows()]  # store only the first 1000 characters of the text
    # embed text
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['text'][:1000]} for _, x in batch.iterrows()  # store only the first 1000 characters of the text
    ]
    # split metadata into smaller batches and store each batch separately
    metadata_batches = [metadata[i:i+10] for i in range(0, len(metadata), 10)]
    for metadata_batch in metadata_batches:
        # add to Pinecone
        index.upsert(vectors=zip(ids, embeds, metadata_batch))


We can check that the vector index has been populated using `describe_index_stats` like before:

In [None]:
index.describe_index_stats()

#### Retrieval Augmented Generation

We've built a fully-fledged knowledge base. Now it's time to connect that knowledge base to our chatbot. To do that we'll be asking specific questions from the documents we uploaded.

To use LangChain here we need to load the LangChain abstraction for a vector index, called a `vectorstore`. We pass in our vector `index` to initialize the object.

In [None]:
from langchain.vectorstores import Pinecone

text_field = "text"  # the metadata field that contains our text

# initialize the vector store object
vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)

Using this `vectorstore` we can already query the index and see if we have any relevant information based on text from one of the uploaded PDFs.

In [None]:
query = "What are two different types of fraud?"

vectorstore.similarity_search(query, k=3)

We return a lot of text here and it's not that clear what we need or what is relevant. Fortunately, our LLM will be able to parse this information much faster than us. All we need is to connect the output from our `vectorstore` to our `chat` chatbot. To do that we can use the same logic as we used earlier.

In [None]:
def augment_prompt(query: str):
    # get top 3 results from knowledge base
    results = vectorstore.similarity_search(query, k=3)
    # get the text from the results
    source_knowledge = "\n".join([x.page_content for x in results])
    # feed into an augmented prompt
    augmented_prompt = f"""Using the contexts below, answer the query.

    Contexts:
    {source_knowledge}

    Query: {query}"""
    return augmented_prompt

Using this we produce an augmented prompt:

In [None]:
print(augment_prompt(query))

There is still a lot of text here, so let's pass it onto our chat model to see how it performs.

In [None]:
# create a new user prompt
prompt = HumanMessage(
    content=augment_prompt(query)
)
# add to messages
messages.append(prompt)

res = chat(messages)

print(res.content)

# Try it out for yourself
Enter your finance and accounting question here as you would in your favorite LLM-driven Chatbot.


In [None]:
import ipywidgets as widgets
from IPython.display import display, clear_output

# Initialize the list to store messages
messages = []

# Function to augment the prompt
def augment_prompt(query: str):
    results = vectorstore.similarity_search(query, k=3)
    source_knowledge = "\n".join([x.page_content for x in results])
    augmented_prompt = f"""Using the contexts below, answer the query.

    Contexts:
    {source_knowledge}

    Query: {query}"""
    return augmented_prompt

# Function to handle button click
def on_button_click(b):
    clear_output()
    display(textbox, button)

    query = textbox.value  # User input
    augmented_query = augment_prompt(query)  # Augment the prompt

    prompt = HumanMessage(content=augmented_query)  # Create new user prompt
    messages.append(prompt)  # Add to existing messages

    res = chat(messages)  # Get the response

    print(res.content)  # Display the response

# Create a text box for user input
textbox = widgets.Text(
    value='',
    placeholder='Enter your prompt here',
    description='Prompt:',
    disabled=False
)

# Create a button to submit the prompt
button = widgets.Button(description="Submit")

# Display textbox and button
display(textbox, button)

# Attach the button click event to the function
button.on_click(on_button_click)
