# Using Retrieval-Augmented Generation to Search Research Notes Database

Retrieval-augmented generation, or _RAG_, is a technique used with large language models to provide additional context without fine-tuning or retraining. It enhances the ability of language models to provide factual responses, which is a limitation of classical setups.

The goal of this project is to build a question-answering bot for alternative data related questions. To achieve this, we will use RAG to provide factual information to the language model. We will upload the alternative data news to a vector database and use it to search for relevant context for the language model.

We will be using the following tools and models:
- [OpenAI](https://openai.com)'s `gpt-3.5-turbo` model for prompt completions
- OpenAI's `text-embedding-ada-002` model to create vector embeddings
- [Pinecone](https://www.pinecone.io/) as the vector database to store the embeddings
- [langchain](https://www.langchain.com/) as the tool to interact with OpenAI and Pinecone

The dataset used for this project is sourced from a real company news dataset. The underlying news headlines were also generated through OpenAI text completion algorithms.

## Before you begin

To get started with this project, you'll need a developer account for OpenAI and Pinecone. Follow the steps in the [getting-started.ipynb](https://app.datacamp.com/workspace/w/f1d996aa-0aaa-47e3-bd61-2b5b5a0fa558/edit/getting-started.ipynb) notebook to create an API key and store it in Workspace.

For this project, we will assume that you have already set the `OPENAI_API_KEY` and `PINECONE_API_KEY` environment variables.

## Setup

To perform this analysis, we need to install the following packages:

- `openai`: for interacting with OpenAI
- `pinecone-client`: for interacting with Pinecone
- `tiktoken`: a string encoder that generates tokens used by OpenAI. It is useful for estimating the number of tokens used.
- `langchain`: the toolchain used to interact with OpenAI and Pinecone

### Instructions

Run the cell below to install the corresponding packages.

In [1]:
%%capture
# Below we installed specific versions of the packages
# Feel free to experiment with different versions
# However, the workspace below is only tested with these specific versions
!pip install pinecone-client==2.2.2 openai==0.28.0 tiktoken==0.5.1 langchain==0.0.291

In [2]:
import os
os.getcwd()

'c:\\Users\\ahmed\\Downloads'

In [3]:
# Set up secrets file

def load_secrets(file_path):
    secrets = {}
    with open(file_path, 'r') as file:
        for line in file:
            if line.strip() and not line.startswith('#'):  # Exclude empty and comment lines
                key, value = line.strip().split('=', 1)
                secrets[key] = value
    return secrets

# Load your secrets
secrets_file_path = 'c:\\Users\\ahmed\\Downloads\\chatbot_secrets.env'  # Update this to your file's path
secrets = load_secrets(secrets_file_path)

## Task 1: Import the Research Notes Data

In [4]:
import pandas as pd

research_notes = pd.read_csv("normalized_rn.csv")

In [5]:
research_notes.columns

Index(['Unnamed: 0', 'gvkey', 'note', 'note_creation', 'data_source', 'conm'], dtype='object')

In [6]:
research_notes.head()

Unnamed: 0.1,Unnamed: 0,gvkey,note,note_creation,data_source,conm
0,0,1690,Republican House Representative Lloyd Smucker ...,2024-01-10,Politician trading,APPLE INC
1,1,2018,"On the 1st of April, BANK LEUMI LE ISRAEL B M ...",2024-01-10,Share buyback intention,BANK LEUMI LE ISRAEL B M
2,2,2176,"The Republican House representative, Morris Br...",2024-01-10,Politician trading,BERKSHIRE HATHAWAY
3,3,2435,"On the 13th of July, BROWN FORMAN CORP will la...",2024-01-10,Share buyback intention,BROWN FORMAN CORP
4,4,2968,Democrat House Representative Bradley Schneide...,2024-01-10,Politician trading,JPMORGAN CHASE & CO


## Task 2: Create Documents from the Data

Later in this project, we will be creating vector embeddings for all of the rows in the `research_notes` DataFrame. Before we do so, we need to create [Document](https://docs.langchain.com/docs/components/schema/document) objects from the data in the DataFrame. To accomplish this, we can utilize the `DataFrameLoader` class provided by langchain, which allows us to create documents from a pandas DataFrame.

For the main content of the documents, we will create a summary string that includes relevant information about each research note/news headline. 

### Instructions

- Import `DataFrameLoader` from `langchain.document_loaders`
- Only keep the columns `page_content` and `source` in the DataFrame
- Use `DataFrameLoader` to load documents from the `research_notes` DataFrame into `docs`. Use `"page_content"` as the `page_content_column`.
- Print the first 3 documents and the total number of documents

In [7]:
# Import DataFrameLoader
from langchain.document_loaders import DataFrameLoader

# Create page content column
research_notes["page_content"] = "Company: " + research_notes["conm"] + "\n" + \
                         "Note: " + research_notes["note"] + "\n" + \
                         "Source: " + research_notes["data_source"]

# Drop all columns except for page_content and source
research_notes = research_notes[["page_content", "data_source"]]

# Load the documents from the dataframe into docs
docs = DataFrameLoader(
    research_notes,
    page_content_column="page_content",
).load()

# Print the first 3 documents and the number of documents
print(f"First 3 documents: {docs[:3]}")
print(f"Number of documents: {len(docs)}")

First 3 documents: [Document(page_content='Company: APPLE INC\nNote: Republican House Representative Lloyd Smucker announced that he had sold shares in APPLE INC (AAPL) for approximately 32.5K USD.\nSource: Politician trading', metadata={'data_source': 'Politician trading'}), Document(page_content='Company: BANK LEUMI LE ISRAEL B M\nNote: On the 1st of April, BANK LEUMI LE ISRAEL B M will initiate a share buyback program with the goal of repurchasing 700M (ILS) from the market.\nSource: Share buyback intention', metadata={'data_source': 'Share buyback intention'}), Document(page_content='Company: BERKSHIRE HATHAWAY\nNote: The Republican House representative, Morris Brooks, disclosed a purchase of BERKSHIRE HATHAWAY BRK/B with an estimated value of around 32.5K USD.\nSource: Politician trading', metadata={'data_source': 'Politician trading'})]
Number of documents: 346


## Task 3: Estimate the Cost of Embedding

We're going to be using OpenAI to calculate [vector embeddings](https://platform.openai.com/docs/guides/embeddings/embeddings) of the document texts. Creating embeddings is a form of dimensionality reduction, where we assign the text to a point in an N-dimensional space. Texts that are semantically close to each other should end up being close to each other in the N-dimensional space.

Luckily, OpenAI has several models that are trained to calculate these kinds of embeddings, so we don't have to do that ourselves. Of course, a cost is associated with this. You can derive the cost from the [pricing page of OpenAI](https://openai.com/pricing).

The calculation is based on the amount of _tokens_ in the text. All text is encoded into tokens to be used by OpenAI. On average, a token consists of roughly 3 characters. However, we can calculate the exact tokens for a string of text by using the `tiktoken` package.

The goal of this task is to calculate the number of tokens in the documents, to then extrapolate the estimated cost.

### Instructions

- Import `tiktoken`
- Create the encoder, use the `"cl100k_base"` encoder. This is the encoder used by OpenAI to calculate the embeddings for text using the `text-embedding-ada-002` model.
- Create a list that contains the amount of tokens for each document
- Calculate the estimated cost: the sum of all tokens, divided by 1000 tokens, multiplied with $0.0001

In [8]:
# Import tiktoken
import tiktoken

# Create the encoder
encoder = tiktoken.get_encoding("cl100k_base")

# Create a list containing the number of tokens for each document
tokens_per_doc = [len(encoder.encode(doc.page_content)) for doc in docs]

# Show the estimated cost, which is the sum of the amount of tokens divided by 1000, times $0.0001
total_tokens = sum(tokens_per_doc)
cost_per_1000_tokens = 0.0001
cost =  (total_tokens / 1000) * cost_per_1000_tokens
cost

0.0022922999999999997

## Task 4: Create the Index on Pinecone

Looks like calculating the embeddings is not going to be too expensive. It's always smart to get a rough estimate on the amount of tokens used, so you get an idea of the cost of calculating the embeddings using OpenAI.

Now we're ready to create the index on Pinecone. An [index in Pinecone](https://docs.pinecone.io/docs/indexes) can be used to store vectors. You can compare an index in Pinecone to a table in SQL, it stores information of one type of object.

In a later task, we'll be creating vectors from the documents we just created using OpenAI's second-generation embedding model. It's important to already know the embeddings we're going to use since we need to know the output dimensions to create an index. For `text-embedding-ada-002`, this is `1536` dimensions ([source](https://platform.openai.com/docs/guides/embeddings/second-generation-models)).

At the end of this task, you should be able to find your new index in the [Pinecone UI](https://app.pinecone.io/).

![Pinecone UI](pinecone_ui.png)

### Instructions

- Import `os` and `pinecone`
- Use `.init` to initialize the Pinecone client with the `"PINECONE_API_KEY"` environment variable. Use the `"gcp-starter"` environment on Pinecone.
- Print all the indexes on Pinecone by using `.list_indexes` on the client.
- Use `.create_index` to create an index, but only if it does not exist yet. The metric we'll use is the `"cosine"` distance, and as we mentioned above, the embeddings wil have `1536` dimensions.

In [28]:
# Import os and pinecone
import os
import pinecone

# Initialize pinecone using the `PINECONE_API_KEY` variable. Use the gcp-starter environment
pinecone.init(
    api_key=secrets["PINECONE_API_KEY"],
    environment="gcp-starter"
)

# Print the indexes
print(pinecone.list_indexes())

index_name = "research-notes-test"

# First check that the given index does not exist yet
if index_name not in pinecone.list_indexes():
    # Create the index if it does not exist
    pinecone.create_index(
        name=index_name,
        metric='cosine',
        dimension=1536,
    )

[]


## Task 5: Fill the Index with the Documents

Now that we have the vector index at our disposal, it's time to populate it with some vectors. In this task, we'll need to:

1. Generate vector embeddings for all documents in `docs`. We'll utilize OpenAI for this purpose. langchain provides a convenient helper for this task, `langchain.embeddings.openai.OpenAIEmbeddings`, which you can use to generate embeddings using the latest `text-embedding-ada-002` model.
2. Populate the vector index in Pinecone with these embeddings. Fortunately, langchain also offers assistance with this through the [`langchain.vectorstores.Pinecone`](https://python.langchain.com/docs/integrations/vectorstores/pinecone) helper.

These two steps can be combined using the convenient helper method `.from_document` of the `Pinecone` class. This method accepts an embedding model as input and efficiently calculates the embeddings, subsequently uploading them to Pinecone. We will also introduce some control flow to the code to ensure we do not add data to the Pinecone index if it already contains data. To achieve this, we can make use of the `.from_existing_index` method of `Pinecone`.

In addition to storing vectors, Pinecone allows the storage of additional metadata. When using the langchain helpers, it automatically assumes that vectors should be created from the `page_content` property of each `Document`. All other properties will be included as metadata.

You can verify that everything has worked correctly by accessing the index in the Pinecone UI.

### Instructions

- Import `OpenAIEmbeddings` from `langchain.embeddings.openai` and `Pinecone` from `langchain.vectorstores` and `Index` from `pinecone.index`
- Create the `embeddings` object, which should be an instance of `OpenAIEmbeddings`. The defaults are good to go here.
- Use `Pinecone.from_documents` to fill up the vector index on Pinecone using the given documents and embeddings object. This will take a while to run, as it will automatically calculate embeddings from all of the `page_content` properties of the documents, and upload that along with metadata to Pinecone. Assign the result to `docsearch`.
   - Some control flow code is already provided for you, this will make sure you use the existing index if it already contains some vectors and avoids filling it up twice.
- Test out the `docsearch` vector database object, by calling `.as_retriever().get_relevant_documents` with a given question. This will first create a [retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/) from the vector database, and then use that to match the most similar documents in the database.

In [44]:
# Import OpenAIEmbeddings, Pinecone and Index
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from pinecone.index import Index

# Create the embeddings object
embeddings = OpenAIEmbeddings(openai_api_key=secrets["OPENAI_API_KEY"])

index = Index(index_name)

# Check if there is already some data in the index on Pinecone
if index.describe_index_stats()['total_vector_count'] > 0:
    # If there is, use from_existing_index to use the vector store
    docsearch = Pinecone.from_existing_index(index_name, embeddings)
else:
    # If there is not, use from_documents to fill the vector store
    docsearch = Pinecone.from_documents(docs, embeddings, index_name=index_name)

question = "What has been happening with apple inc?"
    
# Use the vector database as a retriever and get the relevant documents for a quesiton
docsearch.as_retriever().get_relevant_documents(question)

[Document(page_content='Company: APPLE INC\nNote: Republican House Representative Lloyd Smucker announced that he had sold shares in APPLE INC (AAPL) for approximately 32.5K USD.\nSource: Politician trading', metadata={'data_source': 'Politician trading'}),
 Document(page_content='Company: APPLE INC\nNote: Republican House Representative Lloyd Smucker announced that he had sold shares in APPLE INC (AAPL) for approximately 32.5K USD.\nSource: Politician trading', metadata={'data_source': 'Politician trading'}),
 Document(page_content='Company: ALPHABET INC\nNote: John Curtis, a member of the Republican House, recently announced the sale of ALPHABET INC GOOGL stocks for approximately 32.5K USD at a press conference.\nSource: Politician trading', metadata={'data_source': 'Politician trading'}),
 Document(page_content='Company: ALPHABET INC\nNote: John Curtis, a member of the Republican House, recently announced the sale of ALPHABET INC GOOGL stocks for approximately 32.5K USD at a press c

In [56]:
question = "What is the politician trading activity in Apple?"
docsearch.as_retriever().get_relevant_documents(question)[3]

Document(page_content='Company: SLACK TECHNOLOGIES INC\nNote: House Democrat Nancy Pelosi has announced a purchase of SLACK TECHNOLOGIES INC WORK, which is estimated to be worth around 274.9K US dollars.\nSource: Politician trading', metadata={'data_source': 'Politician trading'})

## Task 6: Create Prompts for RAG


We require two types of [prompt templates](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/):
- A template that demonstrates how the information in relevant documents is presented to the LLM
- A template that combines the context with the rest of the prompt

Some example prompt templates are provided in the sample code below, which you are free to edit. Notice that these example templates contain `=========` separators between different parts of the text. These kinds of delimiters are a common tactic to help the LLM distinguish between different parts of your input prompt.

### Instructions

- Import `PromptTemplate` from `langchain.prompts`
- Some example prompt templates are already provided for you. You are free to adapt them at your will. There are two prompt templates:
  - `DOCUMENT_PROMPT`: this template shows how a summary text is created for each document. The properties between the curly brackets (`{`) are replaced with the properties of each `Document`.
  - `QUESTION_PROMPT`: this template creates the full prompt that is sent to the LLM. `question` is replaced by the question asked by the user, and `summaries` is replaced with the summary of each relevant document, created by the `DOCUMENT_PROMPT` template
- Create the `PromptTemplate` objects by using `PromptTemplate.from_template`. Call them `document_prompt` and `question_prompt`, respectively.

In [54]:
research_notes.iloc[0]["page_content"]

'Company: APPLE INC\nNote: Republican House Representative Lloyd Smucker announced that he had sold shares in APPLE INC (AAPL) for approximately 32.5K USD.\nSource: Politician trading'

In [69]:
# Import PromptTemplate
from langchain.prompts import PromptTemplate

# Read/adapt the prompts below at will
DOCUMENT_PROMPT = """{page_content}
========="""

QUESTION_PROMPT = """Given the following extracted parts of a a database and a question, create a final answer with the research note.
If you don't know the answer, just say that you don't know. Don't try to make up an answer.
Be concise and do not add additional text that does not exist in the note.

QUESTION: What is the politician trading activity in a given company?
=========
Company: APPLE INC
Note: Republican House Representative Lloyd Smucker announced that he had sold shares in APPLE INC (AAPL) for approximately 32.5K USD
=========
Company: SLACK TECHNOLOGIES INC
Note: House Democrat Nancy Pelosi has announced a purchase of SLACK TECHNOLOGIES INC WORK, which is estimated to be worth around 274.9K US dollars
=========
FINAL ANSWER: Republican House Representative Lloyd Smucker announced that he had sold shares in APPLE INC (AAPL) for approximately 32.5K USD

QUESTION: {question}
=========
{summaries}
FINAL ANSWER:"""

# Create prompt template objects
document_prompt = PromptTemplate.from_template(DOCUMENT_PROMPT)
question_prompt = PromptTemplate.from_template(QUESTION_PROMPT)

## Task 7: Chain Everything Together to Perform RAG

Finally, we have the vector index filled up with information, we have the prompt templates set up. That means we have everything we need to build a question-answering bot, which can use the information retrieved from Pinecone to answer questions about the news headlines.

We'll use the `gpt-3.5-turbo` model of OpenAI in order to provide a completion for the question prompt above.

Langchain provides a convenient concept, called [chains](https://python.langchain.com/docs/modules/chains/), that does some of the heavy lifting when you need to combine multiple AI systems into a single application. For the purpose of this project, we'll be using the `RetrievalQAWithSourcesChain` class. This chain will accept a `question` and a `retriever`. When asked a question, it will first use the retriever to retrieve relevant documents. Afterwards, it will combine the documents into a prompt and send it to the LLM to provide a completion.

### Instructions

- Import `RetrievalQAWithSourcesChain` from `langchain.chains` and `ChatOpenAI` from `langchain.chat_models`
- Use `RetrievalQAWithSourcesChain` to create the chain to answer questions. Use the `.from_chain_type` method:
  - Set `chain_type` set to `"stuff"`. This is the simplest type of chain, and will just stuff the document context in one prompt.
  - Set `llm` to an instance of `ChatOpenAI` with `model_name` set to `"gpt-3.5-turbo"` and `temperature` set to `0`.
  - Use the `PromptTemplate` objects you created above to pass to `chain_type_kwargs`
  - As a retriever, use the `docsearch.as_retriever` method you've seen before

In [62]:
question = "What is the politician trading activity in Apple?"

In [72]:
# Import RetrievalQAWithSourcesChain and ChatOpenAI
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chat_models import ChatOpenAI

# Create the QA bot LLM chain
qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
    chain_type="stuff",
    llm=ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, openai_api_key=secrets["OPENAI_API_KEY"]),
    chain_type_kwargs={
        "document_prompt": document_prompt,
        "prompt": question_prompt,
    },
    retriever=docsearch.as_retriever(),
)


In [73]:
question = "What is the recent buyback activity?"
qa_with_sources(question)

{'question': 'What is the recent buyback activity?',
 'answer': 'On the 1st of July, BANK OF AMERICA CORP will commence a share buyback program, purchasing 20.6 billion USD from the market. Citigroup Inc. will also begin a share buyback program on the same date, aiming to retrieve 17.6 billion US dollars from the market.',
 'sources': ''}

In [74]:
question = "Any political trading activity in Apple?"
qa_with_sources(question)

{'question': 'Any political trading activity in Apple?',
 'answer': 'Republican House Representative Lloyd Smucker announced that he had sold shares in APPLE INC (AAPL) for approximately 32.5K USD. There is no other political trading activity in Apple.',
 'sources': ''}

In [79]:
question = "Any interesting insider activity related to Century Casinos?"
qa_with_sources(question)

{'question': 'Any interesting insider activity related to Century Casinos?',
 'answer': 'President/CEO, Peter Hoetzinger, from CENTURY CASINOS INC, reported a Buy of 100K shares worth 414.8K USD.',
 'sources': ''}