# Retrieval Agent with Pinecone & LLaMa for Notion

The goal of this notebook is to show how we can use a local version of LLaMa to run RAG on a Pinecone index.
In this particular example, we are embedding a Notion Workspace and allowing to QA on the content of it.

Configure the .env parameters with:
- Pinecone API key
- Pinecone Environment
- OpenAI API key
- LLaMa model path
- Notion Download Directory


In [None]:
%pip install -qU pinecone-client tqdm langchain llama-cpp-python

In [None]:
# Decide if you want to use LLaMa or OpenAI for the embedding
embedding_model = "openai" # "openai" "llama"

# Decide if you want to use LLaMa or OpenAI for the chat
chat_model = "llama" # "openai" "llama"

In [None]:
import os
from dotenv import find_dotenv, load_dotenv
# Load environment variables from .env file
load_dotenv(find_dotenv())

# initialize connection to pinecone (get API key at app.pinecone.io)
pinecone_api_key = os.getenv("PINECONE_API_KEY")
# find your environment next to the api key in pinecone console
pinecone_env = os.getenv("PINECONE_ENV")

# get our OpenAI API key
openai_api_key = os.getenv("OPENAI_API_KEY")

LLaMa_model_path="../../../llama/llama-2-7b-chat/ggml-model-f16_q4_0.gguf"
NOTION_DIRECTORY_PATH="../notion_data/Support_runbook"

In [None]:
from langchain.document_loaders import NotionDirectoryLoader

loader = NotionDirectoryLoader(NOTION_DIRECTORY_PATH)
data = loader.load()

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)


In [None]:
# count number of chunks
num_chunks = 0
for doc in all_splits:
    num_chunks += 1

print(f"Total number of chunks: {num_chunks}")

In [None]:
from langchain.embeddings import LlamaCppEmbeddings
from langchain.embeddings.openai import OpenAIEmbeddings

if embedding_model == "llama":
    embeddings = LlamaCppEmbeddings(model_path=LLaMa_model_path)
elif embedding_model == "openai":
    embeddings = OpenAIEmbeddings()

In [None]:
import pinecone

pinecone.init(api_key=pinecone_api_key, environment=pinecone_env)
pinecone.whoami()

In [None]:
import time
# set dimension to 4096 for LLaMa 7B model and 1536 for OpenAI
dimension = 4096 if embedding_model == "llama" else 1536

index_name = 'notion-db-chatbot'

if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

# we create a new index
pinecone.create_index(
    name=index_name,
    metric='dotproduct',
    dimension=dimension
)

# wait for index to be initialized
while not pinecone.describe_index(index_name).status['ready']:
    time.sleep(1)

Connect to the index

In [None]:
index = pinecone.Index(index_name)
index.describe_index_stats()

We should see that the new Pinecone index has a `total_vector_count` of `0`, as we haven't added any vectors yet.

Now we upsert the data to Pinecone:

In [None]:
from langchain.vectorstores import Pinecone

vectordb = Pinecone.from_documents(all_splits[:1], embeddings, index_name=index_name)

import tqdm
for i in tqdm.tqdm(range(1, len(all_splits))):
    vectordb.add_documents(all_splits[:i])

Just a quick note on performance.
Right now it takes ~20min to embed 300 splits with OpenAI and 10min per split with LLaMa embeddings.

We've indexed everything, now we can check the number of vectors in our index like so:

In [None]:
# Check that it worked
index.describe_index_stats()

## Creating a chat agent using Pinecone and our LLaMa model

In [None]:
from langchain.llms import LlamaCpp
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

# Initialize Llama2 Model
llm = LlamaCpp(
    model_path=LLaMa_model_path
    , temperature=0.0
    , max_tokens=2000
    , n_gqa=8 # Number of GPUs to use
    , top_p=1
    , verbose=True # Verbose is required to pass to the callback manager
    , f16_kv=True  # MUST set to True, otherwise you will run into problem after a couple of calls
    , n_ctx=2048 # context window size
    , n_batch=512 # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.
    , n_gpu_layers=1 # Metal set to 1 is enough.
)

In [None]:
from langchain.chains import ConversationalRetrievalChain

# Set up the Conversational Retrieval Chain
qa_chain = ConversationalRetrievalChain.from_llm(
    llm,
    vectordb.as_retriever(search_kwargs={'k': 2}),
    return_source_documents=True
)

Let's create a chat agent using Pinecone and our LLaMa model.

In [None]:
import sys

chat_history = []
while True:
    query = input('Prompt: ')
    if query.lower() in ["exit", "quit", "q"]:
        print('Exiting')
        sys.exit()
    result = qa_chain({'question': query, 'chat_history': chat_history})
    print('Answer: ' + result['answer'] + '\n')
    chat_history.append((query, result['answer']))

# Creating a chat agent using Pinecone and OpenAI GPT3.5 turbo

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain.chains import RetrievalQA

# chat completion llm
llm = ChatOpenAI(
    openai_api_key=openai_api_key,
    model_name='gpt-3.5-turbo',
    temperature=0.0
)
# conversational memory
conversational_memory = ConversationBufferWindowMemory(
    memory_key='chat_history',
    k=5,
    return_messages=True
)
# retrieval qa chain
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectordb.as_retriever()
)

In [None]:
from langchain.agents import Tool

tools = [
    Tool(
        name='Knowledge Base',
        func=qa.run,
        description=(
            'use this tool when answering general knowledge queries to get '
            'more information about the topic'
        )
    )
]

initiate the agent

In [None]:
from langchain.agents import initialize_agent

agent = initialize_agent(
    agent='chat-conversational-react-description',
    tools=tools,
    llm=llm,
    verbose=True,
    max_iterations=3,
    early_stopping_method='generate',
    memory=conversational_memory
)

## Run the conversational agent

In [None]:
# get the query from a user input
query = input('Prompt: ')

agent(query)

Some learnings from this exercise:
- OpenAI is much faster than LLaMa (more than 10x)
- OpenAI is fairly cheap (embedding 4Mb of Notion data cost us ~$0.36, running GPT3.5 turbo is "virtually" free)