# Setting Up RAG Chain with Memory Integration

## Setup and Configuration


###  Install Necessary Packages

In [16]:
# Install required packages
# ! pip install langchain_openai langchain_community langchain_weaviate redis

### Set up API keys

In [17]:
import warnings

# Suppress specific warning types
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=UserWarning)
warnings.filterwarnings('ignore', category=FutureWarning)


In [18]:
import getpass
import os

# Set environment variables for API keys
os.environ["OPENAI_API_KEY"] = getpass.getpass()
os.environ["HUGGINGFACEHUB_API_TOKEN"] = getpass.getpass()


# Document Processing and Vector Storage

## Import Packages

In [19]:
from langchain_core.documents import Document
from langchain_community.document_loaders import PyPDFLoader
import weaviate
from langchain_weaviate.vectorstores import WeaviateVectorStore
from langchain_community.embeddings import HuggingFaceHubEmbeddings


## Initialize Weaviate Client

In [20]:
# Connect to Weaviate client
weaviate_client = weaviate.Client("http://127.0.0.1:8080")

# Clean up schema
weaviate_client.schema.delete_all()


## Load and Process PDF Documents

In [21]:
# Load PDF and split into documents
loader = PyPDFLoader("./data.pdf")
docs = loader.load_and_split()
docs[:3]


[Document(metadata={'source': './data.pdf', 'page': 0}, page_content='GPUSimilarity: \nSimilarity Searching a Billion Compounds in Real Time \nPat Lorton \nhttps://github.com/schrodinger/gpusimilarity'),
 Document(metadata={'source': './data.pdf', 'page': 1}, page_content='Original Idea, and some questions \nOriginal idea: \nComputational complexity of fingerprint similarity (tanimoto) is trivial, \nembarrassingly parallel, and involves no branching.  Limiting factor of a \nstraightforward brute force solution is compute power:  This sounds like \nsomething GPUs would be good at. \nThen I asked some questions.. \n2'),
 Document(metadata={'source': './data.pdf', 'page': 2}, page_content='Idea for architecture \nThis problem has been solved in different ways with various success using \ncomplex architectures involving intelligent chemical understanding.  This was \nrequired because a brute force attempt was simply too expensive in terms of \nRAM and compute requirements \nRe-examining th

## Generate Embeddings for Documents

In [22]:
# Initialize embeddings
embeddings = HuggingFaceHubEmbeddings()

# Generate embeddings for a test document
text = "This is a test document."
query_result = embeddings.embed_query(text)
print(query_result[:3])


[-0.048951830714941025, -0.03986202925443649, -0.021562786772847176]


## Initialize Vector Store

In [23]:

client = weaviate.connect_to_local()

# Initialize vector store with Weaviate
db = WeaviateVectorStore.from_documents(docs, embeddings, client=client)

# Perform similarity search
query = "Similarity"
docs = db.similarity_search(query)
docs

# Search with score
docs = db.similarity_search_with_score(query=query, k=3)
docs


[(Document(metadata={'page': 0.0, 'source': './data.pdf'}, page_content='GPUSimilarity: \nSimilarity Searching a Billion Compounds in Real Time \nPat Lorton \nhttps://github.com/schrodinger/gpusimilarity'),
  1.0),
 (Document(metadata={'page': 1.0, 'source': './data.pdf'}, page_content='Original Idea, and some questions \nOriginal idea: \nComputational complexity of fingerprint similarity (tanimoto) is trivial, \nembarrassingly parallel, and involves no branching.  Limiting factor of a \nstraightforward brute force solution is compute power:  This sounds like \nsomething GPUs would be good at. \nThen I asked some questions.. \n2'),
  0.6986278295516968),
 (Document(metadata={'page': 17.0, 'source': './data.pdf'}, page_content='Conclusions \n•Compute speeds and RAM prices have come down enough that brute-force \nis a reasonable method for a similarity search server.  It will only become \ncheaper. \n•Fingerprint folding with re-scoring is a reasonable method to deal with lower \nGPU mem

## Retrieval-Augmented Generation (RAG) Chain 

### Setup RAG Chain

In [30]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_community.chat_models import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough


In [24]:
retriever = db.as_retriever()


In [31]:
# Define the RAG prompt template
template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question}
Context: {context}
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)

# Initialize LLM
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Define the RAG chain
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [25]:
# Test the RAG chain
rag_chain.invoke("explain the GPUSimilarity Searching in detail")

def rag(question):
    return rag_chain.invoke(question)

rag("explain the GPUSimilarity Searching in detail")


'GPUSimilarity Searching allows for similarity searching of a billion compounds in real-time using GPUs. The method involves folding fingerprints, merging and sorting results of chunk searches on the GPU, and utilizing brute force with better math for result retrieval. The approach takes advantage of decreasing compute speeds and RAM prices, making brute-force a viable option for similarity search servers.'

## Memory Management with Redis

## Let's try adding Memeory 

### Setup Redis for Chat History

In [None]:
from langchain_community.chat_message_histories import RedisChatMessageHistory


In [None]:
REDIS_URL = "redis://localhost:6379"


In [32]:


def get_message_history(user_id: str, conversation_id: str) -> RedisChatMessageHistory:
    return RedisChatMessageHistory(session_id=f"{user_id}:{conversation_id}", url=REDIS_URL)

def get_all_conversations(user_id: str, conversation_id: str) -> list:
    chat_history = get_message_history(user_id, conversation_id)
    conversation_output = []
    for message in chat_history.messages:
        conversation_output.append(f"{message.type}: {message.content}")
    return conversation_output

get_all_conversations(user_id="666", conversation_id="1")


['human: Could you please let me know about fingerprints and how GPUsimilarity manages memory constraints',
 'ai: Fingerprints are generated using RDKit and stored in memory for similarity searches. To manage memory constraints, the idea of folding fingerprints to fit GPU memory and then re-screening results on the CPU is used. This technique allows for efficient processing while dealing with GPU memory limitations.',
 'human: Could you please let me know about fingerprints and how GPUsimilarity manages memory constraints',
 'ai: Fingerprints are generated using RDKit and stored in memory for similarity searches. To manage memory constraints, the approach involves folding fingerprints to fit GPU memory and then re-screening results on the CPU. This method helps optimize processing while addressing limitations in GPU memory capacity.']

## Enhanced RAG Chain with Memory


### Setup History-Aware RAG Chain

In [27]:
from langchain_community.chat_message_histories import RedisChatMessageHistory
from typing import Optional
import bs4
import redis
from langchain.chains import create_history_aware_retriever, create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_openai import ChatOpenAI
from langchain_core.runnables import ConfigurableFieldSpec


In [28]:
# Define contextualization prompt
contextualize_q_system_prompt = (
    "Given a chat history and the latest user question "
    "which might reference context in the chat history, "
    "formulate a standalone question which can be understood "
    "without the chat history. Do NOT answer the question, "
    "just reformulate it if needed and otherwise return it as is."
)
contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

# Create history-aware retriever
history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_q_prompt
)

# Define QA prompt
system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)
qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

In [33]:
# Create QA chain
question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)
rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)


In [34]:

# Define conversational RAG chain with memory
conversational_rag_chain = RunnableWithMessageHistory(
    rag_chain,
    get_message_history,
    input_messages_key="input",
    history_messages_key="chat_history",
    output_messages_key="answer",
    history_factory_config=[
        ConfigurableFieldSpec(
            id="user_id",
            annotation=str,
            name="User ID",
            description="Unique identifier for the user.",
            default="",
            is_shared=True,
        ),
        ConfigurableFieldSpec(
            id="conversation_id",
            annotation=str,
            name="conversation ID",
            description="Unique identifier for the session.",
            default="",
            is_shared=True,
        ),
    ],
)

In [35]:


# Example usage of the conversational RAG chain
user_id = "66b521e88d4963b1300eb61e"
session_id = "2e1d9dbb-230b-4689-b879-935d18244acc"

response1 = conversational_rag_chain.invoke(
    {"input": "let me know what the previous message was about"},
    config={"configurable": {"user_id": user_id, "conversation_id": session_id}}
)
print(response1["answer"])

response2 = conversational_rag_chain.invoke(
    {"input": "Could you please let me know about fingerprints and how GPUsimilarity manages memory constraints"},
    config={"configurable": {"user_id": "666", "conversation_id": "1"}}
)
print(response2["answer"])

# Print all conversations

The previous message discussed the challenges faced when trying to scale beyond a "toy" case of 20 million compounds. It mentioned issues such as the inability to fit everything inside a GPU's memory, difficulties in allocating multi-gigabyte contiguous blocks of RAM, and limitations in QByteArray's capacity. The message also touched upon the complexity of parallelizing contiguous arrays across multiple GPUs and the solution of slicing memory into chunks during database creation.
Fingerprints are generated using RDKit and stored in memory for similarity searches. To address memory constraints, GPUsimilarity folds fingerprints to fit GPU memory and then re-screens results on the CPU. This strategy optimizes processing efficiency while working within the limitations of GPU memory.


In [36]:
get_all_conversations(user_id="666", conversation_id="1")


['human: Could you please let me know about fingerprints and how GPUsimilarity manages memory constraints',
 'ai: Fingerprints are generated using RDKit and stored in memory for similarity searches. To manage memory constraints, the idea of folding fingerprints to fit GPU memory and then re-screening results on the CPU is used. This technique allows for efficient processing while dealing with GPU memory limitations.',
 'human: Could you please let me know about fingerprints and how GPUsimilarity manages memory constraints',
 'ai: Fingerprints are generated using RDKit and stored in memory for similarity searches. To manage memory constraints, the approach involves folding fingerprints to fit GPU memory and then re-screening results on the CPU. This method helps optimize processing while addressing limitations in GPU memory capacity.',
 'human: Could you please let me know about fingerprints and how GPUsimilarity manages memory constraints',
 'ai: Fingerprints are generated using RDKit 