# Semantic Search with Redis LangChain & OpenAI

* Redis (Remote Dictionary Server) is an in-memory data structure store, used as a distributed, in-memory key–value database, cache and message broker, with optional durability.

In [40]:
# pip install redis
# pip install pypdf

### 1. Import libraries and set API Key

In [3]:
import os
import openai
import getpass
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.vectorstores.redis import Redis
from langchain.document_loaders import TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import ConversationalRetrievalChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.question_answering import load_qa_chain

In [4]:
p = getpass.getpass(prompt = '\nPlease enter your OpenAI API Secret Key (Input hidden):') 

os.environ['OPENAI_API_KEY'] = p
openai.api_key = os.environ["OPENAI_API_KEY"]

### 2. Load Documents file .txt

#### 2.1. Load file .txt and Embedding

In [3]:
loader = TextLoader('state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

In [4]:
docs[:3]

[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.', metadata={'source'

#### 2.2. Connect Redis DB

In [4]:
rds = Redis.from_documents(docs, embeddings, redis_url="redis://localhost:6379",  index_name='link')
rds.index_name

'link'

#### 2.3. Semantic search

In [9]:
query = "What did the president say about Ketanji Brown Jackson"
results = rds.similarity_search(query)
print(results[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


##### Call OpenAI

In [64]:
def get_similiar_docs(query, k=2, score=False):
  if score:
    similar_docs = rds.similarity_search_with_score(query, k=k)
  else:
    similar_docs = rds.similarity_search(query, k=k)
  return similar_docs

In [60]:
llm = OpenAI(model_name="text-davinci-003", temperature=0)

chain = load_qa_chain(llm, chain_type="stuff")

def get_answer(query):
  similar_docs = get_similiar_docs(query)
  answer = chain.run(input_documents=similar_docs, question=query)
  return answer

In [61]:
query = "What did the president say about Ketanji Brown Jackson"
answer = get_answer(query)
print(answer)

 The president said that Ketanji Brown Jackson is one of our nation's top legal minds, a former top litigator in private practice, a former federal public defender, and from a family of public school educators and police officers. He also said that she is a consensus builder and has received a broad range of support from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.


#### 2.4. Add texts on Document

In [10]:
print(rds.add_texts(["Ankush went to Princeton"]))

['doc:link:a42838898c0745d19ee4dee1624638c6']


In [11]:
query = "Princeton"
results = rds.similarity_search(query)
print(results[0].page_content)

Ankush went to Princeton


#### 2.5. Load from existing index

In [12]:
# Load from existing index
rds = Redis.from_existing_index(embeddings, redis_url="redis://localhost:6379", index_name='link')

query = "What did the president say about Ketanji Brown Jackson"
results = rds.similarity_search(query)
print(results[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


#### 2.6. RedisVectorStoreRetriever

* Here we go over different options for using the vector store as a retriever.

* There are three different search methods we can use to do retrieval. By default, it will use semantic similarity.

In [None]:
retriever = rds.as_retriever()
docs = retriever.get_relevant_documents(query)
retriever = rds.as_retriever(search_type="similarity_limit")

In [16]:
# Here we can see it doesn't return any results because there are no relevant documents
retriever.get_relevant_documents("where did ankush go to college?")

[Document(page_content='Ankush went to Princeton', metadata={}),
 Document(page_content='One was stationed at bases and breathing in toxic smoke from “burn pits” that incinerated wastes of war—medical and hazard material, jet fuel, and more. \n\nWhen they came home, many of the world’s fittest and best trained warriors were never the same. \n\nHeadaches. Numbness. Dizziness. \n\nA cancer that would put them in a flag-draped coffin. \n\nI know. \n\nOne of those soldiers was my son Major Beau Biden. \n\nWe don’t know for sure if a burn pit was the cause of his brain cancer, or the diseases of so many of our troops. \n\nBut I’m committed to finding out everything we can. \n\nCommitted to military families like Danielle Robinson from Ohio. \n\nThe widow of Sergeant First Class Heath Robinson.  \n\nHe was born a soldier. Army National Guard. Combat medic in Kosovo and Iraq. \n\nStationed near Baghdad, just yards from burn pits the size of football fields. \n\nHeath’s widow Danielle is here 

#### RetrievalQA with OpenAI

In [6]:
qa = RetrievalQA.from_chain_type(llm=OpenAI(model_name="text-davinci-003"), retriever=rds.as_retriever())

question = "What did the president say about Ketanji Brown Jackson"

result = qa({"query": question})
print("result ", result)

result  {'query': 'What did the president say about Ketanji Brown Jackson', 'result': " The president said that Ketanji Brown Jackson is one of the nation's top legal minds and that she will continue Justice Breyer's legacy of excellence."}


#### ConversationalRetrievalChain

In [41]:
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader

# load document
loader = PyPDFLoader("./2023_GPT4All_Technical_Report.pdf")
documents = loader.load()
# split the documents into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
# select which embeddings we want to use
embeddings = OpenAIEmbeddings()
# create the vectorestore to use as the index
rds = Redis.from_documents(texts, embeddings, redis_url="redis://localhost:6379",  index_name='link')

# expose this index in a retriever interface
retriever = rds.as_retriever(search_type="similarity", search_kwargs={"k":2})
# create a chain to answer questions 
qa = ConversationalRetrievalChain.from_llm(OpenAI(model_name="text-davinci-003"), retriever)
chat_history = []
query = "what is the total number of AI publications?"
result = qa({"question": query, "chat_history": chat_history})
chat_history.append({'query':query, 'answer':result["answer"]})

In [42]:
result

{'question': 'what is the total number of AI publications?',
 'chat_history': [{'query': 'what is the total number of AI publications?',
   'answer': " I don't know."}],
 'answer': " I don't know."}

In [43]:
query = "What did the president say about Ketanji Brown Jackson"
result = qa({"question": query, "chat_history": chat_history})
chat_history.append({'query':query, 'answer':result["answer"]})

In [44]:
result

{'question': 'What did the president say about Ketanji Brown Jackson',
 'chat_history': [{'query': 'what is the total number of AI publications?',
   'answer': " I don't know."},
  {'query': 'What did the president say about Ketanji Brown Jackson',
   'answer': " The President said that Ketanji Brown Jackson is one of the nation's top legal minds, and that she will continue Justice Breyer's legacy of excellence."}],
 'answer': " The President said that Ketanji Brown Jackson is one of the nation's top legal minds, and that she will continue Justice Breyer's legacy of excellence."}

In [28]:
chat_history

[('what is the total number of AI publications?', " I don't know.")]

In [29]:
result['answer']

" The president said that Ketanji Brown Jackson is one of the nation's top legal minds and that she will continue Justice Breyer's legacy of excellence."

## 3. Load Document file .pdf

In [41]:
from langchain.document_loaders import PyPDFLoader

In [45]:
# load document
loader_pdf = PyPDFLoader("./2023_GPT4All_Technical_Report.pdf")
documents_pdf = loader_pdf.load()

# split the documents into chunks
text_splitter_pdf = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts_pdf = text_splitter_pdf.split_documents(documents_pdf)
# select which embeddings we want to use
embeddings_pdf = OpenAIEmbeddings()

In [46]:
rds_pdf = Redis.from_documents(documents_pdf, embeddings_pdf, redis_url="redis://localhost:6379",  index_name='link')
rds_pdf.index_name

'link'

In [None]:
# expose this index in a retriever interface
retriever = rds_pdf.as_retriever(search_type="similarity", search_kwargs={"k":2})
# create a chain to answer questions 
qa = RetrievalQA.from_chain_type(
    llm=OpenAI(), chain_type="stuff", retriever=retriever, return_source_documents=True)
query = "what is the total number of AI publications?"
result = qa({"query": query})

In [50]:
query = "What was the cost of training the GPT4all model?"
retriever = rds_pdf.as_retriever()
docs = retriever.get_relevant_documents(query)
retriever = rds.as_retriever(search_type="similarity_limit")

In [52]:
# Here we can see it doesn't return any results because there are no relevant documents
retriever.get_relevant_documents("What was the cost of training the GPT4all model?")

[Document(page_content='(a) TSNE visualization of the final training data, ten-colored\nby extracted topic.\n(b) Zoomed in view of Figure 2a. The region displayed con-\ntains generations related to personal health and wellness.\nFigure 2: The final training data was curated to ensure a diverse distribution of prompt topics and model responses.\n2.1 Reproducibility\nWe release all data (including unused P3 genera-\ntions), training code, and model weights for the\ncommunity to build upon. Please check the Git\nrepository for the most up-to-date data, training\ndetails and checkpoints.\n2.2 Costs\nWe were able to produce these models with about\nfour days work, $800 in GPU costs (rented from\nLambda Labs and Paperspace) including several\nfailed trains, and $500 in OpenAI API spend.\nOur released model, gpt4all-lora, can be trained in\nabout eight hours on a Lambda Labs DGX A100\n8x 80GB for a total cost of $100 .\n3 Evaluation\nWe perform a preliminary evaluation of our model\nusing the

In [49]:
query = "What was the cost of training the GPT4all model?"
results_pdf = rds_pdf.similarity_search(query)
print(results_pdf[0].page_content)

(a) TSNE visualization of the final training data, ten-colored
by extracted topic.
(b) Zoomed in view of Figure 2a. The region displayed con-
tains generations related to personal health and wellness.
Figure 2: The final training data was curated to ensure a diverse distribution of prompt topics and model responses.
2.1 Reproducibility
We release all data (including unused P3 genera-
tions), training code, and model weights for the
community to build upon. Please check the Git
repository for the most up-to-date data, training
details and checkpoints.
2.2 Costs
We were able to produce these models with about
four days work, $800 in GPU costs (rented from
Lambda Labs and Paperspace) including several
failed trains, and $500 in OpenAI API spend.
Our released model, gpt4all-lora, can be trained in
about eight hours on a Lambda Labs DGX A100
8x 80GB for a total cost of $100 .
3 Evaluation
We perform a preliminary evaluation of our model
using the human evaluation data from the Self-
Instruc