# Building QA Systems with Langchain

Building QA systems consists of the following steps
![](https://python.langchain.com/assets/images/qa_flow-9fbd91de9282eb806bda1c6db501ecec.jpeg)

but lets start with a quick example to see things in actions

In [1]:
from langchain.document_loaders import TextLoader
from langchain.indexes import VectorstoreIndexCreator

loader = TextLoader("nyc_text.txt")
index = VectorstoreIndexCreator().from_loaders([loader])

In [24]:
question = "How did New York City get its name?"
index.query(question)

' New York City was named after King Charles II of England, who granted the lands to his brother, the Duke of York.'

In this example we build a QA system with just 3 lines of code. Langchain does all of the above steps under the hood for you.

But we're here to study the internals of langchain, so lets take a look at the internals.

## 1. Load Documents

In [3]:
from langchain.document_loaders import TextLoader
from langchain.indexes import VectorstoreIndexCreator

loader = TextLoader("nyc_text.txt")
data = loader.load()

This returns a list of `Documents` which contain the text we want

In [5]:
len(data)

1

## 2. Transform Documents

with the docs loaded what you want to do now is transform it into formats that make it easier to work with. One common strategy we do is splitting the doc into smaller chunks that can be passed into the LLM.

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter()
all_splits = text_splitter.split_documents(data)

In [9]:
len(all_splits)

45

## 3. Store

Now we can store them somewhere where it can be easily retrieved. If your thinking about Databases now your right. We will be using a special kind of DB called VectorDBs which store the vector representations of the chunks we just made above. This makes it easier to retrieve the required ones.

In [11]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())

## 4. Retrieve

All the steps until now was for preparing and storing the data you have so that at query time we can retrieve it and pass it to the LLM for it to use.

The first step at query time is to get the relevent documents which we do be getting the most similar texts to the query.

In [25]:
docs = vectorstore.similarity_search(question)
len(docs)

4

In [26]:
from langchain.retrievers import SVMRetriever

svm_retriever = SVMRetriever.from_documents(all_splits,OpenAIEmbeddings())
docs_svm=svm_retriever.get_relevant_documents(question)
len(docs_svm)



4

explore
- SVM retriever
- MMR
- MultiQueryRetriver

## 5. Generate

Here pass the retrieved texts to the LLM so that it can use it to answer the questions.

In [27]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever()
)
result = qa_chain({"query": question})
result

{'query': 'How did New York City get its name?',
 'result': 'New York City was named in honor of the Duke of York, who would later become King James II of England. In 1664, King Charles II appointed the Duke as the proprietor of the former territory of New Netherland, including the city of New Amsterdam, when England seized it from Dutch control.'}

In [28]:
print(result["result"])

New York City was named in honor of the Duke of York, who would later become King James II of England. In 1664, King Charles II appointed the Duke as the proprietor of the former territory of New Netherland, including the city of New Amsterdam, when England seized it from Dutch control.


You can also change the prompt easily like this

In [34]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end. 
Answer as if you are a pirate and say "Arrr!" at the end.

{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever(),
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)
result = qa_chain({"query": question})
result["result"]

'Arrr! New York City got its name in honor of the Duke of York, who later became King James II of England. Arrr!'

if you want to retrieve the source documents

In [36]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)
result = qa_chain({"query": question})
print(len(result['source_documents']))

list(result.keys())

4


['query', 'result', 'source_documents']

It is also helpful to return citations and to do that in langchain

In [37]:
from langchain.chains import RetrievalQAWithSourcesChain

qa_chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever()
)

result = qa_chain({"question": question})
result

{'question': 'How did New York City get its name?',
 'answer': 'New York City was named in honor of the Duke of York, who would become King James II of England. It was named in 1664 when England seized the territory of New Netherland from Dutch control. The city was originally called New Amsterdam but was renamed New York after it came under British control. \n',
 'sources': 'nyc_text.txt'}

## 6. Converse

this is actually pretty neat and an extention from other RAG systems. The ability to have a conversation with the document. other systems can implement this easily.

In [38]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
    memory_key="chat_history", 
    return_messages=True
)

We use a different chain for this.

In [39]:
from langchain.chains import ConversationalRetrievalChain

retriever = vectorstore.as_retriever()
chat = ConversationalRetrievalChain.from_llm(
    llm, 
    retriever=retriever, 
    memory=memory
)

In [40]:
r = chat(
    {"question": "How did New York city get its name"}
)
r["answer"]

"New York City was named in honor of the Duke of York, who would later become King James II of England. In 1664, King Charles II appointed the Duke as proprietor of the former territory of New Netherland, including the city of New Amsterdam, when England seized it from Dutch control. The Duke of York's name was then given to the city, and it became known as New York."

In [41]:
r = chat(
    {"question": "tell me more about the Duke of York"}
)
r["answer"]

'The Duke of York mentioned in the context is James, who later became King James II of England. He was the younger brother of King Charles II. In 1664, King Charles II appointed James as the proprietor of the territory of New Netherland, which included the city of New Amsterdam (later renamed New York). James played a significant role in the English takeover of New Amsterdam from Dutch control. He was eventually deposed in the Glorious Revolution.'

In [42]:
r = chat(
    {"question": "what is New Netherland and New Amsterdam"}
)
r["answer"]

'New Netherland was a Dutch colony established in the early 17th century in what is now the northeastern United States. It encompassed the area that would later become New York, New Jersey, Delaware, and parts of Connecticut and Pennsylvania. New Amsterdam was the capital and main settlement of New Netherland, located on the southern tip of Manhattan Island. It was founded in 1625 and later renamed New York when it came under English control in 1664.'