# Data-Augmented Question Answering

We are interested to build a personal learning assistant using LangChain. The parts we need:

- user question (input)
- role prompting to mimic learning assistant role
- relevant context obtained via data source
    - knowledge base/data source (we are using lecture transcriptions for simplicity)
- vector database to store the data source and support semantic search
- personalized response with source/citations (summarized output)


<a href="https://colab.research.google.com/github/dair-ai/maven-pe-for-llms-8/blob/main/demos/session-3/rag-qa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
# update or install the necessary libraries
!pip install --upgrade openai
!pip install --upgrade langchain
!pip install --upgrade python-dotenv
!pip install --upgrade chromadb

In [2]:
import openai
import os
import IPython
from langchain.llms import OpenAI
from dotenv import load_dotenv

In [3]:
load_dotenv()

# API configuration
openai.api_key = os.getenv("OPENAI_API_KEY")

# for LangChain
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

First, we need to download the data we want to use as source to augment generation.

In [4]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings.cohere import CohereEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores.elastic_vector_search import ElasticVectorSearch
from langchain.vectorstores import Chroma
from langchain.docstore.document import Document
from langchain.prompts import PromptTemplate

from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.llms import OpenAI

As our data source, we will use a transcription of Karpathy's recent lecture on GPT. 

In [5]:
# split text into chunks
with open('../data/kar-gpt.txt') as f:
    text_data = f.read()
    
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0, separator=" ")
texts = text_splitter.split_text(text_data)

# embeddings obtained from OpenAI (you can use open-source like FAISS)
embeddings = OpenAIEmbeddings()

  warn_deprecated(


In [6]:
docsearch = Chroma.from_texts(texts, embeddings, metadatas=[{"source": str(i)} for i in range(len(texts))])

In [7]:
query = "What is the course about?"
docs = docsearch.similarity_search(query)

In [9]:
chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type="stuff")
query = "What is the course about?"
chain({"input_documents": docs, "question": query}, return_only_outputs=True)

{'output_text': ' The course is about training language models, specifically the transformer neural network, using Python and basic understanding of calculus and statistics. It also includes a code base and notebook for training models like GPT3. \nSOURCES: 108, 1, 107, 7'}

In [10]:
template = """
Given the following extracted parts of a long document and a question, create a final answer with references ("SOURCES"). 
If you don't know the answer, just say that you don't know. Don't try to make up an answer.
ALWAYS return a "SOURCES" part in your answer.

=========
{summaries}
=========

Given the summary above, help answer the following question from the user:

Question: {question}
"""


# create a prompt template
PROMPT = PromptTemplate(template=template, input_variables=["summaries", "question"])

# query 
chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type="stuff", prompt=PROMPT)
query = "What is the course about?"
chain({"input_documents": docs, "question": query}, return_only_outputs=True)

{'output_text': '\nAnswer: The course is about training language models, specifically the transformer neural network, using Python and basic understanding of calculus and statistics. It also covers fine tuning and other stages for tasks such as sentiment detection. The course includes a code base and notebook for training models similar to GPT-3. (SOURCES: 1, 7, 107, 108)'}

Check out other chains like mapreduce and refine if you are working with bigger context and larger documents. https://docs.langchain.com/docs/components/chains/index_related_chains

In [11]:
from langchain import PromptTemplate, LLMChain
from langchain.chains import SimpleSequentialChain

llm = OpenAI(temperature=0.9)

response_prompt = PromptTemplate(
    input_variables=["response"],
    template="""You are a personal learning assistant. 
    Just take the answer from the previous response {response} and summarize it into one sentence.

    Agent:
    """
)

query = "What is the course about?"

response_chain = ( {"response": chain} | response_prompt | llm)

response_chain.invoke({"input_documents": docs ,"question": query})

'\nThe course covers training language models, particularly the transformer neural network, using Python and basic understanding of calculus and statistics, with a focus on fine tuning and other stages for tasks such as sentiment detection, and includes a code base and notebook for training models similar to GPT-3.'

Exercise: Add another chain that connects with the previous `agent_chain` to create another agent that tries to be helpful and follows up with a question if it helps to keep the conversation going.