# Building a RAG chain from Wikipedia

This notebook shows how to use ApertureDB as part of a Retrieval-Augmented Generation Langchain pipeline.  This means that we're going to use ApertureDB as a vector-based search engine to find documents that match the query and then use a large-language model to generate an answer based on those documents. 

We already have a corpus of >600k paragraphs from the Simple English Wikipedia with associated embeddings provided by Cohere.
(If not, see [Ingesting Wikipedia into ApertureDB](./cohere_wikipedia_ingest.ipynb)).
We'll use that to answer natural-language questions.

![RAG workflow](images/RAG_Demo.png)

## Install Dependencies

In [25]:
%pip install --quiet aperturedb langchain langchain-core langchain-community langchainhub langchain-cohere

Note: you may need to restart the kernel to use updated packages.


## Choose a prompt

The prompt ties together the source documents and the user's query, and also sets some basic parameters for the chat engine.

In [1]:
from langchain_core.prompts import PromptTemplate
prompt = PromptTemplate.from_template("""You are an assistant for question-answering tasks. Use the following documents to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.  Additionally, you should always indicate which documents support each part of your answer.
Question: {question} 
{context} 
Answer:""")
print(prompt.template)

You are an assistant for question-answering tasks. Use the following documents to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.  Additionally, you should always indicate which documents support each part of your answer.
Question: {question} 
{context} 
Answer:


For comparison, we're also going to ask the same questions of the language model without using documents.  This prompt is for a non-RAG chain.

In [2]:
from langchain_core.prompts import PromptTemplate
prompt2 = PromptTemplate.from_template("""You are an assistant for question-answering tasks. Answer the question from your general knowledge.  If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question} 
Answer:""")
print(prompt2.template)

You are an assistant for question-answering tasks. Answer the question from your general knowledge.  If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question} 
Answer:


# Cohere API Key

In order to continue with this demo, you will need to enter an API key for Cohere.
An evaluation API key can be obtained for free from [dashboard.cohere.com/api-keys](https://dashboard.cohere.com/api-keys).

In [3]:
import os
from getpass import getpass

os.environ['COHERE_API_KEY'] = getpass()


## Select an embedding scheme

Here we select the embedding scheme that matches the embeddings we have preloaded.

In [4]:
from langchain_cohere import CohereEmbeddings
embeddings = CohereEmbeddings(model="embed-multilingual-v3.0")

emb = embeddings.embed_query("Hello, world!")
print(emb[:10], len(emb))

[0.0030612946, 0.046173096, 0.024490356, 0.032440186, -0.028900146, -0.026855469, -0.02810669, -0.03074646, -0.068481445, 0.033966064] 1024


## Select a vectorstore

Here we're using an instance of ApertureDB that has already been pre-loaded with a selection of paragraphs from Wikipedia.

First activate the connection to ApertureDB.

In [19]:
import os
from getpass import getpass
os.environ['APERTUREDB_JSON'] = getpass()

## Create vectorstore

Now we create a LangChain vectorstore object, backed by the ApertureDB instance we have already uploaded documents to.

In [41]:
from langchain_community.vectorstores import ApertureDB
import logging
import sys
# date_strftime_format = "%Y-%m-%y %H:%M:%S"
# logging.basicConfig(stream=sys.stdout, level=logging.WARN, 
#                     format="%(asctime)s %(levelname)s %(funcName)s %(message)s", datefmt=date_strftime_format)

DESCRIPTOR_SET = "cohere_wikipedia_2023_11_embed_multilingual_v3"

vectorstore = ApertureDB(embeddings=embeddings, 
                 descriptor_set=DESCRIPTOR_SET)

## Create a retriever

The retriever is responsible for finding the most relevant documents in the vectorstore for a given query.  Here's we using the "max marginal relevance" retriever, which is a simple but effective way to find a diverse set of documents that are relevant to a query.  For each query, we retrieve the top 10 documents, but we do so by fetching 20 and then selecting the top 5 using the MMR algorithm.

In [42]:
search_type = "mmr" # "similarity" or "mmr"
k = 10              # number of results used by LLM
fetch_k = 100       # number of results fetched for MMR
retriever = vectorstore.as_retriever(search_type=search_type, 
    search_kwargs=dict(k=k, fetch_k=fetch_k))

## Select an LLM engine

Here we're again using Cohere, but there's no need to use the same provider as we used for embeddings.

In [43]:
from langchain_cohere import ChatCohere

llm = ChatCohere(model="command-r")

## Build the chain

Now we put it all together.  The chain is responsible for taking a user query and returning a response.  It does this by first retrieving the most relevant documents using vector search, then using the LLM to generate a response.

For demonstration purposes, we're printing the documents that were retrieved, but in a real application you would probably want to hide this information from the user.

In [44]:
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser

def format_docs(docs):
    return "\n\n".join(f"Document {i}: " + doc.page_content for i, doc in enumerate(docs, start=1))


rag_chain = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain_with_source = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
).assign(answer=rag_chain) 

This chain does not use RAG.

In [45]:
plain_chain = (
  {"question": RunnablePassthrough()}
    | prompt2
    | llm
    | StrOutputParser()
)

## Look at some documents

In order to come up with questions that match the corpus, it might be a good idea to look at some random documents.

In [46]:
from aperturedb.CommonLibrary import create_connector
offset = 0
query = [ {"FindDescriptor": {"set": DESCRIPTOR_SET, "results": { "list": ["text", "lc_title"], "limit": 10}, "offset": offset, "sort": { "key": "uniqueid" } }} ]
client = create_connector()
response, _ = client.query(query)
for i, result in enumerate(list(response[0].values())[0]["entities"], start=1):
    print(f"{i}. {result['lc_title']}: {result['text']}")

1. Family: Foster families are families where a child lives with and is cared for by people who are not his or her biological parents.
2. Flag of the United States: When a new state joins the United States, a new flag is made with an extra star. The new flag is first flown on the 4th of July (Independence Day).
3. Arithmetic: Most people learn arithmetic in primary school, but some people do not learn arithmetic and others forget the arithmetic they learned. Many jobs require a knowledge of arithmetic, and many employers complain that it is hard to find people who know enough arithmetic. A few of the many jobs that require arithmetic include carpenters, plumbers, mechanics, accountants, architects, doctors, and nurses. Arithmetic is needed in all areas of mathematics, science, and engineering.
4. Prison: There are many books and poems about prisons or prison life, such as The Count of Monte Cristo by Alexandre Dumas, père and The Ballad of Reading Gaol by Oscar Wilde.
5. European Union

## Run the chain

Now we can enter a query and see the response.

In [47]:
from ipywidgets import Combobox, Output, Button
from IPython.display import display_markdown, Markdown, display, clear_output

queries = [
"What is the answer to the great question of life, the universe, and everything?",
"What is the largest city in France",
 "What is the queen of citadels in Lille?",
"What is the Elephant's Graveyard?",
"What can you see along the A61 in England?",
"What person has a connection with the city of Leicester?",
"What are the ancient wards of the city of London?",
"Who created mini marvels?",
"What happens at the end of Moby Dick?",
"What is the Euclidean Algorithm used for?",
"Can I use an EV to power my house?",
"Who is the main character in The Glass Castle?",
"What did Winston Smith do for work?",
"What schools are in Great Bend, Kansas?",
"Who was the first Catholic nominee for U.S. President for a major party?",
"What are the inside and outside styles of the Azerbaijan State Philharmonic Hall?",
"What were the sides in the Mali War?",
"What is the population of Kingston, New Hampshire?",
"What did Jacques Ellul say about mass media communication?",
"Who played Lucas Sinclair?",
"What is stannic oxide used for?",
]

def handler(event):
    if event.name != 'value' or input.value == "":
        return
        
    user_query = input.value
    input.value = ""

    with output:
        clear_output()
        run_query(user_query)
    
def run_query(user_query):
    display(Markdown(f"### User Query\n{user_query}"))
        
    nonrag_answer = plain_chain.invoke(user_query)
    display(Markdown(f"### Non-RAG Answer\n{nonrag_answer}"))

    rag_answer = rag_chain_with_source.invoke(user_query)
    display(Markdown("\n".join([
        f"### RAG Answer\n{rag_answer['answer']}",
        f"### Documents",
        *(f"{i}. **[{doc.metadata['title']}]({doc.metadata['url']})**: {doc.page_content}" for i, doc in enumerate(rag_answer["context"], 1) )
        ])))


interactive = True
if interactive:
    input = Combobox(
        placeholder="Enter a question...",
        options = queries,
        ensure_option=False,
        disabled=False,
        continuous_update=False,
    )
    
    output = Output()
        
    input.observe(handler)
    
    display(input)
    display(output)    
else: # For testing non-interactively
    user_query = queries[0]
    run_query(user_query)

Combobox(value='', continuous_update=False, options=('What is the answer to the great question of life, the un…

Output()