# Document Assistant Using RAG




Let's start by loading the environment variables we need to use to build the RAG chain.

Downloading the model Tinyllama from huggingface_hub

In [None]:
!python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='mav23/Mistral-7B-OpenOrca-GGUF', filename='mistral-7b-openorca.Q4_K_M.gguf', local_dir='./LLM_Model')"

## Setting up the model
Define the LLM model that we'll use as part of the workflow.

In [106]:
from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler


# Loading LLM to use it Locally

model = LlamaCpp(
    model_path="./LLM_Model/mistral-7b-openorca.Q4_K_M.gguf",
    n_batch=256,
    n_ctx=2048,
    verbose=False

)

llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (32768) -- the full capacity of the model will not be utilized


We can test the model by asking a simple question.

In [97]:
model.invoke("who was the first man to land on the moon")

'.\nMoon landing in 1969 by astronauts Neil Armstrong and Edwin E. "Buzz" Aldrin, Jr., and Michael Collins. Photo courtesy NASA.\nHe also led the Apollo program from 1967-1972 and was responsible for the successful launch of the Lunar Orbiter, which was a spacecraft that flew by the Moon to collect data on its surface. He also directed the Apollo 11 mission and the Apollo 15 mission, which were both successful lunar landing missions.\nIn 1973, Armstrong became the first person to walk on the moon when he walked on the surface of the Moon during the Apollo 11 mission.\nArmstrong was a mechanical engineer by training and earned his PhD from the Massachusetts Institute of Technology (MIT) in 1950. He then joined the US Navy as an engineering officer. In 1962, he transferred to MIT, where he became involved with the development of the Apollo program.\nIn the late 1950s and early 1960s, Armstrong worked on projects in aerod'

## prompt template

We want to provide the model with some context and the question.

In [85]:
from langchain.prompts import ChatPromptTemplate

# Defining the Pormpt template 

template = """
You are an assistant that only answers questions based on the context provided. Strictly follow these rules:
- Do not add information or explanations not present in the context.
- If the context does not provide enough information to answer, respond only with: "I don't know."

Here is the context:
{context}

Question:
{question}

Answer:
"""
prompt = ChatPromptTemplate.from_template(template)


# Test the Pormpt Output
prompt.format(context="Mary's sister is Susana", question="Who is Mary's sister?")

'Human: \nYou are an assistant that only answers questions based on the context provided. Strictly follow these rules:\n- Do not add information or explanations not present in the context.\n- If the context does not provide enough information to answer, respond only with: "I don\'t know."\n\nHere is the context:\nMary\'s sister is Susana\n\nQuestion:\nWho is Mary\'s sister?\n\nAnswer:\n'

## Splitting the transcription


Large Language Models support limitted context sizes. The document we are using is too long for the model to handle, so we need to find a different solution.
Since we can't use the entire transcription as the context for the model, a potential solution is to split the document into smaller chunks. We can then invoke the model using only the relevant chunks to answer a particular question:


Let's start by loading all the documents in our directory:

In [None]:
from langchain.document_loaders import DirectoryLoader
from langchain_huggingface import HuggingFaceEmbeddings

# downloading HuggingFace Embedding model

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# documents directory path
directory = "./documents"

# function to load the documents using langchain
def load_docs(directory) :
    loader = DirectoryLoader(directory)
    documents = loader.load()
    return documents

docs = load_docs(directory)

  from .autonotebook import tqdm as notebook_tqdm


There are many different ways to split a document. For this example, we'll use a simple splitter that splits the document into chunks of a fixed size.

split the transcription into chunks of 1000 characters with an overlap of 50 characters.



## Finding the relevant chunks

Given a particular question, we need to find the relevant chunks from the transcription to send to the model. Here is where the idea of **embeddings** comes into play.

An embedding is a mathematical representation of the semantic meaning of a word, sentence, or document. It's a projection of a concept in a high-dimensional space. Embeddings have a simple characteristic: The projection of related concepts will be close to each other, while concepts with different meanings will lie far away. 

To provide with the most relevant chunks, we can use the embeddings of the question and the chunks of the transcription to compute the similarity between them. We can then select the chunks with the highest similarity to the question and use them as the context for the model:

We compute the similarity between the query and each of the two sentences. The closer the embeddings are, the more similar the sentences will be.

We can use [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to calculate the similarity between the query and each of the sentences:



## Setting up a Vector Store

We need an efficient way to store document chunks, their embeddings, and perform similarity searches at scale. To do this, we'll use ChromaDB to do the **vector store**.

A vector store is a database of embeddings that specializes in fast similarity searches. 


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

# split the docs into chunks using recursive character splitter
def split_docs(documents ,chunk_size =1000 , chunk_overlap =50) :
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = chunk_size ,chunk_overlap = chunk_overlap)
    docs = text_splitter.split_documents(documents)
    return docs
docs = split_docs(docs)

# using chromadb as a vectorstore and storing the docs in it

vector_store = Chroma.from_documents(docs,embeddings,collection_metadata={"hnsw : space": "cosine"})
# metadata argument used to customize the distance method of the embedding space from default ( squared L2 norm ) to ( cosine )

## Connecting the vector store to the chain

We can use the vector store to find the most relevant chunks from the transcription to send to the model. Here is how we can connect the vector store to the chain:

We need to configure a [Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/). The retriever will run a similarity search in the vector store and return the most similar documents back to the next step in the chain.

We can get a retriever directly from the vector store we created before: 

Our prompt expects two parameters, "context" and "question." We can use the retriever to find the chunks we'll use as the context to answer the question.

We can create a map with the two inputs by using the [`RunnablePassthrough`](https://python.langchain.com/docs/expression_language/how_to/passthrough) classes. This will allow us to pass the context and question to the prompt as a map with the keys "context" and "question."

In [107]:
from langchain_core.runnables import  RunnablePassthrough

# Building the chain

chain = (
    {"context": vector_store.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
)

RAG Sytsem test

In [105]:
chain.invoke("explique le Système nerveux intestinal ?")

llama_perf_context_print:        load time =  190985.39 ms
llama_perf_context_print: prompt eval time =       0.00 ms /   682 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /    35 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  841024.22 ms /   717 tokens


'Le système nerveux intestinal est un ensemble de neurones interconnectés situés dans la paroi gastrique et les plexus entériques.'

In [77]:
chain.invoke("c'est quoi l'adénosinomimétiques ? ")

"\nAnswer: Adénosiomimétrie s'agit de la production et la libération du purin monophosphate par les neurones, névralgies ou muscle."

In [91]:
chain.invoke("c'est quoi Les effets anaphylactoïdes ? ")

"Les effet(s) anaphylactoïdes sont des manifestations nocive(s) cliniquement procésées(s) des effets immuno-globulié(s) IgE. C'est-à-dire que ces manifestations (lesquelles peuvent être sévères) sont liées aux immuno-globulines IgE, c'est-à-dire aux médicaments anti-allergique ou au traitement immunologique."

In [22]:
chain.invoke("c'est quoi La Bronchoconstriction ?")

Llama.generate: 9 prefix-match hit, remaining 563 prompt tokens to eval



Answer: La Bronchoconstriction est une réaction immunitaire au risque d'insuffisance respiratoire, qui peut être liée à des facteurs de risques tels que l'infection ou une maladie cardiovasculaire. Elle s'accompagne souvent de symptômes d'inflammation (cérulose) et de survolte du centre respiratoire. La réaction immunitaire peut également être liée à des facteurs non immunitaires, tels que une infection ou une maladie cardiovasculaire. Dans un patient avec Bronchoconstruction, le symptôme principal est la cérulose qui apparaît généralement après la première infection et s'accompagne souvent de survolte du centre respiratoire. Les symptômes peuvent varier dans la faible proportion d'infections, les patients ne survivront pas tous, la mortalité est très faible dans la plupart des cas. Dans une personne atteinte de Bronchoconstruction, la survolte du centre respiratoire peut être due à l’effort de respiration (cér

llama_perf_context_print:        load time =    3455.96 ms
llama_perf_context_print: prompt eval time =       0.00 ms /   563 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   255 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   32156.93 ms /   818 tokens


"\nAnswer: La Bronchoconstriction est une réaction immunitaire au risque d'insuffisance respiratoire, qui peut être liée à des facteurs de risques tels que l'infection ou une maladie cardiovasculaire. Elle s'accompagne souvent de symptômes d'inflammation (cérulose) et de survolte du centre respiratoire. La réaction immunitaire peut également être liée à des facteurs non immunitaires, tels que une infection ou une maladie cardiovasculaire. Dans un patient avec Bronchoconstruction, le symptôme principal est la cérulose qui apparaît généralement après la première infection et s'accompagne souvent de survolte du centre respiratoire. Les symptômes peuvent varier dans la faible proportion d'infections, les patients ne survivront pas tous, la mortalité est très faible dans la plupart des cas. Dans une personne atteinte de Bronchoconstruction, la survolte du centre respiratoire peut être due à l’effort de respiration (cér"