# Part 1: Introduction to RAG Systems

## Demonstrating a simple RAG pipeline

## Importing the libraries
- We need torch because transformers, the main library used in this notebook needs it
- the transformers library, has a pipeline wrapper which takes care of the task of building the embedding generation pipeline, which is a BERT like encoder model.

In [2]:
import torch
from transformers import pipeline

### Defining the knowledge corpus
- In the current simple example is merely a collection of sentences, in actual cases, it might be documents, which will then have to split using various chunking strategies, which will be discuss in later notebooks
- The sentences are chosen at random, but a few sentences might have some sematic similarity to each other, which is relevant for later use cases

In [3]:

# Step 1: Define documents
knowledge_base = [
    "Metformin is a medication used to treat type 2 diabetes.",
    "It works by lowering glucose production in the liver.",
    "Common side effects include nausea, upset stomach, and diarrhea.",
    "Octopuses have three hearts and blue blood.",
    "Honey never spoils and has been found edible in ancient Egyptian tombs.",
    "Sharks have been around longer than trees, existing for over 400 million years."
]


## Simple Retriever engine
- Keeping this simple for now, we have created an example where we just trying to determine the common words between the query and the documents.
- If any of our document corpus contains any of words in the query, the document will be included as part of the context for generation.
- This approach is very naive, doesn't include pre-processing steps such as stopwords removal, tokenization etc., but that's just to keep things simple

In [4]:


# Step 2: Define a simple retriever (keyword matching)
def retrieve_docs(query, documents):
    
    matching_docs = []

    for doc in documents:
        doc_lower = doc.lower()
        query_words = query.lower().split()

        for word in query_words:
            if word in doc_lower:
                matching_docs.append(doc)
                break  # Stop checking after the first match

    return matching_docs




## Create the prompt for generation, and then generating the response
- Creating the context with the retrieved documents, and query question
- Creating a pipeline object, where we define the model we want to use for generation, we are using `GPT-2` here
- We pass the prompt and the parameters such as `max_new_tokens`, which instructs model about the number of tokens to use in the generation. More `max_new_tokens`, more are the chances of the model not making much sense

In [7]:
# Step 3: Combine query with retrieved docs
def create_prompt(query, retrieved_docs):
    context = "\\n".join(retrieved_docs)
    return f"Context: {context}\\n\\nQuestion: {query}\\nAnswer:"


# Step 4: Use an LLM to generate answer
qa_pipeline = pipeline("text-generation", model="gpt2")
print("\n \n")
#query
query = "What are the side effects of Metformin?"

# matched documents
retrieved = retrieve_docs(query, knowledge_base)

# prompt that packs the query + matched documents
prompt = create_prompt(query, retrieved)

# generation
result = qa_pipeline(prompt, max_new_tokens=100, do_sample=True)[0]['generated_text']


print("User Query: ")
print(query, '\n')
print('Matched documents: ')
print(retrieved)
print()
print("Generated Response \n \n")
print(result, "\n")

Device set to use mps:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



 

User Query: 
What are the side effects of Metformin? 

Matched documents: 
['It works by lowering glucose production in the liver.', 'Common side effects include nausea, upset stomach, and diarrhea.']

Generated Response 
 

Context: It works by lowering glucose production in the liver.\nCommon side effects include nausea, upset stomach, and diarrhea.\n\nQuestion: What are the side effects of Metformin?\nAnswer: The side effect profile may include: severe abdominal pain, nausea, and vomiting related to ingestion of Metformin.\n\nDiagnosis: Metformin is well tolerated.\n\nDiagnosis should be made if the gastrointestinal side effect occurs.\n\nEfficacy: Metformin is used to treat a number of gastrointestinal symptoms as described above.\n\nOther side effects include: severe diarrhea, constipation, vomiting

The side effects 

