# Introduction to RAG

## 1. Indexing Documents with Min-Search

In [1]:
import json
import minsearch

In [2]:
with open('faq_database.json', 'rt') as f_in:
    docs_raw = json.load(f_in)

documents = []

for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)

Examples of questions and answers in the documents

In [3]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [4]:
documents[-1]

{'text': 'Problem description\nInfrastructure created in AWS with CD-Deploy Action needs to be destroyed\nSolution description\nFrom local:\nterraform init -backend-config="key=mlops-zoomcamp-prod.tfstate" --reconfigure\nterraform destroy --var-file vars/prod.tfvars\nAdded by Erick Calderin',
 'section': 'Module 6: Best practices',
 'question': 'How to destroy infrastructure created via GitHub Actions',
 'course': 'mlops-zoomcamp'}

Minimal search engine to return the most similar documents to the new query

In [5]:
minsearch.Index?

[0;31mInit signature:[0m [0mminsearch[0m[0;34m.[0m[0mIndex[0m[0;34m([0m[0mtext_fields[0m[0;34m,[0m [0mkeyword_fields[0m[0;34m,[0m [0mvectorizer_params[0m[0;34m=[0m[0;34m{[0m[0;34m}[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
A simple search index using TF-IDF and cosine similarity for text fields and exact matching for keyword fields.

Attributes:
    text_fields (list): List of text field names to index.
    keyword_fields (list): List of keyword field names to index.
    vectorizers (dict): Dictionary of TfidfVectorizer instances for each text field.
    keyword_df (pd.DataFrame): DataFrame containing keyword field data.
    text_matrices (dict): Dictionary of TF-IDF matrices for each text field.
    docs (list): List of documents indexed.
[0;31mInit docstring:[0m
Initializes the Index with specified text and keyword fields.

Args:
    text_fields (list): List of text field names to index.
    keyword_fields (list): List of keyword 

In [6]:
index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"] # filter field to "train"
)

In [7]:
index.fit(documents)

<minsearch.Index at 0x706ad67f7460>

In [8]:
index.search?

[0;31mSignature:[0m [0mindex[0m[0;34m.[0m[0msearch[0m[0;34m([0m[0mquery[0m[0;34m,[0m [0mfilter_dict[0m[0;34m=[0m[0;34m{[0m[0;34m}[0m[0;34m,[0m [0mboost_dict[0m[0;34m=[0m[0;34m{[0m[0;34m}[0m[0;34m,[0m [0mnum_results[0m[0;34m=[0m[0;36m10[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Searches the index with the given query, filters, and boost parameters.

Args:
    query (str): The search query string.
    filter_dict (dict): Dictionary of keyword fields to filter by. Keys are field names and values are the values to filter by.
    boost_dict (dict): Dictionary of boost scores for text fields. Keys are field names and values are the boost scores.
    num_results (int): The number of top results to return. Defaults to 10.

Returns:
    list of dict: List of documents matching the search criteria, ranked by relevance.
[0;31mFile:[0m      ~/Documents/Data-Science/LLM Courses/LLMZoomcamp/01-introduction/minsearch.py
[0;31mType:[0m  

In [9]:
def search(question: str, filter_dict: dict = {}, num_results=5) -> list:
    """
    Search for relevant results based on the given question and course.

    Parameters:
    question (str): The question to search for.
    course (str): The course to filter the search results.
    num_results (int): The number of results to return. Default is 5.

    Returns:
    list: A list of search results.
    """
    
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=question,
        filter_dict=filter_dict, # filter field to "predict"
        boost_dict=boost,
        num_results=num_results
    )

    return results

In [10]:
question = "The course has already started, can I still enroll?"
search(question=question, filter_dict={"course": "data-engineering-zoomcamp"})

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 202

Here we have the list of the most similar answered-questions in the database.

## 2. Setup LLM Model

In [11]:
import os
from dotenv import load_dotenv
import google.generativeai as genai

In [12]:
load_dotenv("../.env")
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")

In [13]:
"""
Install the Google AI Python SDK

$ pip install google-generativeai

See the getting started guide for more information:
https://ai.google.dev/gemini-api/docs/get-started/python
"""

genai.configure(api_key=GEMINI_API_KEY)

In [14]:
# Create the model
# See https://ai.google.dev/api/python/google/generativeai/GenerativeModel
generation_config = {
    "temperature": 0,
    "top_p": 0.95,
    "top_k": 64,
    "max_output_tokens": 1024,
    "response_mime_type": "text/plain",
}

In [15]:
model = genai.GenerativeModel(
    model_name="gemini-1.5-flash",
    generation_config=generation_config,
    # safety_settings = Adjust safety settings
    # See https://ai.google.dev/gemini-api/docs/safety-settings
)

In [16]:
chat_session = model.start_chat(history=[])

In [17]:
response = chat_session.send_message("Who is the last champion of FIFA World Cup?")
print(response.text)

The last champion of the FIFA World Cup is **Argentina**, who won the tournament in **2022**. 



Excellent! But, asking about the specific Data Engineering Zoomcamp: 

In [18]:
response = chat_session.send_message(question)
print(response.text)

Please provide me with more context! I need to know which course you're referring to in order to answer your question. 

Tell me:

* **What is the name of the course?**
* **Where is the course offered (online, university, etc.)?**

Once I have this information, I can help you find out if you can still enroll. 



we do not get a good response. The general LLM model needs to know more the question context.

## 3. LLM & RAG System

In [19]:
def build_prompt(query: str, search_results: list) -> str:
    """
    Build a prompt for generating an answer based on a given user query and search results.

    Args:
        query (str): The question/query for which the prompt is being built.
        search_results (list): A list of dictionaries containing search results.

    Returns:
        str: The generated prompt.

    """
    
    prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.
If the CONTEXT is not enough to answer the QUESTION, return NONE.

QUESTION: {question}

CONTEXT: 
{context}

""".strip()

    context = ""
    
    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [20]:
def invoke_llm(prompt:str, model: genai.GenerativeModel, history: list = None) -> str:
    """
    Invokes the LLM (Language Model) to generate a response based on the given prompt.

    Parameters:
    prompt (str): The input prompt for the LLM.
    model (genai.GenerativeModel): The LLM model to be used for generating the response.
    history (list, optional): A list of previous chat messages as history. Defaults to None.

    Returns:
    str: The generated response from the LLM.
    """

    chat_session = model.start_chat(history=history)
    
    response = chat_session.send_message(prompt)

    return response.text

In [21]:
def run_rag_min_search(query: str, model: genai.GenerativeModel, filter_dict: dict = {}) -> str:
    """
    Runs the RAG (Retrieval-Augmented Generation) model to generate an answer based on the given query.

    Args:
        query (str): The query to search for.
        model (genai.GenerativeModel): The RAG model to use for generation.
        filter_dict (dict, optional): A dictionary of filters to apply during the search. Defaults to {}.

    Returns:
        str: The generated answer.

    """
    
    search_results = search(query, filter_dict=filter_dict, num_results=5)

    prompt = build_prompt(query, search_results)

    answer = invoke_llm(prompt, model)

    return answer

In [22]:
question = "The course has already started, can I still enroll?"
response = run_rag_min_search(query=question, model=model, filter_dict={"course": "data-engineering-zoomcamp"})
print(response)

Yes, even if you don't register, you're still eligible to submit the homeworks. 
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute. 



In [23]:
question = "Can I use other programming languages in the course?"
response = run_rag_min_search(query=question, model=model)
print(response)

Technically, yes. Advisable? Not really. Reasons:
Some homework(s) asks for specific python library versions.
Answers may not match in MCQ options if using different languages other than Python 3.10 (the recommended version for 2023 cohort)
And as for midterms/capstones, your peer-reviewers may not know these other languages. Do you want to be penalized for others not knowing these other languages?
You can create a separate repo using course’s lessons but written in other languages for your own learnings, but not advisable for submissions. 



## 4. Elastic Search

In [24]:
from tqdm.auto import tqdm
from elasticsearch import Elasticsearch

In [25]:
es_client = Elasticsearch('http://localhost:9200') 

In [26]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

index_name = "course-questions"

es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [27]:
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

  0%|          | 0/948 [00:00<?, ?it/s]

In [28]:
def elastic_search(query: str, num_results:int = 5, filter_dict: dict ={}) -> list[str]:
    """
    Perform an Elasticsearch search based on the given query and filter criteria.

    Args:
        query (str): The search query string.
        num_results (int, optional): The number of results to return. Defaults to 5.
        filter_dict (dict, optional): The filter criteria to apply to the search. Defaults to {}.

    Returns:
        list[str]: A list of search results as strings.
    """

    # Construct the search query
    search_query = {
        "size": num_results,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": filter_dict
                }
            }
        }
    }

    # Perform the search using Elasticsearch client
    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    # Extract the search results from the response
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [29]:
def run_rag_elastic_search(query : str, model: genai.GenerativeModel, filter_dict: dict = {}):
    """
    Runs the RAG (Retrieval-Augmented Generation) model on the given query using ElasticSearch.

    Args:
        query (str): The search query.
        model (genai.GenerativeModel): The RAG model to be used for generation.
        filter_dict (dict, optional): A dictionary of filters to be applied to the search results. Defaults to {}.

    Returns:
        str: The generated answer based on the query and search results.
    """
    
    search_results = elastic_search(query, num_results=5, filter_dict=filter_dict)

    prompt = build_prompt(query, search_results)

    answer = invoke_llm(prompt, model)
    
    return answer

In [30]:
question = "The course has already started, can I still enroll?"
response = run_rag_elastic_search(query=question, model=model, filter_dict={"course": "data-engineering-zoomcamp"})
print(response)

Yes, even if you don't register, you're still eligible to submit the homeworks. 
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute. 

