# Introduction to RAG

## 1. Indexing Documents with Min-Search

In [16]:
import json
import minsearch

In [17]:
with open('../01-introduction/faq_database.json', 'rt') as f_in:
    docs_raw = json.load(f_in)

documents = []

for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)

Examples of questions and answers in the documents

In [18]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [19]:
documents[1]

{'text': 'GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites',
 'section': 'General course-related questions',
 'question': 'Course - What are the prerequisites for this course?',
 'course': 'data-engineering-zoomcamp'}

Minimal search engine to return the most similar documents to the new query

In [20]:
minsearch.Index?

[0;31mInit signature:[0m [0mminsearch[0m[0;34m.[0m[0mIndex[0m[0;34m([0m[0mtext_fields[0m[0;34m,[0m [0mkeyword_fields[0m[0;34m,[0m [0mvectorizer_params[0m[0;34m=[0m[0;34m{[0m[0;34m}[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
A simple search index using TF-IDF and cosine similarity for text fields and exact matching for keyword fields.

Attributes:
    text_fields (list): List of text field names to index.
    keyword_fields (list): List of keyword field names to index.
    vectorizers (dict): Dictionary of TfidfVectorizer instances for each text field.
    keyword_df (pd.DataFrame): DataFrame containing keyword field data.
    text_matrices (dict): Dictionary of TF-IDF matrices for each text field.
    docs (list): List of documents indexed.
[0;31mInit docstring:[0m
Initializes the Index with specified text and keyword fields.

Args:
    text_fields (list): List of text field names to index.
    keyword_fields (list): List of keyword 

In [21]:
index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"] # filter field to "train"
)

In [22]:
index.fit(documents)

<minsearch.Index at 0x77990e880b50>

In [23]:
index.search?

[0;31mSignature:[0m [0mindex[0m[0;34m.[0m[0msearch[0m[0;34m([0m[0mquery[0m[0;34m,[0m [0mfilter_dict[0m[0;34m=[0m[0;34m{[0m[0;34m}[0m[0;34m,[0m [0mboost_dict[0m[0;34m=[0m[0;34m{[0m[0;34m}[0m[0;34m,[0m [0mnum_results[0m[0;34m=[0m[0;36m10[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Searches the index with the given query, filters, and boost parameters.

Args:
    query (str): The search query string.
    filter_dict (dict): Dictionary of keyword fields to filter by. Keys are field names and values are the values to filter by.
    boost_dict (dict): Dictionary of boost scores for text fields. Keys are field names and values are the boost scores.
    num_results (int): The number of top results to return. Defaults to 10.

Returns:
    list of dict: List of documents matching the search criteria, ranked by relevance.
[0;31mFile:[0m      ~/Documents/Data-Science/LLM Courses/LLMZoomcamp/02-open-source/minsearch.py
[0;31mType:[0m   

In [24]:
def search(question: str, filter_dict: dict = {}, num_results=5) -> list:
    """
    Search for relevant results based on the given question and course.

    Parameters:
    question (str): The question to search for.
    course (str): The course to filter the search results.
    num_results (int): The number of results to return. Default is 5.

    Returns:
    list: A list of search results.
    """
    
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=question,
        filter_dict=filter_dict, # filter field to "predict"
        boost_dict=boost,
        num_results=num_results
    )

    return results

In [25]:
question = "The course has already started, can I still enroll?"
search(question=question, filter_dict={"course": "data-engineering-zoomcamp"})

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 202

Here we have the list of the most similar answered-questions in the database.

## 2. Setup LLM Model

In [26]:
import ollama

In [27]:
ollama.list()

{'models': [{'name': 'phi3:latest',
   'model': 'phi3:latest',
   'modified_at': '2024-07-07T14:14:27.925410356-03:00',
   'size': 2176178401,
   'digest': 'd184c916657ef4eaff1908b1955043cec01e7aafd2cef8a5bbfd405a7d35d1fb',
   'details': {'parent_model': '',
    'format': 'gguf',
    'family': 'phi3',
    'families': ['phi3'],
    'parameter_size': '3.8B',
    'quantization_level': 'Q4_0'}}]}

In [28]:
response = ollama.generate(model='phi3', prompt='Why is the sky blue?')

In [29]:
print(response["response"])

 The sky appears predominantly blue to observers on Earth due to a phenomenon called Rayleigh scattering. As sunlight passes through Earth's atmosphere, it encounters molecules and small particles that are much smaller than its wavelength (i.e., less than 1/10th the wavelength of visible light). These airborne particles cause short blue-wavelength photons to scatter in different directions more efficiently than other colors because they have a shorter, smaller wavelength which is better suited for this interaction with atmospheric molecules.


The scattered sunlight enters the eye from all directions except directly from where it originally came; thus, we see light that has been scattered towards us and not just direct light coming from above as our line of sight typically lies below the scattering paths when looking toward any point in the sky (except during or shortly after a solar eclipse). This is why even if you are standing under an open sky with no clouds blocking your view, eve

In [30]:
response = ollama.generate(model='phi3', prompt=question)
print(response["response"])

 Enrolling in a course once it has begun is typically not possible as classes are often designed with fixed schedules and seats that become filled quickly. However, there might be exceptions depending on the institution or format of your class (e.g., an open online session where new registrants can join). To address this matter accurately:

1. Check directly with the instructor for their specific enrollment policy regarding late registration – some courses may allow it under certain conditions, while others will not accept any changes after they have started.
2. If your course offers an open section or if there's another session that could accommodate you without conflicting schedules, consider looking into those options for future reference to avoid the same issue.
3. For courses with late registration policies (which are uncommon), ensure all other prerequisites such as required materials and assignments have been completed before enrolling.


## 3. LLM & RAG System

In [31]:
def build_prompt(query: str, search_results: list) -> str:
    """
    Build a prompt for generating an answer based on a given user query and search results.

    Args:
        query (str): The question/query for which the prompt is being built.
        search_results (list): A list of dictionaries containing search results.

    Returns:
        str: The generated prompt.

    """
    
    prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.
If the CONTEXT is not enough to answer the QUESTION, return NONE.

QUESTION: {question}

CONTEXT: 
{context}

""".strip()

    context = ""
    
    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [32]:
def invoke_llm(prompt:str) -> str:
    """
    Invokes the LLM (Language Model) to generate a response based on the given prompt.

    Parameters:
    prompt (str): The input prompt for the LLM.

    Returns:
    str: The generated response from the LLM.
    """

    response = ollama.generate(model='phi3', prompt=prompt)
    
    return response["response"]

In [33]:
def run_rag_min_search(query: str, filter_dict: dict = {}) -> str:
    """
    Runs the RAG (Retrieval-Augmented Generation) model to generate an answer based on the given query.

    Args:
        query (str): The query to search for.
        filter_dict (dict, optional): A dictionary of filters to apply during the search. Defaults to {}.

    Returns:
        str: The generated answer.

    """
    
    search_results = search(query, filter_dict=filter_dict, num_results=5)

    prompt = build_prompt(query, search_results)

    answer = invoke_llm(prompt)

    return answer

In [34]:
question = "The course has already started, can I still enroll?"
response = run_rag_min_search(query=question, filter_dict={"course": "data-engineering-zoomcamp"})
print(response)

 Based on the CONTEXT provided:

Yes, even if a course has started and students are already enrolled in some form of self-paced mode or have not yet registered officially beforehand, you still can submit homework assignments for points as stated that "even if you don't register, you're still eligible to submit the homeworks." However, there may be deadlines regarding final project submissions which should not be overlooked.

The CONTEXT doesn’t explicitly mention late registration into an already started course or any specific policies for doing so after a class has begun and materials have been released. Therefore: 


In [35]:
question = "Can I use other programming languages in the course?"
response = run_rag_min_search(query=question)
print(response)

 NONE - The CONTEXT does not provide specific guidance on whether students can use other programming languages such as R or Scala within the course submissions and assessments. However, it advises against using different languages due to potential issues with library versions in homework assignments and peer-review difficulties for midterms/capstones.
