# Retrieval Augmented Generation (RAG)
We basically first retrieve data using a search engine trained to our data, and then generate an answer based on that via LLMs

In [26]:
!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py

--2024-09-12 23:32:47--  https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3832 (3.7K) [text/plain]
Saving to: ‘minsearch.py.1’


2024-09-12 23:32:47 (24.0 MB/s) - ‘minsearch.py.1’ saved [3832/3832]



In [27]:
import minsearch # alexeys small and fast search engine
import json
import openai
import os
from dotenv import load_dotenv

from openai import OpenAI

## Start and train search Engine

In [28]:
# load json data 
# clean json dictionaries extracted by alexey grigorev using:
# https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/parse-faq.ipynb
# raw input data was the FAQ google documents for the mlops, ml, data engineering zoomcamp

with open('documents.json','rt') as f_in:
    docs_raw = json.load(f_in)

In [29]:
# rearrange data a bit (add course type to each faq)
documents = []

for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course'] #adding it to every faq
        documents.append(doc)

In [30]:
# initialize class, tell the search engine what is searchable and what are keywords
index = minsearch.Index(
    text_fields=['text','section','question'],
    keyword_fields=['course']
)

#actually train the search engine
index.fit(docs=documents)

<minsearch.Index at 0x157514640>

In [31]:
q= 'the course has already started, can I enroll?'

boost = {'question': 3.0, 'section': 0.5} # what to stress on, what is more important. give it weights
filter_dict = {'course': 'data-engineering-zoomcamp'}
results = index.search(
    query=q,
    filter_dict=filter_dict,
    boost_dict=boost,
    num_results=5
)

## Generate LLM

In [32]:
# load environment variables
load_dotenv()
client = OpenAI()

In [33]:
# get a response from chatGPT, we are calling 4o-mini
response = client.chat.completions.create(
    model = 'gpt-4o-mini',
    messages=[{'role':'user','content':q}]
)

In [34]:
response.choices[0].message.content

"Whether you can enroll in a course that has already started typically depends on the institution's policies and the specific course in question. Many educational programs may allow late enrollment, while others may have strict deadlines. Here are a few steps you can take:\n\n1. **Check the Institution’s Policies**: Look for information regarding late enrollment or add/drop deadlines on the institution’s website.\n   \n2. **Contact the Instructor or Administration**: Reach out to the course instructor or the administration office to inquire about the possibility of joining the course late.\n\n3. **Consider Alternatives**: If enrollment is not possible, ask about future offerings of the course or other related courses.\n\n4. **Evaluate Your Commitment**: If you can enroll late, consider whether you will be able to keep up with the material that has already been covered.\n\nGood luck!"

In [35]:
# we will give the llm some context
# Alexey mentions that this is a bit of art and science because you somewhat iterate until you find something that works for you.
prompt_template =  """ 

You are a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database. 
Use only the facts from the CONTEXT when answering the QUESTION.
If the CONTEXT does not contain the answer, output NONE.

QUESTION: {question}

CONTEXT: {context}

""".strip() #no line breaks

In [36]:
# we create the context by basically stringing together the answers from the search engine
context = ""

for doc in results:
    context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer:{doc['text']}\n\n"

In [37]:
# we formally add the info on the prompt
prompt = prompt_template.format(question=q,context=context).strip()
print(prompt)

You are a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database. 
Use only the facts from the CONTEXT when answering the QUESTION.
If the CONTEXT does not contain the answer, output NONE.

QUESTION: the course has already started, can I enroll?

CONTEXT: section: General course-related questions
question: Course - Can I follow the course after it finishes?
answer:Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.

section: General course-related questions
question: Course - When will the course start?
answer:The purpose of this document is to capture frequently asked technical questions
The exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1
Subsc

In [38]:
# train openai/chatpgt with the prompt
response = client.chat.completions.create(
    model = 'gpt-4o-mini',
    messages=[{'role':'user','content': prompt}]
)   

In [39]:

# get the answer to the question but now with the background included.
response.choices[0].message.content

"Yes, you can still join the course after the start date. Even if you don't register, you're eligible to submit the homeworks, but be mindful of the deadlines for turning in the final projects."