## STEP 0 - install dependencies

In [2]:
# we have to enter OpenAI key here and then we can run all cells freely...
import os
from getpass import getpass
from openai import OpenAI

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key


In [3]:
# source is https://www.youtube.com/watch?v=GH3lrOsU3AU
# and notebook is https://github.com/DataTalksClub/llm-zoomcamp/blob/main/0a-agents/notebook.ipynb
# Follow along this tutorial: https://github.com/alexeygrigorev/rag-agents-workshop

# STEP 0 - installing packages we need here and in the VS code terminal:

# Run in VS code terminal:
# pip install --upgrade pip # no reminders after!
# pip install tqdm notebook==7.1.2 openai elasticsearch==8.13.0 pandas scikit-learn ipywidgets
# jupyter notebook # to run jupyter engine locally

In [4]:
%pip install minsearch -q

Note: you may need to restart the kernel to use updated packages.


## STEP 1 - download evaluation data from Github and build in-memory search index with minsearch

In [5]:
# download our FAQ dataset
import requests 

docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

documents[-1] # to see a format of downloaded FAQ dataset

{'text': 'Problem description\nInfrastructure created in AWS with CD-Deploy Action needs to be destroyed\nSolution description\nFrom local:\nterraform init -backend-config="key=mlops-zoomcamp-prod.tfstate" --reconfigure\nterraform destroy --var-file vars/prod.tfvars\nAdded by Erick Calderin',
 'section': 'Module 6: Best practices',
 'question': 'How to destroy infrastructure created via GitHub Actions',
 'course': 'mlops-zoomcamp'}

In [6]:
# build index - takes a few seconds with minsearch

from minsearch import AppendableIndex

index = AppendableIndex(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

index.fit(documents)

<minsearch.append.AppendableIndex at 0x71629c047470>

In [7]:
# create a search function with weights: boost = {'question': 3.0, 'section': 0.5}

def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5,
        output_ids=True
    )

    return results

# number of results 5 and course is hard coded here  - filter_dict={'course': 'data-engineering-zoomcamp'}

In [8]:
question = 'Can I still join the course?'

In [9]:
# test if our search function works 

search(question)

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp',
  '_id': 2},
 {'text': "No, you can only get a certificate if you finish the course with a “live” cohort. We don't award certificates for the self-paced mode. The reason is you need to peer-review capstone(s) after submitting a project. You can only peer-review projects at the time the course is running.",
  'section': 'General course-related questions',
  'question': 'Certificate - Can I follow the course in a self-paced mode and get a certificate?',
  'course': 'data-engineering-zoomcamp',
  '_id': 11},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the cou

## STEP 2 - build prompt

In [10]:
# build a prompt

prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

<QUESTION>
{question}
</QUESTION>

<CONTEXT>
{context}
</CONTEXT>
""".strip()

def build_prompt(query, search_results):
    context = ""

    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt



In [11]:
# test how our prompt build works by calling search(question)

prompt = build_prompt(question, search(question))
print(prompt) # better formatted for readability

You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

<QUESTION>
Can I still join the course?
</QUESTION>

<CONTEXT>
section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

section: General course-related questions
question: Certificate - Can I follow the course in a self-paced mode and get a certificate?
answer: No, you can only get a certificate if you finish the course with a “live” cohort. We don't award certificates for the self-paced mode. The reason is you need to peer-review capstone(s) after submitting a project. You can only peer-review projects at the time the course is running.

section: General

## STEP 3 - Connect to OpenAI

In [12]:
# connecting to LLM, we entered our API KEY already at the first cell
client = OpenAI()

In [13]:
search_results = search(question)
# just repeating step 1 - keyword search 

In [14]:
prompt = build_prompt(question, search_results)
# repeating step 2 - building prompt on top of search results 

In [15]:
# function to send our prompt to llm 

def llm(prompt):
    response = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content



In [16]:
# test if our llm can answer something meaningful based on prompt provided

answer = llm(prompt)
print(answer)

Yes, you can still join the course after the start date. Even if you don't register, you are eligible to submit homework. However, keep in mind that there will be deadlines for turning in the final projects, so it's advisable not to leave everything until the last minute.


In [17]:
print(prompt) # this is what we have sent to llm

You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

<QUESTION>
Can I still join the course?
</QUESTION>

<CONTEXT>
section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

section: General course-related questions
question: Certificate - Can I follow the course in a self-paced mode and get a certificate?
answer: No, you can only get a certificate if you finish the course with a “live” cohort. We don't award certificates for the self-paced mode. The reason is you need to peer-review capstone(s) after submitting a project. You can only peer-review projects at the time the course is running.

section: General

## STEP 4 - assemble our RAG pipeline

In [18]:
# this is our RAG pipeline

def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer



In [19]:
# test by irrelevant question - something which is NOT in FAQ

rag("How do I patch KDE under FreeBSD?")

"I'm sorry, but there is no information available in the FAQ database regarding how to patch KDE under FreeBSD. Please consult the official KDE or FreeBSD documentation or forums for assistance."

In [20]:
rag("What Shakespeare said about peeling a carrot?") 

'The context provided does not include any information regarding Shakespeare or his views on peeling a carrot. Therefore, I cannot answer the question about what Shakespeare said about peeling a carrot.'

In [21]:
# relevant question - it actually was in FAQ
answer = rag("How to run Kafka in Docker")
print(answer)

To run Kafka in Docker, first ensure that your Kafka broker Docker container is functioning properly. You can check the status of your Docker containers by running the command `docker ps`. If the Kafka broker is not running, navigate to the directory containing your Docker Compose YAML file and execute the command `docker compose up -d` to start all the instances.


In [22]:
# LLM by itself know the answer, but our RAG is prohibiting it
# above rag("How do I patch KDE under FreeBSD?") returned nothing, but - 
print(llm("How do I patch KDE under FreeBSD?"))

Patching KDE under FreeBSD involves several steps, including downloading the source code, applying your patch, and rebuilding the application or library. Below is a general guide on how to do this:

### Prerequisites

1. **Ensure FreeBSD Ports and Source Tree are Installed:**
   Make sure you have the FreeBSD ports collection installed. If not, you can install it with the following command:
   ```sh
   portsnap fetch extract
   ```

2. **Install Required Tools:**
   Make sure you have the necessary development tools and libraries. You can install them using:
   ```sh
   pkg install git cmake gcc gmake
   ```

### Steps to Patch KDE

1. **Locate the KDE Port:**
   Find the KDE port you want to patch. For example, if you want to patch `kde5`, you can look for it in the ports collection, typically under `/usr/ports/x11/kde5`.

2. **Navigate to the Port Directory:**
   Change to the directory of the specific port. For instance:
   ```sh
   cd /usr/ports/x11/kde5
   ```

3. **Fetch the Late

In [23]:
# Actually this was the main idea for STEP 5 below - if llm can answer question by itself, nice!
# if not - it should be able to search our FAQ database and build the context for answer...

## STEP 5 - "Agentic" RAG

In [24]:
# essentially we only modify our prompt - so llm can decide either to give an answer immediately
# or use a SEARCH tool to get more CONTEXT from our FAQ

prompt_template = """
You're a course teaching assistant.

You're given a QUESTION from a course student and that you need to answer with your own knowledge and provided CONTEXT.
At the beginning the context is EMPTY.

<QUESTION>
{question}
</QUESTION>

<CONTEXT> 
{context}
</CONTEXT>

If CONTEXT is EMPTY, you can use our FAQ database.
In this case, use the following output template:

{{
"action": "SEARCH",
"reasoning": "<add your reasoning here>"
}}

If you can answer the QUESTION using CONTEXT, use this template:

{{
"action": "ANSWER",
"answer": "<your answer>",
"source": "CONTEXT"
}}

If the context doesn't contain the answer, use your own knowledge to answer the question

{{
"action": "ANSWER",
"answer": "<your answer>",
"source": "OWN_KNOWLEDGE"
}}
""".strip()

In [25]:
question = 'Can I still join the course?'
context = 'EMPTY'

In [26]:
prompt = prompt_template.format(question=question, context=context)
print(prompt)

You're a course teaching assistant.

You're given a QUESTION from a course student and that you need to answer with your own knowledge and provided CONTEXT.
At the beginning the context is EMPTY.

<QUESTION>
Can I still join the course?
</QUESTION>

<CONTEXT> 
EMPTY
</CONTEXT>

If CONTEXT is EMPTY, you can use our FAQ database.
In this case, use the following output template:

{
"action": "SEARCH",
"reasoning": "<add your reasoning here>"
}

If you can answer the QUESTION using CONTEXT, use this template:

{
"action": "ANSWER",
"answer": "<your answer>",
"source": "CONTEXT"
}

If the context doesn't contain the answer, use your own knowledge to answer the question

{
"action": "ANSWER",
"answer": "<your answer>",
"source": "OWN_KNOWLEDGE"
}


In [27]:
answer_json = llm(prompt)
answer_json 
# model decided to use "SEARCH" function because CONTEXT is empty
# and provided a reason for it  - "reasoning": "I am unsure about the specific enrollment dates...

'{\n"action": "SEARCH",\n"reasoning": "The question about joining the course requires specific information regarding enrollment deadlines or policies, which is not provided in the current context."\n}'

In [28]:
import json
# we can parse the llm answer and use tool if appropriate:

answer = json.loads(answer_json)
answer['action']

'SEARCH'

In [29]:
# lets ask LLM something it knows already well:
question = 'Can I run Docker on Windows 10?'
context = 'EMPTY'
prompt = prompt_template.format(question=question, context=context)

In [30]:
answer_json = llm(prompt)
answer_json 
# it provides answer immediately - zero shot - action": "ANSWER":
# and says it knows it already - "source": "OWN_KNOWLEDGE"

'{\n"action": "ANSWER",\n"answer": "Yes, you can run Docker on Windows 10. Docker provides a version called Docker Desktop that works on Windows 10 Pro, Enterprise, and Education editions, and it uses the Windows Subsystem for Linux (WSL 2) for running Linux containers. If you are using Windows 10 Home edition, you can still run Docker Desktop, but it will rely on WSL 2 directly. You will need to ensure that WSL 2 is installed and enabled before installing Docker Desktop.",\n"source": "OWN_KNOWLEDGE"\n}'

In [31]:
answer = json.loads(answer_json)
answer['action']

'ANSWER'

In [32]:
# if model says SEARCH then we need to use a search tool and build context + update prompt

def build_context(search_results):
    context = ""

    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    return context.strip()



In [33]:
# asking something it should search for:
question = 'Can I still join the course?'

search_results = search(question)
context = build_context(search_results)
prompt = prompt_template.format(question=question, context=context)
print(prompt) # for better formatting

You're a course teaching assistant.

You're given a QUESTION from a course student and that you need to answer with your own knowledge and provided CONTEXT.
At the beginning the context is EMPTY.

<QUESTION>
Can I still join the course?
</QUESTION>

<CONTEXT> 
section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

section: General course-related questions
question: Certificate - Can I follow the course in a self-paced mode and get a certificate?
answer: No, you can only get a certificate if you finish the course with a “live” cohort. We don't award certificates for the self-paced mode. The reason is you need to peer-review capstone(s) after submitting a project. You can only peer-review projects at the time the cour

In [34]:
# lets ask llm after we performed the search as advised and updated llm context

answer_json = llm(prompt)
print(answer_json)

{
"action": "ANSWER",
"answer": "Yes, you can still join the course even after the start date. While registration is encouraged, you can submit the homework without officially registering. Just keep in mind that there are deadlines for the final projects, so be sure to manage your time wisely.",
"source": "CONTEXT"
}


In [35]:
# we need to put everything together and parse our response json automatically
# algo - if "action": "SEARCH" then llm uses our FAQ search tool
# if "action": "ANSWER" - then it answers immediately ...

def agentic_rag_v1(question):
    context = "EMPTY"
    prompt = prompt_template.format(question=question, context=context)
    answer_json = llm(prompt)
    answer = json.loads(answer_json)
    print(answer)

    if answer['action'] == 'SEARCH':
        print('need to perform search...')
        search_results = search(question)
        context = build_context(search_results)
        
        prompt = prompt_template.format(question=question, context=context)
        answer_json = llm(prompt)
        answer = json.loads(answer_json)
        print(answer)

    return answer



In [36]:
# test our agentic RAG function

agentic_rag_v1('how do I join the course?')
# 'action': 'SEARCH', 'reasoning': 'The context is empty...

# 'action': 'ANSWER'
# 'source': 'OWN_KNOWLEDGE'

{'action': 'SEARCH', 'reasoning': 'The question is about how to join the course, and the context is currently empty, so I need to check the FAQ database for guidance on enrollment procedures.'}
need to perform search...
{'action': 'ANSWER', 'answer': "To join the course, you need to register before the course starts. Use the provided registration link to sign up. The course will start on 15th January 2024 at 17h00, and you can also find important announcements by joining the course's Telegram channel. Don't forget to subscribe to the course public Google Calendar to keep track of important dates.", 'source': 'CONTEXT'}


{'action': 'ANSWER',
 'answer': "To join the course, you need to register before the course starts. Use the provided registration link to sign up. The course will start on 15th January 2024 at 17h00, and you can also find important announcements by joining the course's Telegram channel. Don't forget to subscribe to the course public Google Calendar to keep track of important dates.",
 'source': 'CONTEXT'}

In [37]:
agentic_rag_v1('how patch KDE under FreeBSD?')
# 'reasoning': 'The student is asking for a specific method to patch KDE under FreeBSD, and the context is currently empty...

# 'action': 'ANSWER'
# 'source': 'OWN_KNOWLEDGE'

{'action': 'ANSWER', 'answer': 'To patch KDE under FreeBSD, you generally need to follow these steps:\n\n1. **Install the necessary tools**: Ensure that you have `portsnap` or `ports` collection installed, along with `editors` or `git` for modifying code.\n\n2. **Fetch the KDE source code**: You can download the KDE ports from the FreeBSD ports tree using `portsnap fetch extract` or from the official KDE repositories.\n\n3. **Locate the specific port**: Navigate to the directory of the KDE component you wish to patch (e.g., `/usr/ports/x11/kde5`).\n\n4. **Create a patch**: Modify the source files as needed and then create a patch file using the `diff` command. Example:\n   ```\n   diff -u originalfile.cpp modifiedfile.cpp > mypatch.patch\n   ```\n\n5. **Apply the patch**: Move the patch file to the relevant port directory (e.g., `/usr/ports/x11/kde5/files/`) and apply it using the `patch` command:\n   ```\n   patch < mypatch.patch\n   ```\n\n6. **Build and install**: Run `make install`

{'action': 'ANSWER',
 'answer': 'To patch KDE under FreeBSD, you generally need to follow these steps:\n\n1. **Install the necessary tools**: Ensure that you have `portsnap` or `ports` collection installed, along with `editors` or `git` for modifying code.\n\n2. **Fetch the KDE source code**: You can download the KDE ports from the FreeBSD ports tree using `portsnap fetch extract` or from the official KDE repositories.\n\n3. **Locate the specific port**: Navigate to the directory of the KDE component you wish to patch (e.g., `/usr/ports/x11/kde5`).\n\n4. **Create a patch**: Modify the source files as needed and then create a patch file using the `diff` command. Example:\n   ```\n   diff -u originalfile.cpp modifiedfile.cpp > mypatch.patch\n   ```\n\n5. **Apply the patch**: Move the patch file to the relevant port directory (e.g., `/usr/ports/x11/kde5/files/`) and apply it using the `patch` command:\n   ```\n   patch < mypatch.patch\n   ```\n\n6. **Build and install**: Run `make install

## STEP 6 - Agentic Search

Part 2: Agentic search

So far we had two actions only: search and answer.

But we can let our "agent" formulate one or more search queries - and do it for a few iterations until we found an answer

Let's build a prompt:

List available actions:

Search in FAQ

-- Answer using own knowledge

-- Answer using information extracted from FAQ

-- Provide access to the previous actions

Have clear stop criteria (no more than X iterations)

We also specify the output format, so it's easier to parse it

In [38]:
# deduplication function

def dedup(seq):
    seen = set()
    result = []
    for el in seq:
        _id = el['_id']
        if _id in seen:
            continue
        seen.add(_id)
        result.append(el)
    return result



In [39]:
prompt_template = """
You're a course teaching assistant.

You're given a QUESTION from a course student and that you need to answer with your own knowledge and provided CONTEXT.

The CONTEXT is build with the documents from our FAQ database.
SEARCH_QUERIES contains the queries that were used to retrieve the documents
from FAQ to and add them to the context.
PREVIOUS_ACTIONS contains the actions you already performed.

At the beginning the CONTEXT is empty.

You can perform the following actions:

- Search in the FAQ database to get more data for the CONTEXT
- Answer the question using the CONTEXT
- Answer the question using your own knowledge

For the SEARCH action, build search requests based on the CONTEXT and the QUESTION.
Carefully analyze the CONTEXT and generate the requests to deeply explore the topic. 

Don't use search queries used at the previous iterations.

Don't repeat previously performed actions.

Don't perform more than {max_iterations} iterations for a given student question.
The current iteration number: {iteration_number}. If we exceed the allowed number 
of iterations, give the best possible answer with the provided information.

Output templates:

If you want to perform search, use this template:

{{
"action": "SEARCH",
"reasoning": "<add your reasoning here>",
"keywords": ["search query 1", "search query 2", ...]
}}

If you can answer the QUESTION using CONTEXT, use this template:

{{
"action": "ANSWER_CONTEXT",
"answer": "<your answer>",
"source": "CONTEXT"
}}

If the context doesn't contain the answer, use your own knowledge to answer the question

{{
"action": "ANSWER",
"answer": "<your answer>",
"source": "OWN_KNOWLEDGE"
}}

<QUESTION>
{question}
</QUESTION>

<SEARCH_QUERIES>
{search_queries}
</SEARCH_QUERIES>

<CONTEXT> 
{context}
</CONTEXT>

<PREVIOUS_ACTIONS>
{previous_actions}
</PREVIOUS_ACTIONS>
""".strip()

In [40]:
# repetitive task - search, update context and search again if needed + delete duplicate results

question = 'how do I do well on module 1'
max_iterations = 3
iteration_number = 0
search_queries = []
search_results  = []
previous_actions = []



In [41]:
context = build_context(search_results)

prompt = prompt_template.format(
    question=question,
    context=context,
    search_queries="\n".join(search_queries),
    previous_actions='\n'.join([json.dumps(a) for a in previous_actions]),
    max_iterations=max_iterations,
    iteration_number=iteration_number
)

print(prompt)

You're a course teaching assistant.

You're given a QUESTION from a course student and that you need to answer with your own knowledge and provided CONTEXT.

The CONTEXT is build with the documents from our FAQ database.
SEARCH_QUERIES contains the queries that were used to retrieve the documents
from FAQ to and add them to the context.
PREVIOUS_ACTIONS contains the actions you already performed.

At the beginning the CONTEXT is empty.

You can perform the following actions:

- Search in the FAQ database to get more data for the CONTEXT
- Answer the question using the CONTEXT
- Answer the question using your own knowledge

For the SEARCH action, build search requests based on the CONTEXT and the QUESTION.
Carefully analyze the CONTEXT and generate the requests to deeply explore the topic. 

Don't use search queries used at the previous iterations.

Don't repeat previously performed actions.

Don't perform more than 3 iterations for a given student question.
The current iteration number

In [42]:
answer_json = llm(prompt)
answer_json

'{\n"action": "SEARCH",\n"reasoning": "To provide a comprehensive answer on how to do well in module 1, I need to find specific strategies, tips, or resources outlined in the FAQ that address academic success in the module.",\n"keywords": ["how to succeed in module 1", "tips for module 1", "successful strategies module 1"]\n}'

In [43]:
answer = json.loads(answer_json)
answer

{'action': 'SEARCH',
 'reasoning': 'To provide a comprehensive answer on how to do well in module 1, I need to find specific strategies, tips, or resources outlined in the FAQ that address academic success in the module.',
 'keywords': ['how to succeed in module 1',
  'tips for module 1',
  'successful strategies module 1']}

In [44]:
previous_actions.append(answer)
keywords = answer['keywords']
keywords

['how to succeed in module 1',
 'tips for module 1',
 'successful strategies module 1']

In [45]:
# now we have to search each keyword and add results to a context

for kw in keywords:
    search_queries.append(kw)
    sr = search(kw)
    search_results.extend(sr)

In [46]:
# lets see what we found

search_results = dedup(search_results)
search_results

[{'text': 'You need to look for the Py4J file and note the version of the filename. Once you know the version, you can update the export command accordingly, this is how you check yours:\n` ls ${SPARK_HOME}/python/lib/ ` and then you add it in the export command, mine was:\nexport PYTHONPATH=”${SPARK_HOME}/python/lib/Py4J-0.10.9.5-src.zip:${PYTHONPATH}”\nMake sure that the version under `${SPARK_HOME}/python/lib/` matches the filename of py4j or you will encounter `ModuleNotFoundError: No module named \'py4j\'` while executing `import pyspark`.\nFor instance, if the file under `${SPARK_HOME}/python/lib/` was `py4j-0.10.9.3-src.zip`.\nThen the export PYTHONPATH statement above should be changed to `export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.3-src.zip:$PYTHONPATH"` appropriately.\nAdditionally, you can check for the version of ‘py4j’ of the spark you’re using from here and update as mentioned above.\n~ Abhijit Chakraborty: Sometimes, even with adding the correct version of p

In [47]:
iteration_number = 2

context = build_context(search_results)

prompt = prompt_template.format(
    question=question,
    context=context,
    search_queries="\n".join(search_queries),
    previous_actions='\n'.join([json.dumps(a) for a in previous_actions]),
    max_iterations=max_iterations,
    iteration_number=iteration_number
)

In [48]:
answer_json = llm(prompt)
answer = json.loads(answer_json)
print(answer)

{'action': 'SEARCH', 'reasoning': "Since the previous search did not yield specific tips or strategies for succeeding in Module 1, I'll search for generic study tips or strategies that may apply to success in an academic module, focusing on relevant skills such as Docker and Terraform.", 'keywords': ['study tips for Docker', 'success strategies for Terraform', 'best practices for learning Docker and Terraform', 'effective studying techniques for Module 1']}


In [49]:
# now lets create a function which will perform a search in a loop

question = "what do I need to do to be successful at module 1?"

search_queries = []
search_results = []
previous_actions = []

iteration = 0

while True:
    print(f'ITERATION #{iteration}...')

    context = build_context(search_results)
    prompt = prompt_template.format(
        question=question,
        context=context,
        search_queries="\n".join(search_queries),
        previous_actions='\n'.join([json.dumps(a) for a in previous_actions]),
        max_iterations=3,
        iteration_number=iteration
    )

    print(prompt)

    answer_json = llm(prompt)
    answer = json.loads(answer_json)
    print(json.dumps(answer, indent=2))

    previous_actions.append(answer)

    action = answer['action']
    if action != 'SEARCH':
        break

    keywords = answer['keywords']
    search_queries = list(set(search_queries) | set(keywords))
    
    for k in keywords:
        res = search(k)
        search_results.extend(res)

    search_results = dedup(search_results)
    
    iteration = iteration + 1
    if iteration >= 4:
        break

    print()


ITERATION #0...
You're a course teaching assistant.

You're given a QUESTION from a course student and that you need to answer with your own knowledge and provided CONTEXT.

The CONTEXT is build with the documents from our FAQ database.
SEARCH_QUERIES contains the queries that were used to retrieve the documents
from FAQ to and add them to the context.
PREVIOUS_ACTIONS contains the actions you already performed.

At the beginning the CONTEXT is empty.

You can perform the following actions:

- Search in the FAQ database to get more data for the CONTEXT
- Answer the question using the CONTEXT
- Answer the question using your own knowledge

For the SEARCH action, build search requests based on the CONTEXT and the QUESTION.
Carefully analyze the CONTEXT and generate the requests to deeply explore the topic. 

Don't use search queries used at the previous iterations.

Don't repeat previously performed actions.

Don't perform more than 3 iterations for a given student question.
The current 

In [50]:
print(answer['answer']) # for better look

To be successful in Module 1, which covers Docker and Terraform, consider the following strategies:

1. **Understand the Basics**: Gain a solid understanding of Docker and Terraform concepts, including containers, images, orchestration, and infrastructure as code.

2. **Hands-On Practice**: Engage in hands-on practice by setting up Docker containers and using Terraform to manage infrastructure. Follow tutorial guides and execute example projects.

3. **Learn Command Line Proficiency**: Familiarize yourself with command line operations for both Docker and Terraform, as much of the interaction with these tools occurs via the terminal.

4. **Study the Documentation**: Regularly reference the official documentation for Docker and Terraform. This will provide you with the most accurate and up-to-date information.

5. **Utilize Online Resources**: Explore additional online tutorials, videos, and forums. Community discussions can provide valuable insights and tips.

6. **Group Study**: If pos

In [51]:
iteration

2

In [52]:
# put together agentic search as a function

def agentic_search(question):
    search_queries = []
    search_results = []
    previous_actions = []

    iteration = 0
    
    while True:
        print(f'ITERATION #{iteration}...')
    
        context = build_context(search_results)
        prompt = prompt_template.format(
            question=question,
            context=context,
            search_queries="\n".join(search_queries),
            previous_actions='\n'.join([json.dumps(a) for a in previous_actions]),
            max_iterations=3,
            iteration_number=iteration
        )
    
        print(prompt)
    
        answer_json = llm(prompt)
        answer = json.loads(answer_json)
        print(json.dumps(answer, indent=2))

        previous_actions.append(answer)
    
        action = answer['action']
        if action != 'SEARCH':
            break
    
        keywords = answer['keywords']
        search_queries = list(set(search_queries) | set(keywords))

        for k in keywords:
            res = search(k)
            search_results.extend(res)
    
        search_results = dedup(search_results)
        
        iteration = iteration + 1
        if iteration >= 4:
            break
    
        print()

    return answer



In [53]:
# and test our agentic search function

answer = agentic_search('how do I prepare for the course?')


ITERATION #0...
You're a course teaching assistant.

You're given a QUESTION from a course student and that you need to answer with your own knowledge and provided CONTEXT.

The CONTEXT is build with the documents from our FAQ database.
SEARCH_QUERIES contains the queries that were used to retrieve the documents
from FAQ to and add them to the context.
PREVIOUS_ACTIONS contains the actions you already performed.

At the beginning the CONTEXT is empty.

You can perform the following actions:

- Search in the FAQ database to get more data for the CONTEXT
- Answer the question using the CONTEXT
- Answer the question using your own knowledge

For the SEARCH action, build search requests based on the CONTEXT and the QUESTION.
Carefully analyze the CONTEXT and generate the requests to deeply explore the topic. 

Don't use search queries used at the previous iterations.

Don't repeat previously performed actions.

Don't perform more than 3 iterations for a given student question.
The current 

In [54]:
# for better formatting

print(answer)

{'action': 'ANSWER_CONTEXT', 'answer': "To prepare for the course, make sure to register before the course starts using the provided registration link. Join the public Google Calendar to stay updated on class times and events, and subscribe to the course's Telegram channel for announcements. It's also advisable to check the materials available after the course concludes, as you can continue learning at your own pace and work on your final capstone project. Additionally, visit the document containing recommended resources for further reading: [Awesome Data Engineering](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/awesome-data-engineering.md).", 'source': 'CONTEXT'}


In [55]:
# no need of parsing json?
# answer = json.loads(answer)

print(answer['answer'])


To prepare for the course, make sure to register before the course starts using the provided registration link. Join the public Google Calendar to stay updated on class times and events, and subscribe to the course's Telegram channel for announcements. It's also advisable to check the materials available after the course concludes, as you can continue learning at your own pace and work on your final capstone project. Additionally, visit the document containing recommended resources for further reading: [Awesome Data Engineering](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/awesome-data-engineering.md).


## STEP 7 - Function calling ("tool use")

Part 3: Function calling - code from
https://github.com/alexeygrigorev/rag-agents-workshop/tree/main

Function calling in OpenAI

We put all this logic inside our prompt.

But OpenAI and other providers provide a convenient API for adding extra functionality like search.

https://platform.openai.com/docs/guides/function-calling

It's called "function calling" - you define functions that the model can call, and if it decides to make a call, it returns structured output for that.

For example, let's take our search function:

In [56]:
# search tool

def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5,
        output_ids=True
    )

    return results



In [57]:
answer = search("How to run Kafka in Docker")
print(answer)

[{'text': "Solution from Alexey: create a virtual environment and run requirements.txt and the python files in that environment.\nTo create a virtual env and install packages (run only once)\npython -m venv env\nsource env/bin/activate\npip install -r ../requirements.txt\nTo activate it (you'll need to run it every time you need the virtual env):\nsource env/bin/activate\nTo deactivate it:\ndeactivate\nThis works on MacOS, Linux and Windows - but for Windows the path is slightly different (it's env/Scripts/activate)\nAlso the virtual environment should be created only to run the python file. Docker images should first all be up and running.", 'section': 'Module 6: streaming with kafka', 'question': 'Module “kafka” not found when trying to run producer.py', 'course': 'data-engineering-zoomcamp', '_id': 372}, {'text': "Below I have listed some steps I took to rectify this and potentially other minor errors, in Windows:\nUse the git bash terminal in windows.\nActivate python venv from git

In [58]:
# how to describe our search function for OpenAI / Pydantic:
# (this is a json wrapper to use our minsearch function = search tool from above ))

search_tool = {
    "type": "function",
    "name": "search",
    "description": "Search the FAQ database",
    "parameters": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "Search query text to look up in the course FAQ."
            }
        },
        "required": ["query"],
        "additionalProperties": False
    }
}

# "name": "search" - based on this Python will call search(query) from above
# and pass a property "query": { "type": "string" ... when / into this function call object is created?



In [59]:
# Here we have:

# name: search
# description: when to use it
# parameters: all the arguments that the function can take and their description
# In order to use function calling, we'll use a newer API - the "responses" API (not "chat completions" as previously):

question = "How do I do well in module 1?"

developer_prompt = """
You're a course teaching assistant. 
You're given a question from a course student and your task is to answer it.
""".strip()

tools = [search_tool]

chat_messages = [
    {"role": "developer", "content": developer_prompt},
    {"role": "user", "content": question}
]

response = client.responses.create(
    model='gpt-4o-mini',
    input=chat_messages,
    tools=tools
)

response.output


[ResponseFunctionToolCall(arguments='{"query":"how to do well in module 1"}', call_id='call_2dd1C8lgE68HKjnlGgCyh9bK', name='search', type='function_call', id='fc_68bc63d64e50819eabd5f8a08add38780388b2f6a5ecf56d', status='completed')]

In [60]:
# Let's make a call to search - to the earlier minsearch << def search(query) >> function:
# to the index object which is <minsearch.append.AppendableIndex at 0x7473af0c0cb0> 

calls = response.output
call = calls[0]
call


ResponseFunctionToolCall(arguments='{"query":"how to do well in module 1"}', call_id='call_2dd1C8lgE68HKjnlGgCyh9bK', name='search', type='function_call', id='fc_68bc63d64e50819eabd5f8a08add38780388b2f6a5ecf56d', status='completed')

In [61]:
call_id = call.call_id
call_id

'call_2dd1C8lgE68HKjnlGgCyh9bK'

In [62]:
f_name = call.name
f_name

'search'

In [63]:
arguments = json.loads(call.arguments)
arguments


{'query': 'how to do well in module 1'}

In [64]:
# Nice Python trick - use globals() function which knows all variables and functions in our enviromnent

globals()[f_name] # which is f_name = call.name
# <function __main__.search(query)> - this is our minsearch function

<function __main__.search(query)>

In [65]:
globals()["search_tool"] 
# prints our dictionary we use to talk to openAI function calling interface

{'type': 'function',
 'name': 'search',
 'description': 'Search the FAQ database',
 'parameters': {'type': 'object',
  'properties': {'query': {'type': 'string',
    'description': 'Search query text to look up in the course FAQ.'}},
  'required': ['query'],
  'additionalProperties': False}}

In [66]:
# essentially using globals() we can call any function and pass arguments
arguments

{'query': 'how to do well in module 1'}

In [72]:
f = globals()[f_name]

results = f(**arguments) # here we call our minsearch search function to find 5 'module 1 tips' 

search_results = json.dumps(results, indent=2)
print(search_results)

[
  {
    "text": "Following dbt with BigQuery on Docker readme.md, after `docker-compose build` and `docker-compose run dbt-bq-dtc init`, encountered error `ModuleNotFoundError: No module named 'pytz'`\nSolution:\nAdd `RUN python -m pip install --no-cache pytz` in the Dockerfile under `FROM --platform=$build_for python:3.9.9-slim-bullseye as base`",
    "section": "Module 4: analytics engineering with dbt",
    "question": "DBT - Error: No module named 'pytz' while setting up dbt with docker",
    "course": "data-engineering-zoomcamp",
    "_id": 299
  },
  {
    "text": "Even after installing pyspark correctly on linux machine (VM ) as per course instructions, faced a module not found error in jupyter notebook .\nThe solution which worked for me(use following in jupyter notebook) :\n!pip install findspark\nimport findspark\nfindspark.init()\nThereafter , import pyspark and create spark contex<<t as usual\nNone of the solutions above worked for me till I ran !pip3 install pyspark inst

In [70]:
call


ResponseFunctionToolCall(arguments='{"query":"how to do well in module 1"}', call_id='call_2dd1C8lgE68HKjnlGgCyh9bK', name='search', type='function_call', id='fc_68bc63d64e50819eabd5f8a08add38780388b2f6a5ecf56d', status='completed')

In [71]:
chat_messages # this is our memory ))

[{'role': 'developer',
  'content': "You're a course teaching assistant. \nYou're given a question from a course student and your task is to answer it."},
 {'role': 'user', 'content': 'How do I do well in module 1?'}]

In [73]:
#  save both the response and the result of the function call

chat_messages.append(call)

chat_messages.append({
    "type": "function_call_output",
    "call_id": call.call_id,
    "output": search_results,
})

chat_messages

[{'role': 'developer',
  'content': "You're a course teaching assistant. \nYou're given a question from a course student and your task is to answer it."},
 {'role': 'user', 'content': 'How do I do well in module 1?'},
 ResponseFunctionToolCall(arguments='{"query":"how to do well in module 1"}', call_id='call_2dd1C8lgE68HKjnlGgCyh9bK', name='search', type='function_call', id='fc_68bc63d64e50819eabd5f8a08add38780388b2f6a5ecf56d', status='completed'),
 {'type': 'function_call_output',
  'call_id': 'call_2dd1C8lgE68HKjnlGgCyh9bK',
  'output': '[\n  {\n    "text": "Following dbt with BigQuery on Docker readme.md, after `docker-compose build` and `docker-compose run dbt-bq-dtc init`, encountered error `ModuleNotFoundError: No module named \'pytz\'`\\nSolution:\\nAdd `RUN python -m pip install --no-cache pytz` in the Dockerfile under `FROM --platform=$build_for python:3.9.9-slim-bullseye as base`",\n    "section": "Module 4: analytics engineering with dbt",\n    "question": "DBT - Error: No m

In [74]:
# Now chat_messages contains both the call description (so it keeps track of history) and the results

# Let's make another call to the model:

response = client.responses.create(
    model='gpt-4o-mini',
    input=chat_messages,
    tools=tools
)

In [75]:
# This time it should be the response (but also can be another call):

r = response.output[0]
print(r.content[0].text)

To do well in Module 1, consider the following tips:

1. **Understand Key Concepts**: Familiarize yourself with the basics of Docker and Terraform, as these are central to the module. Make sure you understand how they work and their applications in data engineering.

2. **Follow Installation Instructions**: Ensure that you correctly install all necessary tools and libraries. For instance, the `psycopg2` library is crucial for PostgreSQL integration. If you encounter errors such as `ModuleNotFoundError: No module named 'psycopg2'`, use the command:
   ```bash
   pip install psycopg2-binary
   ```

3. **Hands-On Practice**: Engage in hands-on exercises and projects using Docker and Terraform. Setting up a PostgreSQL container and connecting to it through Python can solidify your understanding.

4. **Troubleshooting**: Familiarize yourself with common errors and their solutions. For example, if you encounter problems with SQLAlchemy and PostgreSQL connection, ensure that the `psycopg2` mo

## STEP 8 - Making multiple calls

In [76]:
# Making multiple calls
# What if we want to make multiple calls? Change the developer prompt a little:

developer_prompt = """
You're a course teaching assistant. 
You're given a question from a course student and your task is to answer it.
If you look up something in FAQ, convert the student question into multiple queries.
""".strip()

chat_messages = [
    {"role": "developer", "content": developer_prompt},
    {"role": "user", "content": question}
]

response = client.responses.create(
    model='gpt-4o-mini',
    input=chat_messages,
    tools=tools
)

question # 'How do I do well in module 1?'

'How do I do well in module 1?'

In [77]:
# Organising code to look neat
# First, create a function do_call:

def do_call(tool_call_response):
    function_name = tool_call_response.name
    arguments = json.loads(tool_call_response.arguments)

    f = globals()[function_name]
    result = f(**arguments)

    return {
        "type": "function_call_output",
        "call_id": tool_call_response.call_id,
        "output": json.dumps(result, indent=2),
    }



In [78]:
# Now iterate over responses:

for entry in response.output:
    chat_messages.append(entry)
    print(entry.type)

    if entry.type == 'function_call':      
        result = do_call(entry)
        chat_messages.append(result)
    elif entry.type == 'message':
        print(entry.text) 



function_call
function_call


In [79]:
# First call will probably be function call, so let's do another one:

response = client.responses.create(
    model='gpt-4o-mini',
    input=chat_messages,
    tools=tools
)

for entry in response.output:
    chat_messages.append(entry)
    print(entry.type)
    print()

    if entry.type == 'function_call':      
        result = do_call(entry)
        chat_messages.append(result)
    elif entry.type == 'message':
        print(entry.content[0].text) 

# this time it is a text response - as LLm has now context in memory

message

To do well in Module 1, here are some tips and important details that can help you succeed:

1. **Familiarize with Tools**:
   - Make sure you understand how to set up Docker and Terraform as they are crucial for this module.
   - Follow the official documentation and the readme files closely. For example, ensure that you have Docker installed and configured correctly.

2. **Resolve Common Issues**:
   - You might encounter specific errors like `ModuleNotFoundError: No module named 'psycopg2'`. To resolve this, you can run:
     ```bash
     pip install psycopg2-binary
     ```
     If issues persist, consider updating your package manager and reinstalling.

3. **Practice Coding**:
   - Regularly practice coding examples and assignments provided in the module to solidify your understanding.
   - Use Jupyter Notebook effectively, and ensure all necessary packages are installed.

4. **Engage with the Community**:
   - Don’t hesitate to ask questions in forums or community boards

## STEP 9 - a Python class to decorate the chatting logic

In [81]:
!wget https://raw.githubusercontent.com/alexeygrigorev/rag-agents-workshop/refs/heads/main/chat_assistant.py

# code from https://raw.githubusercontent.com/alexeygrigorev/rag-agents-workshop/refs/heads/main/chat_assistant.py
# or https://github.com/alexeygrigorev/rag-agents-workshop/blob/main/chat_assistant.py

--2025-09-06 17:31:38--  https://raw.githubusercontent.com/alexeygrigorev/rag-agents-workshop/refs/heads/main/chat_assistant.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3495 (3.4K) [text/plain]
Saving to: ‘chat_assistant.py’


2025-09-06 17:31:38 (57.4 MB/s) - ‘chat_assistant.py’ saved [3495/3495]

