# Create the retrieval part of the RAG 

More specifically we will:
- Read a JSON document containing the internal information that we need
- Connect and create an index in the elastic search
- Index the document using elastic-search, to retrieve it later creating a small search engine

In [1]:
# Download the document containing our own knowledge base
!wget https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json
# Examine the first record of the file
!head documents.json

--2024-05-08 17:03:58--  https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/alexeygrigorev/llm-rag-workshop/main/notebooks/documents.json [following]
--2024-05-08 17:03:58--  https://raw.githubusercontent.com/alexeygrigorev/llm-rag-workshop/main/notebooks/documents.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 658332 (643K) [text/plain]
Saving to: ‘documents.json.3’


2024-05-08 17:03:58 (52.3 MB/s) - ‘documents.json.3’ saved [658332/658332]

[
  {
    "course": "data-engineering-zoomcamp",
    "documents": 

In [2]:
# Load and flatten the file by adding the course in each object in the question list

# Import the necessary libraries to read JSON files
import json

# Read the JSON file
with open('./documents.json', 'rt') as f_in:
    documents_file = json.load(f_in)

# Initialise the flattened list of qas
documents = []

# Flatten the file
for course in documents_file:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

In [3]:
# See the number of questions we have
print(len(documents))
# See the first q&a
documents[0]

948


{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [5]:
from elasticsearch import Elasticsearch
# Connect to the elastic search
es = Elasticsearch("http://localhost:9200")
# Verify that you have connected successfully 
es.info()

ObjectApiResponse({'name': '7b59ed66498f', 'cluster_name': 'docker-cluster', 'cluster_uuid': '8D8G9h_BQ72GSZ9JkGUK3Q', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

In [6]:
# Provide the properties of the elastic search index
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}
# Provide the name of the index
index_name = "course-questions"
# Create the Index 
response = es.indices.create(index=index_name, body=index_settings)
# Verify that the index has been created
response

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [7]:
from tqdm.auto import tqdm
# Index all the document to elastic search - adding document to the specific index
for doc in tqdm(documents):
    es.index(index=index_name, document=doc)

  from .autonotebook import tqdm as notebook_tqdm
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 948/948 [00:24<00:00, 37.96it/s]


In [8]:
# Simple query data for an elastic search

# Create the question for the elastic search to look for
user_question = "How do I join the course after it has started?"

# Create the body of the search request
search_query = {
    "size": 5,     # specify the number of documents to retrieve
    "query": {
        "bool": {
            "must": {
                "multi_match": {
                    "query": user_question,  # specify the question to elastic search
                    "fields": ["question^3", "text", "section"], # specify the field you want the elastic search to look for answers - the ^3 meaning that we will give 3 times more priority to answers found in the question field
                    "type": "best_fields"
                }
            },
            "filter": {   # to filter from which document to look into
                "term": {
                    "course": "data-engineering-zoomcamp" 
                }
            }
        }
    }
}

# Query the elactic search db
response = es.search(index=index_name, body=search_query)

# See the response -  top 5 search results
response

ObjectApiResponse({'took': 143, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 407, 'relation': 'eq'}, 'max_score': 53.20021, 'hits': [{'_index': 'course-questions', '_id': 'calCWY8B0JXxoL3Rzu4E', '_score': 53.20021, '_source': {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.", 'section': 'General course-related questions', 'question': 'Course - Can I still join the course after the start date?', 'course': 'data-engineering-zoomcamp'}}, {'_index': 'course-questions', '_id': 'dqlCWY8B0JXxoL3Rzu6I', '_score': 47.001244, '_source': {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next c

In [9]:
# Prettify the response
for hit in response['hits']['hits']:
    doc = hit['_source']
    print(f"Section: {doc['section']}\nQuestion: {doc['question']}\nAnswer: {doc['text']}\n\n")

Section: General course-related questions
Question: Course - Can I still join the course after the start date?
Answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.


Section: General course-related questions
Question: Course - Can I follow the course after it finishes?
Answer: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.


Section: General course-related questions
Question: Course - What can I do before the course starts?
Answer: You can start by installing and setting up all the dependencies and requirements:
Google cloud account
Google Cloud SDK
Python 3 (installed with Anaconda)
Terra

In [10]:
# To have them all together in one function

# Initialize the elastic search via docker
es = Elasticsearch("http://localhost:9200")

# Create the function to query the user question in Elastic Search
def retrieve_documents(query, index_name="course-questions", max_results=5):    
    search_query = {
        "size": max_results,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }
    
    response = es.search(index=index_name, body=search_query)
    documents = [hit['_source'] for hit in response['hits']['hits']]
    return documents

In [11]:
# Query a question
user_question = "How do I join the course after it has started?"

response = retrieve_documents(user_question)

for doc in response:
    print(f"Section: {doc['section']}\nQuestion: {doc['question']}\nAnswer: {doc['text']}\n\n")

Section: General course-related questions
Question: Course - Can I still join the course after the start date?
Answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.


Section: General course-related questions
Question: Course - Can I follow the course after it finishes?
Answer: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.


Section: General course-related questions
Question: Course - What can I do before the course starts?
Answer: You can start by installing and setting up all the dependencies and requirements:
Google cloud account
Google Cloud SDK
Python 3 (installed with Anaconda)
Terra