In [None]:
%%bash
which python
python --version
#python -m ipykernel install --name py3.10-env --user
pip install -q tqdm openai elasticsearch pandas scikit-learn transformers accelerate bitsandbytes tiktoken

## Q1. Running Mage

Clone the same repo we used in the module and run mage:


```bash
git clone https://github.com/mage-ai/rag-project
```

In [1]:
!git clone https://github.com/mage-ai/rag-project

Cloning into 'rag-project'...
remote: Enumerating objects: 60, done.[K
remote: Counting objects: 100% (60/60), done.[K
remote: Compressing objects: 100% (42/42), done.[K
remote: Total 60 (delta 14), reused 55 (delta 9), pack-reused 0 (from 0)[K
Receiving objects: 100% (60/60), 14.14 KiB | 14.14 MiB/s, done.
Resolving deltas: 100% (14/14), done.


Add the following libraries to the requirements document:

```
python-docx
elasticsearch
```

In [3]:
!echo "python-docx" >> rag-project/llm/requirements.txt
!echo "elasticsearch" >> rag-project/llm/requirements.txt

Make sure you use the latest version of mage:

```bash
docker pull mageai/mageai:llm
```

Start it:

```bash
./scripts/start.sh
```

Now mage is running on [http://localhost:6789/](http://localhost:6789/)

What's the version of mage? 



In [4]:
!docker pull mageai/mageai:llm

llm: Pulling from mageai/mageai

[1B5d672725: Pulling fs layer 
[1B3c12a9c9: Pulling fs layer 
[1B43a5fa82: Pulling fs layer 
[1B1019d793: Pulling fs layer 
[1B72e87958: Pulling fs layer 
[1B7c53cd82: Pulling fs layer 
[1B4ae0623d: Pulling fs layer 
[1B92e7b73b: Pulling fs layer 
[1B1a75568b: Pulling fs layer 
[1Bfadde973: Pulling fs layer 
[1Be10f7a13: Pulling fs layer 
[1Baa005a6d: Pulling fs layer 
[1B1b4caa66: Pulling fs layer 
[1B704b6e4e: Pulling fs layer 
[1Bfbaa3ccb: Pulling fs layer 
[1B9c2cb1f1: Pull complete  112B/112B3kBB[15A[2K[16A[2K[14A[2K[15A[2K[14A[2K[16A[2K[14A[2K[16A[2K[14A[2K[14A[2K[14A[2K[12A[2K[12A[2K[16A[2K[16A[2K[11A[2K[16A[2K[13A[2K[13A[2K[9A[2K[8A[2K[13A[2K[13A[2K[16A[2K[13A[2K[16A[2K[13A[2K[16A[2K[13A[2K[6A[2K[13A[2K[6A[2K[13A[2K[8A[2K[13A[2K[6A[2K[16A[2K[6A[2K[16A[2K[6A[2K[15A[2K[13A[2K[15A[2K[6A[2K[5A[2K[6A[2K[5A[2K[15A[2K[5A[2K[15A[2K[15A[2

In [None]:
#!cd ./rag-project && ./scripts/start.sh

In [7]:
"v0.9.72"

'v0.9.72'


## Creating a RAG pipeline

Create a RAG pipeline

## Q2. Reading the documents

Now we can ingest the documents. Create a custom code ingestion
block 

Let's read the documents. We will use the same code we used
for parsing FAQ: [parse-faq-llm.ipynb](parse-faq-llm.ipynb)


Use the following document_id: 1qZjwHkvP0lXHiE4zdbWyUXSVfmVGzougDD6N37bat3E

Which is the document ID of
[LLM FAQ version 1](https://docs.google.com/document/d/1qZjwHkvP0lXHiE4zdbWyUXSVfmVGzougDD6N37bat3E/edit)

Copy the code to the editor
How many FAQ documents we processed?

* **1**
* 2
* 3
* 4

## Q3. Chunking

We don't really need to do any chuncking because our documents
already have well-specified boundaries. So we just need
to return the documents without any changes.

So let's go to the transformation part and add a custom code
chunking block:

```python
documents = []

for doc in data['documents']:
    doc['course'] = data['course']
    # previously we used just "id" for document ID
    doc['document_id'] = generate_document_id(doc)
    documents.append(doc)

print(len(documents))

return documents
```


Where `data` is the input parameter to the transformer.

And the `generate_document_id` is defined in the same way
as in module 4:

```python
import hashlib

def generate_document_id(doc):
    combined = f"{doc['course']}-{doc['question']}-{doc['text'][:10]}"
    hash_object = hashlib.md5(combined.encode())
    hash_hex = hash_object.hexdigest()
    document_id = hash_hex[:8]
    return document_id
```

Note: if instead of a single dictionary you get a list, 
add a for loop:

```python
for course_dict in data:
    ...
```

You can check the type of `data` with this code:

```python
print(type(data))
```

How many documents (chunks) do we have in the output?

* 66
* 76
* **86**
* 96

```
documents = []
course = data[0]['course']
docs = data[0]['documents']

for doc in docs:
    doc['course'] = course
    # previously we used just "id" for document ID
    doc['document_id'] = generate_document_id(doc)
    documents.append(doc)

print(len(documents))

return documents
```


## Tokenization and embeddings

We don't need any tokenization, so we skip it.

Because currently it's required in mage, we can create 
a dummy code block:

* Create a custom code block
* Don't change it

Because we will use text search, we also don't need embeddings,
so skip it too.

If you want to use sentence transformers - the ones from module
3 - you don't need tokenization, but need embeddings
(you don't need it for this homework)

## Q4. Export

Now we're ready to index the data with elasticsearch. For that,
we use the Export part of the pipeline

* Go to the Export part
* Select vector databases -> Elasticsearch
* Open the code for editing

Because we won't use vector search, but usual text search, we
will need to adjust the code.

First, let's change the line where we read the index name:

```python
index_name = kwargs.get('index_name', 'documents')
``` 

To `index_name_prefix` - we will parametrize it with the day
and time we run the pipeline

```python
from datetime import datetime

index_name_prefix = kwargs.get('index_name', 'documents')
current_time = datetime.now().strftime("%Y%m%d_%M%S")
index_name = f"{index_name_prefix}_{current_time}"
print("index name:", index_name)
```


We will need to save the name in a global variable, so it can be accessible in other code blocks

```python
from mage_ai.data_preparation.variable_manager import set_global_variable

set_global_variable('YOUR_PIPELINE_NAME', 'index_name', index_name)
```

Where your pipeline name is the name of the pipeline, e.g.
`transcendent_nexus` (replace the space with underscore `_`)



Replace index settings with the settings we used previously:

```python
index_settings = {
    "settings": {
        "number_of_shards": number_of_shards,
        "number_of_replicas": number_of_replicas
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"},
            "document_id": {"type": "keyword"}
        }
    }
}
```

Remove the embeddings line:

```python
if isinstance(document[vector_column_name], np.ndarray):
    document[vector_column_name] = document[vector_column_name].tolist()
```

At the end (outside of the indexing for loop), print the last document:

```python
print(document)
```

Now execute the block.

What's the last document id?

Also note the index name.

```python
@data_exporter
def elasticsearch(
    documents: List[Dict[str, Union[Dict, List[int], np.ndarray, str]]], *args, **kwargs,
):
    """
    Exports document data to an Elasticsearch database.
    """

    from mage_ai.data_preparation.variable_manager import set_global_variable


    connection_string = kwargs.get('connection_string', 'http://rag-project-elasticsearch-1:9200')
    #index_name = kwargs.get('index_name', 'documents')

    index_name_prefix = kwargs.get('index_name', 'documents')
    current_time = datetime.now().strftime("%Y%m%d_%M%S")
    index_name = f"{index_name_prefix}_{current_time}"
    print("index name:", index_name)
    set_global_variable('mesmerizing_singularity', 'index_name', index_name)
    number_of_shards = kwargs.get('number_of_shards', 1)
    number_of_replicas = kwargs.get('number_of_replicas', 0)
    vector_column_name = kwargs.get('vector_column_name', 'embedding')

    dimensions = kwargs.get('dimensions')
    if dimensions is None and len(documents) > 0:
        document = documents[0]
        dimensions = len(document.get(vector_column_name) or [])

    es_client = Elasticsearch(connection_string)

    print(f'Connecting to Elasticsearch at {connection_string}')

    index_settings = {
        "settings": {
            "number_of_shards": number_of_shards,
            "number_of_replicas": number_of_replicas
        },
        "mappings": {
            "properties": {
                "text": {"type": "text"},
                "section": {"type": "text"},
                "question": {"type": "text"},
                "course": {"type": "keyword"},
                "document_id": {"type": "keyword"}
            }
        }
    }

    #index_settings = dict(
    #    settings=dict(
    #        number_of_shards=number_of_shards,
    #        number_of_replicas=number_of_replicas,
    #    ),
    #    mappings=dict(
    #        properties=dict(
    #            chunk=dict(type='text'),
    #            document_id=dict(type='text'),
    #            embedding=dict(type='dense_vector', dims=dimensions),
    #        ),
    #    ),
    #)

    if not es_client.indices.exists(index=index_name):
        es_client.indices.create(index=index_name)
        print('Index created with properties:', index_settings)
        print('Embedding dimensions:', dimensions)

    print(f'Indexing {len(documents)} documents to Elasticsearch index {index_name}')
    for document in documents:
        print(f'Indexing document {document["document_id"]}')

        #if isinstance(document[vector_column_name], np.ndarray):
        #    document[vector_column_name] = document[vector_column_name].tolist()

        es_client.index(index=index_name, document=document)
    print(document)
```

index name: documents_20240819_0416
```json
{'text': 'Yes, you need to pass the Capstone project to get the certificate. Homework is not mandatory, though it is recommended for reinforcing concepts, and the points awarded count towards your rank on the leaderboard.', 'section': 'General course-related questions', 'question': 'I missed the first homework - can I still get a certificate?', 'course': 'llm-zoomcamp', 'document_id': 'fa136280'}
```

## Q5. Testing the retrieval

Now let's test the retrieval. Use mage or jupyter notebook to
test it.

Let's use the following query: "When is the next cohort?"

What's the ID of the top matching result?

In [8]:
from elasticsearch import Elasticsearch

es_client = Elasticsearch('http://localhost:9200') 
es_client.info().body

{'name': '1cdf405c2a67',
 'cluster_name': 'docker-cluster',
 'cluster_uuid': 'sPJ4rh_NQq6klLrJ2OPo5g',
 'version': {'number': '8.5.0',
  'build_flavor': 'default',
  'build_type': 'docker',
  'build_hash': 'c94b4700cda13820dad5aa74fae6db185ca5c304',
  'build_date': '2022-10-24T16:54:16.433628434Z',
  'build_snapshot': False,
  'lucene_version': '9.4.1',
  'minimum_wire_compatibility_version': '7.17.0',
  'minimum_index_compatibility_version': '7.0.0'},
 'tagline': 'You Know, for Search'}

In [10]:
index_name="documents_20240819_0416"

In [25]:
def _merge_dicts(a: dict, b: dict, path=[]):
    for key in b:
        if key in a:
            if isinstance(a[key], dict) and isinstance(b[key], dict):
                _merge_dicts(a[key], b[key], path + [str(key)])
            elif a[key] != b[key]:
                raise Exception('Conflict at ' + '.'.join(path + [str(key)]))
        else:
            a[key] = b[key]
    return a
    
def search(q, index = index_name, es_client=es_client, size=5, fields=None, type_match="best_fields", query_patch:dict = None):
    fields = fields or ["question^4", "text"]
    query_patch = query_patch or {}
    search_query = {
        "size": size,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": fields,
                        "type": type_match
                    }
                }
            }
        }
    }
    search_query = _merge_dicts(search_query,query_patch) if query_patch else search_query
    return es_client.search(index=index, body=search_query)

In [12]:
query = "When is the next cohort?"

response = search(query)

In [22]:
response["hits"]["hits"][0]

{'_index': 'documents_20240819_0416',
 '_id': 'EXnibJEBUptsjNsiUjeI',
 '_score': 33.77578,
 '_source': {'text': 'Summer 2025 (via Alexey).',
  'section': 'General course-related questions',
  'question': 'When will the course be offered next?',
  'course': 'llm-zoomcamp',
  'document_id': 'bf024675'}}

## Q6. Reindexing

Our FAQ document changes: every day course participants add
new records or improve existing ones.

Imagine some time passed and the document changed. For that we have another version of the FAQ document: [version 2](https://docs.google.com/document/d/1T3MdwUvqCL3jrh3d3VCXQ8xE0UqRzI3bfgpfBq3ZWG0/edit).

The ID of this document is `1T3MdwUvqCL3jrh3d3VCXQ8xE0UqRzI3bfgpfBq3ZWG0`.

Let's re-execute the entire pipeline with the updated data.

For the same query "When is the next cohort?". What's the ID of the top matching result?

In [26]:


query = "When is the next cohort?"

response = search(query, index="documents_20240820_0416")
response["hits"]["hits"][0]

{'_index': 'documents_20240820_0416',
 '_id': '09IAbZEBJy6ui47HOUNS',
 '_score': 68.84985,
 '_source': {'text': 'Summer 2026.',
  'section': 'General course-related questions',
  'question': 'When is the next cohort?',
  'course': 'llm-zoomcamp',
  'document_id': 'b6fa77f3'}}