In [1]:
from rag_data_uploader.utils.mappings import ES7BaseMapping, OSBaseMapping
from typing import Dict

## Mappings as Pydantic classes

To facilitate easier creation of custom mappings for different vector stores, mappings have been implemented as Pydantic classes. There are a few default mappings that can be used for the use cases that already exist, which can be imported from the `rag_data_loader.utils.mappings` module. To create a custom mapping for one of the existing data stores, the user can inherit from one of the existing mapping base classes. See the following examples.



#### Creating a custom mapping for Elasticsearch 7

To create a custom mapping for Elasticsearch 7, the user can inherit from the class `ES7BaseMapping`. This class only contains a `content` field which is of the type `text`. 

In [2]:
ES7BaseMapping().model_dump()

{'properties': {'content': {'type': 'text'}}}

#### To create a new mapping with more fields, the user simply inherits from the base class and adds attributes with the desired type. To get a dictionary corresponding to the Elasticsearch 7 mapping, the user simply calls the `model_dump()` method on the new mapping instance.

In [3]:
class ES7CustomMapping(ES7BaseMapping):
    title: str = "keyword"
    version: str = "float"

custom_mapping = ES7CustomMapping()
custom_mapping.model_dump()

{'properties': {'content': {'type': 'text'},
  'title': {'type': 'keyword'},
  'version': {'type': 'float'}}}

#### To add an embedding field to the mapping, the user can just add an `int` type attribute named `embedding` with the desired number of dimensions, and the parent class will convert it into the correct format for the given vector store. Note the difference between the embedding mapping for Elasticsearch and Opensearch.

In [4]:
# Embeddings that are consistent with Elasticsearch 7
class ES7EmbeddingMapping(ES7BaseMapping):
    embedding: int = 1024

es_embedding_mapping = ES7EmbeddingMapping()
es_embedding_mapping.model_dump()

{'properties': {'content': {'type': 'text'},
  'embedding': {'type': 'dense_vector', 'dims': 1024}}}

In [5]:
# Embeddings that are consistent with Opensearch
class OSEmbeddingMapping(OSBaseMapping):
    embedding: int = 1024

os_embedding_mapping = OSEmbeddingMapping()
os_embedding_mapping.model_dump()

{'properties': {'content': {'type': 'text'},
  'embedding': {'type': 'knn_vector', 'dims': 1024}}}

#### You can also specify a nested field by supplying a dictionary where the keys are the field names and the values are the types of the fields. So if we want to extend the Elasticsearch embedding maping by defining a field for the mapping called `sections` with keyword subfields `chapter` and `subchapter_1`, that could be done in this way:

In [6]:
class ES7NestedMapping(ES7EmbeddingMapping):
    sections: Dict[str, str] = {"chapter": "keyword", "subchapter_1": "keyword"}

nested_mapping = ES7NestedMapping()
nested_mapping.model_dump()

{'properties': {'content': {'type': 'text'},
  'embedding': {'type': 'dense_vector', 'dims': 1024},
  'sections': {'properties': {'chapter': {'type': 'keyword'},
    'subchapter_1': {'type': 'keyword'}}}}}

#### Sometimes a user might want to control what type new fields that are added to an index after index creation are mapped to. For instance in the 3GPP mapping new fields with `string` values are mapped to `keyword` fields. To control this, a `template` can be added. This is done by specifying a field called `template` as a dictionary with keys `name`, `match` and `to`. Here, `name` is the name of the dynamic mapping template, `match` is which type of field should be mapped, and `to` is the datatype to map the field to. To show this, we can look at extending the previous mapping, to something similar to the 3GPP mapping which is currently used:

In [7]:
class ES7Mapping3GPPMinimal(ES7NestedMapping):
    template: Dict[str, str] = {"name": "3gpp_template", "match": "string", "to" : "keyword"}

mapping_3gpp = ES7Mapping3GPPMinimal()
mapping_3gpp.model_dump()


{'properties': {'content': {'type': 'text'},
  'embedding': {'type': 'dense_vector', 'dims': 1024},
  'sections': {'properties': {'chapter': {'type': 'keyword'},
    'subchapter_1': {'type': 'keyword'}}}},
 'dynamic_templates': [{'3gpp_template': {'match_mapping_type': 'string',
    'mapping': {'type': 'keyword'}}}]}

#### The mapping classes, default or custom specified by the user, are meant to be used with the uploader classes in the rag_data_uploader package. An instance of one of the mapping classes is meant to be sent into either the `upload_from_folder()` or `upload_documents()` methods under the `mapping` key. It is important to remember that what should be passed is a mapping INSTANCE, and not just the class. See below for distinction:

In [11]:
# Mapping instance, which should be passed to uploader methods
mapping = ES7BaseMapping()
mapping

{"properties":{"content":{"type":"text"}}}

In [12]:
# Class, which should NOT be passed to uploader methods without being instantiated
mapping = ES7BaseMapping
mapping

rag_data_uploader.utils.base_mappings.ES7BaseMapping