In [42]:
import datetime
from copy import deepcopy
from typing import Literal, Any

from pydantic import (
    AliasChoices,
    AliasPath,
    AnyHttpUrl,
    BaseModel,
    ConfigDict,
    Field,
    field_serializer,
    field_validator,
    model_serializer
)

#### This notebook demonstrates the use of Parsing models in the PIA rag-data-uploader package. The parsing models are based on Pydantic and their main functionality is to extract data from potentially nested fields and ensure that their values follow a set format.

#### Throughout this notebook we will use the following dummy document to illustrate the function of these parsing models.

In [43]:
dummy_document = {
    "page_content": "Random text for this dummy document.",
    "metadata": {
        "source": "1/DUMMY-DUMMY DUMMY DUMMY",
        "seq_num": 1,
        "book_id": "DUMMY",
        "book_number": "DUMMY-DUMMY",
        "book_title": "DUMMY",
        "book_edition": "DUMMY",
        "book_alt_title": "",
        "book_pdf": "DUMMY-DUMMY",
        "book_view_permission": "DUMMY",
        "book_audience": "DUMMY",
        "book_date": "2022-11-10",
        "doctype": "DUMMY",
        "date": "",
        "view_permission": "DUMMY",
        "inferred_products": [],
        "cpi_folders": None,
        "pia_graphs": None,
        "builder_meta": {
            "identity": "DUMMY-DUMMY",
            "document_number": "DUMMY-DUMMY",
            "decimal_class": "DUMMY",
            "revision": "DUMMY",
            "dxp_filename": "DUMMY-DUMMY.dxp",
            "eridoc_document_number": "1/DUMMY-DUMMY DUMMY DUMMY",
            "elex_filename": "DUMMY-DUMMY.DUMMY.html",
            "title": "DUMMY",
            "document_type": "DUMMY Guide",
            "category_tree": "Category does not exist",
            "library_url": "https://DUMMY.DUMMY.DUMMY.com/DUMMY?LI=DUMMY/DUMMY+DUMMY+DUMMY+DUMMY",
            "document_url": "https://DUMMY.DUMMY.DUMMY.com/DUMMY?LI=DUMMY/DUMMY+DUMMY+DUMMY+DUMMY&FN=DUMMY-DUMMY.DUMMY.html",
            "content_from_libraries": [
                {
                    "identity": "DUMMY/DUMMY DUMMY DUMMY DUMMY",
                    "title": "DUMMY DUMMY DUMMY.DUMMY.2",
                    "date": "2024-04-03",
                }
            ],
            "external_document_url": "https://DUMMY.DUMMY.DUMMY.com/DUMMY?LI=DUMMY/DUMMY+DUMMY+DUMMY+DUMMY&FN=DUMMY-DUMMY.DUMMY.html",
        },
        "chunk_token_size": 887,
    },
    "embedding": [
        0.0048415693,
        0.039551053,
        0.00871117,
    ],
}

#### The idea of these parsing models is to simplify parsing document sources within the RegenX framework, and to allow the user to create their own parsing models in a flexible way. For this reason, a base parsing model called `RegenXBaseParsingModel`, which has the mandatory fields of the RegenX documents, has been created and can be imported by `from rag_data_uploader.utils.parsing_models import RegenXBaseParsingModel`. This is the parsing model that all other parsing models within the framework should inherit from, and it is equivalent to what can be seen in the next cell.

#### If we use this model to parse the dummy document, we will see that only the three fields which are defined in the model are kept. The model keeps `page_content` but now calls it `content`. It also keeps the `embedding` field and adds a new field called `content_type` which is a field which is required by Langchain.

In [44]:
class RegenXBaseParsingModel(BaseModel):
    """Base parsing model with required fields.

    Any parsing model should inherit from this base model. Desired metadata fields,
    field validators and serializers should be added to the child class.
    """

    model_config = ConfigDict(extra="ignore")

    content: str = Field(validation_alias=AliasChoices("page_content", "content"))
    content_type: Literal["text"] = "text"
    embedding: list[float]

In [45]:
RegenXBaseParsingModel.model_validate(dummy_document).model_dump()

{'content': 'Random text for this dummy document.',
 'content_type': 'text',
 'embedding': [0.0048415693, 0.039551053, 0.00871117]}

#### To illustrate what happens if a field does not follow the specified schema or if a required field is missing, we will modify the dumy document and parse it again.

In [None]:
faulty_document = deepcopy(dummy_document)
faulty_document["page_content"] = 1
_ = faulty_document.pop("embedding")
RegenXBaseParsingModel.model_validate(faulty_document)

```bash
ValidationError: 2 validation errors for RegenXBaseParsingModel
page_content
  Input should be a valid string [type=string_type, input_value=1, input_type=int]
    For further information visit https://errors.pydantic.dev/2.5/v/string_type
embedding
  Field required [type=missing, input_value={'page_content': 1, 'meta...chunk_token_size': 887}}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.5/v/missing"
  ```

#### We see that Pydantic raises two validation errors, one for a faulty input and one for a missing field. This is what happens if the documents that are passed into the parsing model does not conform to the specified schema.

#### As mentioned before, the idea of the parsing models is that they should function as flexible pre-processing helpers and that the user can create their own parsers by inheriting from the base parser. A parsing model which inherits from the base parsing model has already been created for the RegenX CPI documents. This can be imported as `from rag_data_uploader.utils.parsing_models import RegenXParsingModelCPI`. This contains the necessary fields which have been specified for the RegenX framework CPI use case and is equivalent to the model seen below.

#### Note that this model uses a helper function called `metadata_field`, which helps tell the model where to look for field names in a potentially nested JSON object. Also note that the model uses a sub-model called `ContentFromLibraries` which helps parse a nested field which is itself a JSON object. This is the preferred way to handle fields which are not simple types or lists of types. This way we can parse and typecheck even the values of these nested fields. Note that the `RegenXParsingModelCPI` has a required field called `chapter_title`. This was required by the RegenX framework standard at creation time of the repo. This might change going forward though, and modifications might be needed. For now, we will simply add such a field to the dummy document in order to parse it with the model.

In [47]:
def metadata_field(field_name: str, **kwargs) -> Field:
    """Helper function which constructs paths for a given metadata field.

    This helper function returns a pydantic.Field object, and it takes any
    arguments that the Field argument takes. If 'validation_alias' is specified
    it is dropped.
    """
    kwargs.pop("validation_alias", None)
    return Field(
        validation_alias=AliasChoices(
            field_name,
            AliasPath("metadata", field_name),
            AliasPath("metadata", "builder_meta", field_name),
        ),
        **kwargs,
    )


class RegenXParsingModelCPI(RegenXBaseParsingModel):
    """Parsing model for RegenX CPI data.

    This model inherits from the base parsing model and extends it
    with required metadata fields for the RegenX CPI schema.
    """

    class ContentFromLibraries(BaseModel):
        """Sub-model to be used for nested dictionary within CPI document."""

        model_config = ConfigDict(extra="ignore")

        identity: str
        title: str
        date: datetime.date

        @field_serializer("date")
        def serialize_date(self, v: datetime.date) -> str:
            return str(v)

    source: str = metadata_field("source")
    document_number: str = metadata_field("document_number")
    identity: str = metadata_field("identity")
    chapter_title: str | list[str] = metadata_field("chapter_title")
    revision: str = metadata_field("revision")
    title: str = metadata_field("title")
    document_type: str = metadata_field("document_type")
    category_tree: str = metadata_field("category_tree")
    document_url: AnyHttpUrl = metadata_field("document_url")
    library_url: AnyHttpUrl = metadata_field("library_url")
    external_document_url: AnyHttpUrl = metadata_field("external_document_url")
    content_from_libraries: list[ContentFromLibraries] = metadata_field("content_from_libraries")
    eridoc_document_number: str = metadata_field("eridoc_document_number")

    @field_validator("chapter_title")
    @classmethod
    def validate_chapter_title(cls, v: str | list[str]) -> str:
        """Ensures that chapter_title is a string instead of a list of strings."""
        if isinstance(v, list):
            if len(v) == 1:
                return next(iter(v))
            else:
                raise ValueError
        return v

    @field_serializer("document_url", "library_url", "external_document_url")
    def serialize_url(self, v: AnyHttpUrl) -> str:
        """URL type is not JSON serializable."""
        return str(v)
    

In [48]:
dummy_document.update({"chapter_title": "DUMMY"})
RegenXParsingModelCPI.model_validate(dummy_document).model_dump()

{'content': 'Random text for this dummy document.',
 'content_type': 'text',
 'embedding': [0.0048415693, 0.039551053, 0.00871117],
 'source': '1/DUMMY-DUMMY DUMMY DUMMY',
 'document_number': 'DUMMY-DUMMY',
 'identity': 'DUMMY-DUMMY',
 'chapter_title': 'DUMMY',
 'revision': 'DUMMY',
 'title': 'DUMMY',
 'document_type': 'DUMMY Guide',
 'category_tree': 'Category does not exist',
 'document_url': 'https://dummy.dummy.dummy.com/DUMMY?LI=DUMMY/DUMMY+DUMMY+DUMMY+DUMMY&FN=DUMMY-DUMMY.DUMMY.html',
 'library_url': 'https://dummy.dummy.dummy.com/DUMMY?LI=DUMMY/DUMMY+DUMMY+DUMMY+DUMMY',
 'external_document_url': 'https://dummy.dummy.dummy.com/DUMMY?LI=DUMMY/DUMMY+DUMMY+DUMMY+DUMMY&FN=DUMMY-DUMMY.DUMMY.html',
 'content_from_libraries': [{'identity': 'DUMMY/DUMMY DUMMY DUMMY DUMMY',
   'title': 'DUMMY DUMMY DUMMY.DUMMY.2',
   'date': '2024-04-03'}],
 'eridoc_document_number': '1/DUMMY-DUMMY DUMMY DUMMY'}

#### To create your own custom model, you should inherit from the `RegenXBaseParsingModel` and add the required fields and types. I would suggest using the helper function `metadata_field` to specify the field name as above and help with the potentially nested structure of the JSON objects. For more information on using Pydantic models as parsing models, check the [Pydantic documentation](https://docs.pydantic.dev/latest/concepts/models/).