# Self Retrievers

Un recuperador autoconsultante puede analizar y entender las consultas que se le hacen en lenguaje natural, y luego, puede buscar y filtrar información relevante de su base de datos o documentos almacenados basándose en esas consultas. Esto lo hace transformando las consultas en un formato estructurado que puede interpretar y procesar de manera eficiente. Esto significa que, además de comparar la consulta del usuario con los documentos para encontrar coincidencias, también puede filtrar los resultados según criterios específicos extraídos de la consulta del usuario.

![Self Retrievers](../diagrams/slide_diagrama_02.png)

## Librerías

In [1]:
from pprint import pprint

from dotenv import load_dotenv
from langchain.chains import create_tagging_chain_pydantic
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.indexes import SQLRecordManager, index
from langchain.retrievers import SelfQueryRetriever
from langchain.schema import Document
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from pydantic import BaseModel, Field

from src.langchain_docs_loader import LangchainDocsLoader, num_tokens_from_string

load_dotenv()

True

## Carga de datos

In [2]:
text_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.MARKDOWN,
    chunk_size=400,
    chunk_overlap=50,
    length_function=num_tokens_from_string,
)

loader = LangchainDocsLoader(include_output_cells=False)
docs = loader.load()
docs = text_splitter.split_documents(docs)
len(docs)

3190

In [3]:
docs = [doc for doc in docs if doc.page_content != "```"]

## Inicializado de modelo de lenguaje

In [4]:
llm = ChatOpenAI(temperature=0.1)

## Etiquetado de documentos

Los documentos por sí mismos son útiles, pero cuando son etiquetados con información adicional, pueden volverse más útiles. Por ejemplo, si etiquetamos los documentos con su idioma, podemos filtrar los documentos que no estén en el idioma que nos interesa. Si etiquetamos los documentos con su tema, podemos filtrar los documentos que no estén relacionados con el tema que nos interesa. De esta manera, podemos reducir el espacio de búsqueda y obtener mejores resultados.

### Creación de esquema de etiquetas

In [5]:
class Tags(BaseModel):
    completness: str = Field(
        description="Describes how useful is the text in terms of self-explanation. It is critical to excel.",
        enum=["Very", "Quite", "Medium", "Little", "Not"],
    )
    code_snippet: bool = Field(
        default=False,
        description="Whether the text fragment includes a code snippet. Code snippets are valid markdown code blocks.",
    )
    description: bool = Field(
        default=False, description="Whether the text fragment includes a description."
    )
    talks_about_vectorstore: bool = Field(
        default=False,
        description="Whether the text fragment talks about a vectorstore.",
    )
    talks_about_retriever: bool = Field(
        default=False, description="Whether the text fragment talks about a retriever."
    )
    talks_about_chain: bool = Field(
        default=False, description="Whether the text fragment talks about a chain."
    )
    talks_about_expression_language: bool = Field(
        default=False,
        description="Whether the text fragment talks about an langchain expression language.",
    )
    contains_markdown_table: bool = Field(
        default=False,
        description="Whether the text fragment contains a markdown table.",
    )


pprint(Tags.schema())

{'properties': {'code_snippet': {'default': False,
                                 'description': 'Whether the text fragment '
                                                'includes a code snippet. Code '
                                                'snippets are valid markdown '
                                                'code blocks.',
                                 'title': 'Code Snippet',
                                 'type': 'boolean'},
                'completness': {'description': 'Describes how useful is the '
                                               'text in terms of '
                                               'self-explanation. It is '
                                               'critical to excel.',
                                'enum': ['Very',
                                         'Quite',
                                         'Medium',
                                         'Little',
                                         'Not'],

### Creación de cadena de generación de etiquetas (etiquetador)

In [6]:
tagging_prompt = """Extract the desired information from the following passage.

Only extract the properties mentioned in the 'information_extraction' function.
Completness should involve more than one sentence.
To consider that a passage talks about a property, it is enough that it mentions it once.
If there is no mention of a property, set it to False. It only applies for the talk_about_* properties.

For instance,
To set `talks_about_vectorstore` to True, document should contain the word 'vectorstore' at least once.
To set `talks_about_retriever` to True, document should contain the word 'retriever' at least once.
To set `talks_about_chain` to True, document should contain the word 'chain' at least once.
To set `talks_about_expression_language` to True, document should contain the word 'expression language' or 'LCEL' at least once.

Passage:
{input}
"""

tagging_chain = create_tagging_chain_pydantic(Tags, llm)

### Ejemplos de uso del etiquetador

Probablemente, un fragmento que únicamente contiene una lista de enlaces a otros fragmentos que también se encuentran indexados no es muy útil. Esto podría ocasionar que recuperemos un documento que no es relevante para la consulta, mientras el documento que sí es relevante no se encuentre en los primeros lugares de la lista de resultados.

In [7]:
idx = 0

result = tagging_chain.invoke(input={"input": docs[idx].page_content})
print(result.get("input"))
pprint(result.get("text").dict())

[📄️ DependentsDependents stats for langchain-ai/langchain](/docs/additional_resources/dependents)[📄️ TutorialsBelow are links to tutorials and courses on LangChain. For written guides on common use cases for LangChain, check out the use cases guides.](/docs/additional_resources/tutorials)[📄️ YouTube videos⛓ icon marks a new addition [last update 2023-09-21]](/docs/additional_resources/youtube)[🔗 Gallery](https://github.com/kyrolabs/awesome-langchain)
{'code_snippet': False,
 'completness': 'Not',
 'contains_markdown_table': False,
 'description': True,
 'talks_about_chain': True,
 'talks_about_expression_language': True,
 'talks_about_retriever': True,
 'talks_about_vectorstore': True}


Un fragmento con enlace a su documentación y ejemplo de uso sería más útil.

In [8]:
idx = 1000

result = tagging_chain.invoke(input={"input": docs[idx].page_content})
print(result.get("input"))
pprint(result.get("text").dict())

# AWS DynamoDB

[Amazon AWS DynamoDB](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/dynamodb/index.html) is a fully managed `NoSQL` database service that provides fast and predictable performance with seamless scalability.

This notebook goes over how to use `DynamoDB` to store chat message history.

First make sure you have correctly configured the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html). Then make sure you have installed `boto3`.

```bash
pip install boto3
```

Next, create the `DynamoDB` Table where we will be storing messages:

```python
import boto3

# Get the service resource.
dynamodb = boto3.resource("dynamodb")

# Create the DynamoDB table.
table = dynamodb.create_table(
    TableName="SessionTable",
    KeySchema=[{"AttributeName": "SessionId", "KeyType": "HASH"}],
    AttributeDefinitions=[{"AttributeName": "SessionId", "AttributeType": "S"}],
    BillingMode="PAY_PER_REQUEST",
)

# Wait until the table exists.

In [9]:
idx = 1400

result = tagging_chain.invoke(input={"input": docs[idx].page_content})
print(result.get("input"))
pprint(result.get("text").dict())

text content from PubMed Central and publisher web sites.](/docs/integrations/retrievers/pubmed)[📄️ RePhraseQueryRetrieverSimple retriever that applies an LLM between the user input and the query pass the to retriever.](/docs/integrations/retrievers/re_phrase)[📄️ SEC filings dataSEC filings data powered by Kay.ai and Cybersyn.](/docs/integrations/retrievers/sec_filings)[📄️ SVMSupport vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.](/docs/integrations/retrievers/svm)[📄️ TF-IDFTF-IDF means term-frequency times inverse document-frequency.](/docs/integrations/retrievers/tf_idf)[📄️ VespaVespa is a fully featured search engine and vector database. It supports vector search (ANN), lexical search, and search in structured data, all in the same query.](/docs/integrations/retrievers/vespa)[📄️ Weaviate Hybrid SearchWeaviate is an open source vector database.](/docs/integrations/retrievers/weaviate-hybrid)[📄️ WikipediaWiki

### Etiquetado de documentos

In [10]:
tagging_results = tagging_chain.batch(
    inputs=[{"input": doc.page_content} for doc in docs[:200]],
    return_exceptions=True,
    config={
        "max_concurrency": 50,
    },
)

docs_with_tags = [
    Document(
        page_content=doc.page_content,
        metadata={
            **doc.metadata,
            **result.get("text").dict(),
        },
    )
    for doc, result in zip(docs, tagging_results)
    if not isinstance(result, Exception)
]

f"Documents with tags: {len(docs_with_tags)}"

'Documents with tags: 184'

## Indexado de documentos

In [11]:
vectorstore = Chroma(
    collection_name="langchain_docs",
    embedding_function=OpenAIEmbeddings(),
)

record_manager = SQLRecordManager(
    db_url="sqlite:///:memory:",
    namespace="chroma/langchain_docs",
)

record_manager.create_schema()

index(
    docs_source=docs_with_tags,
    record_manager=record_manager,
    vector_store=vectorstore,
    cleanup="full",
    source_id_key="source",
)

{'num_added': 184, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

## Recuperación de documentos con un `Self Retriever`

### Creación de interfaz de los metadatos disponibles en el índice

In [12]:
metadata_field_info = [
    AttributeInfo(
        name="completness",
        description="Describes how useful is the text in terms of self-explanation. It is critical to excel.",
        type='enum=["Very", "Quite", "Medium", "Little", "Not"]',
    ),
    AttributeInfo(
        name="code_snippet",
        description="Whether the text fragment includes a code snippet. Code snippets are valid markdown code blocks.",
        type="bool",
    ),
    AttributeInfo(
        name="description",
        description="Whether the text fragment includes a description.",
        type="bool",
    ),
    AttributeInfo(
        name="talks_about_vectorstore",
        description="Whether the text fragment talks about a vectorstore.",
        type="bool",
    ),
    AttributeInfo(
        name="talks_about_retriever",
        description="Whether the text fragment talks about a retriever.",
        type="bool",
    ),
    AttributeInfo(
        name="talks_about_chain",
        description="Whether the text fragment talks about a chain.",
        type="bool",
    ),
    AttributeInfo(
        name="contains_markdown_table",
        description="Whether the text fragment contains a markdown table.",
        type="bool",
    ),
]

document_content_description = "Langchain documentation"

### Creación de `retriever`

In [13]:
llm = ChatOpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectorstore,
    document_contents=document_content_description,
    metadata_field_info=metadata_field_info,
    enable_limit=True,
    verbose=True,
)

### Recuperación de documentos con el `retriever`

In [14]:
relevant_documents = retriever.get_relevant_documents(
    "useful documents that talk about expression language and retrievers"
)
relevant_documents



query='expression language retrievers' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='completness', value='Very'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='talks_about_retriever', value=True)]) limit=None


[Document(page_content='```\n\n```python\nchain = (\n    {"context": retriever, "question": RunnablePassthrough()} \n    | prompt \n    | model \n    | StrOutputParser()\n)\n```\n\n```python\nchain.invoke("where did harrison work?")\n```\n\n```python\ntemplate = """Answer the question based only on the following context:\n{context}\n\nQuestion: {question}\n\nAnswer in the following language: {language}\n"""\nprompt = ChatPromptTemplate.from_template(template)\n\nchain = {\n    "context": itemgetter("question") | retriever, \n    "question": itemgetter("question"), \n    "language": itemgetter("language")\n} | prompt | model | StrOutputParser()\n```\n\n```python\nchain.invoke({"question": "where did harrison work", "language": "italian"})\n```', metadata={'code_snippet': True, 'completness': 'Very', 'contains_markdown_table': False, 'description': True, 'language': 'en', 'source': 'https://python.langchain.com/docs/expression_language/cookbook/retrieval', 'talks_about_chain': True, 'tal

In [15]:
relevant_documents = retriever.get_relevant_documents(
    "useful documents that talk about expression language and retrievers or vectorstores"
)
relevant_documents

query='expression language' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='completness', value='Very'), Operation(operator=<Operator.OR: 'or'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='talks_about_retriever', value=True), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='talks_about_vectorstore', value=True)])]) limit=None


[Document(page_content='# Code writing\n\nExample of how to use LCEL to write Python code.\n\n```python\nfrom langchain.chat_models import ChatOpenAI\nfrom langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate\nfrom langchain.schema.output_parser import StrOutputParser\nfrom langchain.utilities import PythonREPL\n```\n\n> **API Reference:**\n> - [ChatOpenAI](https://api.python.langchain.com/en/latest/chat_models/langchain.chat_models.openai.ChatOpenAI.html)\n> - [ChatPromptTemplate](https://api.python.langchain.com/en/latest/prompts/langchain.prompts.chat.ChatPromptTemplate.html)\n> - [SystemMessagePromptTemplate](https://api.python.langchain.com/en/latest/prompts/langchain.prompts.chat.SystemMessagePromptTemplate.html)\n> - [HumanMessagePromptTemplate](https://api.python.langchain.com/en/latest/prompts/langchain.prompts.chat.HumanMessagePromptTemplate.html)\n> - [StrOutputParser](https://api.python.langchain.com/en/latest/schema/langchain