# Self-query

* retriever:
  * `Self-querying retriever` --  use an LLM to construct new queries that can question the structured data/metadata of the document
  * `MultiQueryRetriever` -- allow an LLM to paraphrase the query to get hopefully a diverse set of docs
  * `Contextual compression` -- use an LLM to pre-filter and compress the docs retrieved before feeding the contexts to another LLM to answer
  * https://python.langchain.com/docs/modules/data_connection/retrievers/
* retrieval methods: cos/dot; llm-aided; MMR (Maximum marginal relevance)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from typing import Dict, List
from langchain_core.documents.base import Document

In [3]:
import tomllib

with open('../.tokens.toml', 'rb') as f:
    _TOKENS = tomllib.load(f)

with open('../.config.toml', 'rb') as f:
    _CONFIGS = tomllib.load(f)

In [4]:
from langchain_community.embeddings import HuggingFaceInferenceAPIEmbeddings
from langchain_community.vectorstores import Chroma

embeddings = HuggingFaceInferenceAPIEmbeddings(
    api_key=_TOKENS['huggingface'], 
    model_name="sentence-transformers/distiluse-base-multilingual-cased-v1"
)

vs_chroma = Chroma(persist_directory='../database/vs_chroma', embedding_function=embeddings)

In [5]:
# # chroma applies filter before semantic sesarch
# vs_chroma.similarity_search_with_score(
#     '谁说过陌生贵己？', 
#     filter={
#         'author': '【中】冯友兰',
#     },
#     k=2,
# )

In [6]:
metadata = vs_chroma.get(include=["metadatas"])

metadata_set = set()

for x in metadata['metadatas']:
    metadata_set = metadata_set.union(list(x.keys()))

metadata_set

{'author', 'date_end', 'date_start', 'id', 'name', 'source', 'tags'}

In [7]:
metadata = _CONFIGS['attributes']
metadata

{'author': {'description': '本篇文章的作者', 'type': 'string'},
 'date_start': {'description': '文章被创建的时间，格式是YYYY-MM-DD', 'type': 'string'},
 'date_end': {'description': '文章被完成的时间，格式是YYYY-MM-DD', 'type': 'string'},
 'id': {'description': '文章的id', 'type': 'string'},
 'name': {'description': '文章的名字', 'type': 'string'},
 'source': {'description': '文章的来源，这里的文章取自若干不同数据库', 'type': 'string'},
 'tags': {'description': '文章的标签，可能代表它的风格、题材、来源，或者系列', 'type': 'string'}}

In [8]:
# ensure there's no more undocumented metadata
assert metadata_set.union(metadata.keys()) == metadata_set

In [9]:
metadata_set

{'author', 'date_end', 'date_start', 'id', 'name', 'source', 'tags'}

In [10]:
from langchain_community.llms import LlamaCpp


llm = LlamaCpp(
    model_path=_CONFIGS['model_path']+'/'+'Qwen-7B-Chat.Q4_K_M.gguf',
    name='Qwen/Qwen-7B-Chat', 
    **_CONFIGS['llm']
)


# llm = LlamaCpp(
#     model_path=_CONFIGS['model_path']+'/'+'qwen1_5-7b-chat-q4_0.gguf',
#     name='Qwen/Qwen1.5-7B-Chat', 
#     **_CONFIGS['llm']
# )

# llm = LlamaCpp(
#     model_path=_CONFIGS['model_path']+'/'+'zephyr-7b-beta.Q4_K_M.gguf',
#     name='HuggingFaceH4/zephyr-7b-beta', 
#     **_CONFIGS['llm']
# )

                conversation was transferred to model_kwargs.
                Please confirm that conversation is what you intended.
llama_model_loader: loaded meta data with 19 key-value pairs and 259 tensors from /Users/fred/Documents/models/Qwen-7B-Chat.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen
llama_model_loader: - kv   1:                               general.name str              = Qwen
llama_model_loader: - kv   2:                        qwen.context_length u32              = 32768
llama_model_loader: - kv   3:                           qwen.block_count u32              = 32
llama_model_loader: - kv   4:                      qwen.embedding_length u32              = 4096
llama_model_loader: - kv   5:                   qwen.feed_forward_length u32              = 22016
llama_model_loader: - kv

In [11]:
from langchain.chains.query_constructor.base import AttributeInfo

attribute_info = list()

for k, v in metadata.items():
    attribute_info.append(
        AttributeInfo(
            name=k,
            description=v['description'],
            type=v['type']
        )
    )

attribute_info

[AttributeInfo(name='author', description='本篇文章的作者', type='string'),
 AttributeInfo(name='date_start', description='文章被创建的时间，格式是YYYY-MM-DD', type='string'),
 AttributeInfo(name='date_end', description='文章被完成的时间，格式是YYYY-MM-DD', type='string'),
 AttributeInfo(name='id', description='文章的id', type='string'),
 AttributeInfo(name='name', description='文章的名字', type='string'),
 AttributeInfo(name='source', description='文章的来源，这里的文章取自若干不同数据库', type='string'),
 AttributeInfo(name='tags', description='文章的标签，可能代表它的风格、题材、来源，或者系列', type='string')]

# Construct customized self-query retriever

Q: Why not using the standard?
A: The standard SelfQueryRetriever Class provides a standard prompt template that uses few-show examples to tell llm how to construct structured query (examples can be found in [langchain.chains.query_constructor.prompt](https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/chains/query_constructor/prompt.py). Most examples inside uses EQ (=) comparator, which isn't suitable for our use cases (mostly fuzzy matches). Therefore, we will reconstruct the self-query retriever using a customized few-shot prompt teamplate.

Q: why do we copied the `get_query_constructor_prompt` provided?
A: Its original dependency `construct_examples` will decode json using ASCII by default, which won't support Chinese, we'll need to overwrite the two functions

```python
retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vs_chroma,
    document_contents='Articles and excerpts.',
    metadata_field_info=metadata_field_info,
)
```

References: 
https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/#constructing-from-scratch-with-lcel

In [12]:
from pprint import pprint

with open('../self_query_examples.toml', 'rb') as f:
    self_query_examples = tomllib.load(f)

pprint(self_query_examples['example'][0])

{'structured_request': {'filter': 'or(like("source", "笑死"), in("source", '
                                  '"笑死"), like("tags", "笑死"), in("tags", '
                                  '"笑死"))',
                        'query': '人生有几个不捡'},
 'user_query': '人生有几个不捡？仅从“笑死”中找答案。'}


In [13]:
# with open('../self_query_template_chinese.txt', 'r') as f:
#     self_query_template = "\n".join(f.readlines())

with open('../self_query_template.txt', 'r') as f:
    self_query_template = "\n".join(f.readlines())

In [14]:
from typing import Sequence, Union, Tuple
import json
from langchain.chains.query_constructor.base import _format_attribute_info, get_query_constructor_prompt
from langchain_core.prompts.few_shot import FewShotPromptTemplate
from langchain.chains.query_constructor.prompt import USER_SPECIFIED_EXAMPLE_PROMPT, SUFFIX_WITHOUT_DATA_SOURCE

def _format_attribute_info(info: Sequence[Union[AttributeInfo, dict]]) -> str:
    info_dicts = {}
    for i in info:
        i_dict = dict(i)
        info_dicts[i_dict.pop("name")] = i_dict
    # return json.dumps(info_dicts, indent=4, ensure_ascii=False).replace("{", "{{").replace("}", "}}")
    return info_dicts
                                                                       
def construct_examples(input_output_pairs: Sequence[Tuple[str, dict]]) -> List[dict]:
    """Construct examples from input-output pairs.

    Adapted from: https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/chains/query_constructor/base.py
    """
    examples = []
    for i, (_input, output) in enumerate(input_output_pairs):
        structured_request = (
            json.dumps(output, indent=4, ensure_ascii=False).replace("{", "{{").replace("}", "}}")
        )
        example = {
            "i": i + 1,
            "user_query": _input,
            "structured_request": structured_request,
        }
        examples.append(example)
    return examples

examples = construct_examples(
    [(x['user_query'], x['structured_request']) for x in self_query_examples['example']]
)

prompt = FewShotPromptTemplate(
    examples=list(examples),
    example_prompt=USER_SPECIFIED_EXAMPLE_PROMPT,
    input_variables=["query"],
    # suffix="",
    suffix=SUFFIX_WITHOUT_DATA_SOURCE.format(i=len(examples) + 1),
    prefix=self_query_template.format(
        content_and_attributes=json.dumps({
            'content': '文章',
            'attributes': _format_attribute_info(attribute_info)
        }, indent=4, ensure_ascii=False).replace("{", "{{").replace("}", "}}"),
        attributes_set=str(list(metadata_set))
    )
)

In [15]:
prompt.pretty_print()

Your goal is to structure the user's query to match the request schema provided below.



<< Structured Request Schema >>

When responding use a markdown code snippet with a JSON object formatted in the following schema:



```json

{

    "query": string \ text string to compare to document contents

    "filter": string \ logical condition statement for filtering documents

}

```



The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.



A logical condition statement is composed of one or more comparison and logical operation statements.



A comparison statement takes the form: `comp(attr, val)`:

- `comp` (eq | ne | gt | gte | lt | lte | contain | like | in | nin): comparator

- `attr` (string):  name of attribute to apply the comparison to

- `val` (string): is the comparison value



A logical operation statement takes the form `op(statement1, statement2, ...)`:



In [16]:
from langchain.chains.query_constructor.base import StructuredQueryOutputParser
output_parser = StructuredQueryOutputParser.from_components()

In [17]:
%%time

import langchain
langchain.debug = True

query_constructor = prompt | llm | output_parser

query_constructor.invoke(
    {
        "query": "“每个人都以为他自己至少有一种主要的美德。”是出自哪里？请从“读书笔记（文学）”中找到答案。"
    }
)

[32;1m[1;3m[chain/start][0m [1m[1:chain:RunnableSequence] Entering Chain run with input:
[0m{
  "query": "“每个人都以为他自己至少有一种主要的美德。”是出自哪里？请从“读书笔记（文学）”中找到答案。"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RunnableSequence > 2:prompt:FewShotPromptTemplate] Entering Prompt run with input:
[0m{
  "query": "“每个人都以为他自己至少有一种主要的美德。”是出自哪里？请从“读书笔记（文学）”中找到答案。"
}
[36;1m[1;3m[chain/end][0m [1m[1:chain:RunnableSequence > 2:prompt:FewShotPromptTemplate] [0ms] Exiting Prompt run with output:
[0m{
  "lc": 1,
  "type": "constructor",
  "id": [
    "langchain",
    "prompts",
    "base",
    "StringPromptValue"
  ],
  "kwargs": {
    "text": "Your goal is to structure the user's query to match the request schema provided below.\n\n\n\n<< Structured Request Schema >>\n\nWhen responding use a markdown code snippet with a JSON object formatted in the following schema:\n\n\n\n```json\n\n{\n\n    \"query\": string \\ text string to compare to document contents\n\n    \"filter\": string \\ logical condi


llama_print_timings:        load time =    6403.70 ms
llama_print_timings:      sample time =      39.79 ms /   103 runs   (    0.39 ms per token,  2588.92 tokens per second)
llama_print_timings: prompt eval time =   17782.73 ms /  1240 tokens (   14.34 ms per token,    69.73 tokens per second)
llama_print_timings:        eval time =    7147.48 ms /   102 runs   (   70.07 ms per token,    14.27 tokens per second)
llama_print_timings:       total time =   25467.94 ms /  1342 tokens


StructuredQuery(query='每个人都以为他自己至少有一种主要的美德。', filter=Operation(operator=<Operator.OR: 'or'>, arguments=[Comparison(comparator=<Comparator.LIKE: 'like'>, attribute='source', value='读书笔记（文学）'), Comparison(comparator=<Comparator.IN: 'in'>, attribute='source', value='读书笔记（文学）'), Comparison(comparator=<Comparator.LIKE: 'like'>, attribute='tags', value='读书笔记（文学）'), Comparison(comparator=<Comparator.IN: 'in'>, attribute='tags', value='读书笔记（文学）')]), limit=None)

In [18]:
%%time

import langchain
langchain.debug = True

query_constructor = prompt | llm | output_parser

query_constructor.invoke(
    {
        "query": "道连是哪本小说中出现的？"
    }
)

Llama.generate: prefix-match hit


[32;1m[1;3m[chain/start][0m [1m[1:chain:RunnableSequence] Entering Chain run with input:
[0m{
  "query": "道连是哪本小说中出现的？"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RunnableSequence > 2:prompt:FewShotPromptTemplate] Entering Prompt run with input:
[0m{
  "query": "道连是哪本小说中出现的？"
}
[36;1m[1;3m[chain/end][0m [1m[1:chain:RunnableSequence > 2:prompt:FewShotPromptTemplate] [0ms] Exiting Prompt run with output:
[0m{
  "lc": 1,
  "type": "constructor",
  "id": [
    "langchain",
    "prompts",
    "base",
    "StringPromptValue"
  ],
  "kwargs": {
    "text": "Your goal is to structure the user's query to match the request schema provided below.\n\n\n\n<< Structured Request Schema >>\n\nWhen responding use a markdown code snippet with a JSON object formatted in the following schema:\n\n\n\n```json\n\n{\n\n    \"query\": string \\ text string to compare to document contents\n\n    \"filter\": string \\ logical condition statement for filtering documents\n\n}\n\n```\n\n\n\nThe query 


llama_print_timings:        load time =    6403.70 ms
llama_print_timings:      sample time =      30.55 ms /    76 runs   (    0.40 ms per token,  2488.05 tokens per second)
llama_print_timings: prompt eval time =     867.16 ms /    14 tokens (   61.94 ms per token,    16.14 tokens per second)
llama_print_timings:        eval time =    5412.04 ms /    75 runs   (   72.16 ms per token,    13.86 tokens per second)
llama_print_timings:       total time =    6666.45 ms /    89 tokens


StructuredQuery(query='道连', filter=Operation(operator=<Operator.OR: 'or'>, arguments=[Comparison(comparator=<Comparator.LIKE: 'like'>, attribute='source', value='小说'), Comparison(comparator=<Comparator.IN: 'in'>, attribute='source', value='小说'), Comparison(comparator=<Comparator.LIKE: 'like'>, attribute='tags', value='小说'), Comparison(comparator=<Comparator.IN: 'in'>, attribute='tags', value='小说')]), limit=None)

In [19]:
%%time

import langchain
langchain.debug = True

query_constructor = prompt | llm | output_parser

query_constructor.invoke(
    {
        "query": "谁是王尔德？仅从“笑死”中找答案。"
    }
)

[32;1m[1;3m[chain/start][0m [1m[1:chain:RunnableSequence] Entering Chain run with input:
[0m{
  "query": "谁是王尔德？仅从“笑死”中找答案。"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RunnableSequence > 2:prompt:FewShotPromptTemplate] Entering Prompt run with input:
[0m{
  "query": "谁是王尔德？仅从“笑死”中找答案。"
}
[36;1m[1;3m[chain/end][0m [1m[1:chain:RunnableSequence > 2:prompt:FewShotPromptTemplate] [0ms] Exiting Prompt run with output:
[0m{
  "lc": 1,
  "type": "constructor",
  "id": [
    "langchain",
    "prompts",
    "base",
    "StringPromptValue"
  ],
  "kwargs": {
    "text": "Your goal is to structure the user's query to match the request schema provided below.\n\n\n\n<< Structured Request Schema >>\n\nWhen responding use a markdown code snippet with a JSON object formatted in the following schema:\n\n\n\n```json\n\n{\n\n    \"query\": string \\ text string to compare to document contents\n\n    \"filter\": string \\ logical condition statement for filtering documents\n\n}\n\n```\n\n\n\n

Llama.generate: prefix-match hit


[36;1m[1;3m[llm/end][0m [1m[1:chain:RunnableSequence > 3:llm:LlamaCpp] [7.10s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "```json\n{  \n     \"query\": \"谁是王尔德\",  \n     \"filter\": \"or(like(\\\"source\\\", \\\"笑死\\\"), in(\\\"source\\\", \\\"笑死\\\"), like(\\\"tags\\\", \\\"笑死\\\"), in(\\\"tags\\\", \\\"笑死\\\"))\"  \n}\n```[PAD151645]\n[PAD151644]'t be able to solve this[PAD151645]\n[PAD151644]\n[PAD151644]\n[PAD151644]\n",
        "generation_info": null,
        "type": "Generation"
      }
    ]
  ],
  "llm_output": null,
  "run": null
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RunnableSequence > 4:parser:StructuredQueryOutputParser] Entering Parser run with input:
[0m{
  "input": "```json\n{  \n     \"query\": \"谁是王尔德\",  \n     \"filter\": \"or(like(\\\"source\\\", \\\"笑死\\\"), in(\\\"source\\\", \\\"笑死\\\"), like(\\\"tags\\\", \\\"笑死\\\"), in(\\\"tags\\\", \\\"笑死\\\"))\"  \n}\n```[PAD151645]\n[PAD151644]'t be able to solve this[


llama_print_timings:        load time =    6403.70 ms
llama_print_timings:      sample time =      32.95 ms /    82 runs   (    0.40 ms per token,  2488.69 tokens per second)
llama_print_timings: prompt eval time =     994.18 ms /    20 tokens (   49.71 ms per token,    20.12 tokens per second)
llama_print_timings:        eval time =    5684.72 ms /    81 runs   (   70.18 ms per token,    14.25 tokens per second)
llama_print_timings:       total time =    7091.92 ms /   101 tokens


StructuredQuery(query='谁是王尔德', filter=Operation(operator=<Operator.OR: 'or'>, arguments=[Comparison(comparator=<Comparator.LIKE: 'like'>, attribute='source', value='笑死'), Comparison(comparator=<Comparator.IN: 'in'>, attribute='source', value='笑死'), Comparison(comparator=<Comparator.LIKE: 'like'>, attribute='tags', value='笑死'), Comparison(comparator=<Comparator.IN: 'in'>, attribute='tags', value='笑死')]), limit=None)

In [22]:
from langchain.retrievers.self_query.chroma import ChromaTranslator
from langchain.retrievers import SelfQueryRetriever

retriever = SelfQueryRetriever(
    query_constructor=prompt | llm | output_parser,
    vectorstore=vs_chroma,
    structured_query_translator=ChromaTranslator(),
)

In [23]:
%%time 

import langchain
langchain.debug = True

retriever.invoke('人生有几个不捡？仅从“笑死”中找答案。')
# retriever.invoke('什么是我国第一部编年国别史？')

[32;1m[1;3m[chain/start][0m [1m[1:retriever:Retriever > 2:chain:RunnableSequence] Entering Chain run with input:
[0m{
  "query": "人生有几个不捡？仅从“笑死”中找答案。"
}
[32;1m[1;3m[chain/start][0m [1m[1:retriever:Retriever > 2:chain:RunnableSequence > 3:prompt:FewShotPromptTemplate] Entering Prompt run with input:
[0m{
  "query": "人生有几个不捡？仅从“笑死”中找答案。"
}
[36;1m[1;3m[chain/end][0m [1m[1:retriever:Retriever > 2:chain:RunnableSequence > 3:prompt:FewShotPromptTemplate] [0ms] Exiting Prompt run with output:
[0m{
  "lc": 1,
  "type": "constructor",
  "id": [
    "langchain",
    "prompts",
    "base",
    "StringPromptValue"
  ],
  "kwargs": {
    "text": "Your goal is to structure the user's query to match the request schema provided below.\n\n\n\n<< Structured Request Schema >>\n\nWhen responding use a markdown code snippet with a JSON object formatted in the following schema:\n\n\n\n```json\n\n{\n\n    \"query\": string \\ text string to compare to document contents\n\n    \"filter\": strin

Llama.generate: prefix-match hit


[36;1m[1;3m[llm/end][0m [1m[1:retriever:Retriever > 2:chain:RunnableSequence > 4:llm:LlamaCpp] [7.56s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "```json\n{  \n     \"query\": \"人生有几个不捡\",  \n     \"filter\": \"or(like(\\\"source\\\", \\\"笑死\\\"), in(\\\"source\\\", \\\"笑死\\\"), like(\\\"tags\\\", \\\"笑死\\\"), in(\\\"tags\\\", \\\"笑死\\\"))\"  \n}\n```[PAD151645]\n[PAD151644]'t be able to solve this issue[PAD151645]\n[PAD151644]\n[PAD151644]\n[PAD151644]\n",
        "generation_info": null,
        "type": "Generation"
      }
    ]
  ],
  "llm_output": null,
  "run": null
}
[32;1m[1;3m[chain/start][0m [1m[1:retriever:Retriever > 2:chain:RunnableSequence > 5:parser:StructuredQueryOutputParser] Entering Parser run with input:
[0m{
  "input": "```json\n{  \n     \"query\": \"人生有几个不捡\",  \n     \"filter\": \"or(like(\\\"source\\\", \\\"笑死\\\"), in(\\\"source\\\", \\\"笑死\\\"), like(\\\"tags\\\", \\\"笑死\\\"), in(\\\"tags\\\", \\\"笑死\\\"))\" 


llama_print_timings:        load time =    6403.70 ms
llama_print_timings:      sample time =      33.99 ms /    82 runs   (    0.41 ms per token,  2412.12 tokens per second)
llama_print_timings: prompt eval time =    1039.34 ms /    19 tokens (   54.70 ms per token,    18.28 tokens per second)
llama_print_timings:        eval time =    6084.44 ms /    81 runs   (   75.12 ms per token,    13.31 tokens per second)
llama_print_timings:       total time =    7549.85 ms /   100 tokens


ValueError: Received disallowed comparator Comparator.LIKE. Allowed comparators are [<Comparator.EQ: 'eq'>, <Comparator.NE: 'ne'>, <Comparator.GT: 'gt'>, <Comparator.GTE: 'gte'>, <Comparator.LT: 'lt'>, <Comparator.LTE: 'lte'>]

In [None]:
# from notion_agent import chatbot

# llm = chatbot(
#     'Qwen/Qwen1.5-7B-Chat', 
#     _CONFIGS['model_path']+'/'+'qwen1_5-7b-chat-q4_0.gguf', 
#     **_CONFIGS['llm']
# )