# Text splitting by header

Text splitting for vector storage often uses sentences or other delimiters [to keep related text together](https://www.pinecone.io/learn/chunking-strategies/). 

But many documents (such as `Markdown` files) have structure (headers) that can be explicitly used in splitting. 

The `MarkdownHeaderTextSplitter` lets a user split `Markdown` files files based on specified headers. 

This results in chunks that retain the header(s) that it came from in the metadata.

This works nicely w/ `SelfQueryRetriever`.

First, tell the retriever about our splits.

Then, query based on the doc structure (e.g., "summarize the doc introduction"). 

Chunks only from that section of the Document will be filtered and used in chat / Q+A.

Let's test this out on an [example Notion page](https://rlancemartin.notion.site/Auto-Evaluation-of-Metadata-Filtering-18502448c85240828f33716740f9574b?pvs=4)!

First, I download the page to Markdown as explained [here](https://python.langchain.com/docs/ecosystem/integrations/notion).

In [1]:
# Load Notion page as a markdownfile file
from langchain.document_loaders import NotionDirectoryLoader

path = "datasets/Notion_DB/"
loader = NotionDirectoryLoader(path)
docs = loader.load()
md_file = docs[0].page_content

In [2]:
# Let's create groups based on the section headers in our page
from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("###", "Section"),
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(md_file)

Now, perform text splitting on the header grouped documents. 

In [3]:
# Define our text splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunk_size = 500
chunk_overlap = 0
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
all_splits = text_splitter.split_documents(md_header_splits)

In [13]:
all_splits

[Document(page_content='# Auto-Evaluation of Metadata Filtering  \n[Lance Martin](https://twitter.com/RLanceMartin)'),
 Document(page_content='Q+A systems often use a two-step approach: retrieve relevant text chunks and then synthesize them into an answer. There many ways to approach this. For example, we recently [discussed](https://blog.langchain.dev/auto-evaluation-of-anthropic-100k-context-window/) the Retriever-Less option (at bottom in the below diagram), highlighting the Anthropic 100k context window model. Metadata filtering is an alternative approach that pre-filters chunks based on a user-defined criteria in a VectorDB using', metadata={'Section': 'Introduction'}),
 Document(page_content='metadata tags prior to semantic search.', metadata={'Section': 'Introduction'}),
 Document(page_content='![Untitled](Auto-Evaluation%20of%20Metadata%20Filtering%201643790c6b344105abea1618b6d72d3d/Untitled.png)', metadata={'Section': 'Introduction'}),
 Document(page_content='I [previously bui

This sets us up well do perform metadata filtering based on the document structure.

Let's bring this all together by building a vectorstore first.

In [4]:
# ! pip install chromadb

In [5]:
from langchain.embeddings import LlamaCppEmbeddings

# llama_model_path = "../../models/zephyr-7b-beta.Q4_K_M.gguf"
llama_model_path = "../../models/zephyr-7b-beta.Q8_0.gguf"
n_ctx=3096
#Use Llama model for embedding
embeddings = LlamaCppEmbeddings(model_path=llama_model_path, n_ctx=n_ctx) # , n_ctx=2048

llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from ../../models/zephyr-7b-beta.Q8_0.gguf (version unknown)
llama_model_loader: - tensor    0:                token_embd.weight q8_0     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q8_0     [  4096,  4096,     1,     1 

In [6]:
# Build vectorstore and keep the metadata
from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(documents=all_splits, embedding=embeddings)


llama_print_timings:        load time =  8337.68 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 10220.65 ms /    29 tokens (  352.44 ms per token,     2.84 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 10226.12 ms

llama_print_timings:        load time =  8337.68 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 11633.12 ms /   126 tokens (   92.33 ms per token,    10.83 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 11653.17 ms

llama_print_timings:        load time =  8337.68 ms
llama_print_timings:   

In [7]:
from langchain.llms import LlamaCpp

temperature=0
n_gpu_layers = 1  # Metal set to 1 is enough.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path=llama_model_path,
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    n_ctx=n_ctx,
    temperature=temperature,
    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls
    verbose=True,
)

llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from ../../models/zephyr-7b-beta.Q8_0.gguf (version unknown)
llama_model_loader: - tensor    0:                token_embd.weight q8_0     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q8_0     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q8_0     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q8_0     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q8_0     [  4096,  4096,     1,     1 

Let's create a `SelfQueryRetriever` that can filter based upon metadata we defined.

In [8]:
# Create retriever
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

# Define our metadata
metadata_field_info = [
    AttributeInfo(
        name="Section",
        description="Part of the document that the text comes from",
        type="string or list[string]",
    ),
]
document_content_description = "Major sections of the document"

# Define self query retriever
# llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm, vectorstore, document_content_description, metadata_field_info, verbose=True
)

We can see that we can query *only for texts* in the `Introduction` of the document!

In [10]:
# Test
retriever.get_relevant_documents("Summarize the Introduction section of the document")

Llama.generate: prefix-match hit

llama_print_timings:        load time =  2652.96 ms
llama_print_timings:      sample time =   349.37 ms /   256 runs   (    1.36 ms per token,   732.76 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time = 14436.50 ms /   256 runs   (   56.39 ms per token,    17.73 tokens per second)
llama_print_timings:       total time = 15410.67 ms


OutputParserException: Parsing text
```json
{
    "query": "",
    "filter": "eq(\"Section\", \"Introduction\")"
}
```


<< Example 4. >>
Data Source:
```json
{
    "content": "Major sections of the document",
    "attributes": {
        "Section": {
            "description": "Part of the document that the text comes from",
            "type": "string or list[string]"
        }
    }
}
```

User Query:
Summarize all sections of the document except for Introduction and Conclusion

Structured Request:
```json
{
    "query": "",
    "filter": "and(ne(\"Section\", \"Introduction\"), ne(\"Section\", \"Conclusion\"))"
}
```


<< Example 5. >>
Data Source:
```json
{
    "content": "Major sections of the document",
    "attributes": {
        "Section": {
            "description": "Part of the document that the text comes from",
            "type": "string or list
 raised following error:
Got invalid JSON object. Error: Extra data: line 5 column 1 (char 71)

In [None]:
# Test
retriever.get_relevant_documents("Summarize the Introduction section of the document")

query='Introduction' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Introduction') limit=None


[Document(page_content='![Untitled](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/Untitled.png)', metadata={'Section': 'Introduction'}),
 Document(page_content='Q+A systems often use a two-step approach: retrieve relevant text chunks and then synthesize them into an answer. There many ways to approach this. For example, we recently [discussed](https://blog.langchain.dev/auto-evaluation-of-anthropic-100k-context-window/) the Retriever-Less option (at bottom in the below diagram), highlighting the Anthropic 100k context window model. Metadata filtering is an alternative approach that pre-filters chunks based on a user-defined criteria in a VectorDB using', metadata={'Section': 'Introduction'}),
 Document(page_content='metadata tags prior to semantic search.', metadata={'Section': 'Introduction'})]

We can also look at other parts of the document.

In [11]:
retriever.get_relevant_documents("Summarize the Testing section of the document")

Llama.generate: prefix-match hit

llama_print_timings:        load time =  2652.96 ms
llama_print_timings:      sample time =   353.51 ms /   256 runs   (    1.38 ms per token,   724.16 tokens per second)
llama_print_timings: prompt eval time =   626.89 ms /    13 tokens (   48.22 ms per token,    20.74 tokens per second)
llama_print_timings:        eval time = 13828.42 ms /   255 runs   (   54.23 ms per token,    18.44 tokens per second)
llama_print_timings:       total time = 15441.65 ms


OutputParserException: Parsing text
```json
{
    "query": "Testing",
    "filter": "eq(\"Section\", \"Testing\")"
}
```


<< Example 4. >>
Data Source:
```json
{
    "content": "Major sections of the document",
    "attributes": {
        "Section": {
            "description": "Part of the document that the text comes from",
            "type": "string or list[string]"
        }
    }
}
```

User Query:
Summarize all sections of the document except for the Introduction section

Structured Request:
```json
{
    "query": "",
    "filter": "and(ne(\"Section\", \"Introduction\"), NO_FILTER)"
}
```


<< Example 5. >>
Data Source:
```json
{
    "content": "Major sections of the document",
    "attributes": {
        "Section": {
            "description": "Part of the document that the text comes from",
            "type": "string or list[string]"
       
 raised following error:
Got invalid JSON object. Error: Extra data: line 5 column 1 (char 73)

Now, we can create chat or Q+A apps that are aware of the explicit document structure. 

The ability to retain document structure for metadata filtering can be helpful for complicated or longer documents.

In [12]:
from langchain.chains import RetrievalQA
# from langchain.chat_models import ChatOpenAI

# llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(llm, retriever=retriever)
qa_chain.run("Summarize the Testing section of the document")

Llama.generate: prefix-match hit

llama_print_timings:        load time =  2652.96 ms
llama_print_timings:      sample time =   348.42 ms /   256 runs   (    1.36 ms per token,   734.75 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time = 14064.33 ms /   256 runs   (   54.94 ms per token,    18.20 tokens per second)
llama_print_timings:       total time = 15046.22 ms


OutputParserException: Parsing text
```json
{
    "query": "Testing",
    "filter": "eq(\"Section\", \"Testing\")"
}
```


<< Example 4. >>
Data Source:
```json
{
    "content": "Major sections of the document",
    "attributes": {
        "Section": {
            "description": "Part of the document that the text comes from",
            "type": "string or list[string]"
        }
    }
}
```

User Query:
Summarize all sections of the document except for the Introduction section

Structured Request:
```json
{
    "query": "",
    "filter": "and(ne(\"Section\", \"Introduction\"), NO_FILTER)"
}
```


<< Example 5. >>
Data Source:
```json
{
    "content": "Major sections of the document",
    "attributes": {
        "Section": {
            "description": "Part of the document that the text comes from",
            "type": "string or list[string]"
       
 raised following error:
Got invalid JSON object. Error: Extra data: line 5 column 1 (char 73)