In [66]:
%%html
<style>
    body {
        --vscode-font-family: "Segoe UI"
    }
</style>

In [1]:
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
from pathlib import Path
import pickle
from pprint import pprint

In [3]:
from llama_index import SimpleDirectoryReader
from llama_index.embeddings import OpenAIEmbedding
from llama_index.ingestion import IngestionPipeline
from llama_index.node_parser import (SemanticSplitterNodeParser,
                                     SentenceSplitter, TokenTextSplitter)
from llama_index.node_parser.file import (MarkdownNodeParser,
                                          SimpleFileNodeParser)
from llama_index.readers.file.flat_reader import FlatReader
from llama_index.schema import MetadataMode

`SimpleDirectoryReader` will parse Markdown and PDFs correctly but without any metadata.

From `//llama_index/llama_index/readers/file/base.py`
```
DEFAULT_FILE_READER_CLS: Dict[str, Type[BaseReader]] = {
    ".hwp": HWPReader,
    ".pdf": PDFReader,
    ".docx": DocxReader,
    ".pptx": PptxReader,
    ".ppt": PptxReader,
    ".pptm": PptxReader,
    ".jpg": ImageReader,
    ".png": ImageReader,
    ".jpeg": ImageReader,
    ".mp3": VideoAudioReader,
    ".mp4": VideoAudioReader,
    ".csv": PandasCSVReader,
    ".epub": EpubReader,
    ".md": MarkdownReader,
    ".mbox": MboxReader,
    ".ipynb": IPYNBReader,
}

```

In [4]:
DATAROOT = Path.home() / "mldata" / "sherlock"
TMPROOT = Path.home() / "Desktop" / "temp"

In [5]:
docs = SimpleDirectoryReader(str(DATAROOT), recursive=True).load_data(show_progress=True)

Loading files: 100%|██████████| 15/15 [00:09<00:00,  1.66file/s]


In [6]:
len(docs)

789

In [7]:
docs[0].metadata

{'file_path': '/Users/avilay/mldata/sherlock/md/advs.md',
 'file_name': 'advs.md',
 'file_type': None,
 'file_size': 563886,
 'creation_date': '2024-04-29',
 'last_modified_date': '2024-04-26',
 'last_accessed_date': '2024-04-29'}

In [9]:
prev_name, prev_idx = None, None
for i, doc in enumerate(docs):
    file_name = doc.metadata["file_name"]
    if file_name != prev_name:
        if prev_name is not None:
            print(f"docs[{prev_idx}] - docs[{i-1}]: {prev_name}")
        prev_name = file_name
        prev_idx = i

docs[0] - docs[14]: advs.md
docs[15] - docs[28]: case.md
docs[29] - docs[40]: lstb.md
docs[41] - docs[52]: mems.md
docs[53] - docs[66]: retn.md
docs[67] - docs[228]: advs.pdf
docs[229] - docs[365]: case.pdf
docs[366] - docs[477]: lstb.pdf
docs[478] - docs[613]: mems.pdf
docs[614] - docs[783]: retn.pdf
docs[784] - docs[784]: advs.txt
docs[785] - docs[785]: case.txt
docs[786] - docs[786]: lstb.txt
docs[787] - docs[787]: mems.txt


As can be seen "The Adventures of Sherlock Holmes" (advs) in markdown format was chunked into 15 docs, 1 per header; but in pdf format was chunked into 229 - 67 = 162 docs, one doc per page in the pdf.

In [10]:
print(docs[0].text)



The Adventures of Sherlock Holmes
Arthur Conan Doyle 





In [24]:
type(docs[0])

llama_index.schema.Document

In [11]:
docs[0]

Document(id_='a31eac91-7027-434f-98ab-8dc644471881', embedding=None, metadata={'file_path': '/Users/avilay/mldata/sherlock/md/advs.md', 'file_name': 'advs.md', 'file_type': None, 'file_size': 563886, 'creation_date': '2024-04-29', 'last_modified_date': '2024-04-26', 'last_accessed_date': '2024-04-29'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='\n\nThe Adventures of Sherlock Holmes\nArthur Conan Doyle \n\n\n', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')

```python
Document(
    id_='a31eac91-7027-434f-98ab-8dc644471881', 
    embedding=None, 
    metadata={
        'file_path': '/Users/avilay/mldata/sherlock/md/advs.md', 
        'file_name': 'advs.md', 
        'file_type': None, 
        'file_size': 563886, 
        'creation_date': '2024-04-29', 
        'last_modified_date': '2024-04-26', 
        'last_accessed_date': '2024-04-29'
    }, 
    excluded_embed_metadata_keys=[
        'file_name', 
        'file_type', 
        'file_size', 
        'creation_date', 
        'last_modified_date', 
        'last_accessed_date'
    ], 
    excluded_llm_metadata_keys=[
        'file_name', 
        'file_type', 
        'file_size', 
        'creation_date', 
        'last_modified_date', 
        'last_accessed_date'
    ], 
    relationships={}, 
    text='\n\nThe Adventures of Sherlock Holmes\nArthur Conan Doyle \n\n\n', 
    start_char_idx=None, 
    end_char_idx=None, 
    text_template='{metadata_str}\n\n{content}', 
    metadata_template='{key}: {value}', 
    metadata_seperator='\n'
)
```

In [12]:
print(docs[1].text)



A Scandal in Bohemia
To Sherlock Holmes she is always the woman. I have seldom heard him mention her under any other name. In his eyes she eclipses and predominates the whole of her sex. It was not that he felt any emotion akin to love for Irene Adler. All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind. He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false position. He never spoke of the softer passions, save with a gibe and a sneer. They were admirable things for the observer--excellent for drawing the veil from men's motives and actions. But for the trained reasoner to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which might throw a doubt upon all his mental results. Grit in a sensitive instrument, or a crack in one of his own high-power lenses, would not be more distu

In [13]:
docs[1].metadata

{'file_path': '/Users/avilay/mldata/sherlock/md/advs.md',
 'file_name': 'advs.md',
 'file_type': None,
 'file_size': 563886,
 'creation_date': '2024-04-29',
 'last_modified_date': '2024-04-26',
 'last_accessed_date': '2024-04-29'}

Even though this object has a lot of metadata, not all of it will be available to the LLM as can be seen from the `excluded_llm_metadata_keys` attribute -

In [14]:
docs[1].excluded_embed_metadata_keys

['file_name',
 'file_type',
 'file_size',
 'creation_date',
 'last_modified_date',
 'last_accessed_date']

To see what the LLM will see, we can do the following -

In [15]:
print(docs[1].get_content(metadata_mode=MetadataMode.LLM))

file_path: /Users/avilay/mldata/sherlock/md/advs.md



A Scandal in Bohemia
To Sherlock Holmes she is always the woman. I have seldom heard him mention her under any other name. In his eyes she eclipses and predominates the whole of her sex. It was not that he felt any emotion akin to love for Irene Adler. All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind. He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false position. He never spoke of the softer passions, save with a gibe and a sneer. They were admirable things for the observer--excellent for drawing the veil from men's motives and actions. But for the trained reasoner to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which might throw a doubt upon all his mental results. Grit in a sensitive instrument, or a crack in one 

Similarly, not all the metadata will be available to the embedding model either, which can be seen from the `excluded_embed_metadata_keys`.

In [16]:
docs[1].excluded_embed_metadata_keys

['file_name',
 'file_type',
 'file_size',
 'creation_date',
 'last_modified_date',
 'last_accessed_date']

To see what the embed model will see we can do -

In [17]:
print(docs[1].get_content(metadata_mode=MetadataMode.EMBED))

file_path: /Users/avilay/mldata/sherlock/md/advs.md



A Scandal in Bohemia
To Sherlock Holmes she is always the woman. I have seldom heard him mention her under any other name. In his eyes she eclipses and predominates the whole of her sex. It was not that he felt any emotion akin to love for Irene Adler. All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind. He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false position. He never spoke of the softer passions, save with a gibe and a sneer. They were admirable things for the observer--excellent for drawing the veil from men's motives and actions. But for the trained reasoner to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which might throw a doubt upon all his mental results. Grit in a sensitive instrument, or a crack in one 

In [18]:
print(docs[2].text)



Chapter II
At three o'clock precisely I was at Baker Street, but Holmes had not yet returned. The landlady informed me that he had left the house shortly after eight o'clock in the morning. I sat down beside the fire, however, with the intention of awaiting him, however long he might be. I was already deeply interested in his inquiry, for, though it was surrounded by none of the grim and strange features which were associated with the two crimes which I have already recorded, still, the nature of the case and the exalted station of his client gave it a character of its own. Indeed, apart from the nature of the investigation which my friend had on hand, there was something in his masterly grasp of a situation, and his keen, incisive reasoning, which made it a pleasure to me to study his system of work, and to follow the quick, subtle methods by which he disentangled the most inextricable mysteries. So accustomed was I to his invariable success that the very possibility of his failing 

The quality of the markdown parsing is not so great. In `docs[1]` the "Chapter I" section title got cut off. There is no header info in the metadata and is not presented to the LLM. Also, all markdown elements are stripped off, so ther is no way for the LLM to know that the text "A Scandal in Bohemia" is actually H2.

Let us see how PDF parsing fared -

In [19]:
docs[75].metadata

{'page_label': '5',
 'file_name': 'advs.pdf',
 'file_path': '/Users/avilay/mldata/sherlock/pdf/advs.pdf',
 'file_type': 'application/pdf',
 'file_size': 753250,
 'creation_date': '2024-04-26',
 'last_modified_date': '2024-04-26',
 'last_accessed_date': '2024-04-26'}

In [20]:
print(docs[75].text)

A Scandal in Bohemia
CHAPTER I.
ToSherlock Holmes she is always the
woman. I have seldom heard him men-
tion her under any other name. In his
eyes she eclipses and predominates the
whole of her sex. It was not that he felt any emo-
tion akin to love for Irene Adler. All emotions, and
that one particularly, were abhorrent to his cold,
precise but admirably balanced mind. He was, I
take it, the most perfect reasoning and observing
machine that the world has seen, but as a lover he
would have placed himself in a false position. He
never spoke of the softer passions, save with a gibe
and a sneer. They were admirable things for the ob-
server—excellent for drawing the veil from men’s
motives and actions. But for the trained reasoner
to admit such intrusions into his own delicate and
ﬁnely adjusted temperament was to introduce a dis-
tracting factor which might throw a doubt upon all
his mental results. Grit in a sensitive instrument, or
a crack in one of his own high-power lenses, would
not

The quality of pdf parsing is slightly better but not great. The sentences are cut-off. The paragraph information is lost. And there is a page number at the end of each page. This is not computed by the parser, it is just a label that pdf provides. So the initial pages had labels like $i$, $ii$, and so on. Just like in Markdown, all the formatting is stripped off so there is now way for the LLM to know that text "A Scandal in Bohemia" or "CHAPTER I" are different headers.

In [21]:
print(docs[75].get_content(MetadataMode.LLM))

page_label: 5
file_path: /Users/avilay/mldata/sherlock/pdf/advs.pdf

A Scandal in Bohemia
CHAPTER I.
ToSherlock Holmes she is always the
woman. I have seldom heard him men-
tion her under any other name. In his
eyes she eclipses and predominates the
whole of her sex. It was not that he felt any emo-
tion akin to love for Irene Adler. All emotions, and
that one particularly, were abhorrent to his cold,
precise but admirably balanced mind. He was, I
take it, the most perfect reasoning and observing
machine that the world has seen, but as a lover he
would have placed himself in a false position. He
never spoke of the softer passions, save with a gibe
and a sneer. They were admirable things for the ob-
server—excellent for drawing the veil from men’s
motives and actions. But for the trained reasoner
to admit such intrusions into his own delicate and
ﬁnely adjusted temperament was to introduce a dis-
tracting factor which might throw a doubt upon all
his mental results. Grit in a sensitive 

To improve the parsing quality, I can avoid using `SimpleDirectoryReader` and all its magic. To start with I can use the `FlatReader`. It does nothing fancy, just loads the entire doc without any chunking. But the doc does not have any additional metdata beyond the filename and type.

UPDATE: Even though thte markdown parse quality is not too good, I am better off with `SimpleDirectoryReader` and its magic because it supports more types of docs out of the box.

In [22]:
advs_md = FlatReader().load_data(DATAROOT/"md"/"advs.md")
len(advs_md)

1

In [23]:
type(advs_md[0])

llama_index.schema.Document

In [25]:
advs_md[0].metadata

{'filename': 'advs.md', 'extension': '.md'}

I can ofcourse add any additional metadata I want.

In [26]:
advs_md[0].metadata["genre"] = "Mystery"
advs_md[0].metadata["is_fictional"] = True

In [27]:
advs_md[0].metadata

{'filename': 'advs.md',
 'extension': '.md',
 'genre': 'Mystery',
 'is_fictional': True}

`SimpleFileNodeParser` will choose the best parser for the doc based on the extension, in this case it will automatically choose the `MarkdownNodeParser` and break the doc into 1 node per header (H1, H2, and H3). This parsing is much more superior with all the metadata and content preserved. Of course the formatting is still stripped off, but the presence of metadata makes up for it. See [doc](http://127.0.0.1:8000/module_guides/loading/node_parsers/modules.html) for all other node parsers.

`SimpleDirectoryReader` reads all the files in the current directory, recursively if specified, and runs the file through a format-specific reader which will chunk a single file into multiple chunks. The resulting chunks are still `Document` objects. E.g., for PDFs, it will chunk the document per page, with each page getting its own Document object. With Markdowns it will chunk based on headers (I have verified H1 and H2, don't know about the rest).

`FlatReader` on the other hand will simply read the file as is without any format-specific parsing. The resulting `Document` object then needs to be passed through another more specific node parser. Using the `SimpleFileNodeParser` will result in a similar behavior as `SimpleDirectoryReader`. The difference is that `SimpleDirectoryReader` will use `MarkdownReader` to read Markdowns but `SimpleFileNodeParser` will use `MarkdownNodeParser`. The node parsers are better because they save the header information in the metadata, something that the readers do not.

In [28]:
parser = SimpleFileNodeParser()
nodes = parser.get_nodes_from_documents(advs_md)
len(nodes)

16

In [None]:
nodes[0].metadata

{'Header 1': 'The Adventures of Sherlock Holmes',
 'filename': 'advs.md',
 'extension': '.md',
 'genre': 'Mystery',
 'is_fictional': True}

In [29]:
type(nodes[0])

llama_index.schema.TextNode

Typically a single document (or a single page of multi-page document) is represented by a single `Document` object. But this is not neccessarily how I might want to save the document to the index. A typical use case is to break the document into multiple chunks and then index the chunks individually. This makes the retrieval more precise. A document chunk is called a **Node** in Llama-Index. The type of a document chunk aka Node is actually `TextNode`. Confusingly enough there is a also a `Node` class in this class family.

![doc-class](./doc-class.png)

But for the most part, when the documentation is referring to a "Node" it is referring to `TextNode`.

In [33]:
nodes[0]

TextNode(id_='bd2e8ccc-94be-4db2-ba41-50b4de47954a', embedding=None, metadata={'Header 1': 'The Adventures of Sherlock Holmes', 'filename': 'advs.md', 'extension': '.md', 'genre': 'Mystery', 'is_fictional': True}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='4f42f4f3-a391-4f87-b79d-838e185bc846', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'filename': 'advs.md', 'extension': '.md', 'genre': 'Mystery', 'is_fictional': True}, hash='37b84532e73711af9d8d89eddef48d97b7579f757fb85dea8a9d2379e7be0c82'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='30447259-c502-438e-a9ae-5504c3211fff', node_type=<ObjectType.TEXT: '1'>, metadata={'Header 1': 'The Adventures of Sherlock Holmes', 'Header 2': 'A Scandal in Bohemia', 'filename': 'advs.md', 'extension': '.md', 'genre': 'Mystery', 'is_fictional': True}, hash='b0d789cddbeb0bafc0ec0eb4901d998969c1b1f2f9493c46e5797d3817cb14c1')}, text='The Adventure

```python
TextNode(
    id_='bd2e8ccc-94be-4db2-ba41-50b4de47954a', 
    embedding=None, 
    metadata={
        'Header 1': 'The Adventures of Sherlock Holmes', 
        'filename': 'advs.md', 
        'extension': '.md', 
        'genre': 'Mystery', 
        'is_fictional': True
    }, 
    excluded_embed_metadata_keys=[], 
    excluded_llm_metadata_keys=[], 
    relationships={
        <NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(
            node_id='4f42f4f3-a391-4f87-b79d-838e185bc846', 
            node_type=<ObjectType.DOCUMENT: '4'>, 
            metadata={
                'filename': 'advs.md', 
                'extension': '.md', 
                'genre': 'Mystery', 
                'is_fictional': True
            }, 
            hash='37b84532e73711af9d8d89eddef48d97b7579f757fb85dea8a9d2379e7be0c82'
        ), 
        <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(
            node_id='30447259-c502-438e-a9ae-5504c3211fff', 
            node_type=<ObjectType.TEXT: '1'>, 
            metadata={
                'Header 1': 'The Adventures of Sherlock Holmes', 
                'Header 2': 'A Scandal in Bohemia', 
                'filename': 'advs.md', 
                'extension': '.md', 
                'genre': 'Mystery', 
                'is_fictional': True
            }, 
            hash='b0d789cddbeb0bafc0ec0eb4901d998969c1b1f2f9493c46e5797d3817cb14c1'
        )
    }, 
    text='The Adventures of Sherlock Holmes\nArthur Conan Doyle', 
    start_char_idx=2, 
    end_char_idx=54, 
    text_template='{metadata_str}\n\n{content}', 
    metadata_template='{key}: {value}', 
    metadata_seperator='\n'
)
```

In [30]:
print(nodes[0].get_content(MetadataMode.LLM))

Header 1: The Adventures of Sherlock Holmes
filename: advs.md
extension: .md
genre: Mystery
is_fictional: True

The Adventures of Sherlock Holmes
Arthur Conan Doyle


In [31]:
print(nodes[1].get_content(MetadataMode.LLM))

Header 1: The Adventures of Sherlock Holmes
Header 2: A Scandal in Bohemia
filename: advs.md
extension: .md
genre: Mystery
is_fictional: True

A Scandal in Bohemia


In [32]:
print(nodes[2].get_content(MetadataMode.LLM))

Header 1: The Adventures of Sherlock Holmes
Header 2: A Scandal in Bohemia
Header 3: CHAPTER I 
filename: advs.md
extension: .md
genre: Mystery
is_fictional: True

CHAPTER I 
To Sherlock Holmes she is always the woman. I have seldom heard him mention her under any other name. In his eyes she eclipses and predominates the whole of her sex. It was not that he felt any emotion akin to love for Irene Adler. All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind. He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false position. He never spoke of the softer passions, save with a gibe and a sneer. They were admirable things for the observer--excellent for drawing the veil from men's motives and actions. But for the trained reasoner to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which 

In [34]:
nodes[5].metadata

{'Header 1': 'The Adventures of Sherlock Holmes',
 'Header 2': 'The Red-Headed League',
 'filename': 'advs.md',
 'extension': '.md',
 'genre': 'Mystery',
 'is_fictional': True}

In [35]:
len(nodes[5].text.split())

9101

However, some sections can be really long as seen in "The Red-Headed League" story which has around 9000 words. It is necessary to chunk this doc down further. The simplest way to do this is to use the `TokenTextSplitter`.

In [36]:
splitter = TokenTextSplitter(
    chunk_size=1024,
    chunk_overlap=100
)
chunks = splitter.get_nodes_from_documents(nodes)
len(chunks)

157

16 nodes were chunked into 157 chunks. These are still the same old `TextNode` objects.

In [37]:
type(chunks[0])

llama_index.schema.TextNode

In [38]:
chunks[1].metadata

{'Header 1': 'The Adventures of Sherlock Holmes',
 'Header 2': 'A Scandal in Bohemia',
 'filename': 'advs.md',
 'extension': '.md',
 'genre': 'Mystery',
 'is_fictional': True}

In [39]:
# This is how the chapters were chunked
prev = None
for i, chunk in enumerate(chunks):
    header = chunk.metadata.get("Header 1", "") + " > " + chunk.metadata.get("Header 2", "") + " > " + chunk.metadata.get("Header 3", "")
    if header != prev:
        prev = header
        print(f"chunk[{i}] = {header}")

chunk[0] = The Adventures of Sherlock Holmes >  > 
chunk[1] = The Adventures of Sherlock Holmes > A Scandal in Bohemia > 
chunk[2] = The Adventures of Sherlock Holmes > A Scandal in Bohemia > CHAPTER I 
chunk[8] = The Adventures of Sherlock Holmes > A Scandal in Bohemia > Chapter II 
chunk[14] = The Adventures of Sherlock Holmes > A Scandal in Bohemia > Chapter III 
chunk[16] = The Adventures of Sherlock Holmes > The Red-Headed League > 
chunk[29] = The Adventures of Sherlock Holmes > A Case of Identity > 
chunk[39] = The Adventures of Sherlock Holmes > The Boscombe Valley Mystery > 
chunk[53] = The Adventures of Sherlock Holmes > The Five Orange Pips > 
chunk[64] = The Adventures of Sherlock Holmes > The Man with the Twisted Lip > 
chunk[78] = The Adventures of Sherlock Holmes > The Adventure of the Blue Carbuncle > 
chunk[90] = The Adventures of Sherlock Holmes > The Adventure of the Speckled Band > 
chunk[104] = The Adventures of Sherlock Holmes > The Adventure of the Engineer's Thu

In [40]:
# And this is how the original nodes - 1 per Markdown header - were parsed out of the single document
prev = None
for i, node in enumerate(nodes):
    header = node.metadata.get("Header 1", "") + " > " + node.metadata.get("Header 2", "") + " > " + node.metadata.get("Header 3", "")
    if header != prev:
        prev = header
        print(f"node[{i}] = {header}")

node[0] = The Adventures of Sherlock Holmes >  > 
node[1] = The Adventures of Sherlock Holmes > A Scandal in Bohemia > 
node[2] = The Adventures of Sherlock Holmes > A Scandal in Bohemia > CHAPTER I 
node[3] = The Adventures of Sherlock Holmes > A Scandal in Bohemia > Chapter II 
node[4] = The Adventures of Sherlock Holmes > A Scandal in Bohemia > Chapter III 
node[5] = The Adventures of Sherlock Holmes > The Red-Headed League > 
node[6] = The Adventures of Sherlock Holmes > A Case of Identity > 
node[7] = The Adventures of Sherlock Holmes > The Boscombe Valley Mystery > 
node[8] = The Adventures of Sherlock Holmes > The Five Orange Pips > 
node[9] = The Adventures of Sherlock Holmes > The Man with the Twisted Lip > 
node[10] = The Adventures of Sherlock Holmes > The Adventure of the Blue Carbuncle > 
node[11] = The Adventures of Sherlock Holmes > The Adventure of the Speckled Band > 
node[12] = The Adventures of Sherlock Holmes > The Adventure of the Engineer's Thumb > 
node[13] = The

It can be seen that "CHAPTER I" got chunked into 8 - 2 = 6 chunks. The number of words in all the chunks will be slightly more than the number of words in the original node because of the token overlap argument that we gave the splitter.

In [42]:
len(nodes[2].text.split())

3433

In [43]:
for i in range(2, 8):
    print(len(chunks[i].text.split()))

798
768
722
752
662
82


In [44]:
798 + 768 + 722 + 752 + 662 + 82

3784

Roughly the same number of words. But not exactly because there is an overlap of 100 tokens (roughly words) between two consecutive chunks.

In [45]:
for i in range(2, 8):
    print(chunks[i].text[:500] + "\n--- <snip> ----\n" + chunks[i].text[-500:])
    print("-" * 50)

CHAPTER I 
To Sherlock Holmes she is always the woman. I have seldom heard him mention her under any other name. In his eyes she eclipses and predominates the whole of her sex. It was not that he felt any emotion akin to love for Irene Adler. All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind. He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false pos
--- <snip> ----
en!" I answered. 

"Indeed, I should have thought a little more. Just a trifle more, I fancy, Watson. And in practice again, I observe. You did not tell me that you intended to go into harness." 

"Then, how do you know?" 

"I see it, I deduce it. How do I know that you have been getting yourself very wet lately, and that you have a most clumsy and careless servant girl?" 

"My dear Holmes," said I, "this is too much. You would certainly have been burned, had you lived a few cen

This is not a bad way of chunking, however notice how the text gets cut off mid-sentence. I can use the `SentenceSplitter` to avoid this.

In [46]:
splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=100
)
chunks = splitter.get_nodes_from_documents(nodes)
len(chunks)

160

Now instead of 157 chunks I see 160.

In [47]:
chunks[1].metadata

{'Header 1': 'The Adventures of Sherlock Holmes',
 'Header 2': 'A Scandal in Bohemia',
 'filename': 'advs.md',
 'extension': '.md',
 'genre': 'Mystery',
 'is_fictional': True}

In [48]:
prev = None
for i, chunk in enumerate(chunks):
    header = chunk.metadata.get("Header 1", "") + " > " + chunk.metadata.get("Header 2", "") + " > " + chunk.metadata.get("Header 3", "")
    if header != prev:
        prev = header
        print(f"chunk[{i}] = {header}")

chunk[0] = The Adventures of Sherlock Holmes >  > 
chunk[1] = The Adventures of Sherlock Holmes > A Scandal in Bohemia > 
chunk[2] = The Adventures of Sherlock Holmes > A Scandal in Bohemia > CHAPTER I 
chunk[8] = The Adventures of Sherlock Holmes > A Scandal in Bohemia > Chapter II 
chunk[14] = The Adventures of Sherlock Holmes > A Scandal in Bohemia > Chapter III 
chunk[16] = The Adventures of Sherlock Holmes > The Red-Headed League > 
chunk[30] = The Adventures of Sherlock Holmes > A Case of Identity > 
chunk[41] = The Adventures of Sherlock Holmes > The Boscombe Valley Mystery > 
chunk[55] = The Adventures of Sherlock Holmes > The Five Orange Pips > 
chunk[66] = The Adventures of Sherlock Holmes > The Man with the Twisted Lip > 
chunk[80] = The Adventures of Sherlock Holmes > The Adventure of the Blue Carbuncle > 
chunk[92] = The Adventures of Sherlock Holmes > The Adventure of the Speckled Band > 
chunk[107] = The Adventures of Sherlock Holmes > The Adventure of the Engineer's Thu

The number of chunks look more or less same, with some sections getting a couple more chunks.

In [49]:
for i in range(2, 8):
    print(chunks[i].text[:500] + "\n--- <snip> ----\n" + chunks[i].text[-500:])
    print("-" * 50)

CHAPTER I 
To Sherlock Holmes she is always the woman. I have seldom heard him mention her under any other name. In his eyes she eclipses and predominates the whole of her sex. It was not that he felt any emotion akin to love for Irene Adler. All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind. He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false pos
--- <snip> ----
n. 

"Wedlock suits you," he remarked. "I think, Watson, that you have put on seven and a half pounds since I saw you." 

"Seven!" I answered. 

"Indeed, I should have thought a little more. Just a trifle more, I fancy, Watson. And in practice again, I observe. You did not tell me that you intended to go into harness." 

"Then, how do you know?" 

"I see it, I deduce it. How do I know that you have been getting yourself very wet lately, and that you have a most clumsy and carele

The text is now cut off at the sentence boundary. This is good if my doc retriever did not get the next section.

So far the docs were chunked into fixed sizes, but the `SemanticSplitterNodeParser` adaptively picks where to split using embeddings.

In [50]:
embed_model = OpenAIEmbedding()
splitter = SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=embed_model
)
chunks = splitter.get_nodes_from_documents(nodes)
len(chunks)

385

In [51]:
chunks[1].metadata

{'Header 1': 'The Adventures of Sherlock Holmes',
 'Header 2': 'A Scandal in Bohemia',
 'filename': 'advs.md',
 'extension': '.md',
 'genre': 'Mystery',
 'is_fictional': True}

In [52]:
prev = None
for i, chunk in enumerate(chunks):
    header = chunk.metadata.get("Header 1", "") + " > " + chunk.metadata.get("Header 2", "") + " > " + chunk.metadata.get("Header 3", "")
    if header != prev:
        prev = header
        print(f"chunk[{i}] = {header}")

chunk[0] = The Adventures of Sherlock Holmes >  > 
chunk[1] = The Adventures of Sherlock Holmes > A Scandal in Bohemia > 
chunk[2] = The Adventures of Sherlock Holmes > A Scandal in Bohemia > CHAPTER I 
chunk[18] = The Adventures of Sherlock Holmes > A Scandal in Bohemia > Chapter II 
chunk[34] = The Adventures of Sherlock Holmes > A Scandal in Bohemia > Chapter III 
chunk[41] = The Adventures of Sherlock Holmes > The Red-Headed League > 
chunk[74] = The Adventures of Sherlock Holmes > A Case of Identity > 
chunk[98] = The Adventures of Sherlock Holmes > The Boscombe Valley Mystery > 
chunk[133] = The Adventures of Sherlock Holmes > The Five Orange Pips > 
chunk[158] = The Adventures of Sherlock Holmes > The Man with the Twisted Lip > 
chunk[192] = The Adventures of Sherlock Holmes > The Adventure of the Blue Carbuncle > 
chunk[223] = The Adventures of Sherlock Holmes > The Adventure of the Speckled Band > 
chunk[257] = The Adventures of Sherlock Holmes > The Adventure of the Engineer'

In [53]:
for i in range(2, 18):
    print(len(chunks[i].text.split()))

703
425
210
471
494
233
7
224
3
21
33
67
27
129
42
344


In [54]:
for i in range(2, 18):
    print(chunks[i].text[:500] + "\n--- <snip> ----\n" + chunks[i].text[-500:])
    print("-" * 50)

CHAPTER I 
To Sherlock Holmes she is always the woman. I have seldom heard him mention her under any other name. In his eyes she eclipses and predominates the whole of her sex. It was not that he felt any emotion akin to love for Irene Adler. All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind. He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false pos
--- <snip> ----
 which had formerly been in part my own. 

His manner was not effusive. It seldom was; but he was glad, I think, to see me. With hardly a word spoken, but with a kindly eye, he waved me to an armchair, threw across his case of cigars, and indicated a spirit case and a gasogene in the corner. Then he stood before the fire and looked me over in his singular introspective fashion. 

"Wedlock suits you," he remarked. "I think, Watson, that you have put on seven and a half pounds sin

There is no overlap but the text is split off at neat points.

reader - `[Document]` -> parser - `[TextNode]` -> splitter -> `[TextNode]`

There is also a `HierarchicalNodeParser` but I am not so sure of its usage.

Refs
  * [Node Parsers](http://localhost:8000/module_guides/loading/node_parsers/modules.html)
  * [Customizing Documents](http://localhost:8000/module_guides/loading/documents_and_nodes/usage_documents.html)
  * [Customizing Nodes](http://localhost:8000/module_guides/loading/documents_and_nodes/usage_nodes.html)
  


#### Metadata Extraction
Instead of statically adding metadata to the chunks, I can have an LLM extract various types of metadata from the content directly. I was able to get the `TitleExctractor` working correctly, but when I tried to use the `QuestionsAnsweredExtractor`, the program hangs.

Run `metadata.py` as a stand-alone script. This won't run inside of notebook because the metdata extractor runs a asyncio loop.

In [164]:
from glob import glob

filenames = glob(str(TMPROOT/"chunks"/"*.pkl"))
chunks = [None] * len(filenames)

for filename in filenames:
    idx = int(Path(filename).stem.split("_")[1])
    with open(filename, "rb") as fin:
        chunk = pickle.load(fin)
        if chunk is not None:
            chunks[idx] = chunk
        else:
            print(f"chunk[{i}] is None.")

In [168]:
len(chunks)

165

Two chunks with the same headers have the same inferred title even though their contents are possibly different. Its possible that the transformations are re-ordered with the metadata extractors being applied before the splitter.

In [171]:
for i, chunk in enumerate(chunks):
    pprint(chunk.metadata)

{'Header 1': 'The Adventures of Sherlock Holmes',
 'document_title': '"The Intriguing Cases and Brilliant Mind of Sherlock '
                   'Holmes: A Comprehensive Study"',
 'extension': '.md',
 'filename': 'advs.md'}
{'Header 1': 'The Adventures of Sherlock Holmes',
 'Header 2': 'A Scandal in Bohemia',
 'document_title': 'The Scandalous Bohemian Affair: A Case Study',
 'extension': '.md',
 'filename': 'advs.md'}
{'Header 1': 'The Adventures of Sherlock Holmes',
 'Header 2': 'A Scandal in Bohemia',
 'Header 3': 'CHAPTER I ',
 'document_title': '"The Adventures of Sherlock Holmes: A Collection of '
                   'Mysteries and Investigations"',
 'extension': '.md',
 'filename': 'advs.md'}
{'Header 1': 'The Adventures of Sherlock Holmes',
 'Header 2': 'A Scandal in Bohemia',
 'Header 3': 'CHAPTER I ',
 'document_title': '"The Adventures of Sherlock Holmes: A Collection of '
                   'Mysteries and Investigations"',
 'extension': '.md',
 'filename': 'advs.md'}
{'Header

#### Transformations
Parsers, splitters, and metadata extractors are called transformers, they all derive from the `TransformComponent` abstract base class. I can implement my own transformer component by implementing this base class. See [doc](http://127.0.0.1:8000/module_guides/loading/ingestion_pipeline/transformations.html#custom-transformations) for a simple example.

So far I have been using the different transformers directly so it is not immediately clear why have the same base class for everything. However, there are two usages where this is needed - 
  * Combining with `ServiceContext` where we set the `transformations` key with an array of our desired transformers and they are then applied when creating the vector DB. IMHO - this is an anti-pattern because it hides what exactly is going on somewhere in some module. See [doc](http://127.0.0.1:8000/module_guides/loading/ingestion_pipeline/transformations.html#combining-with-servicecontext) for example.
  * [Ingestion pipelines](http://127.0.0.1:8000/module_guides/loading/ingestion_pipeline/root.html#): It'll make sense for me to come back to this after I have covered the Indexing and Storing modules. 