<a href="https://colab.research.google.com/github/Vi-vek9135/llama_docs_bot/blob/main/2_documents_nodes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LlamaIndex Bottoms-Up Development - Documents and Nodes
In order to answer questions about the LlamaIndex docs, we first need to load them!

A majority of our documentation is in markdown format. For the sake of scope, we will ONLY worry about markdown files for now.

When parsing these files, there are a few things we might want to keep track of

- Current header (and header hierarchy!)
- Code blocks
- Text
- Source file names

While LlamaIndex does have a built-in markdown loader, we can write our own to fit our requirements exactly! Loaders are not magic -- they just read files and create documents. So building our own is easy!

We have provided an implementation of a custom markdown loaded in the source code. Let's test it out to see how it works!

In [1]:
import os
import sys
sys.path.append(os.path.join(os.getcwd(), '..'))

In [9]:
# from llama_docs_bot.markdown_docs_reader import MarkdownDocsReader
from llama_index_readers_file import MarkdownReader as MarkdownReaderFile
from llama_index_readers_file import DocxReader as DocxReaderFile
from llama_index import SimpleDirectoryReader

def load_markdown_docs(filepath):
    """Load markdown docs from a directory, excluding all other file types."""
    loader = SimpleDirectoryReader(
        input_dir=filepath,
        exclude=["*.rst", "*.ipynb", "*.py", "*.bat", "*.txt", "*.png", "*.jpg", "*.jpeg", "*.csv", "*.html", "*.js", "*.css", "*.pdf", "*.json"],
        # file_extractor={".md": MarkdownDocsReader()},
        file_extractor={".md": DocxReader()},
        recursive=True
    )

    return loader.load_data()

ModuleNotFoundError: No module named 'llama_index_readers_file'

In [None]:
# load our documents from each folder.
# we keep them seperate for now, in order to create seperate indexes later
getting_started_docs = load_markdown_docs("../docs/getting_started")
community_docs = load_markdown_docs("../docs/community")
data_docs = load_markdown_docs("../docs/core_modules/data_modules")
agent_docs = load_markdown_docs("../docs/core_modules/agent_modules")
model_docs = load_markdown_docs("../docs/core_modules/model_modules")
query_docs = load_markdown_docs("../docs/core_modules/query_modules")
supporting_docs = load_markdown_docs("../docs/core_modules/supporting_modules")
tutorials_docs = load_markdown_docs("../docs/end_to_end_tutorials")
contributing_docs = load_markdown_docs("../docs/development")

In [None]:
# Make our printing look nice
from llama_index.schema import MetadataMode

In [None]:
print(agent_docs[5].get_content(metadata_mode=MetadataMode.ALL))

File Name: ../docs/core_modules/agent_modules/agents/root.md
Content Type: text
Header Path: Data Agents/Concept/Tool Abstractions

You can learn more about our Tool abstractions in our Tools section.


In [None]:
print(agent_docs[0].metadata)

{'File Name': '../docs/core_modules/agent_modules/agents/modules.md', 'Content Type': 'text', 'Header Path': 'Module Guides'}


Looks not bad! We can see that we have metadata, as well as nicely formatted content.

But, we can improve the formatting even further! We can provide better templating, so that the LLM and embedding models can get a better idea of what they are reading.

In [None]:
text_template = "Content Metadata:\n{metadata_str}\n\nContent:\n{content}"

metadata_template = "{key}: {value},"
metadata_seperator= " "

for doc in agent_docs:
    doc.text_template = text_template
    doc.metadata_template = metadata_template
    doc.metadata_seperator = metadata_seperator

In [None]:
print(agent_docs[0].get_content(metadata_mode=MetadataMode.ALL))

Content Metadata:
File Name: ../docs/core_modules/agent_modules/agents/modules.md, Content Type: text, Header Path: Module Guides,

Content:
These guide provide an overview of how to use our agent classes.

For more detailed guides on how to use specific tools, check out our tools module guides.


### Advanced Customization
Going even further with metadata, we can also customize which metadata fields will be seen by both the embedding model and LLM.

In [None]:
# Hide the File Name from the LLM
agent_docs[0].excluded_llm_metadata_keys = ["File Name"]
print(agent_docs[0].get_content(metadata_mode=MetadataMode.LLM))

Content Metadata:
Content Type: text, Header Path: Module Guides,

Content:
These guide provide an overview of how to use our agent classes.

For more detailed guides on how to use specific tools, check out our tools module guides.


In [None]:
# Hide the File Name from the embedding model
agent_docs[0].excluded_embed_metadata_keys = ["File Name"]
print(agent_docs[0].get_content(metadata_mode=MetadataMode.EMBED))

Content Metadata:
Content Type: text, Header Path: Module Guides,

Content:
These guide provide an overview of how to use our agent classes.

For more detailed guides on how to use specific tools, check out our tools module guides.


# Conclusion
In this notebook, we covered how to use a custom data loader, as well as how to customize the text representations of your data when including metadata for both LLMs and embedding models.