<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/node_postprocessor/FileNodeProcessors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="在Colab中打开"/></a>


# 基于文件的节点解析器

`SimpleFileNodeParser` 和 `FlatReader` 旨在允许打开各种文件类型，并自动选择最佳的 `NodeParser` 来处理文件。`FlatReader` 以原始文本格式加载文件，并将文件信息附加到元数据中，然后 `SimpleFileNodeParser` 将文件类型映射到 `node_parser/file` 中的节点解析器，选择最适合任务的节点解析器。

`SimpleFileNodeParser` 不执行基于标记的文本分块，并旨在与标记节点解析器结合使用。

让我们看一个使用 `FlatReader` 和 `SimpleFileNodeParser` 加载内容的示例。对于 README 文件，我将使用 LlamaIndex 的 README，对于 HTML 文件，我将使用 Stack Overflow 的首页，但是任何 README 和 HTML 文件都可以使用。


如果您在colab上打开这个笔记本，您可能需要安装LlamaIndex 🦙。


In [None]:
%pip install llama-index-readers-file

In [None]:
!pip install llama-index

In [None]:
from llama_index.core.node_parser import SimpleFileNodeParser
from llama_index.readers.file import FlatReader
from pathlib import Path



In [None]:
reader = FlatReader()
html_file = reader.load_data(Path("./stack-overflow.html"))
md_file = reader.load_data(Path("./README.md"))
print(html_file[0].metadata)
print(html_file[0])
print("----")
print(md_file[0].metadata)
print(md_file[0])

{'filename': 'stack-overflow.html', 'extension': '.html'}
Doc ID: a6750408-b0fa-466d-be28-ff2fcbcbaa97
Text: <!DOCTYPE html>       <html class="html__responsive
html__unpinned-leftnav" lang="en">      <head>          <title>Stack
Overflow - Where Developers Learn, Share, &amp; Build Careers</title>
<link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackove
rflow/Img/favicon.ico?v=ec617d715196">         <link rel="apple-touch-
icon" hr...
----
{'filename': 'README.md', 'extension': '.md'}
Doc ID: 1d872f44-2bb3-4693-a1b8-a59392c23be2
Text: # 🗂️ LlamaIndex 🦙 [![PyPI -
Downloads](https://img.shields.io/pypi/dm/llama-
index)](https://pypi.org/project/llama-index/) [![GitHub contributors]
(https://img.shields.io/github/contributors/jerryjliu/llama_index)](ht
tps://github.com/jerryjliu/llama_index/graphs/contributors) [![Discord
](https://img.shields.io/discord/1059199217496772688)](https:...


## 解析文件

简单的读取器已经将文件的内容加载到文档对象中，以便进一步处理。我们可以看到文件信息保留在元数据中。让我们将这些文档传递给节点解析器，看看解析的结果。


In [None]:
parser = SimpleFileNodeParser()
md_nodes = parser.get_nodes_from_documents(md_file)
html_nodes = parser.get_nodes_from_documents(html_file)
print(md_nodes[0].metadata)
print(md_nodes[0].text)
print(md_nodes[1].metadata)
print(md_nodes[1].text)
print("----")
print(html_nodes[0].metadata)
print(html_nodes[0].text)

{'filename': 'README.md', 'extension': '.md', 'Header 1': '🗂️ LlamaIndex 🦙'}
🗂️ LlamaIndex 🦙
[![PyPI - Downloads](https://img.shields.io/pypi/dm/llama-index)](https://pypi.org/project/llama-index/)
[![GitHub contributors](https://img.shields.io/github/contributors/jerryjliu/llama_index)](https://github.com/jerryjliu/llama_index/graphs/contributors)
[![Discord](https://img.shields.io/discord/1059199217496772688)](https://discord.gg/dGcwcsnxhU)


LlamaIndex (GPT Index) is a data framework for your LLM application.

PyPI: 
- LlamaIndex: https://pypi.org/project/llama-index/.
- GPT Index (duplicate): https://pypi.org/project/gpt-index/.

LlamaIndex.TS (Typescript/Javascript): https://github.com/run-llama/LlamaIndexTS.

Documentation: https://gpt-index.readthedocs.io/.

Twitter: https://twitter.com/llama_index.

Discord: https://discord.gg/dGcwcsnxhU.
{'filename': 'README.md', 'extension': '.md', 'Header 1': '🗂️ LlamaIndex 🦙', 'Header 3': 'Ecosystem'}
Ecosystem

- LlamaHub (community librar

## 进一步处理文件

我们可以看到，Markdown 和 HTML 文件已根据文档结构被分割成块。Markdown 节点解析器会根据任何标题进行分割，并将标题的层次结构附加到元数据中。HTML 节点解析器从常见文本元素中提取文本，简化了 HTML 文件，并合并相邻的相同元素节点。与处理原始 HTML 相比，这已经在检索有意义的文本内容方面有了很大的改进。

由于这些文件仅根据文件结构进行了分割，我们可以使用文本分割器进行进一步处理，将内容准备成有限标记长度的节点。


In [None]:
from llama_index.core.node_parser import SentenceSplitter# 为了在演示中更清晰，进行小范围且不重叠的分割splitting_parser = SentenceSplitter(chunk_size=200, chunk_overlap=0)html_chunked_nodes = splitting_parser(html_nodes)md_chunked_nodes = splitting_parser(md_nodes)print(f"\n\nHTML解析节点数：{len(html_nodes)}")print(html_nodes[0].text)print(f"\n\nHTML分块节点数：{len(html_chunked_nodes)}")print(html_chunked_nodes[0].text)print(f"\n\nMD解析节点数：{len(md_nodes)}")print(md_nodes[0].text)print(f"\n\nMD分块节点数：{len(md_chunked_nodes)}")print(md_chunked_nodes[0].text)



HTML parsed nodes: 67
About
Products
For Teams
Stack Overflow
Public questions & answers
Stack Overflow for Teams
Where developers & technologists share private knowledge with coworkers
Talent

								Build your employer brand
Advertising
Reach developers & technologists worldwide
Labs
The future of collective knowledge sharing
About the company
current community
















            Stack Overflow
        



help
chat









            Meta Stack Overflow
        






your communities            



Sign up or log in to customize your list.                


more stack exchange communities

company blog


HTML chunked nodes: 87
About
Products
For Teams
Stack Overflow
Public questions & answers
Stack Overflow for Teams
Where developers & technologists share private knowledge with coworkers
Talent

								Build your employer brand
Advertising
Reach developers & technologists worldwide
Labs
The future of collective knowledge sharing
About the company
current community




## 概要

我们可以看到文件在`SimpleFileNodeParser`创建的拆分中进一步处理，现在已经准备好被索引或向量存储器摄入。下面的代码单元格展示了从原始文件到分块节点的解析器链式调用：


In [None]:
from llama_index.core.ingestion import IngestionPipeline

pipeline = IngestionPipeline(
    documents=reader.load_data(Path("./README.md")),
    transformations=[
        SimpleFileNodeParser(),
        SentenceSplitter(chunk_size=200, chunk_overlap=0),
    ],
)

md_chunked_nodes = pipeline.run()
print(md_chunked_nodes)

[TextNode(id_='e6236169-45a1-4699-9762-c8d3d89f8fa0', embedding=None, metadata={'filename': 'README.md', 'extension': '.md', 'Header 1': '🗂️ LlamaIndex 🦙'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='e7bc328f-85c1-430a-9772-425e59909a58', node_type=None, metadata={'filename': 'README.md', 'extension': '.md', 'Header 1': '🗂️ LlamaIndex 🦙'}, hash='e538ad7c04f635f1c707eba290b55618a9f0942211c4b5ca2a4e54e1fdf04973'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='51b40b54-dfd3-48ed-b377-5ca58a0f48a3', node_type=None, metadata={'filename': 'README.md', 'extension': '.md', 'Header 1': '🗂️ LlamaIndex 🦙'}, hash='ca9e3590b951f1fca38687fd12bb43fbccd0133a38020c94800586b3579c3218')}, hash='ec733c85ad1dca248ae583ece341428ee20e4d796bc11adea1618c8e4ed9246a', text='🗂️ LlamaIndex 🦙\n[![PyPI - Downloads](https://img.shields.io/pypi/dm/llama-index)](https://pypi.org/project/llama-index/)\n[![GitHub contrib