# `HTMLToDocument`


In [1]:
%load_ext autoreload
%autoreload 2

## On its own


In [2]:
from pathlib import Path
from haystack.components.converters import HTMLToDocument

In [3]:
converter = HTMLToDocument()

In [4]:
docs = converter.run(sources=[Path("sample.html")])

In [5]:
print(docs)

{'documents': [Document(id=dc99820f61d2e9d012e89462cc1586cdc81654abdb859c607d09b317e26051b9, content: 'What is VKS?
VKS (VNGCloud Kubernetes Service) is a managed service on VNGCloud that simplifies the ...', meta: {'file_path': 'sample.html'})]}


In [6]:
docs['documents'][0].content

'What is VKS?\nVKS (VNGCloud Kubernetes Service) is a managed service on VNGCloud that simplifies the deployment and management of container-based applications. Kubernetes, an open-source platform developed by Google, is widely used to manage and deploy containerized applications in distributed environments.\nHighlights of VKS\nFully Managed control plane: VKS will free you from the burden of managing the Kubernetes Control Plane, allowing you to focus on developing applications.\nSupport for the latest Kubernetes versions: VKS is always updating to the latest Kubernetes versions (minor versions from 1.27, 1.28, 1.29) to ensure you can take advantage of the most advanced features.\nKubernetes Networking: VKS integrates Calico CNI, providing high efficiency and security.\nUpgrade seamlessly: VKS supports easy and fast upgrades between Kubernetes versions, helping you stay updated with the latest improvements.\nScaling & Healing Automatically: VKS automatically scales the Node group when

## In pipeline


In [7]:
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import HTMLToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter

In [8]:
document_store = InMemoryDocumentStore()

In [9]:
pipeline = Pipeline()

pipeline.add_component("converter", HTMLToDocument())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=5))
pipeline.add_component("writer", DocumentWriter(document_store=document_store))

pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "writer")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7209f8b8ae10>
🚅 Components
  - converter: HTMLToDocument
  - cleaner: DocumentCleaner
  - splitter: DocumentSplitter
  - writer: DocumentWriter
🛤️ Connections
  - converter.documents -> cleaner.documents (List[Document])
  - cleaner.documents -> splitter.documents (List[Document])
  - splitter.documents -> writer.documents (List[Document])

In [10]:
file_names = ["sample.html"]
res = pipeline.run({"converter": {"sources": file_names}})

In [11]:
res

{'writer': {'documents_written': 4}}

In [12]:
for doc in document_store.filter_documents():
    print(f"Content: {doc.content}\nLength {len(doc.content)}", end="\n========================\n\n")

Content: What is VKS?
VKS (VNGCloud Kubernetes Service) is a managed service on VNGCloud that simplifies the deployment and management of container-based applications. Kubernetes, an open-source platform developed by Google, is widely used to manage and deploy containerized applications in distributed environments.
Highlights of VKS
Fully Managed control plane: VKS will free you from the burden of managing the Kubernetes Control Plane, allowing you to focus on developing applications.
Support for the latest Kubernetes versions: VKS is always updating to the latest Kubernetes versions (minor versions from 1.27, 1.
Length 611

Content: 28, 1.29) to ensure you can take advantage of the most advanced features.
Kubernetes Networking: VKS integrates Calico CNI, providing high efficiency and security.
Upgrade seamlessly: VKS supports easy and fast upgrades between Kubernetes versions, helping you stay updated with the latest improvements.
Scaling & Healing Automatically: VKS automatically sca