# `DOCXToDocument`

In [1]:
%load_ext autoreload
%autoreload 2

## On its own

In [2]:
from haystack.components.converters.docx import DOCXToDocument, DOCXTableFormat
from datetime import datetime

In [3]:
converter = DOCXToDocument()
# or define the table format
converter = DOCXToDocument(table_format=DOCXTableFormat.CSV)

In [4]:
results = converter.run(sources=["sample.docx"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]

In [5]:
print(documents[0].content)

Sample Document
This document was created using accessibility techniques for headings, lists, image alternate text, tables, and columns. It should be completely accessible using assistive technologies such as screen readers.
Headings
There are eight section headings in this document. At the beginning, "Sample Document" is a level 1 heading. The main section headings, such as "Headings" and "Lists" are level 2 headings. The Tables section contains two sub-headings, "Simple Table" and "Complex Table," which are both level 3 headings.
Lists
The following outline of the sections of this document is an ordered (numbered) list with six items. The fifth item, "Tables," contains a nested unordered (bulleted) list with two items.
Headings 
Lists 
Links 
Images 
Tables 
Simple Tables 
Complex Tables 
Columns 
Links
In web documents, links can point different locations on the page, different pages, or even downloadable documents, such as Word documents or PDFs:
Top of this Page
Sample Document
Sa

In [6]:
documents

[Document(id=90ce147c1c802e5d9146deeb539604917f1b5aa83a5f8b2f222743cfd414bec0, content: 'Sample Document
 This document was created using accessibility techniques for headings, lists, image ...', meta: {'file_path': 'sample.docx', 'date_added': '2024-12-30T10:27:34.994496', 'docx': DOCXMetadata(author='Mike Scott', category='', comments='', content_status='', created='2016-12-06T21:20:00+00:00', identifier='', keywords='', language='', last_modified_by='Mike Scott', last_printed=None, modified='2016-12-06T22:10:00+00:00', revision=3, subject='', title='', version='')})]

## In pipeline

In [7]:
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import DOCXToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter

In [8]:
document_store = InMemoryDocumentStore()

In [9]:
pipeline = Pipeline()

pipeline.add_component("converter", DOCXToDocument())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=5))
pipeline.add_component("writer", DocumentWriter(document_store=document_store))

pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "writer")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7e5586ff24e0>
🚅 Components
  - converter: DOCXToDocument
  - cleaner: DocumentCleaner
  - splitter: DocumentSplitter
  - writer: DocumentWriter
🛤️ Connections
  - converter.documents -> cleaner.documents (List[Document])
  - cleaner.documents -> splitter.documents (List[Document])
  - splitter.documents -> writer.documents (List[Document])

In [10]:
file_names = ["sample.docx"]
res = pipeline.run({"converter": {"sources": file_names}})

In [11]:
res

{'writer': {'documents_written': 5}}

In [12]:
for doc in document_store.filter_documents():
    print(f"Content: {doc.content}\nLength {len(doc.content)}", end="\n========================\n\n")

Content: Sample Document
This document was created using accessibility techniques for headings, lists, image alternate text, tables, and columns. It should be completely accessible using assistive technologies such as screen readers.
Headings
There are eight section headings in this document. At the beginning, "Sample Document" is a level 1 heading. The main section headings, such as "Headings" and "Lists" are level 2 headings.
Length 422

Content:  The Tables section contains two sub-headings, "Simple Table" and "Complex Table," which are both level 3 headings.
Lists
The following outline of the sections of this document is an ordered (numbered) list with six items. The fifth item, "Tables," contains a nested unordered (bulleted) list with two items.
Headings Lists Links Images Tables Simple Tables Complex Tables Columns Links
In web documents, links can point different locations on the page, different pages, or even downloadable documents, such as Word documents or PDFs:
Top of this 