# Microsoft Word(doc, docx) With Langchain


## Overview

This tutorial covers two methods for loading `Microsoft Word` documents into a document format that can be used in RAG.


We will demonstrate the usage of `Docx2txtLoader` and `UnstructuredWordDocumentLoader` , exploring their functionalities to process and load .docx files effectively.


Additionally, we provide a comparison to help users choose the appropriate loader for their requirements.

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Comparison of DOCX Loading Methods](#Comparison-of-DOCX-Loading-Methods)
- [Docx2txtLoader](#Docx2txtLoader)
- [UnstructuredWordDocumentLoader](#UnstructuredWordDocumentLoader)

### References

- [UnstructuredWordDocumentLoader Documentation](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.word_document.UnstructuredWordDocumentLoader.html#langchain_community.document_loaders.word_document.UnstructuredWordDocumentLoader/)
- [Docx2txtLoader Documentation](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.word_document.Docx2txtLoader.html/)

----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [29]:
# Install required packages
from langchain_opentutorial import package

package.install(
    ["langchain", "langchain_community", "docx2txt", "unstructured", "python-docx"],
    verbose=False,
    upgrade=False,
)

## Comparison of docx Loading Methods

| **Feature**           | **Docx2txtLoader**      | **UnstructuredWordDocumentLoader**    |
|-----------------------|-------------------------|---------------------------------------|
| **Base Library**      | docx2txt               | Unstructured                         |
| **Speed**             | Fast                   | Relatively slow                      |
| **Memory Usage**      | Efficient              | Relatively high                      |
| **Installation Dependencies** | Lightweight (only requires docx2txt) | Heavy (requires multiple dependency packages) |

## Docx2txtLoader

**Used Library** : A lightweight Python module such as `docx2txt` for text extraction.

**Key Features** :
- Extracts text from `.docx` files quickly and simply.
- Suitable for efficient and straightforward tasks.

**Use Case** :
- When you need to quickly retrieve text data from `.docx` files.

In [3]:
from langchain_community.document_loaders import Docx2txtLoader

# Initialize the document loader
loader = Docx2txtLoader("/content/text_data.docx")

# Load the document
docs = loader.load()

# Print the number of documents
print(f"Document Count: {len(docs)}\n")

# Print the type of the loader
print(f"Type of loader: {type(loader)}\n")

# Print the metadata of the document
print(f"Document Metadata: {docs[0].metadata}\n")

# Note: The entire docx file is converted into a single document.
# It needs to be split into smaller parts using a text splitter.
print("Document Content")
print(docs[0])

Document Count: 1

Type of loader: <class 'langchain_community.document_loaders.word_document.Docx2txtLoader'>

Document Metadata: {'source': '/content/text_data.docx'}

Document Content
page_content='KLERCIDE 70/30 STERILE IPA SPRAY 1LX6/pk KLERCIDE 70/30 STERILE IPA SPRAY 1LX6/pk SEROLOGICAL PIPETTES 25mL CORNING 250mL STERILE CONTAINERS CORNING 125mL STERILE CONTAINERS R2A MEDIUM 10 PLATES SHARPS BIN SQUARE 4.5L SHARPS BIN SQUARE 4.5L SHARPS BIN SQUARE 12.5L SHARPS BIN SQUARE 12.5L sodium hydrogen carbonate ENDOSAFE PTS CARTIDGE, 0.05-5.0EU/mL ENDOSAFE PTS CARTIDGE, 0.01-1.0 EU/mL CARBON DIOXIDE MEDICAL G SIZE Not Provided ACETIC ACID GLACIAL DRW CITRIC ACID MONOHYDRATE DRW SODIUM ACETATE TRIHYDRATE DRW SODIUM ACETATE TRIHYDRATE DRW Sodium Hydroxide Pellets DRW SODIUM PHOSPHATE DIBASIC DIHYDRATE DRW Sodium Chloride DRW RESIN CAPTO Q AKTA READY GRADIENT HIGH FLOW SECTION SUCROSE (FF) NON-CATALOG Medium Dulbecco's Modified SENSOR IRRADIATED SINGLE USE - 1" Tubing, Silicone Platinum-cu

## UnstructuredWordDocumentLoader

**Used Library** : A comprehensive document analysis library called `unstructured` .

**Key Features** :
- Capable of understanding the structure of a document, such as titles and body, and separating them into distinct elements.
- Allows hierarchical representation and detailed processing of documents.
- Extracts meaningful information from unstructured data and transforms it into structured formats.

**Use Case** :
- When you need to extract text while preserving the document's structure, formatting, and metadata.
- Suitable for handling complex document structures or converting unstructured data into structured formats.

| **Parameter**           | **Option**              | **Description**                                                                               |
|-------------------------|-------------------------|---------------------------------------------------------------------------------------------|
| `mode`                  | `single` (default)      | Returns the entire document as a single `Document` object.                                  |
|                         | `elements`              | Splits the document into elements (e.g., title, body) and returns each as a `Document` object. |
| `strategy`              | `None` (default)        | No specific strategy is applied.                                                           |
|                         | `fast`                  | Prioritizes speed (may reduce accuracy).                                                    |
|                         | `hi_res`                | Prioritizes high accuracy (slower processing).                                              |
| `include_page_breaks`   | `True` (default)        | Detects page breaks and adds `PageBreak` elements.                                          |
|                         | `False`                 | Ignores page breaks.                                                                        |
| `infer_table_structure` | `True` (default)        | Infers table structure and includes it in HTML format.                                      |
|                         | `False`                 | Does not infer table structure.                                                            |
| `starting_page_number`  | `1` (default)           | Specifies the starting page number of the document.                                         |

### mode: Single (default)


In this mode, the entire document is returned as a single LangChain Document object. In other words, all the content of the document is contained within a single object.

In [7]:
from langchain_community.document_loaders import UnstructuredWordDocumentLoader

# Initialize the document loader
loader = UnstructuredWordDocumentLoader("/content/text_data.docx")

# Load the document
docs = loader.load()

# Print the number of documents
print(f"Document Count: {len(docs)}\n")

# Print the type of the loader
print(f"Type of loader: {type(loader)}\n")

# Print the metadata of the document
print(f"Document Metadata: {docs[0].metadata}\n")

# Note: The entire docx file is converted into a single document.
# It needs to be split into smaller parts using a text splitter.
print("Document Content")
print(docs[0])

Document Count: 1

Type of loader: <class 'langchain_community.document_loaders.word_document.UnstructuredWordDocumentLoader'>

Document Metadata: {'source': '/content/text_data.docx'}

Document Content
page_content='KLERCIDE 70/30 STERILE IPA SPRAY 1LX6/pk KLERCIDE 70/30 STERILE IPA SPRAY 1LX6/pk SEROLOGICAL PIPETTES 25mL CORNING 250mL STERILE CONTAINERS CORNING 125mL STERILE CONTAINERS R2A MEDIUM 10 PLATES SHARPS BIN SQUARE 4.5L SHARPS BIN SQUARE 4.5L SHARPS BIN SQUARE 12.5L SHARPS BIN SQUARE 12.5L sodium hydrogen carbonate ENDOSAFE PTS CARTIDGE, 0.05-5.0EU/mL ENDOSAFE PTS CARTIDGE, 0.01-1.0 EU/mL CARBON DIOXIDE MEDICAL G SIZE Not Provided ACETIC ACID GLACIAL DRW CITRIC ACID MONOHYDRATE DRW SODIUM ACETATE TRIHYDRATE DRW SODIUM ACETATE TRIHYDRATE DRW Sodium Hydroxide Pellets DRW SODIUM PHOSPHATE DIBASIC DIHYDRATE DRW Sodium Chloride DRW RESIN CAPTO Q AKTA READY GRADIENT HIGH FLOW SECTION SUCROSE (FF) NON-CATALOG Medium Dulbecco's Modified SENSOR IRRADIATED SINGLE USE - 1" Tubing, Sili

### mode: elements


The document is divided into individual elements, such as Title and NarrativeText. Each element is returned as a separate Document object, allowing for more detailed analysis or processing of the document's structure.

In [1]:
from langchain_community.document_loaders import UnstructuredWordDocumentLoader

# Initialize the document loader with "elements" mode
loader = UnstructuredWordDocumentLoader(
    "/content/text_data.docx", mode="elements"
)

# Load the document
docs = loader.load()

# Print the number of documents
print(
    f"Document Count: {len(docs)}\n"
)  # Using "elements" mode, each element is converted into a separate Document object

# Print the type of the loader
print(f"Type of loader: {type(loader)}\n")

# Print the metadata of the first document element
print(f"Document Metadata: {docs[0].metadata}\n")

# Print the content of the first document element
print("Document Content")
print(docs)

Document Count: 7

Type of loader: <class 'langchain_community.document_loaders.word_document.UnstructuredWordDocumentLoader'>

Document Metadata: {'source': '/content/text_data.docx', 'category_depth': 0, 'file_directory': '/content', 'filename': 'text_data.docx', 'last_modified': '2025-02-27T08:57:08', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'NarrativeText', 'element_id': '86094d379d5752796dc6927ec33cce34'}

Document Content
[Document(metadata={'source': '/content/text_data.docx', 'category_depth': 0, 'file_directory': '/content', 'filename': 'text_data.docx', 'last_modified': '2025-02-27T08:57:08', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'NarrativeText', 'element_id': '86094d379d5752796dc6927ec33cce34'}, page_content='KLERCIDE 70/30 STERILE IPA SPRAY 1LX6/pk KLERCIDE 70/30 STERILE IPA SPR

In [22]:
docs[2].page_content

'BalanCD CHO Growth A Powder BOTTLE - 20L Mixing Vessel Water, LAL Reagent, 30mL Tubing Manifold, Charter Medical, 21 x 2 MVP VF Viresolve Pump Head MVP VF Viresolve Filter Manifold MVP VF Viresolve Outlet Manifold MVP VF Viresolve Feed Manifold MVP VF Viresolve Jumper Manifold BAG PAK CELSIUS 2 L FIlter, Millipak, 0.22um Sterile, 200 9/ Aseptiquick G, Bod, 3/4", High T Aseptiquick G, Bod, 1/2", High T Aseptiquick G, Bod, 3/8", High T Non-Catalog Disposable Non-Woven Steri NON-CATALOG Disp. Non-Woven Sterile Co Disp. Non-Woven Sterile Coverall, 4 XL Disp. Non-Woven Sterile Coverall, XL Disp. Non-Woven Sterile Coverall, LG Disp. Non-Woven Sterile Coverall, MED AKTA Ready Low Flow Kit Tube Asem - Sta-Pure Loadsure 19mm ID Gasket - UF/DF Membrane NON-CATALOG Filter Millipore 3P Millif Sight Glass - Oring NON-CATALOG Bottles Spray Sterile 16oz BAG-SUB 20L Cellbag BPC Bioclear 10w/AQG BAG-SUB 2L Cellbag BPC Bioclear 11 w/AQG BAG-SUB 10L Cellbag BPC Bioclear 11w/AQG BAG-SUB 20L Cellbag BPC B

### Efficient Document Loader Configuration with Various Parameter Combinations

By combining various parameters, you can configure a document loader that fits your specific needs efficiently. Adjusting settings such as `mode` , `strategy` , and `include_page_breaks` allows for tailored handling of different document structures and processing requirements.


In [8]:
from langchain_community.document_loaders import UnstructuredWordDocumentLoader

# Initialize the document loader with specific parameters
loader = UnstructuredWordDocumentLoader(
    "/content/text_data.docx",
    strategy="fast",  # Prioritize fast processing
    include_page_breaks=True,  # Include page breaks as PageBreak elements
    infer_table_structure=True,  # Infer table structures and include in HTML format
    starting_page_number=1,  # Start page numbering from 1
)

# Load the document
docs = loader.load()

# Print the number of documents
print(f"Document Count: {len(docs)}\n")

# Print the type of the loader
print(f"Type of loader: {type(loader)}\n")

# Print the metadata of the first document
print(f"Document Metadata: {docs[0].metadata}\n")

# Print the content of the first document
print("Document Content")
print(docs[0])

Document Count: 1

Type of loader: <class 'langchain_community.document_loaders.word_document.UnstructuredWordDocumentLoader'>

Document Metadata: {'source': '/content/text_data.docx'}

Document Content
page_content='KLERCIDE 70/30 STERILE IPA SPRAY 1LX6/pk KLERCIDE 70/30 STERILE IPA SPRAY 1LX6/pk SEROLOGICAL PIPETTES 25mL CORNING 250mL STERILE CONTAINERS CORNING 125mL STERILE CONTAINERS R2A MEDIUM 10 PLATES SHARPS BIN SQUARE 4.5L SHARPS BIN SQUARE 4.5L SHARPS BIN SQUARE 12.5L SHARPS BIN SQUARE 12.5L sodium hydrogen carbonate ENDOSAFE PTS CARTIDGE, 0.05-5.0EU/mL ENDOSAFE PTS CARTIDGE, 0.01-1.0 EU/mL CARBON DIOXIDE MEDICAL G SIZE Not Provided ACETIC ACID GLACIAL DRW CITRIC ACID MONOHYDRATE DRW SODIUM ACETATE TRIHYDRATE DRW SODIUM ACETATE TRIHYDRATE DRW Sodium Hydroxide Pellets DRW SODIUM PHOSPHATE DIBASIC DIHYDRATE DRW Sodium Chloride DRW RESIN CAPTO Q AKTA READY GRADIENT HIGH FLOW SECTION SUCROSE (FF) NON-CATALOG Medium Dulbecco's Modified SENSOR IRRADIATED SINGLE USE - 1" Tubing, Sili