### Word Document Processing


In [1]:
from langchain_community.document_loaders import Docx2txtLoader, UnstructuredWordDocumentLoader

1 *  Docx2txtLoader (word2text loader) → Fast, simple plain-text extraction (best for quick preprocessing).

In [3]:
# Method 1: Using Docx2txtLoader
print("Docx2txtLoader")

try:
    docx_loader = Docx2txtLoader("data/Job Description.docx")
    docx_documents = docx_loader.load()
    print(f"✅ Loaded {len(docx_documents)} documents")
    print(f"Content Preview: {docx_documents[0].page_content[:500]}")
    print(f"Metadata: {docx_documents[0].metadata}")
except Exception as e:
    print(f"Error loading document: {e}")

Docx2txtLoader
✅ Loaded 1 documents
Content Preview: Job Description – Data Scientist

Company: AXVG Pvt. Ltd.
Location: Delhi
Employment Type: Full-Time



About AXVG

AXVG is a forward-thinking technology and analytics company, committed to solving real-world business problems using data-driven decision-making. We leverage AI, machine learning, and advanced analytics to optimize operations, enhance customer experiences, and create innovative products.



Role Overview

We are seeking a Data Scientist who is passionate about extracting insights f
Metadata: {'source': 'data/Job Description.docx'}


2 * UnstructuredWordDocumentLoader → Rich, structured extraction with metadata (best for intelligent pipelines).

In [6]:
# Method 2: Using UnstructuredWordDocumentLoader
print("UnstructuredWordDocumentLoader")

try:
    unstructured_loader = UnstructuredWordDocumentLoader("data/Job Description.docx", mode="elements")
    unstructured_documents = unstructured_loader.load()
    print(f"✅ Loaded {len(unstructured_documents)} documents")
    for i, doc in enumerate(unstructured_documents[0:3]):
        print(f"Document {i + 1} Preview: {doc.page_content[:500]}")
        print(f"Metadata: {doc.metadata.get('category','title')}")
    print(f"Content Preview: {unstructured_documents[0].page_content[:500]}")
    print(f"Metadata: {unstructured_documents[0].metadata}")
except Exception as e:
    print(f"Error loading document: {e}")


UnstructuredWordDocumentLoader
✅ Loaded 34 documents
Document 1 Preview: Job Description – Data Scientist
Metadata: UncategorizedText
Document 2 Preview: Company: AXVG Pvt. Ltd.
Location: Delhi
Employment Type: Full-Time
Metadata: NarrativeText
Document 3 Preview: About AXVG
Metadata: UncategorizedText
Content Preview: Job Description – Data Scientist
Metadata: {'source': 'data/Job Description.docx', 'category_depth': 0, 'emphasized_text_contents': ['Job Description – Data Scientist'], 'emphasized_text_tags': ['b'], 'file_directory': 'data', 'filename': 'Job Description.docx', 'last_modified': '2025-08-25T01:46:59', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'UncategorizedText', 'element_id': '351e478b06a946e7b8d4c049d665e816'}
