A Neuron AI FileDataLoader reader for .docx
documents, powered by elephant-php.
Drop it into a Neuron RAG pipeline and .docx files become embeddable
documents alongside the bundled PDF, HTML and plain-text readers.
composer require endless-creativity/elephant-php-neuronRequires PHP 8.2+. No external binaries needed (unlike PdfReader).
use NeuronAI\RAG\DataLoader\FileDataLoader;
use EndlessCreativity\ElephantPhpNeuron\DocxReader;
$documents = FileDataLoader::for(__DIR__.'/knowledge')
->addReader('docx', new DocxReader())
->getDocuments();
MyRAG::make()->addDocuments($documents);Pass a directory and Neuron walks it, picking the right reader per extension; pass a single file to ingest just that one.
By default the reader returns plain text via
Converter::extractRawText() — paragraphs separated by "\n\n", no
markup. This is usually what you want for embeddings: less syntactic
noise, more semantic signal per token.
If you'd rather preserve headings, lists and links — for example because your splitter or post-processor relies on Markdown structure — request Markdown explicitly through the reader options:
FileDataLoader::for($path)
->addReader('docx', new DocxReader())
->getDocuments(['format' => DocxReader::FORMAT_MARKDOWN]);The $options array is forwarded by FileDataLoader to every reader,
so the same flag is in effect for the whole loading pass.
- Only OOXML
.docxis supported. Legacy binary.doc(Word 97–2003) is not handled by elephant-php and therefore not by this reader either. - Images embedded in the document are dropped during text extraction. This is intentional for RAG — embeddings are text-only.
- Conversion warnings emitted by elephant-php (
Result::messages) are currently silenced. If you need them, open an issue.
BSD-2-Clause. See LICENSE.