A GPU-accelerated DOCX parser that efficiently processes complex documents using CuPy and RAPIDS, featuring advanced XML parsing, hierarchical data structures, and robust error handling.
-
Advanced XML Parsing
- Utilizes a Finite State Machine (FSM) for robust parsing directly on the GPU.
-
Parallel Attribute Extraction
- Efficiently extracts XML attributes with GPU-optimized methods.
-
Hierarchical Data Structures
- Represents DOCX elements like Tables, Rows, Cells, and Paragraphs for easy manipulation.
-
Performance Optimizations
- Minimizes GPU-CPU data transfers and supports multi-GPU environments for scalability.
-
Robust Error Handling
- Comprehensive XML schema validation with informative error messages.
-
Seamless Integration with GPU Libraries
- Integrates effortlessly with RAPIDS cuDF for data frames and cuStrings for string manipulation.
Important: This varies based on your CUDA version. Make sure you install the right packages for your graphics card.
- Install CuPy
Choose the correct installation command for your CUDA version:
-
For CUDA 12.x:
pip install cupy-cuda12x
-
For CUDA 11.x:
pip install cupy-cuda11x
- i havent tested 11.x
More information is available on the CuPy PyPI page.
- Install RAPIDS AI
pip install rapidsai
- Install Dask cuDF Follow the installation instructions available in the Dask cuDF documentation.
Note: This project does not provide support for installation or configuration issues for your specific graphics card. Please consult the respective documentation for help with any problems.
Usage Here’s a quick example of how to use the GPU-Accelerated DOCX Parser:
from gpu_docx_parser import GpuDocxParser
# Your DOCX XML content
xml_content = #get the xml content somehow.
# Probably just unzip view and read the Document.xml.
# I want to write this unzip portion in Rust eventually.
# Initialize the parser
parser = GpuDocxParser(xml_content)
# Parse the document
parsed_elements = parser.parse()
This project is licensed under the MIT License.
- CuPy
- RAPIDS
- Dask
- Python
- Rust