Skip to content

A GPU-accelerated DOCX parser that efficiently processes complex documents using CuPy and RAPIDS for fast XML parsing, hierarchical data structures, and robust error handling.

Notifications You must be signed in to change notification settings

andrewprograms/gpu_xml_parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

GPU-Accelerated DOCX Parser

A GPU-accelerated DOCX parser that efficiently processes complex documents using CuPy and RAPIDS, featuring advanced XML parsing, hierarchical data structures, and robust error handling.

MIT License Python CUDA

Features

  • Advanced XML Parsing

    • Utilizes a Finite State Machine (FSM) for robust parsing directly on the GPU.
  • Parallel Attribute Extraction

    • Efficiently extracts XML attributes with GPU-optimized methods.
  • Hierarchical Data Structures

    • Represents DOCX elements like Tables, Rows, Cells, and Paragraphs for easy manipulation.
  • Performance Optimizations

    • Minimizes GPU-CPU data transfers and supports multi-GPU environments for scalability.
  • Robust Error Handling

    • Comprehensive XML schema validation with informative error messages.
  • Seamless Integration with GPU Libraries

Install

Important: This varies based on your CUDA version. Make sure you install the right packages for your graphics card.

  1. Install CuPy

Choose the correct installation command for your CUDA version:

  • For CUDA 12.x:

    pip install cupy-cuda12x
  • For CUDA 11.x:

    pip install cupy-cuda11x
    • i havent tested 11.x

More information is available on the CuPy PyPI page.

  1. Install RAPIDS AI
pip install rapidsai
  1. Install Dask cuDF Follow the installation instructions available in the Dask cuDF documentation.

Note: This project does not provide support for installation or configuration issues for your specific graphics card. Please consult the respective documentation for help with any problems.

Getting Started

Usage Here’s a quick example of how to use the GPU-Accelerated DOCX Parser:

from gpu_docx_parser import GpuDocxParser

# Your DOCX XML content
xml_content = #get the xml content somehow. 
# Probably just unzip view and read the Document.xml. 
# I want to write this unzip portion in Rust eventually.

# Initialize the parser
parser = GpuDocxParser(xml_content)

# Parse the document
parsed_elements = parser.parse()

License

This project is licensed under the MIT License.

Acknowledgements

  • CuPy
  • RAPIDS
  • Dask
  • Python
  • Rust

Thank you for checking out the GPU-Accelerated DOCX Parser! If you encounter any issues or have suggestions, feel free to open an issue.

About

A GPU-accelerated DOCX parser that efficiently processes complex documents using CuPy and RAPIDS for fast XML parsing, hierarchical data structures, and robust error handling.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages