Skip to content

excoffierleonard/parser

Repository files navigation

Parser

A Rust-based document parsing system that extracts text content from various file formats.

Live Demo | API Endpoint

Website Preview

📚 Overview

Parser is a modular Rust project that provides comprehensive document parsing capabilities through multiple interfaces:

  • Core library: The foundation providing parsing functionality for various file formats
  • CLI tool: Command-line interface for quick file parsing
  • Web API: REST service for parsing files via HTTP requests
  • Web UI: Simple interface for testing the parser functionality

📦 Project Structure

The project is organized as a Rust workspace with multiple crates:

  • parser-core: The core parsing engine
  • parser-cli: Command-line interface
  • parser-web: Web API and frontend
  • test-utils: Shared testing utilities

📄 Supported File Types

  • Documents: PDF (.pdf), Word (.docx), PowerPoint (.pptx), Excel (.xlsx)
  • Text: Plain text (.txt), CSV, JSON, YAML, source code, and other text-based formats
  • Images: PNG, JPEG, WebP, and other image formats with OCR (Optical Character Recognition)

The OCR functionality supports English and French languages.

🛠️ Getting Started

Prerequisites

  • Rust (latest stable)
  • OCR Dependencies:
    • Tesseract development libraries
    • Leptonica development libraries
    • Clang development libraries

Installing OCR Dependencies

Debian/Ubuntu:

sudo apt install libtesseract-dev libleptonica-dev libclang-dev

macOS:

brew install tesseract

Windows: Follow the instructions at Tesseract GitHub repository.

Building from Source

# Build all crates
cargo build

# Build in release mode
cargo build --release

Using the CLI

# Run directly with cargo
cargo run -p parser-cli -- path/to/file1.pdf path/to/file2.docx

# Or use the built binary
./target/release/parser-cli path/to/file1.pdf path/to/file2.docx

Running the Web Server

# Run the web server
cargo run -p parser-web

# With custom port
PARSER_APP_PORT=9000 cargo run -p parser-web

# With file serving enabled (for frontend)
ENABLE_FILE_SERVING=true cargo run -p parser-web

🚀 Deployment

The easiest way to deploy the service is using Docker:

curl -o compose.yaml https://raw.githubusercontent.com/excoffierleonard/parser/refs/heads/main/compose.yaml && \
docker compose up -d

Environment Variables

  • PARSER_APP_PORT: The port on which the web service listens (default: 8080)
  • ENABLE_FILE_SERVING: Enable serving frontend files (default: false)

🧪 Development

Testing

# Run all tests
cargo test --workspace

# Run specific test
cargo test test_name

Benchmarking

# Run benchmarks
cargo bench --workspace

# Run benchmark script
./scripts/benchmark.sh

Code Quality

# Run linter
cargo clippy --workspace -- -D warnings

# Format code
cargo fmt --all

Building with Scripts

# Full build script
./scripts/build.sh

# Deployment tests
./scripts/deploy-tests.sh

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

About

REST API service in Rust that takes in any file and returns its parsed content.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages