A Rust application that parses directory files and generates AI embeddings using the Ollama API. This tool is designed to process text files into chunks and generate vector embeddings that can be used for semantic search, RAG (Retrieval Augmented Generation), and other AI applications.
- File Parsing: Recursively parses directories and extracts text content from supported file types
- Text Chunking: Intelligently splits text into overlapping chunks with configurable sizes
- Ollama Integration: Generates embeddings using local Ollama models
- Multiple Output Formats: Supports JSON, JSON Lines, and CSV output formats
- Configuration Management: Flexible configuration via files, environment variables, and CLI arguments
- Progress Tracking: Real-time progress updates during embedding generation
- Dry Run Mode: Analyze files without generating embeddings
- Error Handling: Comprehensive error handling and logging
- Text files (
.txt) - Markdown (
.md) - Source code files (
.rs,.py,.js,.ts,.html,.css) - Configuration files (
.json,.yaml,.yml,.toml,.xml) - Data files (
.csv,.log) - Document files (
.pdf,.odt,.ods,.odp,.odg) - Microsoft Office files (
.doc,.docx,.ppt,.pptx,.xls,.xlsx)
- Rust 1.70+ installed
- Ollama running locally with an embedding model
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh
# Start Ollama
ollama serve
# Pull an embedding model
ollama pull nomic-embed-textgit clone <repository-url>
cd ollama-embeddings
cargo build --releaseBefore processing large directories, use the inspect-files.sh script to analyze your files:
# Analyze a directory structure
./inspect-files.sh /path/to/documents
# Example output shows:
# - File counts and sizes per directory
# - Supported vs unsupported file types
# - Processing time estimates
# - Suggested command parametersUse the interactive menu for common operations:
./menu.sh# Generate embeddings for a directory
./target/release/ollama-embeddings -d /path/to/documents
# Use a specific model
./target/release/ollama-embeddings -d /path/to/documents -m nomic-embed-text
# Save to file
./target/release/ollama-embeddings -d /path/to/documents -o embeddings.json
# Dry run to analyze files first
./target/release/ollama-embeddings -d /path/to/documents --dry-run
# Remove duplicate file references (symlinks, etc.)
./target/release/ollama-embeddings -d /path/to/documents --deduplicate-files
# Keep only newest version of files with same name
./target/release/ollama-embeddings -d /path/to/documents --detect-newest
# Combine both deduplication features
./target/release/ollama-embeddings -d /path/to/documents --deduplicate-files --detect-newestYou can retrain existing embeddings with a different model without re-parsing the original files:
# Retrain embeddings with a different model
./target/release/ollama-embeddings --retrain-from /path/to/existing/embeddings --model new-model-name
# Example: Switch from nomic-embed-text to llama2
./target/release/ollama-embeddings --retrain-from ./embeddings --model llama2
# Retrain with custom output location
./target/release/ollama-embeddings --retrain-from ./embeddings --model mistral -o retrained_embeddings.jsonRetrain Features:
- Preserves original text chunks and metadata
- Generates new embeddings using the specified model
- Adds retrain metadata (original model, retrain timestamp)
- Creates new output files with model name suffix
- Supports both JSON and JSONL formats
- Maintains chunk relationships and document structure
# Use custom Ollama server
./target/release/ollama-embeddings -d /path/to/docs -u http://remote-ollama:11434
# Use configuration file
./target/release/ollama-embeddings -d /path/to/docs -c config.toml
# Verbose logging
./target/release/ollama-embeddings -d /path/to/docs -v-d, --directory <DIR>: Input directory to process (required)-c, --config <FILE>: Configuration file path-o, --output <FILE>: Output file path-m, --model <MODEL>: Ollama model to use for embeddings-u, --url <URL>: Ollama server URL--dry-run: Analyze files without generating embeddings-v, --verbose: Enable verbose logging--deduplicate-files: Remove duplicate file references (symlinks, hardlinks) and sort processing order--detect-newest: When multiple files have the same name, keep only the newest based on modification time
--deduplicate-files: Removes duplicate file references by canonicalizing paths and detecting when the same physical file is accessed via different paths (e.g., through symlinks). Files are sorted for consistent processing order.
--detect-newest: When multiple files share the same filename (in different directories), this option keeps only the newest version based on modification timestamp. Useful for:
- Backup directories with multiple versions
- Archive folders with dated copies
- Development directories with old/new versions
These options can be used independently or together for comprehensive duplicate handling.
The project includes an interactive menu script that provides easy access to all build, test, and run commands:
./menu.shMenu Features:
- Cargo & Build Commands: Build, clean, format, lint, documentation
- Unit Testing: Run all tests, specific modules, with coverage
- Application Testing: Dry runs, custom directories, embedding generation
- System Information: Check dependencies, Ollama status, build status
- Quick Start: Automated build + test workflow
- Full Workflow: Complete clean + build + test + run cycle
The menu provides a user-friendly interface for developers and makes it easy to verify functionality without memorizing command-line options.
Create a config.toml file:
[ollama]
base_url = "http://localhost:11434"
model = "nomic-embed-text"
timeout_seconds = 300
[processing]
chunk_size = 1000
chunk_overlap = 200
min_chunk_size = 100
max_file_size = 10485760 # 10MB
supported_extensions = ["txt", "md", "rs", "py", "js", "ts"]
concurrent_requests = 5
deduplicate_files = false # Remove duplicate file references (symlinks, hardlinks)
detect_newest = false # Keep only newest file when multiple files have same name
[output]
format = "Json" # "Json", "JsonLines", or "Csv"
include_metadata = true
pretty_print = trueOLLAMA_BASE_URL: Ollama server URLOLLAMA_MODEL: Model name for embeddingsOLLAMA_TIMEOUT_SECONDS: Request timeoutCHUNK_SIZE: Text chunk size in charactersCHUNK_OVERLAP: Overlap between chunksMIN_CHUNK_SIZE: Minimum chunk size to keepMAX_FILE_SIZE: Maximum file size to processCONCURRENT_REQUESTS: Number of concurrent API requestsOUTPUT_FORMAT: Output format (json, jsonlines, csv)OUTPUT_FILE: Output file pathINCLUDE_METADATA: Include file metadata (true/false)PRETTY_PRINT: Pretty print JSON output (true/false)
[
{
"chunk": {
"id": "chunk-uuid",
"content": "Text content of the chunk...",
"source_file": "/path/to/file.txt",
"chunk_index": 0,
"start_position": 0,
"end_position": 500,
"metadata": {
"file_path": "/path/to/file.txt",
"mime_type": "text/plain",
"file_size": "1024",
"chunk_size": "500",
"file_extension": "txt",
"language": "text"
}
},
"embedding": {
"id": "embedding-uuid",
"chunk_id": "chunk-uuid",
"vector": [0.1, 0.2, 0.3, ...],
"model": "nomic-embed-text",
"created_at": "2024-01-01T12:00:00Z"
}
}
]Each line contains a complete embedding document:
{"chunk": {...}, "embedding": {...}}
{"chunk": {...}, "embedding": {...}}The application is structured into several modules:
file_parser: Handles file discovery, reading, and content extractionembedding: Manages text chunking and embedding document creationollama_client: HTTP client for Ollama API communicationconfig: Configuration management and validationmain: CLI interface and application orchestration
- Recursively traverses directories
- Filters files by extension and size
- Handles text encoding detection
- Validates text content
- Splits text into overlapping chunks
- Finds natural breaking points (sentences, paragraphs)
- Adds metadata to chunks
- Detects programming languages
- Communicates with Ollama API
- Handles model validation
- Supports batch processing
- Includes progress tracking
cargo testRUST_LOG=debug cargo run -- -d /path/to/docs -vTo support new file types, add extensions to the supported_extensions list in your configuration or modify the default list in ProcessingConfig::default().
- File Size: Large files are chunked to manage memory usage
- Concurrency: Batch processing with configurable concurrency limits
- Rate Limiting: Built-in delays between requests to avoid overwhelming Ollama
- Memory Usage: Streaming processing for large directories
- Ollama not running: Ensure Ollama is started with
ollama serve - Model not available: Pull the required model with
ollama pull <model-name> - Connection timeout: Increase
timeout_secondsin configuration - Out of memory: Reduce
chunk_sizeormax_file_size
Enable debug logging to see detailed processing information:
RUST_LOG=debug ./target/release/ollama-embeddings -d /path/to/docs -vThis error occurs when the embedding model is not properly installed or has a different name.
Solution:
- Check available models:
ollama list - If the model shows as
nomic-embed-text:latest, use the full name:./target/release/ollama-embeddings -d /path/to/docs --model nomic-embed-text:latest
- If the model is missing, install it:
ollama pull nomic-embed-text
Solutions:
- Start Ollama service:
ollama serve - Check if running:
ps aux | grep ollama - Test connectivity:
curl http://localhost:11434/api/tags - Use custom URL:
--url http://your-ollama-server:11434
Solutions:
- Make binary executable:
chmod +x target/release/ollama-embeddings - Check directory permissions:
ls -la /path/to/documents - Use absolute paths instead of relative paths
Solutions:
- Process smaller directories or use file filters
- Reduce chunk size in configuration
- Use
--dry-runto estimate memory usage first
Solutions:
- Use faster embedding models (smaller parameter count)
- Reduce chunk overlap in configuration
- Process files in smaller batches
- Use SSD storage for better I/O performance
MIT License - see LICENSE file for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
- Support for PDF files ✅
- Support for OpenDocument files (ODT, ODS, ODP, ODG) ✅
- Support for Microsoft Office files (DOC, DOCX, PPT, PPTX, XLS, XLSX) ✅
- Database storage backends
- Web interface
- Docker containerization
- Kubernetes deployment manifests
- Integration with vector databases (Qdrant, Pinecone, etc.)