Ollama Embeddings Generator

A Rust application that parses directory files and generates AI embeddings using the Ollama API. This tool is designed to process text files into chunks and generate vector embeddings that can be used for semantic search, RAG (Retrieval Augmented Generation), and other AI applications.

Features

File Parsing: Recursively parses directories and extracts text content from supported file types
Text Chunking: Intelligently splits text into overlapping chunks with configurable sizes
Ollama Integration: Generates embeddings using local Ollama models
Multiple Output Formats: Supports JSON, JSON Lines, and CSV output formats
Configuration Management: Flexible configuration via files, environment variables, and CLI arguments
Progress Tracking: Real-time progress updates during embedding generation
Dry Run Mode: Analyze files without generating embeddings
Error Handling: Comprehensive error handling and logging

Supported File Types

Text files (.txt)
Markdown (.md)
Source code files (.rs, .py, .js, .ts, .html, .css)
Configuration files (.json, .yaml, .yml, .toml, .xml)
Data files (.csv, .log)
Document files (.pdf, .odt, .ods, .odp, .odg)
Microsoft Office files (.doc, .docx, .ppt, .pptx, .xls, .xlsx)

Installation

Prerequisites

Rust 1.70+ installed
Ollama running locally with an embedding model

Install Ollama and Models

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh

# Start Ollama
ollama serve

# Pull an embedding model
ollama pull nomic-embed-text

Build from Source

git clone <repository-url>
cd ollama-embeddings
cargo build --release

Quick Start Tools

File Inspector Script

Before processing large directories, use the inspect-files.sh script to analyze your files:

# Analyze a directory structure
./inspect-files.sh /path/to/documents

# Example output shows:
# - File counts and sizes per directory
# - Supported vs unsupported file types
# - Processing time estimates
# - Suggested command parameters

Interactive Menu

Use the interactive menu for common operations:

./menu.sh

Usage

Basic Usage

# Generate embeddings for a directory
./target/release/ollama-embeddings -d /path/to/documents

# Use a specific model
./target/release/ollama-embeddings -d /path/to/documents -m nomic-embed-text

# Save to file
./target/release/ollama-embeddings -d /path/to/documents -o embeddings.json

# Dry run to analyze files first
./target/release/ollama-embeddings -d /path/to/documents --dry-run

# Remove duplicate file references (symlinks, etc.)
./target/release/ollama-embeddings -d /path/to/documents --deduplicate-files

# Keep only newest version of files with same name
./target/release/ollama-embeddings -d /path/to/documents --detect-newest

# Combine both deduplication features
./target/release/ollama-embeddings -d /path/to/documents --deduplicate-files --detect-newest

Retraining Existing Embeddings

You can retrain existing embeddings with a different model without re-parsing the original files:

# Retrain embeddings with a different model
./target/release/ollama-embeddings --retrain-from /path/to/existing/embeddings --model new-model-name

# Example: Switch from nomic-embed-text to llama2
./target/release/ollama-embeddings --retrain-from ./embeddings --model llama2

# Retrain with custom output location
./target/release/ollama-embeddings --retrain-from ./embeddings --model mistral -o retrained_embeddings.json

Retrain Features:

Preserves original text chunks and metadata
Generates new embeddings using the specified model
Adds retrain metadata (original model, retrain timestamp)
Creates new output files with model name suffix
Supports both JSON and JSONL formats
Maintains chunk relationships and document structure

Advanced Usage

# Use custom Ollama server
./target/release/ollama-embeddings -d /path/to/docs -u http://remote-ollama:11434

# Use configuration file
./target/release/ollama-embeddings -d /path/to/docs -c config.toml

# Verbose logging
./target/release/ollama-embeddings -d /path/to/docs -v

Command Line Options

-d, --directory <DIR>: Input directory to process (required)
-c, --config <FILE>: Configuration file path
-o, --output <FILE>: Output file path
-m, --model <MODEL>: Ollama model to use for embeddings
-u, --url <URL>: Ollama server URL
--dry-run: Analyze files without generating embeddings
-v, --verbose: Enable verbose logging
--deduplicate-files: Remove duplicate file references (symlinks, hardlinks) and sort processing order
--detect-newest: When multiple files have the same name, keep only the newest based on modification time

File Deduplication Options

--deduplicate-files: Removes duplicate file references by canonicalizing paths and detecting when the same physical file is accessed via different paths (e.g., through symlinks). Files are sorted for consistent processing order.

--detect-newest: When multiple files share the same filename (in different directories), this option keeps only the newest version based on modification timestamp. Useful for:

Backup directories with multiple versions
Archive folders with dated copies
Development directories with old/new versions

These options can be used independently or together for comprehensive duplicate handling.

Interactive Menu

The project includes an interactive menu script that provides easy access to all build, test, and run commands:

./menu.sh

Menu Features:

Cargo & Build Commands: Build, clean, format, lint, documentation
Unit Testing: Run all tests, specific modules, with coverage
Application Testing: Dry runs, custom directories, embedding generation
System Information: Check dependencies, Ollama status, build status
Quick Start: Automated build + test workflow
Full Workflow: Complete clean + build + test + run cycle

The menu provides a user-friendly interface for developers and makes it easy to verify functionality without memorizing command-line options.

Configuration

Configuration File

Create a config.toml file:

[ollama]
base_url = "http://localhost:11434"
model = "nomic-embed-text"
timeout_seconds = 300

[processing]
chunk_size = 1000
chunk_overlap = 200
min_chunk_size = 100
max_file_size = 10485760  # 10MB
supported_extensions = ["txt", "md", "rs", "py", "js", "ts"]
concurrent_requests = 5
deduplicate_files = false  # Remove duplicate file references (symlinks, hardlinks)
detect_newest = false      # Keep only newest file when multiple files have same name

[output]
format = "Json"  # "Json", "JsonLines", or "Csv"
include_metadata = true
pretty_print = true

Environment Variables

OLLAMA_BASE_URL: Ollama server URL
OLLAMA_MODEL: Model name for embeddings
OLLAMA_TIMEOUT_SECONDS: Request timeout
CHUNK_SIZE: Text chunk size in characters
CHUNK_OVERLAP: Overlap between chunks
MIN_CHUNK_SIZE: Minimum chunk size to keep
MAX_FILE_SIZE: Maximum file size to process
CONCURRENT_REQUESTS: Number of concurrent API requests
OUTPUT_FORMAT: Output format (json, jsonlines, csv)
OUTPUT_FILE: Output file path
INCLUDE_METADATA: Include file metadata (true/false)
PRETTY_PRINT: Pretty print JSON output (true/false)

Output Format

JSON Format

[
  {
    "chunk": {
      "id": "chunk-uuid",
      "content": "Text content of the chunk...",
      "source_file": "/path/to/file.txt",
      "chunk_index": 0,
      "start_position": 0,
      "end_position": 500,
      "metadata": {
        "file_path": "/path/to/file.txt",
        "mime_type": "text/plain",
        "file_size": "1024",
        "chunk_size": "500",
        "file_extension": "txt",
        "language": "text"
      }
    },
    "embedding": {
      "id": "embedding-uuid",
      "chunk_id": "chunk-uuid",
      "vector": [0.1, 0.2, 0.3, ...],
      "model": "nomic-embed-text",
      "created_at": "2024-01-01T12:00:00Z"
    }
  }
]

JSON Lines Format

Each line contains a complete embedding document:

{"chunk": {...}, "embedding": {...}}
{"chunk": {...}, "embedding": {...}}

Architecture

The application is structured into several modules:

file_parser: Handles file discovery, reading, and content extraction
embedding: Manages text chunking and embedding document creation
ollama_client: HTTP client for Ollama API communication
config: Configuration management and validation
main: CLI interface and application orchestration

Key Components

FileParser

Recursively traverses directories
Filters files by extension and size
Handles text encoding detection
Validates text content

EmbeddingProcessor

Splits text into overlapping chunks
Finds natural breaking points (sentences, paragraphs)
Adds metadata to chunks
Detects programming languages

OllamaClient

Communicates with Ollama API
Handles model validation
Supports batch processing
Includes progress tracking

Development

Running Tests

cargo test

Running with Logging

RUST_LOG=debug cargo run -- -d /path/to/docs -v

Adding New File Types

To support new file types, add extensions to the supported_extensions list in your configuration or modify the default list in ProcessingConfig::default().

Performance Considerations

File Size: Large files are chunked to manage memory usage
Concurrency: Batch processing with configurable concurrency limits
Rate Limiting: Built-in delays between requests to avoid overwhelming Ollama
Memory Usage: Streaming processing for large directories

Troubleshooting

Common Issues

Ollama not running: Ensure Ollama is started with ollama serve
Model not available: Pull the required model with ollama pull <model-name>
Connection timeout: Increase timeout_seconds in configuration
Out of memory: Reduce chunk_size or max_file_size

Debug Mode

Enable debug logging to see detailed processing information:

RUST_LOG=debug ./target/release/ollama-embeddings -d /path/to/docs -v

Troubleshooting

Common Issues

"Model 'nomic-embed-text' is not available"

This error occurs when the embedding model is not properly installed or has a different name.

Solution:

Check available models: ollama list

If the model shows as nomic-embed-text:latest, use the full name:

./target/release/ollama-embeddings -d /path/to/docs --model nomic-embed-text:latest

If the model is missing, install it:
```
ollama pull nomic-embed-text
```

"Failed to connect to Ollama server"

Solutions:

Start Ollama service: ollama serve
Check if running: ps aux | grep ollama
Test connectivity: curl http://localhost:11434/api/tags
Use custom URL: --url http://your-ollama-server:11434

"Permission denied" or "File not found"

Solutions:

Make binary executable: chmod +x target/release/ollama-embeddings
Check directory permissions: ls -la /path/to/documents
Use absolute paths instead of relative paths

Out of memory errors

Solutions:

Process smaller directories or use file filters
Reduce chunk size in configuration
Use --dry-run to estimate memory usage first

Slow performance

Solutions:

Use faster embedding models (smaller parameter count)
Reduce chunk overlap in configuration
Process files in smaller batches
Use SSD storage for better I/O performance

License

MIT License - see LICENSE file for details.

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

Roadmap

Support for PDF files ✅
Support for OpenDocument files (ODT, ODS, ODP, ODG) ✅
Support for Microsoft Office files (DOC, DOCX, PPT, PPTX, XLS, XLSX) ✅
Database storage backends
Web interface
Docker containerization
Kubernetes deployment manifests
Integration with vector databases (Qdrant, Pinecone, etc.)

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github		.github
config		config
dist		dist
docs		docs
public		public
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md
build.sh		build.sh
inspect-files.sh		inspect-files.sh
interactive-retrain-working.sh		interactive-retrain-working.sh
logging-add-todo.md		logging-add-todo.md
menu.sh		menu.sh

Uh oh!

devopsbob/ollama-embeddings

Folders and files

Latest commit

History

Repository files navigation

Ollama Embeddings Generator

Features

Supported File Types

Installation

Prerequisites

Install Ollama and Models

Build from Source

Quick Start Tools

File Inspector Script

Interactive Menu

Usage

Basic Usage

Retraining Existing Embeddings

Advanced Usage

Command Line Options

File Deduplication Options

Interactive Menu

Configuration

Configuration File

Environment Variables

Output Format

JSON Format

JSON Lines Format

Architecture

Key Components

FileParser

EmbeddingProcessor

OllamaClient

Development

Running Tests

Running with Logging

Adding New File Types

Performance Considerations

Troubleshooting

Common Issues

Debug Mode

Troubleshooting

Common Issues

"Model 'nomic-embed-text' is not available"

"Failed to connect to Ollama server"

"Permission denied" or "File not found"

Out of memory errors

Slow performance

License

Contributing

Roadmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Sponsor this project

Uh oh!

Packages 0

Languages

Packages