Text Processing Microservices

Project Requirements

Core Requirements

Serving Service (FastAPI)
- Expose a POST /summarize endpoint
- Accept JSON text payload
- Forward text to Processing Service via gRPC
- Return processed result to client
Processing Service (gRPC)
- Expose ProcessText gRPC method
- Perform NLP processing on input text
- Return processed result to Serving Service
NLP Features Implemented
- Tokenization (split into words/tokens)
- Sentence Splitting (split into sentences)
- Keyword Extraction (top N important words/phrases)
- Sentiment Analysis (positive/negative feedback)
- Part-of-Speech Tagging (identifying nouns, verbs, etc.)
- Named Entity Recognition (people, organizations, locations)
- Summarization (extractive)
- Text Classification (categorizing text)
Technical Requirements
- Python for both services
- FastAPI for HTTP service
- grpcio for gRPC service
- Async I/O where relevant
- Error handling and logging
- Containerized setup with Docker/docker-compose
- Health check endpoints
- Unit tests for core logic
- GitHub Actions for CI/CD

System Architecture Overview

graph TB
    subgraph "Client Layer"
        C[HTTP Client]
    end
    
    subgraph "API Gateway Layer"
        FS[FastAPI Serving Service<br/>:8000]
    end
    
    subgraph "Processing Layer"
        PS[gRPC Processing Service<br/>:50051]
        
        subgraph "NLP Pipelines"
            CP[Classical Pipeline<br/>spaCy-based]
            MP[Modern Pipeline<br/>Transformers-based]
        end
    end
    
    subgraph "Observability Layer"
        OT[OpenTelemetry<br/>Console Export]
    end
    
    C -->|POST /summarize| FS
    C -->|POST /compare| FS
    FS -->|gRPC ProcessText| PS
    PS --> CP
    PS --> MP
    PS -->|Metrics & Traces| OT
    FS -->|Metrics & Traces| OT

Architecture Components

Serving Service: FastAPI-based HTTP API gateway (Port 8000)
Processing Service: gRPC-based NLP processing engine (Port 50051)
Pipeline System: Pluggable NLP processors with strategy pattern
Configuration Management: Pydantic-based type-safe configs
Observability: OpenTelemetry instrumentation (console export)
Deployment: Docker containers orchestrated via docker-compose

Project Directory Structure

text-processing-microservices/
├── serving_service/                    # FastAPI HTTP Service
│   ├── main.py                         # FastAPI application
│   ├── api/                            
│   │   └── endpoints/                  
│   │       ├── summarize.py            # /summarize and /compare endpoints
│   │       └── health.py               # Health checks
│   ├── services/                       
│   │   └── grpc_client.py              # gRPC client wrapper
│   ├── models/                         
│   │   ├── requests.py                 # Request models
│   │   └── responses.py                # Response models
│   ├── middleware/
│   │   └── telemetry.py                # OpenTelemetry middleware
│   └── config/                         
│       └── settings.py                 # Service configuration
│
├── processing_service/                 # gRPC Processing Service
│   ├── main.py                         # gRPC server entry
│   ├── grpc_server/                    
│   │   ├── server.py                   # gRPC server implementation
│   │   └── servicer.py                 # ProcessText servicer
│   ├── processors/                     # NLP Processors
│   │   ├── base.py                     # Abstract processor
│   │   ├── classical/                  # Classical NLP
│   │   │   ├── tokenizer.py            # Tokenization
│   │   │   ├── sentence_splitter.py    # Sentence splitting
│   │   │   ├── pos_tagger.py           # POS tagging
│   │   │   ├── ner_extractor.py        # Named Entity Recognition
│   │   │   └── keyword_extractor.py    # TF-IDF keywords
│   │   └── modern/                     # Modern NLP
│   │       ├── summarizer.py           # DistilBART summarization
│   │       ├── sentiment_analyzer.py   # Sentiment analysis
│   │       └── text_classifier.py      # Zero-shot classification
│   ├── pipelines/                      # Pipeline Orchestrators
│   │   ├── classical_pipeline.py       # Classical flow
│   │   ├── modern_pipeline.py          # Modern flow
│   │   └── pipeline_manager.py         # Pipeline selection & comparison
│   ├── interceptors/
│   │   └── telemetry.py                # gRPC telemetry interceptor
│   └── utils/                          # Utilities
│       ├── metrics_collector.py        # Performance metrics
│       ├── pipeline_comparator.py      # Pipeline comparison
│       └── results_aggregator.py       # Results aggregation
│
├── shared/                             # Shared Components
│   ├── protos/                         # gRPC Definitions
│   │   └── text_processing.proto       # Proto definitions
│   ├── interfaces/                     # Abstract Interfaces
│   │   └── processor.py                # Processor interface
│   ├── exceptions/                     # Custom Exceptions
│   │   └── base.py                     # Exception hierarchy
│   ├── utils/                          # Utilities
│   │   ├── logging.py                  # Structured logging
│   │   └── telemetry.py                # OpenTelemetry setup
│   └── config/                         # Shared Configuration
│       └── base.py                     # Base config models
│
├── infrastructure/                     # Deployment & Operations
│   ├── docker/                         # Docker Configuration
│   │   ├── serving.Dockerfile          # FastAPI image
│   │   └── processing.Dockerfile       # gRPC image
│   └── scripts/                        # Utility Scripts
│       ├── generate_proto.sh           # Proto generation
│       ├── benchmark.py                # Performance benchmarking
│       └── download_models.sh          # Model downloads
│
├── tests/                              # Test Suite
│   ├── unit/                           # Unit Tests
│   ├── integration/                    # Integration Tests
│   └── fixtures/                       # Test Data
│
├── .github/                            # GitHub Configuration
│   └── workflows/                      
│       └── ci.yml                      # CI/CD pipeline
│
├── docker-compose.yml                  # Container orchestration
├── docker-compose.override.yml         # Development overrides
├── .env.example                        # Environment template
├── .gitignore                          # Git ignores
├── requirements.txt                    # Python dependencies
├── requirements-dev.txt                # Dev dependencies
├── pyproject.toml                      # Project metadata
├── Makefile                            # Build commands
└── README.md                           # This file

Technology Stack

Languages: Python 3.10.12
HTTP Framework: FastAPI 0.116.1
RPC Framework: gRPC 1.74.0 with Protocol Buffers
Classical NLP: spaCy 3.8.7
Modern NLP: HuggingFace Transformers 4.56.1
Configuration: Pydantic 2.11.9
Testing: pytest 8.4.2
Containerization: Docker, docker-compose
Observability: OpenTelemetry 1.37.0
CI/CD: GitHub Actions

Quick Start

Prerequisites

Docker and docker-compose installed
Python 3.10+ (for local development)
Make (for convenience commands)

Using Docker

# Clone repository
git clone <repository-url>
cd text-processing-microservices

# Copy environment configuration
cp .env.example .env

# Start services
make docker-up

# Wait for model downloads (first run only, ~1GB)
sleep 60

# Test classical pipeline
curl -X POST http://localhost:8000/summarize \
  -H "Content-Type: application/json" \
  -d '{"text": "The organization shall establish procedures.", "pipelines": ["classical"]}'

# Test modern pipeline
curl -X POST http://localhost:8000/summarize \
  -H "Content-Type: application/json" \
  -d '{"text": "The organization shall establish procedures.", "pipelines": ["modern"]}'

# Compare pipelines
curl -X POST http://localhost:8000/compare \
  -H "Content-Type: application/json" \
  -d '{"text": "The organization shall establish comprehensive procedures."}'

# Check health
curl http://localhost:8000/health

# Stop services
make docker-down

Installation

Local Development Setup

# Setup Python environment
python3.10 -m venv venv
source venv/bin/activate

# Install dependencies
make install

# Generate proto files with relative imports
make proto

# Download spaCy model
python -m spacy download en_core_web_sm

# Format code
make format

# Run linting
make lint

# Start services locally
python processing_service/main.py &
python serving_service/main.py

System Requirements

Python: 3.10.12 or higher
Docker: 20.10+ (for containerized deployment)
Memory: Minimum 4GB RAM (8GB recommended)
Storage: ~7GB for Docker images, ~1GB for models

Development

Available Make Commands

make help          # Show available commands
make install       # Install dependencies
make test          # Run tests
make test-cov      # Run tests with coverage
make lint          # Run linters
make format        # Format code
make clean         # Clean cache files
make proto         # Generate proto files
make docker-build  # Build Docker images
make docker-up     # Start services
make docker-down   # Stop services
make docker-logs   # View service logs

Code Quality

The project uses the following tools for code quality:

black: Code formatting
isort: Import sorting
flake8: Linting
mypy: Type checking

Run all checks:

make format
make lint

Protocol Buffers

Proto files must be regenerated after any changes:

make proto

This command generates Python code from proto definitions and fixes imports to use relative paths.

API Documentation

POST /summarize

Process text through selected NLP pipelines.

Request:

{
  "text": "Text to process",
  "pipelines": ["classical", "modern"],
  "return_metrics": true,
  "processor_params": {
    "keyword_extractor": {"n_keywords": 5}
  }
}

Response:

{
  "pipeline": "classical",
  "results": {
    "tokenizer": {"tokens": [...], "count": 15},
    "sentence_splitter": {"sentences": [...], "count": 2},
    "pos_tagger": {"pos_tags": [["The", "DT"], ...]},
    "ner_extractor": {"entities": [...]},
    "keyword_extractor": {"keywords": [["procedures", 0.95]]}
  },
  "metrics": {
    "total_processing_time": 4.8,
    "processor_count": 5
  }
}

POST /compare

Compare multiple pipelines on the same text.

Request:

{
  "text": "Text to compare",
  "pipelines": ["classical", "modern"]
}

Response includes:

Results from both pipelines
Performance comparison
Speed difference metrics

GET /health

Check service health status.

Response:

{
  "status": "healthy",
  "version": "1.0.0",
  "grpc_connected": true,
  "available_pipelines": ["classical", "modern"]
}

Testing

Running Tests

# All tests
make test

# With coverage
make test-cov

# Specific test file
pytest tests/unit/test_processors/classical/test_tokenizer.py -v

# Integration tests only
pytest tests/integration/ -v

Test Coverage

Total Tests: 129
Coverage: >80%
Execution Time: ~45 seconds (includes model loading)

Test Categories

Unit tests for individual components
Integration tests for service communication
End-to-end tests for complete workflows
Performance benchmarks

Performance Benchmarks

Benchmark Results

Pipeline	Average Time	Min Time	Max Time	Speed Factor
Classical	4.8s	4.2s	5.1s	1x (baseline)
Modern (initial)	12.6s	11.8s	13.2s	0.38x
Modern (cached)	0.28s	0.25s	0.31s	17x

Running Benchmarks

python infrastructure/scripts/benchmark.py --iterations 5 --pipelines classical modern

Performance Metrics

Classical vs Modern: Classical is ~48x faster on average (uncached)
Time Savings: Classical saves ~98% processing time
Model Loading: ~12 seconds for modern pipeline
Cache Impact: 45x speedup with cached models

Docker Deployment

Building Images

make docker-build

Running Services

# Start all services
make docker-up

# View logs
make docker-logs

# Stop services
make docker-down

Docker Compose Configuration

The project includes:

docker-compose.yml: Production configuration
docker-compose.override.yml: Development overrides with volume mounts

Container Details

Processing Service: 6.8GB (includes ML models)
Serving Service: 935MB
Model Cache: Persistent volume for model storage

CI/CD Pipeline

GitHub Actions Workflow

The CI/CD pipeline runs on push and pull requests:

jobs:
  test:
    - Run 129 unit tests
    - Run integration tests
    - Coverage report (>80%)
    
  lint:
    - Black formatting
    - isort imports
    - flake8 linting
    - mypy type checking
    
  docker-build:
    - Build serving image
    - Build processing image
    - Cache layers for speed

Running CI Locally

While the full CI runs on GitHub Actions, you can run tests locally:

make test
make lint
make docker-build

Monitoring & Observability

Current Implementation

The system includes OpenTelemetry instrumentation with:

Distributed tracing across services
Metrics collection (request rate, latency, errors)
Structured JSON logging with trace correlation
Console export for development

Telemetry Configuration

Enable telemetry by setting in .env:

OTEL_ENABLED=true

Metrics Collected

Request rate/latency/errors (RED metrics)
Pipeline processing times
Model inference latency
Token processing rate
Pipeline comparison metrics

Possible Enhancements

The system is prepared for production observability with:

Prometheus metrics export
Jaeger distributed tracing
Grafana dashboards
Log aggregation with Loki

Known Issues

Current Limitations

Docker Image Size: Processing service is 6.8GB due to ML models
- Workaround: Use volume mounts for models in production
PyTorch Compatibility: Requires typing-extensions==4.8.0
- Solution: Install typing-extensions before PyTorch
First Run Performance: Initial model download takes several minutes
- Solution: Pre-download models or use cached images

Dependency Notes

PyTorch CPU version is used to avoid CUDA dependencies
spaCy models must be downloaded separately
Transformer models are downloaded on first use

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
data/datasets		data/datasets
infrastructure		infrastructure
processing_service		processing_service
serving_service		serving_service
shared		shared
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
docker-compose.override.yml		docker-compose.override.yml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

bltzr75/text-processing-microservices

Folders and files

Latest commit

History

Repository files navigation