Distributed-Task-Scheduler

Case Study:

Executive Summary

Designed and implemented a high-performance, distributed background job processing system to solve the problem of handling long-running and asynchronous operations in a scalable and resilient manner. The system decouples task submission from execution, allowing main application threads to respond quickly while ensuring tasks are processed reliably across multiple worker nodes. This architecture is a foundational component of modern microservices, similar to systems like AWS Step Functions or Google Cloud Tasks.

The Problem

In monolithic applications, time-consuming processes like sending emails, generating reports, or processing files are often executed within the same request/response cycle. This leads to:

Poor User Experience: Users face slow response times as they wait for these processes to finish.
System Vulnerability: A failure in one of these processes (e.g., an external API being slow) can cause the entire application request to fail.
Lack of Scalability: It's difficult to scale these specific workloads independently from the main web application.

The Solution

A distributed task scheduling system built on a producer-consumer model using Spring Boot. The solution consists of three core components:

REST API (Producer): Accepts task submissions, persists them to a database, and places them on a message queue.
Redis Message Queue: Acts as a durable buffer, decoupling the API from the workers and allowing for asynchronous communication.
Worker Processes (Consumers): Independently scalable services that poll the queue, process tasks, and update their status, implementing retry logic for failures.

Technical Architecture & Implementation

Tech Stack: Java 17, Spring Boot 3.3, Spring Data JPA, Redis (Lettuce client), PostgreSQL, Docker, Docker Compose, Maven.

System Workflow:

sequenceDiagram
    participant C as Client
    participant API as Spring Boot API
    participant DB as PostgreSQL
    participant Q as Redis Queue
    participant W as Worker

    C->>API: POST /api/tasks {type, payload}
    API->>DB: Persist Task (status=PENDING)
    API->>Q: LPUSH task_id
    API->>C: 202 Accepted + task_id

    loop Polling
        W->>Q: BRPOP task_id
        W->>DB: Update status=PROCESSING
        W->>W: Execute Task Logic
        alt Success
            W->>DB: Update status=SUCCESS
        else Failure
            W->>DB: Increment retry_count
            alt Retries Available
                W->>Q: LPUSH task_id
            else Max Retries
                W->>DB: Update status=FAILED
            end
        end
    end

Key Features Implemented:

RESTful API: Provided clean endpoints for submitting and monitoring tasks, returning immediate acceptance (HTTP 202) for asynchronous processing.
Database Persistence: Used PostgreSQL with Spring Data JPA to reliably store all task metadata, ensuring no task is lost and its state is always traceable.
Redis for Decoupling: Implemented Redis lists as a simple, yet powerful, message queue to reliably pass tasks from the API to the workers.
Robust Worker Logic: Engineered idempotent workers that handle task processing, manage state transitions (PENDING → PROCESSING → SUCCESS/FAILED), and implement exponential backoff for retries on transient failures.
Containerization: Defined multi-container Docker environments using Docker Compose for seamless local development and testing, mimicking a production-like setup.

Challenges & Solutions

Challenge: Ensuring At-Least-Once Delivery
- Problem: A network glitch could cause a worker to crash after processing a task but before marking it as complete, leading to the task being redelivered and executed twice.
- Solution: Designed all task logic to be idempotent (able to be applied multiple times without changing the result beyond the initial application). This meant using unique IDs in operations and checking if an action had already been performed before doing it again.
Challenge: Coordinating Concurrent Workers
- Problem: Multiple workers pulling from the same queue could create race conditions or process the same task.
- Solution: Leveraged Redis's BRPOP command, which atomically removes a task from the list. This operation is atomic and thread-safe, guaranteeing that only one worker receives a given task.
Challenge: Monitoring and Debugging
- Problem: Understanding what's happening inside a distributed system is complex.
- Solution: Integrated Spring Boot Actuator for health checks and metrics. Implemented comprehensive logging throughout the task lifecycle and provided a simple API endpoint (GET /api/tasks/{id}) to check the status of any task.

Results & Impact

Theoretical Performance: The decoupled architecture allows the web API to remain responsive regardless of backend task load. Horizontal scaling is achieved simply by launching more worker containers, improving system throughput significantly.
Improved Reliability: The retry mechanism and persistent state management ensured task execution was resilient to temporary failures (e.g., network timeouts, third-party API downtime).
Foundation for Scale: This project demonstrated a critical software architecture pattern (asynchronous processing) that is essential for building scalable, cloud-native applications. It served as a practical implementation of key distributed systems concepts.

Lessons Learned

This project provided deep, hands-on experience with the challenges of distributed systems, including concurrency, state management, and message-based communication. It solidified understanding of how to use databases and queues together to build reliable systems and highlighted the importance of idempotency and careful logging in a microservices environment.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed-Task-Scheduler

Case Study:

Executive Summary

The Problem

The Solution

Technical Architecture & Implementation

Challenges & Solutions

Results & Impact

Lessons Learned

About

Uh oh!

Releases

Packages

darpan02-cypher/Distributed-Task-Scheduler

Folders and files

Latest commit

History

Repository files navigation

Distributed-Task-Scheduler

Case Study:

Executive Summary

The Problem

The Solution

Technical Architecture & Implementation

Challenges & Solutions

Results & Impact

Lessons Learned

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages