Skip to content

Conversation

@mobs75
Copy link

@mobs75 mobs75 commented Nov 22, 2025

Summary

This PR adds comprehensive Apache Spark integration to OpenServerless, enabling users to deploy and manage Spark clusters alongside their serverless workloads for big data processing capabilities.

Architecture

The integration is implemented in the operator submodule and follows OpenServerless patterns:

  • Spark deployment managed by Kubernetes operator
  • Seamless integration with existing OpenServerless components (MinIO, PostgreSQL, MongoDB, Redis)
  • Declarative configuration through Whisk CRD

Key Features

Spark Components

  • Spark Master: Standalone cluster manager with configurable resources
  • Spark History Server: Web UI for completed applications with S3-compatible storage
  • Spark Workers: (foundation ready for dynamic scaling)

Technical Implementation

  • Resource Management: Proper memory format handling (Kubernetes 1Gi ↔ JVM 1g)
  • Service Discovery: Automatic DNS configuration for inter-component communication
  • Storage Integration: MinIO S3-compatible storage for Spark event logs
  • RBAC: Least-privilege security with proper ServiceAccount and Role bindings
  • Health Checks: Comprehensive readiness/liveness probes
  • Lifecycle Management: Owner references for automatic cleanup

Changes

Operator Submodule (commit afc74b4)

  • New Module: nuvolaris/spark.py - Complete Spark operator implementation
  • Templates: Kubernetes manifests for RBAC, ConfigMaps, Services, StatefulSets
  • Integration: Hooks into main operator workflow (patcher.py, main.py)

Configuration

apiVersion: nuvolaris.org/v1
kind: Whisk
metadata:
  name: controller
spec:
  components:
    spark: true
  spark:
    enabled: true
    mode: standalone
    image: apache/spark:3.5.0
    master:
      memory: 1Gi
      cpu: 1000m
    history:
      enabled: true
      backend: s3a
      s3a:
        bucket: spark-history
        endpoint: http://minio.nuvolaris.svc.cluster.local:9000
        secretRef: nuvolaris-minio

Testing

Tested on MicroK8s cluster:

  • ✅ Spark Master deployment and healthy startup
  • ✅ History Server with MinIO integration
  • ✅ Resource limits properly applied
  • ✅ Service endpoints accessible (spark://spark-master:7077)
  • ✅ Web UI available on port 8080

Verification

kubectl -n nuvolaris get pods -l app=spark
NAME                            READY   STATUS    RESTARTS   AGE
spark-history-7b7d97c7d         1/1     Running   0          10m
spark-master-0                  1/1     Running   0          10m

Use Cases

  • Data Processing: Run Spark jobs within OpenServerless environment
  • ETL Pipelines: Process large datasets stored in MinIO
  • Machine Learning: Train models using Spark MLlib
  • Analytics: Query and analyze data alongside serverless functions

Future Enhancements

  • Dynamic Spark Worker scaling
  • Spark application submission via operator API
  • Metrics integration with Prometheus
  • Support for Spark on Kubernetes mode
  • Jupyter notebook integration

Documentation

User documentation and examples to be added in follow-up PRs.


Related Issues: Closes #[issue-number]

Operator Submodule PR: mobs75/openserverless-operator#[pr-number]

- Add Spark operator build and test tasks in task project
- TaskfileBuild.yml for GHCR image building
- TaskfileTest.yml for SparkJob testing
- Sync with openserverless-task fork feature/enable-spark-in-whisk
- Update operator submodule to include Spark integration (commit afc74b4)
- Add comprehensive Spark deployment support
- Enable Spark Master, History Server, and Worker management
- Integrate with MinIO for Spark event logs storage
- Add Spark operator build tasks
- Add Spark testing workflows
- Update Whisk CRD with Spark configuration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant