Batch Job Stuck at "Starting" #9

xaviermerino · 2022-01-18T17:36:26Z

Hello!

Thanks for you work on Lighter, I've been looking for a replacement for Livy for a bit and this is the closest thing on the Internet!

I'm running Lighter on a minikube Kubernetes environment. I have a Spark cluster deployed through Helm charts. I also deployed PostgreSQL through Helm since it seems like Lighter needs it. Here's how my environment looks (in the default namespace):

NAME                           READY   STATUS    RESTARTS      AGE
lighter-6675c44b5b-jjpwk       1/1     Running   2 (20m ago)   16h
nfs-nfs-server-provisioner-0   1/1     Running   1 (21m ago)   16h
postgres-postgresql-0          1/1     Running   1 (21m ago)   16h
spark-data-pod                 1/1     Running   1 (21m ago)   16h
spark-master-0                 1/1     Running   1 (21m ago)   16h
spark-worker-0                 1/1     Running   1 (21m ago)   16h
spark-worker-1                 1/1     Running   1 (21m ago)   16h
spark-worker-2                 1/1     Running   1 (21m ago)   16h

Based on your instructions, I did:

Manifest for: ServiceAccount, Role, and RoleBinding

apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark
  namespace: default
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: lighter-spark
  namespace: default
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps", "pods/log"]
  verbs: ["*"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: lighter-spark
  namespace: default
subjects:
- kind: ServiceAccount
  name: spark
  namespace: default
roleRef:
  kind: Role
  name: lighter-spark
  apiGroup: rbac.authorization.k8s.io

Manifest for: Lighter Service

apiVersion: v1
kind: Service
metadata:
    name: lighter
    namespace: default
    labels:
        run: lighter
spec:
    ports:
        -   name: api
            port: 8080
            protocol: TCP
        -   name: javagw
            port: 25333
            protocol: TCP
    selector:
        run: lighter

Manifest for: Lighter Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
    namespace: default
    name: lighter
spec:
    selector:
        matchLabels:
            run: lighter
    replicas: 1
    strategy:
        rollingUpdate:
            maxUnavailable: 0
            maxSurge: 1
    template:
        metadata:
            labels:
                run: lighter
        spec:
            containers:
                -   image: ghcr.io/exacaster/lighter:0.0.3-spark3.1.2
                    name: lighter
                    readinessProbe:
                        httpGet:
                            path: /health/readiness
                            port: 8080
                        initialDelaySeconds: 15
                        periodSeconds: 15
                    resources:
                        requests:
                            cpu: "0.25"
                            memory: "512Mi"
                    ports:
                        -   containerPort: 8080
                    env:
                        -   name: LIGHTER_STORAGE_JDBC_USERNAME
                            value: postgres
                        -   name: LIGHTER_STORAGE_JDBC_PASSWORD
                            value: secretpassword
                        -   name: LIGHTER_STORAGE_JDBC_URL
                            value: jdbc:postgresql://postgres-postgresql:5432/lighter
                        -   name: LIGHTER_STORAGE_JDBC_DRIVER_CLASS_NAME
                            value: org.postgresql.Driver
                        -   name: LIGHTER_SPARK_HISTORY_SERVER_URL
                            value: http://spark-master-svc:7077/spark-history
                        -   name: LIGHTER_MAX_RUNNING_JOBS
                            value: "15"
                        -   name: LIGHTER_KUBERNETES_CONTAINER_IMAGE
                            value: "ghcr.io/exacaster/spark:latest"
            serviceAccountName: spark

I did not bother with the ingress portion because I am just testing and use port-forwarding to see the Spark and Lighter UIs.

I am sending a POST request to Lighter's Batch API with the following:

{
  "name": "Test",
  "file": "/data/spark-examples.jar",
  "args": ["/data/test.txt"],
  "files": ["/data/test.txt"],
  "className" : "org.apache.spark.examples.JavaWordCount"
}

After the request gets accepted, other fields get automatically filled in and the UI reflects the submission. However, the state of the job is always "Starting" (also I can't delete it if I press the "X" button).

I'm not sure what is going on. I tried checking the pod's logs but there is nothing related. I was wondering if you could point me in the right direction to figure out what is missing/misconfigured for it to work properly. I'm attaching a screenshot of the UI.

Thank you for your time!

The text was updated successfully, but these errors were encountered:

Minutis · 2022-01-18T19:21:48Z

Hi! Glad to hear that you are trying out Lighter!

Since you changed all the namespaces in your code to default you should also set LIGHTER_KUBERNETES_NAMESPACE to default in Lighter deployment.

Another thing is that we see that you are using ghcr.io/exacaster/spark:latest for LIGHTER_KUBERNETES_CONTAINER_IMAGE which I am sure does not exist. Keep in mind that the jar that you are submitting ("file": "/data/spark-examples.jar",) must exist inside that container that you use.

xaviermerino · 2022-01-23T18:42:27Z

@Minutis thanks for your help and sorry I did not see this earlier!

I replaced the ghcr.io/exacaster/spark:latest with bitnami/spark:3 and made sure to set LIGHTER_KUBERNETES_NAMESPACE to default. The logs are producing things now!

You were right about the data problems. I currently mount an NFS volume in all my workers (extra settings for helm chart):

service:
  type: ClusterIP
worker:
  replicaCount: 3
  extraVolumes:
    - name: spark-data
      persistentVolumeClaim:
        claimName: spark-data-pvc
  extraVolumeMounts:
    - name: spark-data
      mountPath: /data

but I guess that Lighter creates a new container without that extra volume. I sent the following request:

curl -X POST -H "Content-type: application/json" -d '{
  "name": "Test",
  "file": "/data/spark-examples.jar",
  "args": ["/data/test.txt"],
  "files": ["/data/test.txt"],
  "className" : "org.apache.spark.examples.JavaWordCount"
}' 'localhost:8081/lighter/api/batches'

and it was correctly received but the logs show that it can't find the file.

java.lang.RuntimeException: Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error 'File file:/data/spark-examples.jar does not exist'.  Please specify one with --class.
        at org.apache.spark.launcher.OutputRedirector.redirect(OutputRedirector.java:67)
        at java.base/java.lang.Thread.run(Thread.java:834)

Is there any way to add this volume via Lighter's deployment configuration, environment variables, etc? If not, what do you guys recommend for getting files available in the new container?

I'm also getting an error after the exception shown above:

Exception in thread "launcher-proc-1" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://10.96.0.1/api/v1/namespaces/spark/pods?labelSelector=spark-app-tag%3Debae8134-6125-45a9-9343-0ba61f7bfb19%2Cspark-role%3Ddriver. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods is forbidden: User "system:serviceaccount:default:spark" cannot list resource "pods" in API group "" in the namespace "spark".

I checked and I have no spark namespace that I'm using.

Any ideas on what's going on?

Thanks for your help and sorry for the delay!

pdambrauskas · 2022-01-26T06:23:58Z

To use Volume mounts on your driver/executor pods you'll have to follow official Spark documentation regarding it:
https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes
And add needed "conf" properties to submit options.

Other option would be to to use a custom pod templates, and attach your mounts here. To do that, you can create Config map on /home/app/k8s/driver_pod_template.yaml & /home/app/k8s/driver_pod_template.yaml paths (see defaults: https://github.com/exacaster/lighter/tree/master/k8s)

I'm not sure about your last Exception, have you really got it after changing LIGHTER_KUBERNETES_NAMESPACE and not before that?

xaviermerino closed this as completed Apr 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch Job Stuck at "Starting" #9

Batch Job Stuck at "Starting" #9

xaviermerino commented Jan 18, 2022

Minutis commented Jan 18, 2022 •

edited

xaviermerino commented Jan 23, 2022 •

edited

pdambrauskas commented Jan 26, 2022

Batch Job Stuck at "Starting" #9

Batch Job Stuck at "Starting" #9

Comments

xaviermerino commented Jan 18, 2022

Minutis commented Jan 18, 2022 • edited

xaviermerino commented Jan 23, 2022 • edited

pdambrauskas commented Jan 26, 2022

Minutis commented Jan 18, 2022 •

edited

xaviermerino commented Jan 23, 2022 •

edited