Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch Job Stuck at "Starting" #9

Closed
xaviermerino opened this issue Jan 18, 2022 · 3 comments
Closed

Batch Job Stuck at "Starting" #9

xaviermerino opened this issue Jan 18, 2022 · 3 comments

Comments

@xaviermerino
Copy link

Hello!

Thanks for you work on Lighter, I've been looking for a replacement for Livy for a bit and this is the closest thing on the Internet!

I'm running Lighter on a minikube Kubernetes environment. I have a Spark cluster deployed through Helm charts. I also deployed PostgreSQL through Helm since it seems like Lighter needs it. Here's how my environment looks (in the default namespace):

NAME                           READY   STATUS    RESTARTS      AGE
lighter-6675c44b5b-jjpwk       1/1     Running   2 (20m ago)   16h
nfs-nfs-server-provisioner-0   1/1     Running   1 (21m ago)   16h
postgres-postgresql-0          1/1     Running   1 (21m ago)   16h
spark-data-pod                 1/1     Running   1 (21m ago)   16h
spark-master-0                 1/1     Running   1 (21m ago)   16h
spark-worker-0                 1/1     Running   1 (21m ago)   16h
spark-worker-1                 1/1     Running   1 (21m ago)   16h
spark-worker-2                 1/1     Running   1 (21m ago)   16h

Based on your instructions, I did:

  1. Manifest for: ServiceAccount, Role, and RoleBinding
apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark
  namespace: default
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: lighter-spark
  namespace: default
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps", "pods/log"]
  verbs: ["*"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: lighter-spark
  namespace: default
subjects:
- kind: ServiceAccount
  name: spark
  namespace: default
roleRef:
  kind: Role
  name: lighter-spark
  apiGroup: rbac.authorization.k8s.io
  1. Manifest for: Lighter Service
apiVersion: v1
kind: Service
metadata:
    name: lighter
    namespace: default
    labels:
        run: lighter
spec:
    ports:
        -   name: api
            port: 8080
            protocol: TCP
        -   name: javagw
            port: 25333
            protocol: TCP
    selector:
        run: lighter
  1. Manifest for: Lighter Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
    namespace: default
    name: lighter
spec:
    selector:
        matchLabels:
            run: lighter
    replicas: 1
    strategy:
        rollingUpdate:
            maxUnavailable: 0
            maxSurge: 1
    template:
        metadata:
            labels:
                run: lighter
        spec:
            containers:
                -   image: ghcr.io/exacaster/lighter:0.0.3-spark3.1.2
                    name: lighter
                    readinessProbe:
                        httpGet:
                            path: /health/readiness
                            port: 8080
                        initialDelaySeconds: 15
                        periodSeconds: 15
                    resources:
                        requests:
                            cpu: "0.25"
                            memory: "512Mi"
                    ports:
                        -   containerPort: 8080
                    env:
                        -   name: LIGHTER_STORAGE_JDBC_USERNAME
                            value: postgres
                        -   name: LIGHTER_STORAGE_JDBC_PASSWORD
                            value: secretpassword
                        -   name: LIGHTER_STORAGE_JDBC_URL
                            value: jdbc:postgresql://postgres-postgresql:5432/lighter
                        -   name: LIGHTER_STORAGE_JDBC_DRIVER_CLASS_NAME
                            value: org.postgresql.Driver
                        -   name: LIGHTER_SPARK_HISTORY_SERVER_URL
                            value: http://spark-master-svc:7077/spark-history
                        -   name: LIGHTER_MAX_RUNNING_JOBS
                            value: "15"
                        -   name: LIGHTER_KUBERNETES_CONTAINER_IMAGE
                            value: "ghcr.io/exacaster/spark:latest"
            serviceAccountName: spark

I did not bother with the ingress portion because I am just testing and use port-forwarding to see the Spark and Lighter UIs.

I am sending a POST request to Lighter's Batch API with the following:

{
  "name": "Test",
  "file": "/data/spark-examples.jar",
  "args": ["/data/test.txt"],
  "files": ["/data/test.txt"],
  "className" : "org.apache.spark.examples.JavaWordCount"
}

After the request gets accepted, other fields get automatically filled in and the UI reflects the submission. However, the state of the job is always "Starting" (also I can't delete it if I press the "X" button).

I'm not sure what is going on. I tried checking the pod's logs but there is nothing related. I was wondering if you could point me in the right direction to figure out what is missing/misconfigured for it to work properly. I'm attaching a screenshot of the UI.

image

Thank you for your time!

@Minutis
Copy link
Member

Minutis commented Jan 18, 2022

Hi! Glad to hear that you are trying out Lighter!

Since you changed all the namespaces in your code to default you should also set LIGHTER_KUBERNETES_NAMESPACE to default in Lighter deployment.

Another thing is that we see that you are using ghcr.io/exacaster/spark:latest for LIGHTER_KUBERNETES_CONTAINER_IMAGE which I am sure does not exist. Keep in mind that the jar that you are submitting ("file": "/data/spark-examples.jar",) must exist inside that container that you use.

@xaviermerino
Copy link
Author

xaviermerino commented Jan 23, 2022

@Minutis thanks for your help and sorry I did not see this earlier!

I replaced the ghcr.io/exacaster/spark:latest with bitnami/spark:3 and made sure to set LIGHTER_KUBERNETES_NAMESPACE to default. The logs are producing things now!

You were right about the data problems. I currently mount an NFS volume in all my workers (extra settings for helm chart):

service:
  type: ClusterIP
worker:
  replicaCount: 3
  extraVolumes:
    - name: spark-data
      persistentVolumeClaim:
        claimName: spark-data-pvc
  extraVolumeMounts:
    - name: spark-data
      mountPath: /data

but I guess that Lighter creates a new container without that extra volume. I sent the following request:

curl -X POST -H "Content-type: application/json" -d '{
  "name": "Test",
  "file": "/data/spark-examples.jar",
  "args": ["/data/test.txt"],
  "files": ["/data/test.txt"],
  "className" : "org.apache.spark.examples.JavaWordCount"
}' 'localhost:8081/lighter/api/batches'

and it was correctly received but the logs show that it can't find the file.

java.lang.RuntimeException: Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error 'File file:/data/spark-examples.jar does not exist'.  Please specify one with --class.
        at org.apache.spark.launcher.OutputRedirector.redirect(OutputRedirector.java:67)
        at java.base/java.lang.Thread.run(Thread.java:834)

Is there any way to add this volume via Lighter's deployment configuration, environment variables, etc? If not, what do you guys recommend for getting files available in the new container?

I'm also getting an error after the exception shown above:

Exception in thread "launcher-proc-1" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://10.96.0.1/api/v1/namespaces/spark/pods?labelSelector=spark-app-tag%3Debae8134-6125-45a9-9343-0ba61f7bfb19%2Cspark-role%3Ddriver. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods is forbidden: User "system:serviceaccount:default:spark" cannot list resource "pods" in API group "" in the namespace "spark".

I checked and I have no spark namespace that I'm using.

Any ideas on what's going on?

Thanks for your help and sorry for the delay!

@pdambrauskas
Copy link
Collaborator

To use Volume mounts on your driver/executor pods you'll have to follow official Spark documentation regarding it:
https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes
And add needed "conf" properties to submit options.

Other option would be to to use a custom pod templates, and attach your mounts here. To do that, you can create Config map on /home/app/k8s/driver_pod_template.yaml & /home/app/k8s/driver_pod_template.yaml paths (see defaults: https://github.com/exacaster/lighter/tree/master/k8s)

I'm not sure about your last Exception, have you really got it after changing LIGHTER_KUBERNETES_NAMESPACE and not before that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants