Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ATAC pipeline run on slurm report error #31

Closed
gmgitx opened this issue Sep 11, 2018 · 32 comments
Closed

ATAC pipeline run on slurm report error #31

gmgitx opened this issue Sep 11, 2018 · 32 comments

Comments

@gmgitx
Copy link

gmgitx commented Sep 11, 2018

Hi, thanks your wonderful work.
I run in/mypath/atac-seq-pipeline/
and source activate encode-atac-seq-pipeline

java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=slurm /my_path/local/bin/cromwell-34.jar run atac.wdl -i /my_path1/input.json -o /my_path2/atac-seq-pipeline/workflow_opts/slurm.json

But just one file named "cromwell-workflow-logs" left but nothing in it
Jenkinsfile LICENSE README.md atac.wdl backends conda **cromwell-workflow-logs** docker_image docs examples genome src test workflow_opts
What's more, when it was running, it shows the following on the screen:

[2018-09-08 09:23:52,43] [info] Running with database db.url = jdbc:hsqldb:mem:a42fb754-58fc-418e-8224-01cd57b5b131;shutdown=false;hsqldb.tx=mvcc
[2018-09-08 09:24:01,66] [info] Running migration RenameWorkflowOptionsInMetadata with a read batch size of 100000 and a write batch size of 100000
[2018-09-08 09:24:01,67] [info] [RenameWorkflowOptionsInMetadata] 100%
[2018-09-08 09:24:01,78] [info] Running with database db.url = jdbc:hsqldb:mem:8c25714f-6a58-4b03-bf8d-b686ee8442fc;shutdown=false;hsqldb.tx=mvcc
[2018-09-08 09:24:02,13] [warn] This actor factory is deprecated. Please use cromwell.backend.google.pipelines.v1alpha2.PipelinesApiLifecycleActorFactory for
PAPI v1 or cromwell.backend.google.pipelines.v2alpha1.PipelinesApiLifecycleActorFactory for PAPI v2
[2018-09-08 09:24:02,16] [warn] Couldn't find a suitable DSN, defaulting to a Noop one.
[2018-09-08 09:24:02,16] [info] Using noop to send events.
[2018-09-08 09:24:02,44] [info] Slf4jLogger started
[2018-09-08 09:24:02,66] [info] Workflow heartbeat configuration:
{
  "cromwellId" : "cromid-d9e2d67",
  "heartbeatInterval" : "2 minutes",
  "ttl" : "10 minutes",
  "writeBatchSize" : 10000,
  "writeThreshold" : 10000
}
[2018-09-08 09:24:02,69] [info] Metadata summary refreshing every 2 seconds.
[2018-09-08 09:24:02,72] [info] WriteMetadataActor configured to flush with batch size 200 and process rate 5 seconds.
[2018-09-08 09:24:02,72] [info] KvWriteActor configured to flush with batch size 200 and process rate 5 seconds.
[2018-09-08 09:24:02,72] [info] CallCacheWriteActor configured to flush with batch size 100 and process rate 3 seconds.
[2018-09-08 09:24:03,69] [info] JobExecutionTokenDispenser - Distribution rate: 50 per 1 seconds.
[2018-09-08 09:24:03,71] [info] SingleWorkflowRunnerActor: Version 34
[2018-09-08 09:24:03,71] [info] JES batch polling interval is 33333 milliseconds
[2018-09-08 09:24:03,71] [info] JES batch polling interval is 33333 milliseconds
[2018-09-08 09:24:03,71] [info] JES batch polling interval is 33333 milliseconds
[2018-09-08 09:24:03,71] [info] PAPIQueryManager Running with 3 workers
[2018-09-08 09:24:03,72] [info] SingleWorkflowRunnerActor: Submitting workflow
[2018-09-08 09:24:03,77] [info] Unspecified type (Unspecified version) workflow 1e03bf36-d64b-42a7-9857-a644de257de3 submitted
[2018-09-08 09:24:03,82] [info] SingleWorkflowRunnerActor: Workflow submitted 1e03bf36-d64b-42a7-9857-a644de257de3
[2018-09-08 09:24:03,82] [info] 1 new workflows fetched
[2018-09-08 09:24:03,82] [info] WorkflowManagerActor Starting workflow 1e03bf36-d64b-42a7-9857-a644de257de3
[2018-09-08 09:24:03,83] [warn] SingleWorkflowRunnerActor: received unexpected message: Done in state RunningSwraData
[2018-09-08 09:24:03,83] [info] WorkflowManagerActor Successfully started WorkflowActor-1e03bf36-d64b-42a7-9857-a644de257de3
[2018-09-08 09:24:03,83] [info] Retrieved 1 workflows from the WorkflowStoreActor
[2018-09-08 09:24:03,85] [info] WorkflowStoreHeartbeatWriteActor configured to flush with batch size 10000 and process rate 2 minutes.
[2018-09-08 09:24:03,89] [info] MaterializeWorkflowDescriptorActor [1e03bf36]: Parsing workflow as WDL draft-2
[2018-09-08 09:24:22,52] [error] WorkflowManagerActor Workflow 1e03bf36-d64b-42a7-9857-a644de257de3 failed (during MaterializingWorkflowDescriptorState): cro
mwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anon$1: Workflow input processing failed:
Unexpected character ']' at input index 643 (line 13, position 5), expected JSON Value:
    ],
    ^


        at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor.cromwell$engine$workflow$lifecycle$materialization$Materiali
zeWorkflowDescriptorActor$$workflowInitializationFailed(MaterializeWorkflowDescriptorActor.scala:200)
        at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anonfun$2.applyOrElse(MaterializeWorkflowDescriptorActor.sc
ala:170)
        at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anonfun$2.applyOrElse(MaterializeWorkflowDescriptorActor.sc
ala:165)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:34)
        at akka.actor.FSM.processEvent(FSM.scala:670)
        at akka.actor.FSM.processEvent$(FSM.scala:667)
        at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor.akka$actor$LoggingFSM$$super$processEvent(MaterializeWorkflo
wDescriptorActor.scala:123)
        at akka.actor.LoggingFSM.processEvent(FSM.scala:806)
        at akka.actor.LoggingFSM.processEvent$(FSM.scala:788)
        at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor.processEvent(MaterializeWorkflowDescriptorActor.scala:123)
        at akka.actor.FSM.akka$actor$FSM$$processMsg(FSM.scala:664)
        at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:658)
        at akka.actor.Actor.aroundReceive(Actor.scala:517)
        at akka.actor.Actor.aroundReceive$(Actor.scala:515)
        at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor.aroundReceive(MaterializeWorkflowDescriptorActor.scala:123)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:588)
        at akka.actor.ActorCell.invoke(ActorCell.scala:557)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
        at akka.dispatch.Mailbox.run(Mailbox.scala:225)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)                                                            [6/1501]


[2018-09-08 09:24:22,52] [info] WorkflowManagerActor WorkflowActor-1e03bf36-d64b-42a7-9857-a644de257de3 is in a terminal state: WorkflowFailedState
[2018-09-08 09:24:25,09] [info] SingleWorkflowRunnerActor workflow finished with status 'Failed'.
[2018-09-08 09:24:27,74] [info] Workflow polling stopped
[2018-09-08 09:24:27,76] [info] Shutting down WorkflowStoreActor - Timeout = 5 seconds
[2018-09-08 09:24:27,76] [info] Shutting down WorkflowLogCopyRouter - Timeout = 5 seconds
[2018-09-08 09:24:27,76] [info] Shutting down JobExecutionTokenDispenser - Timeout = 5 seconds
[2018-09-08 09:24:27,77] [info] Aborting all running workflows.
[2018-09-08 09:24:27,77] [info] JobExecutionTokenDispenser stopped
[2018-09-08 09:24:27,77] [info] WorkflowStoreActor stopped
[2018-09-08 09:24:27,78] [info] WorkflowLogCopyRouter stopped
[2018-09-08 09:24:27,78] [info] Shutting down WorkflowManagerActor - Timeout = 3600 seconds
[2018-09-08 09:24:27,78] [info] WorkflowManagerActor All workflows finished
[2018-09-08 09:24:27,78] [info] WorkflowManagerActor stopped
[2018-09-08 09:24:27,78] [info] Connection pools shut down
[2018-09-08 09:24:27,78] [info] Shutting down SubWorkflowStoreActor - Timeout = 1800 seconds
[2018-09-08 09:24:27,79] [info] Shutting down JobStoreActor - Timeout = 1800 seconds
[2018-09-08 09:24:27,79] [info] Shutting down CallCacheWriteActor - Timeout = 1800 seconds
[2018-09-08 09:24:27,79] [info] SubWorkflowStoreActor stopped
[2018-09-08 09:24:27,79] [info] Shutting down ServiceRegistryActor - Timeout = 1800 seconds
[2018-09-08 09:24:27,79] [info] Shutting down DockerHashActor - Timeout = 1800 seconds
[2018-09-08 09:24:27,79] [info] JobStoreActor stopped
[2018-09-08 09:24:27,79] [info] Shutting down IoProxy - Timeout = 1800 seconds
[2018-09-08 09:24:27,79] [info] CallCacheWriteActor Shutting down: 0 queued messages to process
[2018-09-08 09:24:27,79] [info] WriteMetadataActor Shutting down: 0 queued messages to process
[2018-09-08 09:24:27,79] [info] CallCacheWriteActor stopped
[2018-09-08 09:24:27,79] [info] KvWriteActor Shutting down: 0 queued messages to process
[2018-09-08 09:24:27,79] [info] DockerHashActor stopped
[2018-09-08 09:24:27,79] [info] IoProxy stopped
[2018-09-08 09:24:27,79] [info] ServiceRegistryActor stopped
[2018-09-08 09:24:27,81] [info] Database closed
[2018-09-08 09:24:27,81] [info] Stream materializer shut down
Workflow 1e03bf36-d64b-42a7-9857-a644de257de3 transitioned to state Failed
[2018-09-08 09:24:27,85] [info] Automatic shutdown of the async connection
[2018-09-08 09:24:27,85] [info] Gracefully shutdown sentry threads.
[2018-09-08 09:24:27,85] [info] Shutdown finished.

I follow https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/master/docs/tutorial_slurm.md, since I should run on my shool's slurm not my local pc but not stanford university‘s slurm.

Would you have any advice about my two errors?

@leepc12
Copy link
Contributor

leepc12 commented Sep 11, 2018

Did you use examples/local/ENCSR356KRQ_subsampled.json as your input JSON?

@gmgitx
Copy link
Author

gmgitx commented Sep 11, 2018

yes, I tried, showed error report, so I wonder maybe since it is json format for "local".
Here is what it shows:

[2018-09-11 04:49:33,13] [info] Running with database db.url = jdbc:hsqldb:mem:58141743-77d8-458d-8454-0b7f4293431d;shutdown=false;hsqldb.tx=mvcc
[2018-09-11 04:49:43,58] [info] Running migration RenameWorkflowOptionsInMetadata with a read batch size of 100000 and a write batch size of 100000
[2018-09-11 04:49:43,60] [info] [RenameWorkflowOptionsInMetadata] 100%
[2018-09-11 04:49:43,73] [info] Running with database db.url = jdbc:hsqldb:mem:be828468-81d2-4957-8563-09c6b6c058d3;shutdown=false;hsqldb.tx=mvcc
[2018-09-11 04:49:44,09] [warn] This actor factory is deprecated. Please use cromwell.backend.google.pipelines.v1alpha2.PipelinesApiLifecycleActorFactory for PAPI v1 or cromwell.backend.google.pipelines.v2alpha1.PipelinesApiLifecycleActo
rFactory for PAPI v2
[2018-09-11 04:49:44,13] [warn] Couldn't find a suitable DSN, defaulting to a Noop one.
[2018-09-11 04:49:44,14] [info] Using noop to send events.
[2018-09-11 04:49:44,46] [info] Slf4jLogger started
[2018-09-11 04:49:44,71] [info] Workflow heartbeat configuration:
{
  "cromwellId" : "cromid-84c9232",
  "heartbeatInterval" : "2 minutes",
  "ttl" : "10 minutes",
  "writeBatchSize" : 10000,
  "writeThreshold" : 10000
}
[2018-09-11 04:49:44,76] [info] Metadata summary refreshing every 2 seconds.
[2018-09-11 04:49:44,82] [info] WriteMetadataActor configured to flush with batch size 200 and process rate 5 seconds.
[2018-09-11 04:49:44,82] [info] CallCacheWriteActor configured to flush with batch size 100 and process rate 3 seconds.
[2018-09-11 04:49:44,82] [info] KvWriteActor configured to flush with batch size 200 and process rate 5 seconds.
[2018-09-11 04:49:45,95] [info] JobExecutionTokenDispenser - Distribution rate: 50 per 1 seconds.
[2018-09-11 04:49:45,97] [info] JES batch polling interval is 33333 milliseconds
[2018-09-11 04:49:45,97] [info] JES batch polling interval is 33333 milliseconds
[2018-09-11 04:49:45,97] [info] JES batch polling interval is 33333 milliseconds
[2018-09-11 04:49:45,98] [info] PAPIQueryManager Running with 3 workers
[2018-09-11 04:49:45,98] [info] SingleWorkflowRunnerActor: Version 34
[2018-09-11 04:49:45,98] [info] SingleWorkflowRunnerActor: Submitting workflow
[2018-09-11 04:49:46,04] [info] Unspecified type (Unspecified version) workflow 8823cc11-5e71-4004-b8d4-edf40eb38cd6 submitted
[2018-09-11 04:49:46,09] [info] SingleWorkflowRunnerActor: Workflow submitted 8823cc11-5e71-4004-b8d4-edf40eb38cd6
[2018-09-11 04:49:46,13] [info] 1 new workflows fetched
[2018-09-11 04:49:46,13] [info] WorkflowManagerActor Starting workflow 8823cc11-5e71-4004-b8d4-edf40eb38cd6
[2018-09-11 04:49:46,14] [warn] SingleWorkflowRunnerActor: received unexpected message: Done in state RunningSwraData
[2018-09-11 04:49:46,14] [info] WorkflowManagerActor Successfully started WorkflowActor-8823cc11-5e71-4004-b8d4-edf40eb38cd6
[2018-09-11 04:49:46,14] [info] Retrieved 1 workflows from the WorkflowStoreActor
[2018-09-11 04:49:46,16] [info] WorkflowStoreHeartbeatWriteActor configured to flush with batch size 10000 and process rate 2 minutes.
[2018-09-11 04:49:46,21] [info] MaterializeWorkflowDescriptorActor [8823cc11]: Parsing workflow as WDL draft-2
[2018-09-11 21:59:21,27] [info] MaterializeWorkflowDescriptorActor [2a0d0464]: Call-to-Backend assignments: atac.pool_ta_pr2 -> slurm, atac.filter -> slurm, atac.macs2_pr2 -> slurm, atac.overlap_pr -> slurm, atac.bowtie2 -> slurm, atac.$
ool_ta_pr1 -> slurm, atac.ataqc -> slurm, atac.reproducibility_overlap -> slurm, atac.overlap_ppr -> slurm, atac.spr -> slurm, atac.idr_ppr -> slurm, atac.xcor -> slurm, atac.qc_report -> slurm, atac.idr_pr -> slurm, atac.overlap -> slu$
m, atac.macs2_ppr2 -> slurm, atac.macs2_pr1 -> slurm, atac.macs2_ppr1 -> slurm, atac.read_genome_tsv -> slurm, atac.macs2_pooled -> slurm, atac.reproducibility_idr -> slurm, atac.pool_ta -> slurm, atac.macs2 -> slurm, atac.bam2ta -> slur
m, atac.idr -> slurm, atac.trim_adapter -> slurm
[2018-09-11 21:59:21,40] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,40] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,40] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,40] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,40] [warn] slurm [2a0d0464]: Key/s [preemptible, disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,40] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,40] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,40] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,40] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,41] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,41] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,41] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,41] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,41] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,41] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,41] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,41] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,41] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,41] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,41] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,41] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,41] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,41] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,41] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,41] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:21,41] [warn] slurm [2a0d0464]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-11 21:59:23,66] [info] WorkflowExecutionActor-2a0d0464-f14a-4c4d-a994-cf084e712a66 [2a0d0464]: Starting atac.read_genome_tsv
[2018-09-11 21:59:23,66] [info] WorkflowExecutionActor-2a0d0464-f14a-4c4d-a994-cf084e712a66 [2a0d0464]: Condition met: 'enable_idr'. Running conditional section
[2018-09-11 21:59:23,66] [info] WorkflowExecutionActor-2a0d0464-f14a-4c4d-a994-cf084e712a66 [2a0d0464]: Condition met: '!align_only && !true_rep_only && enable_idr'. Running conditional section
[2018-09-11 21:59:23,66] [info] WorkflowExecutionActor-2a0d0464-f14a-4c4d-a994-cf084e712a66 [2a0d0464]: Condition met: 'enable_idr'. Running conditional section
[2018-09-11 21:59:23,66] [info] WorkflowExecutionActor-2a0d0464-f14a-4c4d-a994-cf084e712a66 [2a0d0464]: Condition met: '!disable_xcor'. Running conditional section
[2018-09-11 21:59:23,67] [info] WorkflowExecutionActor-2a0d0464-f14a-4c4d-a994-cf084e712a66 [2a0d0464]: Condition met: '!true_rep_only'. Running conditional section
[2018-09-11 21:59:23,67] [info] WorkflowExecutionActor-2a0d0464-f14a-4c4d-a994-cf084e712a66 [2a0d0464]: Condition met: '!align_only && !true_rep_only'. Running conditional section
[2018-09-11 21:59:25,00] [warn] DispatchedConfigAsyncJobExecutionActor [2a0d0464atac.read_genome_tsv:NA:1]: Unrecognized runtime attribute keys: disks
[2018-09-11 21:59:25,43] [info] DispatchedConfigAsyncJobExecutionActor [2a0d0464atac.read_genome_tsv:NA:1]: cat /project2/yangili1/mengguo/ASP/atac-seq-pipeline/cromwell-executions/atac/2a0d0464-f14a-4c4d-a994-cf084e712a66/call-read_geno
me_tsv/inputs/378634365/hg19.tsv
[2018-09-11 21:59:25,49] [info] DispatchedConfigAsyncJobExecutionActor [2a0d0464atac.read_genome_tsv:NA:1]: executing: sbatch \
--export=ALL \
-J cromwell_2a0d0464_read_genome_tsv \
-D /project2/yangili1/mengguo/ASP/atac-seq-pipeline/cromwell-executions/atac/2a0d0464-f14a-4c4d-a994-cf084e712a66/call-read_genome_tsv \
-o /project2/yangili1/mengguo/ASP/atac-seq-pipeline/cromwell-executions/atac/2a0d0464-f14a-4c4d-a994-cf084e712a66/call-read_genome_tsv/execution/stdout \
-e /project2/yangili1/mengguo/ASP/atac-seq-pipeline/cromwell-executions/atac/2a0d0464-f14a-4c4d-a994-cf084e712a66/call-read_genome_tsv/execution/stderr \
-t 60 \
-n 1 \
--ntasks-per-node=1 \
--cpus-per-task=1 \
--mem=4000 \
 \
--account mengguo \
--wrap "/bin/bash /project2/yangili1/mengguo/ASP/atac-seq-pipeline/cromwell-executions/atac/2a0d0464-f14a-4c4d-a994-cf084e712a66/call-read_genome_tsv/execution/script"
[2018-09-11 21:59:26,04] [error] WorkflowManagerActor Workflow 2a0d0464-f14a-4c4d-a994-cf084e712a66 failed (during ExecutingWorkflowState): java.lang.RuntimeException: Unable to start job. Check the stderr file for possible errors: /proj
ect2/yangili1/mengguo/ASP/atac-seq-pipeline/cromwell-executions/atac/2a0d0464-f14a-4c4d-a994-cf084e712a66/call-read_genome_tsv/execution/stderr.submit
        at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.$anonfun$execute$2(SharedFileSystemAsyncJobExecutionActor.scala:131)
        at scala.util.Either.fold(Either.scala:188)
        at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.execute(SharedFileSystemAsyncJobExecutionActor.scala:126)
        at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.execute$(SharedFileSystemAsyncJobExecutionActor.scala:121)
        at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.execute(ConfigAsyncJobExecutionActor.scala:208)
        at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$executeAsync$1(StandardAsyncExecutionActor.scala:600)
        at scala.util.Try$.apply(Try.scala:209)
        at cromwell.backend.standard.StandardAsyncExecutionActor.executeAsync(StandardAsyncExecutionActor.scala:600)
        at cromwell.backend.standard.StandardAsyncExecutionActor.executeAsync$(StandardAsyncExecutionActor.scala:600)
        at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.executeAsync(ConfigAsyncJobExecutionActor.scala:208)
        at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover(StandardAsyncExecutionActor.scala:915)
        at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover$(StandardAsyncExecutionActor.scala:907)
        at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.executeOrRecover(ConfigAsyncJobExecutionActor.scala:208)
        at cromwell.backend.async.AsyncBackendJobExecutionActor.$anonfun$robustExecuteOrRecover$1(AsyncBackendJobExecutionActor.scala:65)
        at cromwell.core.retry.Retry$.withRetry(Retry.scala:37)
        at cromwell.backend.async.AsyncBackendJobExecutionActor.withRetry(AsyncBackendJobExecutionActor.scala:61)
        at cromwell.backend.async.AsyncBackendJobExecutionActor.cromwell$backend$async$AsyncBackendJobExecutionActor$$robustExecuteOrRecover(AsyncBackendJobExecutionActor.scala:65)
        at cromwell.backend.async.AsyncBackendJobExecutionActor$$anonfun$receive$1.applyOrElse(AsyncBackendJobExecutionActor.scala:88)
        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
        at akka.actor.Actor.aroundReceive(Actor.scala:517)
        at akka.actor.Actor.aroundReceive$(Actor.scala:515)
        at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.aroundReceive(ConfigAsyncJobExecutionActor.scala:208)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:588)
        at akka.actor.ActorCell.invoke(ActorCell.scala:557)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
        at akka.dispatch.Mailbox.run(Mailbox.scala:225)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[2018-09-11 21:59:26,04] [info] WorkflowManagerActor WorkflowActor-2a0d0464-f14a-4c4d-a994-cf084e712a66 is in a terminal state: WorkflowFailedState
[2018-09-11 21:59:35,67] [info] SingleWorkflowRunnerActor workflow finished with status 'Failed'.
[2018-09-11 21:59:36,55] [info] Workflow polling stopped
[2018-09-11 21:59:36,56] [info] Shutting down WorkflowStoreActor - Timeout = 5 seconds
[2018-09-11 21:59:36,56] [info] Shutting down WorkflowLogCopyRouter - Timeout = 5 seconds
[2018-09-11 21:59:36,57] [info] Shutting down JobExecutionTokenDispenser - Timeout = 5 seconds
[2018-09-11 21:59:36,57] [info] JobExecutionTokenDispenser stopped
[2018-09-11 21:59:36,57] [info] Aborting all running workflows.
[2018-09-11 21:59:36,57] [info] WorkflowLogCopyRouter stopped
[2018-09-11 21:59:36,57] [info] Shutting down WorkflowManagerActor - Timeout = 3600 seconds
[2018-09-11 21:59:36,57] [info] WorkflowStoreActor stopped
[2018-09-11 21:59:36,57] [info] WorkflowManagerActor All workflows finished
[2018-09-11 21:59:36,57] [info] WorkflowManagerActor stopped
[2018-09-11 21:59:36,57] [info] Connection pools shut down
[2018-09-11 21:59:36,57] [info] Shutting down SubWorkflowStoreActor - Timeout = 1800 seconds
[2018-09-11 21:59:36,57] [info] Shutting down JobStoreActor - Timeout = 1800 seconds
[2018-09-11 21:59:36,57] [info] SubWorkflowStoreActor stopped
[2018-09-11 21:59:36,57] [info] Shutting down CallCacheWriteActor - Timeout = 1800 seconds
[2018-09-11 21:59:36,57] [info] Shutting down ServiceRegistryActor - Timeout = 1800 seconds
[2018-09-11 21:59:36,57] [info] Shutting down DockerHashActor - Timeout = 1800 seconds
[2018-09-11 21:59:36,57] [info] CallCacheWriteActor Shutting down: 0 queued messages to process
[2018-09-11 21:59:36,57] [info] Shutting down IoProxy - Timeout = 1800 seconds
[2018-09-11 21:59:36,57] [info] JobStoreActor stopped
[2018-09-11 21:59:36,57] [info] CallCacheWriteActor stopped
[2018-09-11 21:59:36,57] [info] KvWriteActor Shutting down: 0 queued messages to process
[2018-09-11 21:59:36,58] [info] DockerHashActor stopped
[2018-09-11 21:59:36,58] [info] IoProxy stopped
[2018-09-11 21:59:36,58] [info] WriteMetadataActor Shutting down: 0 queued messages to process
[2018-09-11 21:59:36,58] [info] ServiceRegistryActor stopped
[2018-09-11 21:59:36,60] [info] Database closed
[2018-09-11 21:59:36,60] [info] Stream materializer shut down
Workflow 2a0d0464-f14a-4c4d-a994-cf084e712a66 transitioned to state Failed
[2018-09-11 21:59:36,64] [info] Automatic shutdown of the async connection
[2018-09-11 21:59:36,64] [info] Gracefully shutdown sentry threads.
[2018-09-11 21:59:36,65] [info] Shutdown finished.

I never change -Dconfig.file=backends/backend.conf -Dbackend.default=slurm, I don't know it should be adjusted?

@leepc12
Copy link
Contributor

leepc12 commented Sep 11, 2018

"local" here means running pipelines with downloaded (so locally existing) files.

This looks like a SLURM problem. Does your SLURM sbatch take --account or --partition?

Please post an example sbatch command or shell script template you use for submitting your own job to SLURM.

Also, can you run the following sbatch command and see what happens. Post any errors here. I will take a look.

sbatch \
--export=ALL \
-J cromwell_2a0d0464_read_genome_tsv \
-D /project2/yangili1/mengguo/ASP/atac-seq-pipeline/cromwell-executions/atac/2a0d0464-f14a-4c4d-a994-cf084e712a66/call-read_genome_tsv \
-o /project2/yangili1/mengguo/ASP/atac-seq-pipeline/cromwell-executions/atac/2a0d0464-f14a-4c4d-a994-cf084e712a66/call-read_genome_tsv/execution/stdout \
-e /project2/yangili1/mengguo/ASP/atac-seq-pipeline/cromwell-executions/atac/2a0d0464-f14a-4c4d-a994-cf084e712a66/call-read_genome_tsv/execution/stderr \
-t 60 \
-n 1 \
--ntasks-per-node=1 \
--cpus-per-task=1 \
--mem=4000 \
 \
--account mengguo \
--wrap "/bin/bash /project2/yangili1/mengguo/ASP/atac-seq-pipeline/cromwell-executions/atac/2a0d0464-f14a-4c4d-a994-cf084e712a66/call-read_genome_tsv/execution/script"

BTW, I got your email but I cannot personally skype with you.

@gmgitx
Copy link
Author

gmgitx commented Sep 12, 2018

I see, thanks!

I just put this command line in shell as doc step8, I run in/mypath/atac-seq-pipeline/
and source activate encode-atac-seq-pipeline:
java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=slurm /home/mengguo/local/bin/cromwell-34.jar run atac.wdl -i /mypath1/input.json -o /mypath2/atac-seq-pipeline/workflow_opts/slurm.json

For "run the following sbatch command and see what happens", it happens

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
(encode-atac-seq-pipeline) Tue Sep 11 18:58:59 2018

slurm.json:
{
"default_runtime_attributes" : {
"slurm_account": "mengguo"
}
}

@leepc12
Copy link
Contributor

leepc12 commented Sep 14, 2018

@gmgitx: Your error says Invalid account or account/partition combination specified. Please post an example sbatch command or shell script template you use for submitting your own job to SLURM.

@gmgitx
Copy link
Author

gmgitx commented Sep 14, 2018

I got right account/partition from our IT after your mentioned, no error like that. But still can't work, for the document provide ENCSR356KRQ data.
java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=slurm /home/mengguo/local/bin/cromwell-34.jar run atac.wdl -i /mypath1/ENCSR356KRQ_subsampled.json -o /mypath2/atac-seq-pipeline/workflow_opts/slurm.json this is the command I put in, I didn't add sbatch.

Part of warn and error:

[2018-09-14 17:11:22,17] [warn] This actor factory is deprecated. Please use cromwell.backend.google.pipelines.v1alpha2.PipelinesApiLifecycleActorFactory fo$
 PAPI v1 or cromwell.backend.google.pipelines.v2alpha1.PipelinesApiLifecycleActorFactory for PAPI v2
[2018-09-14 17:11:22,17] [warn] Couldn't find a suitable DSN, defaulting to a Noop one.
...
[2018-09-14 17:11:23,73] [warn] SingleWorkflowRunnerActor: received unexpected message: Done in state RunningSwraData
...
[2018-09-14 17:13:09,52] [info] MaterializeWorkflowDescriptorActor [00851093]: Call-to-Backend assignments: atac.overlap_pr -> slurm, atac.spr -> sl[42/1857$
.qc_report -> slurm, atac.reproducibility_idr -> slurm, atac.reproducibility_overlap -> slurm, atac.pool_ta -> slurm, atac.macs2_pr2 -> slurm, atac.xcor -> s
lurm, atac.ataqc -> slurm, atac.overlap_ppr -> slurm, atac.filter -> slurm, atac.idr_ppr -> slurm, atac.idr_pr -> slurm, atac.bam2ta -> slurm, atac.overlap -
> slurm, atac.bowtie2 -> slurm, atac.macs2_ppr1 -> slurm, atac.pool_ta_pr2 -> slurm, atac.read_genome_tsv -> slurm, atac.trim_adapter -> slurm, atac.macs2_pp
r2 -> slurm, atac.macs2_pr1 -> slurm, atac.idr -> slurm, atac.pool_ta_pr1 -> slurm, atac.macs2 -> slurm, atac.macs2_pooled -> slurm
[2018-09-14 17:13:09,74] [warn] slurm [00851093]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-14 17:13:09,74] [warn] slurm [00851093]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-14 17:13:09,74] [warn] slurm [00851093]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-14 17:13:09,75] [warn] slurm [00851093]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-14 17:13:09,75] [warn] slurm [00851093]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-14 17:13:09,75] [warn] slurm [00851093]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-14 17:13:09,75] [warn] slurm [00851093]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-14 17:13:09,75] [warn] slurm [00851093]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-14 17:13:09,75] [warn] slurm [00851093]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-14 17:13:09,75] [warn] slurm [00851093]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-14 17:13:09,75] [warn] slurm [00851093]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-14 17:13:09,75] [warn] slurm [00851093]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-14 17:13:09,75] [warn] slurm [00851093]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-14 17:13:09,75] [warn] slurm [00851093]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-14 17:13:09,75] [warn] slurm [00851093]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-14 17:13:09,76] [warn] slurm [00851093]: Key/s [preemptible, disks] is/are not supported by backend. Unsupported attributes will not be part of job
executions.
[2018-09-14 17:13:09,76] [warn] slurm [00851093]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-14 17:13:09,76] [warn] slurm [00851093]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-14 17:13:09,76] [warn] slurm [00851093]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-09-14 17:13:09,76] [warn] slurm [00851093]: Key/s [disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
...
[warn] DispatchedConfigAsyncJobExecutionActor [e0efd905atac.read_genome_tsv:NA:1]: Unrecognized runtime attribute keys: disks

...
[2018-09-14 20:35:09,32] [warn] Localization via hard link has failed: /project2/yangili1/mengguo/ASP/atac-seq-pipeline/cromwell-executions/atac/018d7f41-9161-4d4d-8a85-30b884e414c1/call-trim_adapter/shard-1/inputs/1398708413/ENCFF193RRC
.subsampled.400.fastq.gz -> /project2/yangili1/mengguo/ASP/atac-seq-pipeline/test_sample/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF193RRC.subsampled.400.fastq.gz
[2018-09-14 20:35:09,32] [warn] Localization via copy has failed: /project2/yangili1/mengguo/ASP/atac-seq-pipeline/test_sample/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF193RRC.subsampled.400.fastq.gz
[2018-09-14 20:35:09,32] [warn] Localization via hard link has failed: /project2/yangili1/mengguo/ASP/atac-seq-pipeline/cromwell-executions/atac/018d7f41-9161-4d4d-8a85-30b884e414c1/call-trim_adapter/shard-1/inputs/1398708414/ENCFF886FS$
.subsampled.400.fastq.gz -> /project2/yangili1/mengguo/ASP/atac-seq-pipeline/test_sample/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF886FSC.subsampled.400.fastq.gz
[2018-09-14 20:35:09,33] [warn] Localization via copy has failed: /project2/yangili1/mengguo/ASP/atac-seq-pipeline/test_sample/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF886FSC.subsampled.400.fastq.gz
[2018-09-14 20:35:09,33] [warn] Localization via hard link has failed: /project2/yangili1/mengguo/ASP/atac-seq-pipeline/cromwell-executions/atac/018d7f41-9161-4d4d-8a85-30b884e414c1/call-trim_adapter/shard-1/inputs/1398708413/ENCFF366DF$
.subsampled.400.fastq.gz -> /project2/yangili1/mengguo/ASP/atac-seq-pipeline/test_sample/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF366DFI.subsampled.400.fastq.gz
[2018-09-14 20:35:09,33] [warn] Localization via copy has failed: /project2/yangili1/mengguo/ASP/atac-seq-pipeline/test_sample/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF366DFI.subsampled.400.fastq.gz
[2018-09-14 20:35:09,33] [warn] Localization via hard link has failed: /project2/yangili1/mengguo/ASP/atac-seq-pipeline/cromwell-executions/atac/018d7f41-9161-4d4d-8a85-30b884e414c1/call-trim_adapter/shard-1/inputs/1398708414/ENCFF573UX$
.subsampled.400.fastq.gz -> /project2/yangili1/mengguo/ASP/atac-seq-pipeline/test_sample/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF573UXK.subsampled.400.fastq.gz
[2018-09-14 20:35:09,33] [warn] Localization via copy has failed: /project2/yangili1/mengguo/ASP/atac-seq-pipeline/test_sample/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF573UXK.subsampled.400.fastq.gz
[2018-09-14 20:35:09,34] [error] DispatchedConfigAsyncJobExecutionActor [018d7f41atac.trim_adapter:1:1]: Error attempting to Execute
java.lang.Exception: Failed command instantiation

...
[2018-09-14 17:00:23,88] [error] DispatchedConfigAsyncJobExecutionActor [2386aadbatac.trim_adapter:1:1]: Error attempting to Execute
java.lang.Exception: Failed command instantiation
        at cromwell.backend.standard.StandardAsyncExecutionActor.instantiatedCommand(StandardAsyncExecutionActor.scala:537)
        at cromwell.backend.standard.StandardAsyncExecutionActor.instantiatedCommand$(StandardAsyncExecutionActor.scala:472)
        at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.instantiatedCommand$lzycompute(ConfigAsyncJobExecutionActor.scala:208)
        at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.instantiatedCommand(ConfigAsyncJobExecutionActor.scala:208)

@leepc12
Copy link
Contributor

leepc12 commented Sep 14, 2018

I didn't mean a pipeline command line. I just wanted to see an example sbatch command line that you usually use. Is there an wiki page for your cluster?

What is your sbatch command line to submit the following HelloWorld shell script hello_world.sh?

#!/bin/bash
echo Hello world
echo Sleep 60

@gmgitx
Copy link
Author

gmgitx commented Sep 14, 2018

Sorry for my misunderstanding.

guide for our cluster
https://github.com/jdblischak/giladlab-midway-guide

Here sbatch command:
sbatch hello_world.sh
then back me a file slurm-[number].out

@leepc12
Copy link
Contributor

leepc12 commented Sep 14, 2018

Are you sure that sbatch hello_world.sh works without any extra parameters? If so, remove account settings ("slurm_account": "mengguo") from workflow_opts/slurm.json and try again.

@gmgitx
Copy link
Author

gmgitx commented Sep 14, 2018

Yes, in the slurm-[number].out,

Hello world
Sleep 60

I removed account, seems same warning and error report.

@leepc12
Copy link
Contributor

leepc12 commented Sep 17, 2018

Please post a full log and also your workflow_opts/slurm.json.

@gmgitx
Copy link
Author

gmgitx commented Sep 19, 2018

######command
sbatch --mem=8g --partition=broadwl run_atac.sh

#atac.sh

#!/bin/bash
java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=slurm /home/name2/local/bin/cromwell-34.jar run atac.wdl -i /project2/name1/name2/DLDS/ENCSR356KRQ_subsampled.json -o /project2/name1/name2/ASP/atac-seq-pipeline/workflow_opts/slurm.json

######slurm.json

{
    "default_runtime_attributes" : {
        "slurm_partition": "broadwl"
    }
}

######ENCSR356KRQ_subsampled.json

#########################result

slurm-49805362.out

@leepc12
Copy link
Contributor

leepc12 commented Sep 19, 2018

I guess that you (or your partition) have a limited quota for resources on your cluster?

$ scontrol show partition broadwl

Do you have a privilege to use enough resources (memory>=16GB, cpu>=4, walltime>=48hr per task) on your partition?

Please run the following on the working directory where you ran the pipeline. This will make a tar ball of all log files and please upload it here. I need it for debugging:

$ find . -type f -name 'stdout' -or -name 'stderr' -or -name 'script' -or \
-name '*.qc' -or -name '*.txt' -or -name '*.log' -or -name '*.png' -or -name '*.pdf' \
| xargs tar -zcvf debug_issue_31.tar.gz

@gmgitx
Copy link
Author

gmgitx commented Sep 20, 2018

Thanks!
scontrol show partition broadwl

PartitionName=broadwl
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=midway2-[0002-0089,0103-0124,0137-0182,0221-0230,0258-0280,0282-0301,0312-0398,0400]
   PriorityJobFactor=20 PriorityTier=20 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=8316 TotalNodes=297 SelectTypeParameters=NONE
   DefMemPerCPU=2048 MaxMemPerCPU=9223372036854775807

tar zxvf debug_issue_31.tar.gz

@leepc12
Copy link
Contributor

leepc12 commented Sep 20, 2018

@gmgitx Your tarball does not have any file in it.

@gmgitx
Copy link
Author

gmgitx commented Sep 20, 2018

Thanks!
After my executed that command on the working directory where I ran the pipeline, one debug_issue_31.tar.gz left. So it means what if no file in it or it should have?

@leepc12
Copy link
Contributor

leepc12 commented Sep 20, 2018

Please send that file debug_issue_31.tar.gz to my email.

@gmgitx
Copy link
Author

gmgitx commented Sep 21, 2018

I sent

@leepc12
Copy link
Contributor

leepc12 commented Sep 21, 2018

I got your log but it includes outputs from too many pipeline runs. But for the latest pipeline run, I found that the first task of the pipeline worked fine so you can keep using your partition broadwl. But the next step failed and I need to figure it out. I guess that it's rejected by a cluster due to resource quota.

What is resource quota on your cluster? How much resource your partition can use on your cluster? For example, maximum number of concurrent jobs, max cpu per job, max memory per job, max walltime per job. This information will be helpful for debugging it.

Can you clean up (rm -rf cromwell-execution*) your output directories and run a pipeline again? Or if that rm -rf does not work then make a new directory and follow steps on the documentation again. And then post both your screen log and a new tar ball (please make a new one using the same command).

@gmgitx
Copy link
Author

gmgitx commented Sep 21, 2018

Many thanks!
I sent debug_issue_31.tar.gz to your email, and slurm-49925402.out

According what i know from IT, memory>=16GB, cpu>=4 are allowed but walltime must be under 36 hours total

my partition:
MaxCPUsPerUser 2800
MaxNodesPerUser 100
MaxJobsPerUser 100
MaxSubmitJobs 500
MaxWall 1-12:00:00

@leepc12
Copy link
Contributor

leepc12 commented Sep 25, 2018

Default walltime for bowtie2 is 48 hours. I think this caused the problem. Please add the following to your input JSON and try again.

    "atac.bowtie2.mem_mb" : 10000,
    "atac.bowtie2.cpu" : 1,
    "atac.bowtie2.time_hr" : 12,

Also, reduce the number of concurrent jobs to 1 or 2 in backends/backend.conf.
https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/master/backends/backend.conf#L164

@gmgitx
Copy link
Author

gmgitx commented Sep 25, 2018

Thanks your kindly help.
This fold "cromwell-executions" not be created after ran, I modified as your advice this time and by,
###
sbatch ./example.sbatch
#example.sbatch

#!/bin/bash
#SBATCH --job-name=example_sbatch
#SBATCH --output=example_sbatch.out
#SBATCH --error=example_sbatch.err
#SBATCH --time=36:00:00
#SBATCH --partition=broadwl
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --mem-per-cpu=20
source activate encode-atac-seq-pipeline
bash run_atac1.sh
source deactivate

###run_atac1.sh

#!/bin/bash
java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=slurm /home/name2/local/bin/cromwell-34.jar run atac.wdl -i /project2/name1/name2/DLDS/ENCSR356KRQ_subsampled.json -o /project2/name1/name2/ASP/atac-seq-pipeline/workflow_opts/slurm.json

sbatch_report

Sorry for its error.
Also, I sent debug_issue_31.tar.gz to your email

However, thank you again.

@leepc12
Copy link
Contributor

leepc12 commented Sep 26, 2018

Your sbatch_report is trimmed?

@gmgitx
Copy link
Author

gmgitx commented Sep 26, 2018

I combined two files (example_sbatch.err and example_sbatch.out) together, no other process

@leepc12
Copy link
Contributor

leepc12 commented Sep 26, 2018

Please take a look at the ###example_sbatch.out part.

A log file in your tar ball says that some of the sub-tasks (read_genome_tsv, trim_adapter) were done successfully but they are not shown on your sbatch_report. I think it's trimmed indeed. It only shows
some initialization stages of the pipeline.

@gmgitx
Copy link
Author

gmgitx commented Sep 26, 2018

yes, you are right, here is a example_sbatch.out. But I think if sub-tasks were done successfully, fold "cromwell-executions" should be created and here not

###example_sbatch.out

[2018-09-25 18:19:47,44] [info] Running with database db.url = jdbc:hsqldb:mem:47a741e8-0324-4c29-a170-f4dd54d61b24;shutdown=false;hsqldb.tx=mvcc
[2018-09-25 18:19:55,46] [info] Running migration RenameWorkflowOptionsInMetadata with a read batch size of 100000 and a write batch size of 100000
[2018-09-25 18:19:55,47] [info] [RenameWorkflowOptionsInMetadata] 100%
[2018-09-25 18:19:55,56] [info] Running with database db.url = jdbc:hsqldb:mem:124b08c3-9730-416e-a798-3cf91acbf493;shutdown=false;hsqldb.tx=mvcc
[2018-09-25 18:19:55,88] [warn] This actor factory is deprecated. Please use cromwell.backend.google.pipelines.v1alpha2.PipelinesApiLifecycleActorFactory for PAPI v1 or cromwell.backend.google.pipelines.v2alpha1.PipelinesApiLifecycleActorFactory for PAPI v2
[2018-09-25 18:19:55,92] [warn] Couldn't find a suitable DSN, defaulting to a Noop one.
[2018-09-25 18:19:55,92] [info] Using noop to send events.
[2018-09-25 18:19:56,19] [info] Slf4jLogger started
[2018-09-25 18:19:56,37] [info] Workflow heartbeat configuration:
{
  "cromwellId" : "cromid-a9ac2b1",
  "heartbeatInterval" : "2 minutes",
  "ttl" : "10 minutes",
  "writeBatchSize" : 10000,
  "writeThreshold" : 10000
}
[2018-09-25 18:19:56,40] [info] Metadata summary refreshing every 2 seconds.
[2018-09-25 18:19:56,43] [info] CallCacheWriteActor configured to flush with batch size 100 and process rate 3 seconds.
[2018-09-25 18:19:56,43] [info] WriteMetadataActor configured to flush with batch size 200 and process rate 5 seconds.
[2018-09-25 18:19:56,43] [info] KvWriteActor configured to flush with batch size 200 and process rate 5 seconds.
[2018-09-25 18:19:57,18] [info] JobExecutionTokenDispenser - Distribution rate: 50 per 1 seconds.
[2018-09-25 18:19:57,20] [info] SingleWorkflowRunnerActor: Version 34
[2018-09-25 18:19:57,20] [info] JES batch polling interval is 33333 milliseconds
[2018-09-25 18:19:57,20] [info] JES batch polling interval is 33333 milliseconds
[2018-09-25 18:19:57,20] [info] JES batch polling interval is 33333 milliseconds
[2018-09-25 18:19:57,20] [info] PAPIQueryManager Running with 3 workers
[2018-09-25 18:19:57,21] [info] SingleWorkflowRunnerActor: Submitting workflow
[2018-09-25 18:19:57,25] [info] Unspecified type (Unspecified version) workflow 40567000-f7d2-491b-b255-44cdcec9a54b submitted
[2018-09-25 18:19:57,30] [info] SingleWorkflowRunnerActor: Workflow submitted 40567000-f7d2-491b-b255-44cdcec9a54b
[2018-09-25 18:19:57,31] [info] 1 new workflows fetched
[2018-09-25 18:19:57,31] [info] WorkflowManagerActor Starting workflow 40567000-f7d2-491b-b255-44cdcec9a54b
[2018-09-25 18:19:57,31] [warn] SingleWorkflowRunnerActor: received unexpected message: Done in state RunningSwraData
[2018-09-25 18:19:57,31] [info] WorkflowManagerActor Successfully started WorkflowActor-40567000-f7d2-491b-b255-44cdcec9a54b
[2018-09-25 18:19:57,32] [info] Retrieved 1 workflows from the WorkflowStoreActor
[2018-09-25 18:19:57,32] [info] WorkflowStoreHeartbeatWriteActor configured to flush with batch size 10000 and process rate 2 minutes.
[2018-09-25 18:19:57,37] [info] MaterializeWorkflowDescriptorActor [40567000]: Parsing workflow as WDL draft-2

@leepc12
Copy link
Contributor

leepc12 commented Sep 26, 2018

Can you upload your modified input JSON here?

@gmgitx
Copy link
Author

gmgitx commented Sep 26, 2018

Sure, thanks
####.../ENCSR356KRQ_subsampled.json

{
    "atac.pipeline_type" : "atac",
    "atac.genome_tsv" : "/project2/name1/name2/DLDS/process_data/hg19db/hg19.tsv",
    "atac.fastqs" : [
        [
            ["/project2/name1/name2/DLDS/test_sample/ENCSR356KRQ/fastq_subsampled/rep1/pair1/ENCFF341MYG.subsampled.400.fastq.gz",
             "/project2/name1/name2/DLDS/test_sample/ENCSR356KRQ/fastq_subsampled/rep1/pair2/ENCFF248EJF.subsampled.400.fastq.gz"],
            ["/project2/name1/name2/DLDS/test_sample/ENCSR356KRQ/fastq_subsampled/rep1/pair1/ENCFF106QGY.subsampled.400.fastq.gz",
             "/project2/name1/name2/DLDS/test_sample/ENCSR356KRQ/fastq_subsampled/rep1/pair2/ENCFF368TYI.subsampled.400.fastq.gz"]
        ],
        [
            ["/project2/name1/name2/DLDS/test_sample/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF641SFZ.subsampled.400.fastq.gz",
             "/project2/name1/name2/DLDS/test_sample/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF031ARQ.subsampled.400.fastq.gz"],
            ["/project2/name1/name2/DLDS/test_sample/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF751XTV.subsampled.400.fastq.gz",
             "/project2/name1/name2/DLDS/test_sample/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF590SYZ.subsampled.400.fastq.gz"],
            ["/project2/name1/name2/DLDS/test_sample/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF927LSG.subsampled.400.fastq.gz",
             "/project2/name1/name2/DLDS/test_sample/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF734PEQ.subsampled.400.fastq.gz"],
            ["/project2/name1/name2/DLDS/test_sample/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF859BDM.subsampled.400.fastq.gz",
             "/project2/name1/name2/DLDS/test_sample/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF007USV.subsampled.400.fastq.gz"],
            ["/project2/name1/name2/DLDS/test_sample/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF193RRC.subsampled.400.fastq.gz",
             "/project2/name1/name2/DLDS/test_sample/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF886FSC.subsampled.400.fastq.gz"],
            ["/project2/name1/name2/DLDS/test_sample/ENCSR356KRQ/fastq_subsampled/rep2/pair1/ENCFF366DFI.subsampled.400.fastq.gz",
             "/project2/name1/name2/DLDS/test_sample/ENCSR356KRQ/fastq_subsampled/rep2/pair2/ENCFF573UXK.subsampled.400.fastq.gz"]
        ]
    ],

    "atac.paired_end" : true,
    "atac.multimapping" : 4,

    "atac.trim_adapter.auto_detect_adapter" : true,
    "atac.trim_adapter.cpu" : 1,

    "atac.bowtie2.mem_mb" : 10000,
    "atac.bowtie2.cpu" : 1,
    "atac.bowtie2.mem_hr" : 12,

    "atac.filter.cpu" : 1,
    "atac.filter.mem_mb" : 12000,

    "atac.macs2_mem_mb" : 16000,

    "atac.smooth_win" : 73,
    "atac.enable_idr" : true,
    "atac.idr_thresh" : 0.05,

    "atac.qc_report.name" : "ENCSR356KRQ (subsampled 1/400 reads)",
    "atac.qc_report.desc" : "ATAC-seq on primary keratinocytes in day 0.0 of differentiation"
}

####...atac-seq-pipeline/workflow_opts/slurm.json

{
    "default_runtime_attributes" : {
        "slurm_partition": "broadwl"
    }
}

@leepc12
Copy link
Contributor

leepc12 commented Sep 27, 2018

I think this is a resource quota/limit problem on your cluster. Please play with some resource settings in your input JSON. You may need to revert back to the last partially successful configuration (for some tasks) somehow and change settings for resources.

https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/master/docs/input.md#resource

Resource settings for one of your successful task (trim_adapter) was 2 cpu, 12000 mem_mb, 24 time_hr.

@gmgitx
Copy link
Author

gmgitx commented Sep 28, 2018

Thanks! Although just have trim result, I'll continue to adjust resource settings.
But for run out trim_adapter

.../atac-seq-pipeline/cromwell-executions/atac/06bf6b3b-164f-4917-9507-d90a58a428e4/call-trim_adapter/shard-0/execution
merge_fastqs_R1_ENCFF341MYG.subsampled.400.trim.merged.fastq.gz
merge_fastqs_R2_ENCFF248EJF.subsampled.400.trim.merged.fastq.gz
.../atac-seq-pipeline/cromwell-executions/atac/06bf6b3b-164f-4917-9507-d90a58a428e4/call-trim_adapter/shard-1/execution
merge_fastqs_R1_ENCFF641SFZ.subsampled.400.trim.merged.fastq.gz
merge_fastqs_R2_ENCFF031ARQ.subsampled.400.trim.merged.fastq.gz

Could I make sure it's right or something wrong for just get first two files's trim result for each rep of ENCSR356KRQ?

@leepc12
Copy link
Contributor

leepc12 commented Sep 28, 2018

Yes, these fastqs (two for each replicate) look fine.

@leepc12
Copy link
Contributor

leepc12 commented Nov 19, 2018

Closing this issue due to long inactivity.

@leepc12 leepc12 closed this as completed Nov 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants