samtools error ([main_samview] fail to read the header from "-") during DE analysis ... empty bam file? #45

cea295933 · 2023-11-15T17:14:22Z

Ask away!

Hello,

My workflow seems to be failing as a result of what I think is a samtools error.

Command:
minimap2 -t 4 -ax splice -uf -p 1.0 "genome_index.mmi" "seqs.fastq.gz" | samtools view -Sb > "output.bam"
3499 samtools sort -@ 4 "output.bam" -o "WTpoly2_reads_aln_sorted.bam"

Error:
[WARNING] Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index.
[main_samview] fail to read the header from "-".

I've taken a quick look at the files being maniupulated here (in the offending work directory). The .mmi does not appear to be human readable but is at least not empty (I can provide if you like ... GitHub won't allow me to upload). And seqs.fastq.gz is certainly not empty: 5G compressed and 20G uncompressed (I am pasting just the first entry for your reference below). However, output.bam is empty, and so I think this causes the issue for the samtools sort command. Any idea what could be causing this?

@4aabeb37-0ae5-4289-ab27-e6f246a7e21a runid=215283ee2a07610bc86fc4baad99de671b475048 read=74966 ch=2198 start_time=2023-09-23T00:45:55.980051-04:00 flow_cell_id=PAO05278 protocol_group_id=Vassar_smm092223 sample_id=NB_A11-H11 barcode=barcode84 barcode_alias=barcode84 parent_read_id=4aabeb37-0ae5-4289-ab27-e6f246a7e21a basecall_model_version_id=dna_r10.4.1_e8.2_5khz_400bps_hac@v4.2.0
ATGTTATGTCCTGTACTTCGTTCAGTTACGTATTGCTAAGGTTAAGTAGTGGACCTAGAACCTGTGCCACAGCACCTCGCCTACCGTGACAAGAAAGTTGTCGGTGTCTTTGTGACTGCCTGTCGCTCTATCTTCAGAGGAGAGTCCGCCGCCCGCAAGTTTTTTTTTTTTTTTTTTTTTTTTTTGCCGATGTGATTCTAAAGATGTCATTCATATATTCAAGTTAAAAAATATATAGAATTGATAAGCCATATATAGAAACAATTACTAAAGAGAGAAAAAAAACCAAGTCTGGTAAAAGAAAAAGAAAAAGAGCCTAATATACGACACTTTGGTTCAAATAATTCACCAGAAGTCCTACGTTTTATCCTCCACTTATTATTATATCATCTTGTTAATTTTATTTCTTTAAAATTTAGCAAATTGCTTGTTGGCCTTACCGGCAGCAGCAGAGACCTTGACGACGTTGAATCTAACAGTCTTGGAGATTGGTCTACATTGACCAACGGTAACAATGTCACCAACTTGGACACGGAAAGCTGGGGAGACGTGGACTGGGACGTTCTTGTGTCTCTTTTCGTATCTGTTGTACTTTGGAATGTAATGCAAGTAAGCTCTTTCTGATGACAATGGTACGGTGCATCTTGGTGGAGACGACGGTACCGGTCAAGATCTTACCACGGATGGAAACTAAACCAGTGAATGGACATTTCTTATCAATGTAAGAACCTTCAATAGCGGTCTTTGGGGTCTTGAATCCCAAACCGGCATTCTTATACCATCTCTTGGTTCTCTTGGAAGTCTTGACCTTTGGATTGTTGAAGATGTGAGGTTGCTTTTGGAAAGCTCTTTAGATTGAACAGTTAATTCAGTGGACATCTTTTTTCCCTGGCTTGATACCCCAAAGTCCAAGGGCAATTCGAACTGGAAAGCAATATCAGCACCAACAGAAACACAAAGACACCGACAACTTTCTTGTCACC
+
%%&'*(++)%%(+,**++++235891122<><=>>FIDDCCFD=714211<BA?BBHGITFFEDDEEEE{HIPCBA@@?@ADJPHH{I{NHE=<<=BCECEIDGHGNHFJHI{{JHPFDICE{F@@AAFBCCBI=888>KHFNL{HJEGDDDEC=EHEHJ@D67>?ADRI{{{{{{{{{HBFCIJB?=.,,9{>:9<433267998:99:?>E{J{P{;:99@@b<>?DHME=>AAG>???C@?777;>@bw>@?AFQEJSIJHIHFGH{F{QIGEH{{FCCBB^PFHQJJFDIJKLHEGGHKF{{KGDJS{K{GFF{H{OO{I{KKN{HJIKHRMJ{DEJFFDFCBEIGHJG{GFNEEKEHEKLHDDB@==;:=DCABCDBCEEIIEEABBGKGUEHHJHKJHGABC@BC5546==>=F6666B@BFJFFDDCEGJEC5;;IFB==?A=76689FEGD@A@AIKGGMFEG@{9=<AC=BKFEGHGCDE{=444--,,02+**'('''+-/CDEGIHCA?@eee{HEFHE{GJ{IF{HPDFF@CA889=B?@coc;988,012677452003404476,***00+((/)6<;=::>@FFDEFBFEDHEIG{{{H6666@C.,,,<;FEGDC=77BBFGI8889FEDEDFCBCDEG<;@dlph@??GG:9:YI{JFFBE7(((788:BBFEFFFDKJE{LHI{GBDDF{@@>BDAD?>/../1011.-,-33=>@EELGJEG{HIHIFFBB@ALLGGKFHFFEDHGFHGGIJEDCHCJDEIGFJGHUIFXGHHFEIIOGMFFIGGCC:94&

sarahjeeeze · 2023-11-16T12:04:12Z

Hi, can you provide your full input cmd? Are these cdna or direct_rna reads?

cea295933 · 2023-11-16T12:25:33Z

Thanks for getting back to me! These are cDNA reads and I am using sbatch to start the job via slurm. See below: #!/bin/bash #SBATCH -J Aitken_epi2me_poly_noRNA #SBATCH -o Aitken_epi2me_poly_noRNA.out #SBATCH --nodes=1 #SBATCH --exclusive #SBATCH --mem=MaxMemPerNode #SBATCH -p emc cd /work/epi2me ./nextflow -log /work/caitken/cea_nextflow.log run epi2me-labs/wf-transcriptomes -with-trace \ --fastq /work/caitken/data/DegronNanoporeSequencing/Poly \ --ref_genome /work/caitken/data/DegronNanoporeSequencing/sacCer3/20110902_sacCer3.fa \ --ref_annotation /work/caitken/data/DegronNanoporeSequencing/sacCer3/20110902_sacCer3.gff \ --de_analysis \ --sample_sheet /work/caitken/BarcodesPoly.csv \ --out_dir /work/caitken/data/DegronNanoporeSequencing/outputPoly_noRNA \ -c /work/caitken/data/DegronNanoporeSequencing/my_config.cfg Colin Echeverría Aitken Assistant Professor Biology Department Biochemistry Program Vassar College ***@***.*** ***@***.***> 845.437.7430

…

On Nov 16, 2023, at 7:04 AM, Sarah Griffiths ***@***.***> wrote: Hi, can you provide your full input cmd? Are these cdna or direct_rna reads? — Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BD2SWUZKMNA5ADBEOP7YCEDYEX6MPAVCNFSM6AAAAAA7M2VUXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJUGMYTGMZWGM>. You are receiving this because you authored the thread.

sarahjeeeze · 2023-11-16T15:42:00Z

Hi @cea295933

What is the exit code you are getting from that process that has the error fail to read the header from "-". This step is very memory intensive process and I have seen this same error message/empty bam appear if the alignment process has run out of memory. Ensure the global config has a high memory set and consider adding --minimap2_index_opts to eg. -w 25 or -w 50

You can actually delete this section from any config you have

executor {
    $local {
        cpus = 4
        memory = "8 GB"
    }
}

cea295933 · 2023-11-16T16:02:07Z

Thanks for taking the time to look into this! The error code I see is: 3496 Command executed: 3497 3498 minimap2 -t 4 -ax splice -uf -p 1.0 "genome_index.mmi" "seqs.fastq.gz" | samtools view -Sb > "output.bam" 3499 samtools sort -@ 4 "output.bam" -o "WTpoly2_reads_aln_sorted.bam" 3500 3501 Command exit status: 3502 1 3503 3504 Command output: 3505 (empty) 3506 3507 Command error: 3508 [WARNING] Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index. 3509 [main_samview] fail to read the header from "-". 3510 3511 Work dir: 3512 /work/epi2me/work/85/754a2fa2bd94abf6deaf62d3491903 The contents of my config file are: 1 executor { 2 $local { 3 cpus = 64 4 memory = "256 GB" 5 } 6 } Our HPC has 512 GB and 128 CPUs available per node. It looks to me like the sbatch code is overriding the config file, because squeue returns 128 CPUs running for the job. But I will try running this with the added code and will amend to config file auto to request all the available CPUs and memory (unless you suggest I try just shy of the max memory, say 480 GB) Colin Echeverría Aitken Assistant Professor Biology Department Biochemistry Program Vassar College ***@***.*** ***@***.***> 845.437.7430

…

On Nov 16, 2023, at 10:42 AM, Sarah Griffiths ***@***.***> wrote: Hi @cea295933 <https://github.com/cea295933> What is the exit code you are getting from that process that has the error fail to read the header from "-". This step is very memory intensive process and I have seen this same error message/empty bam appear if the alignment process has run out of memory. Ensure the global config has a high memory set and consider adding --minimap2_index_opts to eg. -w 25 or -w 50 You can actually delete this section from any config you have executor { $local { cpus = 4 memory = "8 GB" } } — Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BD2SWUZCD37W2GZWCEQYVM3YEYX5JAVCNFSM6AAAAAA7M2VUXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJUG4YDSNZWHA>. You are receiving this because you were mentioned.

sarahjeeeze · 2023-11-16T17:06:54Z

Sorry I see you have supplied the full input cmd above, what is the reference annotation and reference genome you are using if publicly available?

sarahjeeeze · 2023-11-16T17:12:49Z

Also how many samples are you running?

cea295933 · 2023-11-16T17:20:28Z

I am running two replicates of each sample … I know 3 or more is best but elsewhere on the GitHub I see that 2 is ok (and know that DESeq2, for example, can accept just 2 replicates. This is a historical set of samples that we used for Illumina sequencing and so don’t have more replicates. I’m trying to compare our results here to what we obtained previously using DESeq2 on those Illumina runs to determine whether to purchase a PromethION instrument. We actually have 8 samples total, but I am running them in two batches. They represent either total RNA or a fraction of Toal RNA (associated with ribosomes). We have two replicates for each condition (WT/Mut and Total/Poly). All I really need is a read counts so that I can go into DESeq2 to repeat our previous analysis and compare. I’m certainly interested in the DE numbers the workflow provides, but I also want to determine DE for the (Poly/Total) reads, when comparing mutant to WT, and I know how to setup DESeq2 to achieve that using the R package directly. Thanks! Colin Echeverría Aitken Assistant Professor Biology Department Biochemistry Program Vassar College ***@***.*** ***@***.***> 845.437.7430

…

On Nov 16, 2023, at 12:13 PM, Sarah Griffiths ***@***.***> wrote: Also how many samples are you running? — Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BD2SWU2AGQOY5HP4HBBSW2LYEZCRZAVCNFSM6AAAAAA7M2VUXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJUHA4DANRUGA>. You are receiving this because you were mentioned.

sarahjeeeze · 2023-11-16T17:25:21Z

If you have a reference transcriptome available and supply it to the ref_transcriptome parameter does the workflow progress past this part of the workflow? Trying to work out if the reads or something to do with the string tie merged transcriptome we are using could be causing this error.

cea295933 · 2023-11-16T17:29:09Z

If I supply a reference guided transcritptome, it crashes at the salmon step … I get an error about needing to upgrade salmon (described in issue 43 but also pasted below): Process pipeline:differential_expression:count_transcripts (1) terminated with an error exit status (1) Command executed: salmon quant --noErrorModel -p "4" -t "ammended.ref_transcriptome" -l SF -a "WTpoly1_reads_aln_sorted.bam" -o counts mv counts/quant.sf "WTpoly1.transcript_counts.tsv" seqkit bam "WTpoly1_reads_aln_sorted.bam" 2> "WTpoly1.seqkit.stats" Command exit status: 1 Command output: (empty) Command error: Version Info: ### PLEASE UPGRADE SALMON ### Colin Echeverría Aitken Assistant Professor Biology Department Biochemistry Program Vassar College ***@***.*** ***@***.***> 845.437.7430

…

On Nov 16, 2023, at 12:25 PM, Sarah Griffiths ***@***.***> wrote: If you have a reference transcriptome available and supply it to the ref_transcriptome parameter does the workflow progress past this part of the workflow? Trying to work out if the reads or something to do with the string tie merged transcriptome we are using could be causing this error. — Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BD2SWU6IKGZIVUYMHKYMDILYEZEAZAVCNFSM6AAAAAA7M2VUXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJUHEYDKNZTGU>. You are receiving this because you were mentioned.

sarahjeeeze · 2023-11-16T17:39:41Z

Could you try decreasing the memory directive to something like 64GB? Is there any further message in the salmon error. I think the part about updating the version isn't what's causing it to break - As we have that message in our automated tests as well which don't fail.

cea295933 · 2023-11-16T17:43:46Z

A few questions:

You want me to decrease memory now? Via sbatch or config file, or both?
Do you want me to run with transcriptome to produce salmon error? I can also provide those log files.
what is the syntax of the minimap2_index_opts command? When I enter "minimap2_index_opts -w 25" within my sbatch file I get an error.

cea295933 · 2023-11-16T17:46:15Z

Here's what I get if I run DE only a la GitHub test code (DE only test code from GitHub runs fine):

613 ERROR ~ Error executing process > 'pipeline:differential_expression:count_transcripts (1)'
614
615 Caused by:
616 Process pipeline:differential_expression:count_transcripts (1) terminated with an error exit status (1)
617
618 Command executed:
619
620 salmon quant --noErrorModel -p "4" -t "ammended.ref_transcriptome" -l SF -a "WTpoly1_reads_aln_sorted.bam" -o counts
621 mv counts/quant.sf "WTpoly1.transcript_counts.tsv"
622 seqkit bam "WTpoly1_reads_aln_sorted.bam" 2> "WTpoly1.seqkit.stats"
623
624 Command exit status:
625 1
626
627 Command output:
628 (empty)
629
630 Command error:
631 Version Info: ### PLEASE UPGRADE SALMON ###
632 ### A newer version of salmon with important bug fixes and improvements is available. ####
633 ###
634 The newest version, available at https://github.com/COMBINE-lab/salmon/releases
635 contains new features, improvements, and bug fixes; please upgrade at your
636 earliest convenience.
637 ###
638 Sign up for the salmon mailing list to hear about new versions, features and updates at:
639 https://oceangenomics.com/subscribe
640 # salmon (alignment-based) v1.9.0
641 # [ program ] => salmon
642 # [ command ] => quant
643 # [ noErrorModel ] => { }
644 # [ threads ] => { 4 }
645 # [ targets ] => { ammended.ref_transcriptome }
646 # [ libType ] => { SF }
647 # [ alignments ] => { WTpoly1_reads_aln_sorted.bam }
648 # [ output ] => { counts }
649 Logs will be written to counts/logs
650 [2023-11-14 22:10:17.514] [jointLog] [info] setting maxHashResizeThreads to 4
651 [2023-11-14 22:10:17.514] [jointLog] [info] Fragment incompatibility prior below threshold. Incompatible fragments will be ignored.
652 Library format { type:single end, relative orientation:none, strandedness:sense }
653 [2023-11-14 22:10:17.514] [jointLog] [info] numQuantThreads = 2
654 parseThreads = 2
655 Checking that provided alignment files have consistent headers . . . done
656 Populating targets from aln = "WTpoly1_reads_aln_sorted.bam", fasta = "ammended. ref_transcriptome" . . .done

cea295933 · 2023-11-16T17:47:10Z

If I run without DE, no issues with these data or with test data.

cea295933 · 2023-11-16T17:49:29Z

One issue I can see with running with supplied transcriptome is that there are no gold-standard (SGD, RefSeq, NCBI, etc.) yeast transcriptomes that contain good UTR info. In fact, most (if not all) are just ORFs, and just one isoform per gene. There are studies that have tried to investigate this(Pelechano, et al. for example) but they do not report a curated transcriptome and these have not been adopted by the big databases.

cea295933 · 2023-11-16T17:54:28Z

Just to clarify:

Thanks so much for your help!
Happy to run whatever you need ... not trying to avoid that by supplying previous logs. Just let me know what you'd like to test. Our HPC is currently pretty wide open so I can run multiple jobs at a time.

sarahjeeeze · 2023-11-16T17:54:44Z

yes sorry try either decreasing in the executor or just delete the executor

  1 executor {
  2     $local {
  3         cpus = 64
  4         memory = "32 GB"
  5     }
  6 }

Yes please, if you use this test in your environment do you get the salmon error?

wget -O differential_expression.tar.gz https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-isoforms/differential_expression.tar.gz && tar -xzvf differential_expression.tar.gz
nextflow run epi2me-labs/wf-transcriptomes \
--fastq  differential_expression/differential_expression_fastq \
--transcriptome-source precomputed \
--de_analysis \
--ref_genome differential_expression/hg38_chr20.fa \
--ref_annotation differential_expression/gencode.v22.annotation.chr20.gff3 \
--direct_rna --minimap2_index_opts '-k 15' \
--ref_transcriptome differential_expression/ref_transcriptome.fasta \
--sample_sheet test_data/sample_sheet.csv

Put quotes around the index opts. --minimap2_index_opts "-w 25"

cea295933 · 2023-11-16T17:57:15Z

Test DE analysis runs fine. Log below:

Last login: Wed Nov 15 19:06:08 on console

The default interactive shell is now zsh.
To update your account to use zsh, please run chsh -s /bin/zsh.
For more details, please visit https://support.apple.com/kb/HT208050.
caitken:~ caitken$ ssh caitken@jr.vassar.edu
^C
caitken:~ caitken$ ssh caitken@jr.vassar.edu
caitken@jr.vassar.edu's password:

Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-153-generic x86_64)
__ __
/ / / /___ ____ ____ ___ _____
/ // / __ / __ / __ / _ / /
/ __ / // / // / // / __/ /
// //_/ ./ ./_//
// /_/

System information as of Thu 16 Nov 2023 12:29:58 PM EST

System load: 0.0 Users logged in: 3
Usage of /: 17.4% of 1.79TB IPv4 address for docker0: 172.17.0.1
Memory usage: 17% IPv4 address for eth0: 192.168.0.5
Swap usage: 0% IPv4 address for eth1: 143.229.7.20
Processes: 608
Last login: Thu Nov 16 10:55:33 2023 from 143.229.41.79
caitken@ccas020:[12:29]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
caitken@ccas020:[12:30]$ cd /work/epi2me/
caitken@ccas020:/work/epi2me[12:30]$ ls
50 nextflow
Aitken_epi2me_20231110_poly_emc_noDE.out outdir
Aitken_epi2me_20231110_total_emc2.out output
Aitken_epi2me_20231110_total_emcK10.out Poly_epi2me_noRNA_sbatch
Aitken_epi2me_20231110_total_emcK11.out Poly_epi2me_sbatch_emc_DEonly
Aitken_epi2me_20231110_total_emcK14.out Poly_epi2me_sbatch_emc_noDE.sh
Aitken_epi2me_20231110_total_emc_noDE.out test_data.tar.gz
Aitken_epi2me_20231110_total_emc.out testDE_sbatch
Aitken_epi2me_poly_emc_DEonly.out Total_epi2me_noRNA_sbatch
Aitken_epi2me_poly_noRNA2.out Total_epi2me_sbatch_emc
Aitken_epi2me_poly_noRNA.out Total_epi2me_sbatch_emc_DEonly
Aitken_epi2me_testDE.out Total_epi2me_sbatch_emc_K10
Aitken_epi2me_total_emc_DEonly.out Total_epi2me_sbatch_emc_K11
Aitken_epi2me_total_noRNA.out Total_epi2me_sbatch_emc_K14.sh
chr20 Total_epi2me_sbatch_emc_NCBI.sh
differential_expression Total_epi2me_sbatch_emc_noDE
differential_expression.tar.gz work
ERR6053095_chr20.fastq workspace_dir
caitken@ccas020:/work/epi2me[12:30]$ vi Aitken_epi2me_poly_noRNA2.out
caitken@ccas020:/work/epi2me[12:30]$ vi Poly_epi2me_noRNA_sbatch
caitken@ccas020:/work/epi2me[12:35]$ vi Aitken_epi2me_poly_noRNA2.out
caitken@ccas020:/work/epi2me[12:36]$ sbatch Poly_epi2me_noRNA_sbatch
Submitted batch job 36004
caitken@ccas020:/work/epi2me[12:37]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
caitken@ccas020:/work/epi2me[12:37]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36005 general augcccation fgoggins R 0:19 1 10 node1
caitken@ccas020:/work/epi2me[12:37]$ vi Aitken_epi2me_poly_noRNA2.out
caitken@ccas020:/work/epi2me[12:38]$ sbatch Poly_epi2me_noRNA_sbatch
Submitted batch job 36007
caitken@ccas020:/work/epi2me[12:38]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 0:16 1 10 node1
36005 general augcccation fgoggins R 0:49 1 10 node1
36007 emc Aitken_epi2me_po caitken R 0:01 1 128 node5
caitken@ccas020:/work/epi2me[12:38]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 0:21 1 10 node1
36005 general augcccation fgoggins R 0:54 1 10 node1
caitken@ccas020:/work/epi2me[12:38]$ vi Aitken_epi2me_poly_noRNA2.out
caitken@ccas020:/work/epi2me[12:39]$ sbatch Poly_epi2me_noRNA_sbatch
Submitted batch job 36008
caitken@ccas020:/work/epi2me[12:39]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 1:04 1 10 node1
36005 general augcccation fgoggins R 1:37 1 10 node1
36008 emc Aitken_epi2me_po caitken R 0:01 1 128 node5
caitken@ccas020:/work/epi2me[12:39]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 1:08 1 10 node1
36005 general augcccation fgoggins R 1:41 1 10 node1
36008 emc Aitken_epi2me_po caitken R 0:05 1 128 node5
caitken@ccas020:/work/epi2me[12:39]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 1:10 1 10 node1
36005 general augcccation fgoggins R 1:43 1 10 node1
36008 emc Aitken_epi2me_po caitken R 0:07 1 128 node5
caitken@ccas020:/work/epi2me[12:39]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 1:13 1 10 node1
36005 general augcccation fgoggins R 1:46 1 10 node1
36008 emc Aitken_epi2me_po caitken R 0:10 1 128 node5
caitken@ccas020:/work/epi2me[12:39]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 1:20 1 10 node1
36005 general augcccation fgoggins R 1:53 1 10 node1
caitken@ccas020:/work/epi2me[12:39]$ vi Aitken_epi2me_poly_noRNA2.out
caitken@ccas020:/work/epi2me[12:40]$ sbatch Poly_epi2me_noRNA_sbatch
Submitted batch job 36009
caitken@ccas020:/work/epi2me[12:40]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 2:33 1 10 node1
36009 emc Aitken_epi2me_po caitken R 0:01 1 128 node5
caitken@ccas020:/work/epi2me[12:40]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 2:35 1 10 node1
36009 emc Aitken_epi2me_po caitken R 0:03 1 128 node5
caitken@ccas020:/work/epi2me[12:40]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 2:36 1 10 node1
36009 emc Aitken_epi2me_po caitken R 0:04 1 128 node5
caitken@ccas020:/work/epi2me[12:40]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 2:38 1 10 node1
36009 emc Aitken_epi2me_po caitken R 0:06 1 128 node5
caitken@ccas020:/work/epi2me[12:40]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 2:40 1 10 node1
36009 emc Aitken_epi2me_po caitken R 0:08 1 128 node5
caitken@ccas020:/work/epi2me[12:40]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
caitken@ccas020:/work/epi2me[12:40]$ vi Aitken_epi2me_poly_noRNA2.out
caitken@ccas020:/work/epi2me[12:43]$ ls
25 nextflow
50 outdir
Aitken_epi2me_20231110_poly_emc_noDE.out output
Aitken_epi2me_20231110_total_emc2.out Poly_epi2me_noRNA_sbatch
Aitken_epi2me_20231110_total_emcK10.out Poly_epi2me_sbatch_emc_DEonly
Aitken_epi2me_20231110_total_emcK11.out Poly_epi2me_sbatch_emc_noDE.sh
Aitken_epi2me_20231110_total_emcK14.out test_data.tar.gz
Aitken_epi2me_20231110_total_emc_noDE.out testDE_sbatch
Aitken_epi2me_20231110_total_emc.out Total_epi2me_noRNA_sbatch
Aitken_epi2me_poly_emc_DEonly.out Total_epi2me_sbatch_emc
Aitken_epi2me_poly_noRNA2.out Total_epi2me_sbatch_emc_DEonly
Aitken_epi2me_poly_noRNA.out Total_epi2me_sbatch_emc_K10
Aitken_epi2me_testDE.out Total_epi2me_sbatch_emc_K11
Aitken_epi2me_total_emc_DEonly.out Total_epi2me_sbatch_emc_K14.sh
Aitken_epi2me_total_noRNA.out Total_epi2me_sbatch_emc_NCBI.sh
chr20 Total_epi2me_sbatch_emc_noDE
differential_expression work
differential_expression.tar.gz workspace_dir
ERR6053095_chr20.fastq
caitken@ccas020:/work/epi2me[12:43]$ vi Aitken_epi2me_poly_emc_DEonly.out
caitken@ccas020:/work/epi2me[12:46]$ vi Aitken_epi2me_20231110_poly_emc_noDE.out
caitken@ccas020:/work/epi2me[12:47]$ vi Aitken_epi2me_total_emc_DEonly.out
caitken@ccas020:/work/epi2me[12:56]$ vi Aitken_epi2me_testDE.out

24014 [ee/9bb70f] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24015 [06/6fef76] process > pipeline:differential_expre... [100%] 6 of 6 ✔
24016 [5a/06e947] process > pipeline:differential_expre... [100%] 6 of 6 ✔
24017 [f3/3d0571] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24018 [29/44dba5] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24019 [88/6a0146] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24020 [d2/edc948] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24021 [87/d5752d] process > pipeline:makeReport (1) [100%] 1 of 1 ✔
24022 [59/5d6d71] process > output (22) [ 90%] 20 of 22
24023
24024 executor > local (58)
24025 [25/87e83a] process > validate_sample_sheet [100%] 1 of 1 ✔
24026 [cd/a8bb45] process > fastcat (5) [100%] 6 of 6 ✔
24027 [e2/6e936a] process > pipeline:preprocess_ref_ann... [100%] 1 of 1 ✔
24028 [c7/2a7592] process > pipeline:collectFastqIngres... [100%] 6 of 6 ✔
24029 [88/616c92] process > pipeline:getVersions [100%] 1 of 1 ✔
24030 [0e/9b4ea0] process > pipeline:getParams [100%] 1 of 1 ✔
24031 [b8/e04e62] process > pipeline:preprocess_ref_tra... [100%] 1 of 1 ✔
24032 [b1/35adc2] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24033 [ee/9bb70f] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24034 [06/6fef76] process > pipeline:differential_expre... [100%] 6 of 6 ✔
24035 [5a/06e947] process > pipeline:differential_expre... [100%] 6 of 6 ✔
24036 [f3/3d0571] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24037 [29/44dba5] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24038 [88/6a0146] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24039 [d2/edc948] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24040 [87/d5752d] process > pipeline:makeReport (1) [100%] 1 of 1 ✔
24041 [bf/2e04ad] process > output (21) [100%] 22 of 22 ✔
24042
24043 executor > local (58)
24044 [25/87e83a] process > validate_sample_sheet [100%] 1 of 1 ✔
24045 [cd/a8bb45] process > fastcat (5) [100%] 6 of 6 ✔
24046 [e2/6e936a] process > pipeline:preprocess_ref_ann... [100%] 1 of 1 ✔
24047 [c7/2a7592] process > pipeline:collectFastqIngres... [100%] 6 of 6 ✔
24048 [88/616c92] process > pipeline:getVersions [100%] 1 of 1 ✔
24049 [0e/9b4ea0] process > pipeline:getParams [100%] 1 of 1 ✔
24050 [b8/e04e62] process > pipeline:preprocess_ref_tra... [100%] 1 of 1 ✔
24051 [b1/35adc2] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24052 [ee/9bb70f] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24053 [06/6fef76] process > pipeline:differential_expre... [100%] 6 of 6 ✔
24054 [5a/06e947] process > pipeline:differential_expre... [100%] 6 of 6 ✔
24055 [f3/3d0571] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24056 [29/44dba5] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24057 [88/6a0146] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24058 [d2/edc948] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24059 [87/d5752d] process > pipeline:makeReport (1) [100%] 1 of 1 ✔
24060 [bf/2e04ad] process > output (21) [100%] 22 of 22 ✔
24061 Completed at: 14-Nov-2023 10:58:58
24062 Duration : 2m 55s
24063 CPU hours : 0.1
24064 Succeeded : 58
24065

sarahjeeeze · 2023-11-16T17:59:59Z

ok and now try reference guided

nextflow run epi2me-labs/wf-transcriptomes \
--fastq  differential_expression/differential_expression_fastq \
--de_analysis \
--ref_genome differential_expression/hg38_chr20.fa --transcriptome-source reference-guided \
--ref_annotation differential_expression/gencode.v22.annotation.chr20.gtf \
--direct_rna --minimap2_index_opts '-k 15'  --sample_sheet test_data/sample_sheet.csv

cea295933 · 2023-11-16T18:07:04Z

running all three now ... btw, the --sampel_sheet test_data/sample_sheet.csv gave me problems previously ... I don't remember if it's not included in the test data one downloads, or whether it's in a different directory or what ... I ended up having to create my own sample sheet (or move it to the directory ... can't remember)

... perhaps you could look into that separately... it might be giving others trouble too

sarahjeeeze · 2023-11-16T18:07:41Z

good idea, yes will add it to the downloadable test set

cea295933 · 2023-11-16T18:11:55Z

two jobs with my data appear to be running stably, and test run ran without issues (see below). when I get samtools error, it is after long stable run (over 1 h). whereas when I get salmon error, it's usually after 10 minutes of run or less. Both running now for at least 10 m

N E X T F L O W ~ version 23.10.0
Launching https://github.com/epi2me-labs/wf-transcriptomes [boring_planck] DSL2 - revision: 0a8fe19 [master]

WARN: Found unexpected parameters:

--transcriptome-source: reference-guided

Ignore this warning: params.schema_ignore_params = "transcriptome-source"

|||||||||| ___ _ _ _ _
|||||||||| | | _ _ | | / | | | | | | _
||||| | | | |) | | __) | |/| | | _| |/ ` | ' / |
||||| | || /| | / /| | | | ||| | (| | |) _
|||||||||| |_____|| ||**|| ||***| ||_,|.*/|__/
|||||||||| wf-transcriptomes v0.4.1-g0a8fe19

Core Nextflow options
revision : master
runName : boring_planck
containerEngine : docker
container : [withLabel:isoforms:ontresearch/wf-transcriptomes:shae7c9f184996a384e99be68e790f0612f0c732867, withLabel:wf_common:ontresearch/wf-common:sha0a6dc21fac17291f4acb2e0f67bcdec7bf63e6b7]
launchDir : /work/epi2me
workDir : /work/epi2me/work
projectDir : /home/caitken/.nextflow/assets/epi2me-labs/wf-transcriptomes
userName : caitken
profile : standard
configFiles : /home/caitken/.nextflow/assets/epi2me-labs/wf-transcriptomes/nextflow.config

Input Options
fastq : differential_expression/differential_expression_fastq
ref_genome : differential_expression/hg38_chr20.fa
ref_annotation : differential_expression/gencode.v22.annotation.chr20.gtf
direct_rna : true

Sample Options
sample_sheet : differential_expression/sample_sheet.csv

Options for reference-based workflow
minimap2_index_opts: -k 15

Differential Expression Options
de_analysis : true

!! Only displaying parameters that differ from the pipeline defaults !!

If you use epi2me-labs/wf-transcriptomes for your analysis please cite:

The nf-core framework
https://doi.org/10.1038/s41587-020-0439-x

This is epi2me-labs/wf-transcriptomes v0.4.1-g0a8fe19.

Checking fastq input.
executor > local (119)
[1b/620c1e] process > validate_sample_sheet [100%] 1 of 1 ✔
[91/df2f5c] process > fastcat (6) [100%] 6 of 6 ✔
[a2/4fc9fc] process > pipeline:preprocess_ref_annotation [100%] 1 of 1 ✔
[46/5e2ffe] process > pipeline:collectFastqIngressResultsInDir (6) [100%] 6 of 6 ✔
[90/bb2a11] process > pipeline:getVersions [100%] 1 of 1 ✔
[1e/02f132] process > pipeline:getParams [100%] 1 of 1 ✔
[c4/1f960d] process > pipeline:build_minimap_index (1) [100%] 1 of 1 ✔
[d8/c25222] process > pipeline:reference_assembly:map_reads (6) [100%] 6 of 6 ✔
[90/7a7229] process > pipeline:split_bam (6) [100%] 6 of 6 ✔
[29/517ddb] process > pipeline:assemble_transcripts (6) [100%] 6 of 6 ✔
[80/d44317] process > pipeline:merge_gff_bundles (6) [100%] 6 of 6 ✔
[2f/7955ff] process > pipeline:run_gffcompare (6) [100%] 6 of 6 ✔
[b9/bc2600] process > pipeline:get_transcriptome (6) [100%] 6 of 6 ✔
[2a/5f66bf] process > pipeline:merge_transcriptomes (1) [100%] 1 of 1 ✔
[bf/9fd364] process > pipeline:differential_expression:checkSampleSheetCond... [100%] 1 of 1 ✔
[11/844a2e] process > pipeline:differential_expression:build_minimap_index_... [100%] 1 of 1 ✔
[06/51b1ab] process > pipeline:differential_expression:map_transcriptome (2) [100%] 6 of 6 ✔
[d3/469d56] process > pipeline:differential_expression:count_transcripts (6) [100%] 6 of 6 ✔
[3b/4f7573] process > pipeline:differential_expression:mergeCounts [100%] 1 of 1 ✔
[70/2b6988] process > pipeline:differential_expression:mergeTPM [100%] 1 of 1 ✔
[61/5666c7] process > pipeline:differential_expression:deAnalysis (1) [100%] 1 of 1 ✔
[5b/9a1800] process > pipeline:differential_expression:plotResults (1) [100%] 1 of 1 ✔
[2a/cb76df] process > pipeline:makeReport (1) [100%] 1 of 1 ✔
[ad/9cda77] process > output (42) [100%] 46 of 46 ✔
Completed at: 16-Nov-2023 13:10:44
Duration : 5m 7s
CPU hours : 0.2
Succeeded : 119

cea295933 · 2023-11-16T18:12:23Z

circle back once those have completed?

Thanks so much for all of your help!

sarahjeeeze · 2023-11-16T18:27:45Z

No worries, sorry it's not working! For the salmon breakage one - if you go in to a work directory for the map_transcriptome process do the bams contain alignments? I was able to recreate the error you had in the other ticket by inputting an empty bam to that step. That's an error we should be catching and reporting to the user.

Perhaps if you run the workflow with your inputs without the --de_analysis parameter you could share the output report wf-transcriptome-report.html just to check there are alignments between the reference_genome and your reads.

cea295933 · 2023-11-16T18:44:13Z

BAM file doesn’t look empty … scp to my local machine is 40% at 762 MB transferred du -sh only show 4.0K, however. Colin Echeverría Aitken Assistant Professor Biology Department Biochemistry Program Vassar College ***@***.*** ***@***.***> 845.437.7430

…

On Nov 16, 2023, at 1:27 PM, Sarah Griffiths ***@***.***> wrote: No worries, sorry it's not working! For the salmon breakage one - if you go in to a work directory for the map_transcriptome process do the bams contain alignments? I was able to recreate the error you had in the other ticket by inputting an empty bam to that step. That's an error we should be catching and reporting to the user. Perhaps if you run the workflow with your inputs without the --de_analysis parameter you could share the output report wf-transcriptome-report.html just to check there are alignments between the reference_genome and your reads. — Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BD2SWUYE5ZNV6JLPIKYZZ53YEZLKZAVCNFSM6AAAAAA7M2VUXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJVGAYDONJTHE>. You are receiving this because you were mentioned.

cea295933 · 2023-11-16T18:46:07Z

tried to send .html reports but I can't attach in GitHub

sarahjeeeze · 2023-11-16T18:47:52Z

I think if you zip them then you can attach

cea295933 · 2023-11-16T18:49:19Z

Reports.zip

cea295933 · 2023-11-16T18:50:32Z

I have a meeting until later this afternoon ... so not ignoring if I'm radio silent. Will send along results of current runs once I have them. I also added a third run with the reference-guided (with our data) as a point of comparison

sarahjeeeze · 2023-11-20T09:31:16Z

Could you go to the work directory of the merge_transcriptomes and if its small enough perhaps zip up and send the final_non_redundant_transcriptome.fasta and if possible also the gtf file from that directory? Just to check it looks sensible?

cea295933 · 2023-11-20T13:07:04Z

Files attached below.

Some updates on where things stand here (updating my posts above). I've been chasing the memory constraint lead. I realized that the jobs I'm running on our HPC our running entirely on virtual memory and using almost no RSS. So for example, the results of htop on the running node show almost every process running at 48 GB of virtual memory and using only 0.4% of available memory (I can provide a screen grab of that if you want). In trying to chase this down, I realized that the memory limits within SLURM were not set properly. Unfortunately, correcting those (I can now ask SLRUM to give me all 512 GB of RAM on the node and it will) still result in the same odd virtual memory usage (unless this is actually normal?) and the processes still fail in the same way. I also noticed that it seemed like nextflow/docker were ignoring my config file that I was passing (I still saw what looked like 4 CPU and 8 GB of ram ... the default ... in the nextflow log). So I modified nextflow.log just to change those two lines to match what I'm requesting (128 CPU and 512 GB of RAM). But still the same failure. So that brings me to my question: **Is this virtual memory usage, and the very low physical memory usage expected? If not, do you have ideas of where to look? I'd like to make as much progress as possible investigating solutions on our end, as I know you're very busy! **

TXandGTF.zip

sarahjeeeze · 2023-11-20T14:25:32Z

ok those files look reasonable. Its only a very small transcriptome so although you have a lot of reads, I wouldn't expect this step to require that much memory. Could you possibly try to run the workflow on a laptop with docker or singularity? You could even try the desktop app EPI2ME https://labs.epi2me.io/downloads/. Asking my team if they have any ideas about your SLURM issues.

cea295933 · 2023-11-20T16:07:38Z

I will try … I have tried previously to run in on a PC, but it takes absolutely forever. I was able to get it to run on one barcode, but it took 44 hours. I then tried with everything together, but I got errors I had not seen previously. I have also tried running it on my Mac (Mac Studio), as it’s relatively powerful. But the desktop app does not play well with the new apple silicon processors … would that be any different if I run through the terminal? I will try running with this smaller dataset on our PC and let you know what I see. Thanks for sticking with me! Colin Echeverría Aitken Assistant Professor Biology Department Biochemistry Program Vassar College ***@***.*** ***@***.***> 845.437.7430

…

On Nov 20, 2023, at 9:25 AM, Sarah Griffiths ***@***.***> wrote: ok those files look reasonable. Its only a very small transcriptome so although you have a lot of reads, I wouldn't expect this step to require that much memory. Could you possibly try to run the workflow on a laptop with docker or singularity? You could even try the desktop app EPI2ME https://labs.epi2me.io/downloads/. Asking my team if they have any ideas about your SLURM issues. — Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BD2SWU5TXGVWN7JDEQRJ3XDYFNR6PAVCNFSM6AAAAAA7M2VUXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJZGE3DKMRRGM>. You are receiving this because you were mentioned.

cea295933 · 2023-11-20T17:32:23Z

ok ... I just did this with a further subsetted dataset (only 10 fastq.gz files per barcode). Ran on a local PC using desktop GUI ... received the exact same error (log below). I ran this further subsetted version on our HPC and again, same error. I then tried on the local PC again, but using different genome files (I had previously tested this on our HPC but that was before I realized I should not provide transcriptome when using DE and instead rely on generated transcriptome).

The good news: it looks like it proceeds past the point where we've seen the error previously.
The bad news: it now fails at a separate step ... the error suggests it is looking for "+" and "-" in the gff/gtf granges object but doesn't find it (logs below ... labeled Genbank and RefSeq). Do you think we can move forward from here?

A few questions/ideas:

The desktop GUI says that a reference transcriptome is required for DE analysis ... but didn't complain that I did not provide one ... could this be an issue? My understanding was that if one chose reference-guided, the workflow assemblies a transcriptome that then guides the subsequent DE analysis (a question here ... is a separate transcriptome generated per sample? or is there a master merged transcriptome that is used for all samples for the DE analysis?)
I had to change one of the options in the GUI version ... to say that alignments needed to be in at least 2 samples
minimum (default was three) ... because I only have two replicates of each ... could that be an issue?

nextflow.txt
RefSeq_epi2me.txt
GenBank_epi2me.txt

cea295933 · 2023-11-20T17:55:36Z

just check on our HPC and the same holds true there ... it runs further with genbank files but crashes at the pipeline:differential_expression:deAnalysis(1) step (see below)

2189 [b6/d0517e] process > output (16) [ 69%] 16 of 23
2190 Retry deAnalysis with gff format setting and version removal.
2191 Retry deAnalysis with gtf format setting and version removal.
2192 [70/24e7e0] NOTE: Process pipeline:differential_expression:deAnalysis (1) terminated with an error exit status (1) -- Execution is retried (2)
2193 [94/280d99] NOTE: Process pipeline:differential_expression:deAnalysis (1) terminated with an error exit status (1) -- Execution is retried (3)
2194 ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)'
2195
2196 Caused by:
2197 Process pipeline:differential_expression:deAnalysis (1) terminated with an error exit status (1)
2198
2199 Command executed:
2200
2201 mkdir merged
2202 mkdir de_analysis
2203 mv de_transcript_counts.tsv merged/all_counts.tsv
2204 mv BarcodesPoly.csv de_analysis/coldata.tsv
2205 de_analysis.R annotation.gtf 3 1 10 3 gtf true
2206
2207 Command exit status:
2208 1
2209
2210 Command output:
2211 Loading counts, conditions and parameters.
2212 Loading annotation database.
2213
2214 Command error:
2215 Loading counts, conditions and parameters.
2216 Warning message:
2217 In read.table(file = file, header = header, sep = sep, quote = quote, :
2218 incomplete final line found by readTableHeader on 'de_analysis/coldata.tsv'
2219 Loading annotation database.
2220 Import genomic features from the file as a GRanges object ... OK
2221 Prepare the 'metadata' data frame ... OK
2222 Make the TxDb object ... Error in .makeTxDb_normarg_transcripts(transcripts) :
2223 values in 'transcripts$tx_strand' must be "+" or "-"
2224 Calls: makeTxDbFromGFF ... makeTxDbFromGRanges -> makeTxDb -> .makeTxDb_normarg_transcripts
2225 Execution halted

sarahjeeeze · 2023-11-20T18:03:49Z

Ah no I will update that about the reference transcriptome requirement. Oh that's good, we have only tested it as mentioned in the documentation with annotation files from Encode, Ensembl and NCBI.

I think just ensure the strand column in your ref annotation file only has + or - or i think . is ok as well. Let me know if you need any guidance on fixing it, feel free to send the ref_annotation you are using or a link to where its from and I can check it. No it should be ok to set that it needs to be present in 2 samples.

One thing to try may be to reduce the threads parameter to one or two, which will make some steps take a bit longer but may reduce memory requirement at the transcriptome alignment step.

Also in the output directory execution folder did you have any example report.html and timeline.htmls, do they have any info about memory consumption before the workflow failure?

cea295933 · 2023-11-20T18:19:37Z

Attaching the ones I tried that work ... one of these is NCBI_RefSeq so should work .... but I see "." for entries like centromeres and telomeres ... suggestions on how to fix (I don't remember my Perl for text editing all that well)? Or alternatives for yeast?

GenBank.zip
NCBI_RefSeq.zip

cea295933 · 2023-11-20T18:23:30Z

Here is the execution folder. Bear in mind this is the heavily downsampled data set (only 10 files per barcode), it's likely not testing memory limits like the others. That said, it still is eating up 80 GB of virtual memory on the HPC ... and only 1 G RAM ... is that normal?

I will say that for the other jobs, it was only the preprocessing reads step that appeared hugely memory intensive, at least according to these reports. And that always worked fine on the HPC (whereas it takes ages on a local PC)

execution.zip

sarahjeeeze · 2023-11-20T18:39:39Z

I think any "." entries are ok, have you tried the genomic.gtf as the annotation file from the genbank zip folder, Oh I also noticed ### at the end of each of the annotation files.. I have not seen this before. Perhaps remove the last lines.

cea295933 · 2023-11-20T19:21:03Z

ok ... trying both of these

even now, I can find the "counts.tsv" in the working folder ... is that complete at this point of the pipeline? if so, I can use that to run DESeq2 in R myself ... (at least as long as this now works on the complete dataset ... haven't tried that yet)

cea295933 · 2023-11-20T19:50:14Z

those don't appear to work ... attaching the files in the work directory in the event they spark any ideas
files.zip

sarahjeeeze · 2023-11-20T20:10:22Z

Hi, hmm looks like you have all novel transcripts - does BK006934.2 match up with the reference file fasta header?

sarahjeeeze · 2023-11-20T20:13:24Z

oh sorry i see you transcript_ids are all 'unassigned_transcript_1234' is that expected?

cea295933 · 2023-11-20T20:17:42Z

yeah, that's strange. on one of my previous runs I didn't see that ... I don't remember now whether it was refseq or genbank (or it could be a gff vs. gtf) .... I'll try to figure what the combination was, but that all_counts.tsv is below
all_counts.txt

cea295933 · 2023-11-20T21:53:47Z

ok ... if I run using the GenBank .fa and .gff, I get transcripts assigned. But if I replace with the .gtf (or use any combination of the RefSeq files), I get either unassigned transcripts or MSTRG (which I think come from stringtie). But I appear to get the same/similar error for all.

NCBI Gengank .fa and .gff
1 Reference MUTpoly2 MUTpoly1 WTpoly1 WTpoly2
2 rna-YNCL0010C 1369.0 1222.0 1721.0 2023.0
3 rna-YNCL0012C 1148.0 1044.0 1435.0 1717.0
4 rna-YLR154C-G 274.0 212.0 214.0 369.0
5 rna-YNCH0007W 230.0 212.0 372.0 455.0
6 rna-YNCL0016C 208.0 165.0 276.0 299.0
7 MSTRG.1131.1 188.0 151.0 107.0 112.0
8 rna-YBR118W 175.0 147.0 156.0 147.0
9 rna-YGL103W 150.0 163.0 92.0 114.0
10 rna-YLR110C 142.0 152.0 162.0 196.0

Error
2384 ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)'
2385
2386 Caused by:
2387 Process pipeline:differential_expression:deAnalysis (1) terminated with an error exit status (1)
2388
2389 Command executed:
2390
2391 mkdir merged
2392 mkdir de_analysis
2393 mv de_transcript_counts.tsv merged/all_counts.tsv
2394 mv BarcodesPoly.csv de_analysis/coldata.tsv
2395 de_analysis.R annotation.gtf 3 1 10 3 gtf true
2396
2397 Command exit status:
2398 1
2399
2400 Command output:
2401 Loading counts, conditions and parameters.
2402 Loading annotation database.
2403
2404 Command error:
2405 Loading counts, conditions and parameters.
2406 Warning message:
2407 In read.table(file = file, header = header, sep = sep, quote = quote, :
2408 incomplete final line found by readTableHeader on 'de_analysis/coldata.tsv'
2409 Loading annotation database.
2410 Import genomic features from the file as a GRanges object ... OK
2411 Prepare the 'metadata' data frame ... OK
2412 Make the TxDb object ... Error in .makeTxDb_normarg_transcripts(transcripts) :
2413 values in 'transcripts$tx_strand' must be "+" or "-"
2414 Calls: makeTxDbFromGFF ... makeTxDbFromGRanges -> makeTxDb -> .makeTxDb_normarg_transcripts
2415 Execution halted
2416
2417 Work dir:
2418 /work/epi2me/work/d3/fd1f38bd3fe7e3c1162b91b2ac41c5

NCBI Genbank .fa and .gtf
1 Reference MUTpoly2 MUTpoly1 WTpoly1 WTpoly2
2 unassigned_transcript_4097 1363.0 1221.0 1722.0 2025.0
3 unassigned_transcript_4099 1150.0 1042.0 1436.0 1719.0
4 unassigned_transcript_4110 278.0 215.0 209.0 365.0
5 unassigned_transcript_2712 230.0 212.0 372.0 455.0
6 unassigned_transcript_4104 208.0 166.0 277.0 298.0
7 MSTRG.1131.1 188.0 151.0 107.0 112.0
8 unassigned_transcript_350 175.0 147.0 156.0 147.0
9 unassigned_transcript_2160 150.0 163.0 92.0 114.0
10 unassigned_transcript_4051 142.0 152.0 162.0 196.0

Error
2382 ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)'
2383
2384 Caused by:
2385 Process pipeline:differential_expression:deAnalysis (1) terminated with an error exit status (1)
2386
2387 Command executed:
2388
2389 mkdir merged
2390 mkdir de_analysis
2391 mv de_transcript_counts.tsv merged/all_counts.tsv
2392 mv BarcodesPoly.csv de_analysis/coldata.tsv
2393 de_analysis.R annotation.gtf 3 1 10 3 gtf true
2394
2395 Command exit status:
2396 1
2397
2398 Command output:
2399 Loading counts, conditions and parameters.
2400 Loading annotation database.
2401
2402 Command error:
2403 Loading counts, conditions and parameters.
2404 Warning message:
2405 In read.table(file = file, header = header, sep = sep, quote = quote, :
2406 incomplete final line found by readTableHeader on 'de_analysis/coldata.tsv'
2407 Loading annotation database.
2408 Import genomic features from the file as a GRanges object ... OK
2409 Prepare the 'metadata' data frame ... OK
2410 Make the TxDb object ... Error in .makeTxDb_normarg_transcripts(transcripts) :
2411 values in 'transcripts$tx_strand' must be "+" or "-"
2412 Calls: makeTxDbFromGFF ... makeTxDbFromGRanges -> makeTxDb -> .makeTxDb_normarg_transcripts
2413 Execution halted
2414
2415 Work dir:
2416 /work/epi2me/work/9c/84eec819b9dfa17fda6c0626238884

NCBI Refseq .fa and .gff
1 Reference MUTpoly2 MUTpoly1 WTpoly1 WTpoly2
2 MSTRG.263.1 188.0 151.0 107.0 112.0
3 MSTRG.289.1 159.0 175.0 96.0 121.0
4 MSTRG.457.1 152.0 130.0 94.0 119.0
5 MSTRG.511.1 148.0 157.0 167.0 198.0
6 MSTRG.32.1 140.0 134.0 114.0 106.0
7 MSTRG.768.1 127.0 133.0 81.0 81.0
8 MSTRG.61.1 113.0 96.0 111.0 100.0
9 MSTRG.198.1 112.0 82.0 39.0 38.0
10 MSTRG.515.1 109.0 107.0 74.0 72.0

Error
2283 ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)'
2284
2285 Caused by:
2286 Process pipeline:differential_expression:deAnalysis (1) terminated with an error exit status (1)
2287
2288 Command executed:
2289
2290 mkdir merged
2291 mkdir de_analysis
2292 mv de_transcript_counts.tsv merged/all_counts.tsv
2293 mv BarcodesPoly.csv de_analysis/coldata.tsv
2294 de_analysis.R annotation.gtf 3 1 10 3 gtf true
2295
2296 Command exit status:
2297 1
2298
2299 Command output:
2300 Loading counts, conditions and parameters.
2301 Loading annotation database.
2302 Filtering counts using DRIMSeq.
2303
2304 Command error:
2305 Loading counts, conditions and parameters.
2306 Warning message:
2307 In read.table(file = file, header = header, sep = sep, quote = quote, :
2308 incomplete final line found by readTableHeader on 'de_analysis/coldata.tsv'
2309 Loading annotation database.
2310 Import genomic features from the file as a GRanges object ... OK
2311 Prepare the 'metadata' data frame ... OK
2312 Make the TxDb object ... OK
2313 'select()' returned 1:many mapping between keys and columns
2314 Filtering counts using DRIMSeq.
2315 Error in dmDSdata(counts = counts, samples = coldata) :
2316 mode(counts) %in% "numeric" is not TRUE
2317 Calls: dmDSdata -> stopifnot
2318 Execution halted
2319
2320 Work dir:
2321 /work/epi2me/work/7f/bbb4ae1a2db55fea2627f197d32994

NCBI Refseq .fa and .gtf
1 Reference MUTpoly2 WTpoly1 MUTpoly1 WTpoly2
2 MSTRG.263.1 188.0 107.0 151.0 112.0
3 MSTRG.289.1 159.0 96.0 175.0 121.0
4 MSTRG.457.1 152.0 94.0 130.0 119.0
5 MSTRG.511.1 148.0 167.0 157.0 198.0
6 MSTRG.32.1 140.0 115.0 135.0 106.0
7 MSTRG.768.1 127.0 81.0 133.0 81.0
8 MSTRG.61.1 113.0 111.0 96.0 100.0
9 MSTRG.198.1 112.0 39.0 82.0 38.0
10 MSTRG.515.1 109.0 74.0 107.0 72.0

Error
2206 ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)'
2207
2208 Caused by:
2209 Process pipeline:differential_expression:deAnalysis (1) terminated with an error exit status (1)
2210
2211 Command executed:
2212
2213 mkdir merged
2214 mkdir de_analysis
2215 mv de_transcript_counts.tsv merged/all_counts.tsv
2216 mv BarcodesPoly.csv de_analysis/coldata.tsv
2217 de_analysis.R annotation.gtf 3 1 10 3 gtf true
2218
2219 Command exit status:
2220 1
2221
2222 Command output:
2223 Loading counts, conditions and parameters.
2224 Loading annotation database.
2225 Filtering counts using DRIMSeq.
2226
2227 Command error:
2228 Loading counts, conditions and parameters.
2229 Warning message:
2230 In read.table(file = file, header = header, sep = sep, quote = quote, :
2231 incomplete final line found by readTableHeader on 'de_analysis/coldata.tsv'
2232 Loading annotation database.
2233 Import genomic features from the file as a GRanges object ... OK
2234 Prepare the 'metadata' data frame ... OK
2235 Make the TxDb object ... OK
2236 'select()' returned 1:many mapping between keys and columns
2237 Filtering counts using DRIMSeq.
2238 Error in dmDSdata(counts = counts, samples = coldata) :
2239 mode(counts) %in% "numeric" is not TRUE
2240 Calls: dmDSdata -> stopifnot
2241 Execution halted
2242
2243 Work dir:
2244 /work/epi2me/work/01/c91d7314e0a6a9978ccec290d477f9

sarahjeeeze · 2023-11-21T09:17:09Z

Sorry, I feel like its nearly there! Does it retry that DE_analysis step? I would expect that error but it to retry that step 4 times and hopefully one of the retries will work, as we iterate through trying a few different gtf2/gff3 R options. Check you are on the latest version within the app (if you are in the app). Or try nextflow pull epi2me-labs/wf-transcriptomes if you are on cmd line to check you have the latest version.

cea295933 · 2023-11-21T12:53:16Z

I think so too … thanks to you! It looks like it retries twice but fails at that moment: 2379 [27/4f170f] process > output (13) [ 69%] 16 of 23 2380 Retry deAnalysis with gff format setting and version removal. 2381 Retry deAnalysis with gtf format setting and version removal. 2382 [58/b882f6] NOTE: Process `pipeline:differential_expression:deAnalysis (1)` terminated with an error exit status (1) -- Execution is retried (2) 2383 [a4/ff1720] NOTE: Process `pipeline:differential_expression:deAnalysis (1)` terminated with an error exit status (1) -- Execution is retried (3) 2384 ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)' I tried the update but received the below error. Perhaps I need to ask my sysadmin to run it? Is it ok to run this on the head node or do we need to run it on the compute nodes? ***@***.***:/work/epi2me[07:47]$ ./nextflow pull epi2me-labs/wf-transcriptomes Checking epi2me-labs/wf-transcriptomes ... epi2me-labs/wf-transcriptomes contains uncommitted changes -- cannot pull from repository Colin Echeverría Aitken Assistant Professor Biology Department Biochemistry Program Vassar College ***@***.*** ***@***.***> 845.437.7430

…

On Nov 21, 2023, at 4:17 AM, Sarah Griffiths ***@***.***> wrote: nextflow pull epi2me-labs/wf-transcriptomes

sarahjeeeze · 2023-11-21T17:45:25Z

No i'm sorry its still not working!

Ok on the files in your files.zip I did this to find the two which have strand set to . ,

grep -P '\t\.\t\.\t' annotation.gtf
BK006938.2      StringTie       transcript      1302119 1302626 1000    .       .       gene_id "MSTRG.844"; transcript_id "MSTRG.844.1"; 
BK006938.2      StringTie       exon    1302119 1302626 1000    .       .       gene_id "MSTRG.844"; transcript_id "MSTRG.844.1"; exon_number "1";

Which is causing this error values in 'transcripts$tx_strand' must be "+" or "-", If I run the downstream DE_analysis without these lines it seems to complete.

This is an error being created by stringtie I've not seen it before but will add a step to remove any non stranded annotations hopefully will release that by end of the week, in the meantime you could add these lines to line 92 of subworkflows/differential_expression.nf which might fix it

grep -v -P '\t\.\t\.\t' annotation.gtf > tmp.gtf
mv tmp.gtf annotation.gtf

cea295933 · 2023-11-21T18:26:09Z

Above and beyond! … where can I find the subworkflows directory? I tried find, but it’s taking a while ... Colin Echeverría Aitken Assistant Professor Biology Department Biochemistry Program Vassar College ***@***.*** ***@***.***> 845.437.7430

…

On Nov 21, 2023, at 12:45 PM, Sarah Griffiths ***@***.***> wrote: No i'm sorry its still not working! Ok on the files in your files.zip I did this to find the two which have strand set to . , grep -P '\t\.\t\.\t' annotation.gtf BK006938.2 StringTie transcript 1302119 1302626 1000 . . gene_id "MSTRG.844"; transcript_id "MSTRG.844.1"; BK006938.2 StringTie exon 1302119 1302626 1000 . . gene_id "MSTRG.844"; transcript_id "MSTRG.844.1"; exon_number "1"; Which is causing this error values in 'transcripts$tx_strand' must be "+" or "-", If I run the downstream DE_analysis without these lines it seems to complete. This is an error being created by stringtie I've not seen it before but will add a step to remove any non stranded annotations hopefully will release that by end of the week, in the meantime you could add these lines to line 92 of subworkflows/differential_expression.nf which might fix it grep -v -P '\t\.\t\.\t' annotation.gtf > tmp.gtf mv tmp.gtf annotation.gtf — Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BD2SWUYDNC4OOW4SMMFYKVDYFTSEBAVCNFSM6AAAAAA7M2VUXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRRGM4DANJVGA>. You are receiving this because you were mentioned.

cea295933 · 2023-11-21T18:42:50Z

ok .. I think I found it. Before line 92, right? I see the de_analysis.R script being called on line 92, and it appears to use the annotation.gtf as an input if I understand if correctly

cea295933 · 2023-11-21T19:18:34Z

ok, I changed differential_expression.nf as below:

"""
88 mkdir merged
89 mkdir de_analysis
90 mv $merged_tsv merged/all_counts.tsv
91 mv $sample_sheet de_analysis/coldata.tsv
92
93 grep -v -P '\t.\t.\t' annotation.gtf > tmp.gtf
94 mv tmp.gtf annotation.gtf
95
96 de_analysis.R annotation.gtf $params.min_samps_gene_expr $params.min_samps_feature_expr $params.min_gene_expr $params.min_feature_expr $annotation_type $strip_version

but now I get the below the moment I launch the workflow, so I think I made the edit incorrectly (sorry!)

ERROR ~ Module compilation error

file : /home/caitken/.nextflow/assets/epi2me-labs/wf-transcriptomes/./subworkflows/differential_expression.nf
cause: token recognition error at: ' de_analysis/coldata.tsv\n\t\n\tgrep -v -P '\t.' @ line 91, column 21.
mv $sample_sheet de_analysis/coldata.tsv

sarahjeeeze · 2023-11-21T19:53:58Z

ok sorry forgot to escape the '\'

so should of been -

 grep -v -P '\\t\\.\\t\\.\\t' annotation.gtf > tmp.gtf
 mv tmp.gtf annotation.gtf

cea295933 · 2023-11-21T20:51:47Z

completed (with the downsampled data set)!!! Huzzah! I just launched jobs for the full data set to see if those work.

Thank you!!!

sarahjeeeze · 2023-11-21T20:54:32Z

ah okay yes, now the challenge is the full data set - let me know how it goes

cea295933 · 2023-11-22T02:45:25Z

completed! Thank you so much. One question: is there a meaningful difference in the pipeline with and without the minimap2 option we incorporated ("w 50")? Trying to determine whether it's worth testing it without that flag or not, now that everything works.

sarahjeeeze · 2023-11-22T09:16:36Z

Good point! It may result in a few more unaligned reads, so you may be missing some transcripts. Now its working it might be worth trying with the standard "w 10" to see if it still works, or "w 25".

sarahjeeeze · 2023-12-12T10:57:47Z

Closing through lack of response and issues solves

cea295933 · 2023-12-12T15:25:25Z

My apologies for not closing the loop here. Yes, this is now resolved. Running smoothly on the full dataset and with either set of minimap2 options. Thank you for all of your help! Colin Echeverría Aitken Assistant Professor Biology Department Biochemistry Program Vassar College ***@***.*** ***@***.***> 845.437.7430

…

On Dec 12, 2023, at 5:57 AM, Sarah Griffiths ***@***.***> wrote: Closing through lack of response and issues solves — Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BD2SWU4W22IZMKFLKI7GWKLYJA2DNAVCNFSM6AAAAAA7M2VUXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJRHAYDQMZXHA>. You are receiving this because you were mentioned.

cea295933 added the question Further information is requested label Nov 15, 2023

cea295933 mentioned this issue Nov 15, 2023

running wf-transcriptomes on HPC using sbatch #43

Closed

sarahjeeeze closed this as completed Dec 12, 2023

samtools error ([main_samview] fail to read the header from "-") during DE analysis ... empty bam file? #45

samtools error ([main_samview] fail to read the header from "-") during DE analysis ... empty bam file? #45

Comments

cea295933 commented Nov 15, 2023 • edited Loading

Ask away!

sarahjeeeze commented Nov 16, 2023

cea295933 commented Nov 16, 2023 via email

sarahjeeeze commented Nov 16, 2023

cea295933 commented Nov 16, 2023 via email

sarahjeeeze commented Nov 16, 2023 • edited Loading

sarahjeeeze commented Nov 16, 2023

cea295933 commented Nov 16, 2023 via email

sarahjeeeze commented Nov 16, 2023

cea295933 commented Nov 16, 2023 via email

sarahjeeeze commented Nov 16, 2023

cea295933 commented Nov 16, 2023

cea295933 commented Nov 16, 2023

cea295933 commented Nov 16, 2023

cea295933 commented Nov 16, 2023 • edited Loading

cea295933 commented Nov 16, 2023

sarahjeeeze commented Nov 16, 2023

cea295933 commented Nov 16, 2023

sarahjeeeze commented Nov 16, 2023

cea295933 commented Nov 16, 2023

sarahjeeeze commented Nov 16, 2023 • edited Loading

cea295933 commented Nov 16, 2023

|||||||||| _____ ____ ___ ____ __ __ _____ _ _ |||||||||| | | _ _ | | / | | | | __ | | ___ ||||| | | | |) | | __) | |/| | | _| |/ ` | ' / | ||||| | || /| | / /| | | | ||| | (| | |) _ |||||||||| |_____|| |||| ||| ||_,|./|__/ |||||||||| wf-transcriptomes v0.4.1-g0a8fe19

!! Only displaying parameters that differ from the pipeline defaults !!

This is epi2me-labs/wf-transcriptomes v0.4.1-g0a8fe19.

cea295933 commented Nov 16, 2023

sarahjeeeze commented Nov 16, 2023 • edited Loading

cea295933 commented Nov 16, 2023 via email

cea295933 commented Nov 16, 2023

sarahjeeeze commented Nov 16, 2023

cea295933 commented Nov 16, 2023

cea295933 commented Nov 16, 2023

sarahjeeeze commented Nov 20, 2023

cea295933 commented Nov 20, 2023

sarahjeeeze commented Nov 20, 2023

cea295933 commented Nov 20, 2023 via email

cea295933 commented Nov 20, 2023

cea295933 commented Nov 20, 2023

sarahjeeeze commented Nov 20, 2023

cea295933 commented Nov 20, 2023

cea295933 commented Nov 20, 2023 • edited Loading

sarahjeeeze commented Nov 20, 2023 • edited Loading

cea295933 commented Nov 20, 2023

cea295933 commented Nov 20, 2023

sarahjeeeze commented Nov 20, 2023

sarahjeeeze commented Nov 20, 2023

cea295933 commented Nov 20, 2023

cea295933 commented Nov 20, 2023

sarahjeeeze commented Nov 21, 2023

cea295933 commented Nov 21, 2023 via email

sarahjeeeze commented Nov 21, 2023

cea295933 commented Nov 21, 2023 via email

cea295933 commented Nov 21, 2023

cea295933 commented Nov 21, 2023

sarahjeeeze commented Nov 21, 2023 • edited Loading

cea295933 commented Nov 21, 2023

sarahjeeeze commented Nov 21, 2023

cea295933 commented Nov 22, 2023 • edited Loading

sarahjeeeze commented Nov 22, 2023

sarahjeeeze commented Dec 12, 2023

cea295933 commented Dec 12, 2023 via email

cea295933 commented Nov 15, 2023 •

edited

Loading

sarahjeeeze commented Nov 16, 2023 •

edited

Loading

cea295933 commented Nov 16, 2023 •

edited

Loading

sarahjeeeze commented Nov 16, 2023 •

edited

Loading

|||||||||| ___ _ _ _ _
|||||||||| | | _ _ | | / | | | | | | _
||||| | | | |) | | __) | |/| | | _| |/ ` | ' / |
||||| | || /| | / /| | | | ||| | (| | |) _
|||||||||| |_____|| ||**|| ||***| ||_,|.*/|__/
|||||||||| wf-transcriptomes v0.4.1-g0a8fe19

sarahjeeeze commented Nov 16, 2023 •

edited

Loading

cea295933 commented Nov 20, 2023 •

edited

Loading

sarahjeeeze commented Nov 20, 2023 •

edited

Loading

sarahjeeeze commented Nov 21, 2023 •

edited

Loading

cea295933 commented Nov 22, 2023 •

edited

Loading