Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

samtools error ([main_samview] fail to read the header from "-") during DE analysis ... empty bam file? #45

Closed
cea295933 opened this issue Nov 15, 2023 · 69 comments
Labels
question Further information is requested

Comments

@cea295933
Copy link

cea295933 commented Nov 15, 2023

Ask away!

Hello,

My workflow seems to be failing as a result of what I think is a samtools error.

Command:
minimap2 -t 4 -ax splice -uf -p 1.0 "genome_index.mmi" "seqs.fastq.gz" | samtools view -Sb > "output.bam"
3499 samtools sort -@ 4 "output.bam" -o "WTpoly2_reads_aln_sorted.bam"

Error:
[WARNING] Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index.
[main_samview] fail to read the header from "-".

I've taken a quick look at the files being maniupulated here (in the offending work directory). The .mmi does not appear to be human readable but is at least not empty (I can provide if you like ... GitHub won't allow me to upload). And seqs.fastq.gz is certainly not empty: 5G compressed and 20G uncompressed (I am pasting just the first entry for your reference below). However, output.bam is empty, and so I think this causes the issue for the samtools sort command. Any idea what could be causing this?

@4aabeb37-0ae5-4289-ab27-e6f246a7e21a runid=215283ee2a07610bc86fc4baad99de671b475048 read=74966 ch=2198 start_time=2023-09-23T00:45:55.980051-04:00 flow_cell_id=PAO05278 protocol_group_id=Vassar_smm092223 sample_id=NB_A11-H11 barcode=barcode84 barcode_alias=barcode84 parent_read_id=4aabeb37-0ae5-4289-ab27-e6f246a7e21a basecall_model_version_id=dna_r10.4.1_e8.2_5khz_400bps_hac@v4.2.0
ATGTTATGTCCTGTACTTCGTTCAGTTACGTATTGCTAAGGTTAAGTAGTGGACCTAGAACCTGTGCCACAGCACCTCGCCTACCGTGACAAGAAAGTTGTCGGTGTCTTTGTGACTGCCTGTCGCTCTATCTTCAGAGGAGAGTCCGCCGCCCGCAAGTTTTTTTTTTTTTTTTTTTTTTTTTTGCCGATGTGATTCTAAAGATGTCATTCATATATTCAAGTTAAAAAATATATAGAATTGATAAGCCATATATAGAAACAATTACTAAAGAGAGAAAAAAAACCAAGTCTGGTAAAAGAAAAAGAAAAAGAGCCTAATATACGACACTTTGGTTCAAATAATTCACCAGAAGTCCTACGTTTTATCCTCCACTTATTATTATATCATCTTGTTAATTTTATTTCTTTAAAATTTAGCAAATTGCTTGTTGGCCTTACCGGCAGCAGCAGAGACCTTGACGACGTTGAATCTAACAGTCTTGGAGATTGGTCTACATTGACCAACGGTAACAATGTCACCAACTTGGACACGGAAAGCTGGGGAGACGTGGACTGGGACGTTCTTGTGTCTCTTTTCGTATCTGTTGTACTTTGGAATGTAATGCAAGTAAGCTCTTTCTGATGACAATGGTACGGTGCATCTTGGTGGAGACGACGGTACCGGTCAAGATCTTACCACGGATGGAAACTAAACCAGTGAATGGACATTTCTTATCAATGTAAGAACCTTCAATAGCGGTCTTTGGGGTCTTGAATCCCAAACCGGCATTCTTATACCATCTCTTGGTTCTCTTGGAAGTCTTGACCTTTGGATTGTTGAAGATGTGAGGTTGCTTTTGGAAAGCTCTTTAGATTGAACAGTTAATTCAGTGGACATCTTTTTTCCCTGGCTTGATACCCCAAAGTCCAAGGGCAATTCGAACTGGAAAGCAATATCAGCACCAACAGAAACACAAAGACACCGACAACTTTCTTGTCACC
+
%%&'*(++)%%(+,**++++235891122<><=>>FIDDCCFD=714211<BA?BBHGITFFEDDEEEE{HIPCBA@@?@ADJPHH{I{NHE=<<=BCECEIDGHGNHFJHI{{JHPFDICE{F@@AAFBCCBI=888>KHFNL{HJEGDDDEC=EHEHJ@D67>?ADRI{{{{{{{{{HBFCIJB?=.,,9{>:9<433267998:99:?>E{J{P{;:99@@b<>?DHME=>AAG>???C@?777;>@bw>@?AFQEJSIJHIHFGH{F{QIGEH{{FCCBB^PFHQJJFDIJKLHEGGHKF{{KGDJS{K{GFF{H{OO{I{KKN{HJIKHRMJ{DEJFFDFCBEIGHJG{GFNEEKEHEKLHDDB@==;:=DCABCDBCEEIIEEABBGKGUEHHJHKJHGABC@BC5546==>=F6666B@BFJFFDDCEGJEC5;;IFB==?A=76689FEGD@A@AIKGGMFEG@{9=<AC=BKFEGHGCDE{=444--,,02+**'('''+-/CDEGIHCA?@eee{HEFHE{GJ{IF{HPDFF@CA889=B?@coc;988,012677452003404476,***00+((/)6<;=::>@FFDEFBFEDHEIG{{{H6666@C.,,,<;FEGDC=77BBFGI8889FEDEDFCBCDEG<;@dlph@??GG:9:YI{JFFBE7(((788:BBFEFFFDKJE{LHI{GBDDF{@@>BDAD?>/../1011.-,-33=>@EELGJEG{HIHIFFBB@ALLGGKFHFFEDHGFHGGIJEDCHCJDEIGFJGHUIFXGHHFEIIOGMFFIGGCC:94&

@cea295933 cea295933 added the question Further information is requested label Nov 15, 2023
@sarahjeeeze
Copy link
Contributor

Hi, can you provide your full input cmd? Are these cdna or direct_rna reads?

@cea295933
Copy link
Author

cea295933 commented Nov 16, 2023 via email

@sarahjeeeze
Copy link
Contributor

Hi @cea295933

What is the exit code you are getting from that process that has the error fail to read the header from "-". This step is very memory intensive process and I have seen this same error message/empty bam appear if the alignment process has run out of memory. Ensure the global config has a high memory set and consider adding --minimap2_index_opts to eg. -w 25 or -w 50

You can actually delete this section from any config you have

executor {
    $local {
        cpus = 4
        memory = "8 GB"
    }
}

@cea295933
Copy link
Author

cea295933 commented Nov 16, 2023 via email

@sarahjeeeze
Copy link
Contributor

sarahjeeeze commented Nov 16, 2023

Sorry I see you have supplied the full input cmd above, what is the reference annotation and reference genome you are using if publicly available?

@sarahjeeeze
Copy link
Contributor

Also how many samples are you running?

@cea295933
Copy link
Author

cea295933 commented Nov 16, 2023 via email

@sarahjeeeze
Copy link
Contributor

If you have a reference transcriptome available and supply it to the ref_transcriptome parameter does the workflow progress past this part of the workflow? Trying to work out if the reads or something to do with the string tie merged transcriptome we are using could be causing this error.

@cea295933
Copy link
Author

cea295933 commented Nov 16, 2023 via email

@sarahjeeeze
Copy link
Contributor

Could you try decreasing the memory directive to something like 64GB? Is there any further message in the salmon error. I think the part about updating the version isn't what's causing it to break - As we have that message in our automated tests as well which don't fail.

@cea295933
Copy link
Author

A few questions:

  1. You want me to decrease memory now? Via sbatch or config file, or both?
  2. Do you want me to run with transcriptome to produce salmon error? I can also provide those log files.
  3. what is the syntax of the minimap2_index_opts command? When I enter "minimap2_index_opts -w 25" within my sbatch file I get an error.

@cea295933
Copy link
Author

Here's what I get if I run DE only a la GitHub test code (DE only test code from GitHub runs fine):

613 ERROR ~ Error executing process > 'pipeline:differential_expression:count_transcripts (1)'
614
615 Caused by:
616 Process pipeline:differential_expression:count_transcripts (1) terminated with an error exit status (1)
617
618 Command executed:
619
620 salmon quant --noErrorModel -p "4" -t "ammended.ref_transcriptome" -l SF -a "WTpoly1_reads_aln_sorted.bam" -o counts
621 mv counts/quant.sf "WTpoly1.transcript_counts.tsv"
622 seqkit bam "WTpoly1_reads_aln_sorted.bam" 2> "WTpoly1.seqkit.stats"
623
624 Command exit status:
625 1
626
627 Command output:
628 (empty)
629
630 Command error:
631 Version Info: ### PLEASE UPGRADE SALMON ###
632 ### A newer version of salmon with important bug fixes and improvements is available. ####
633 ###
634 The newest version, available at https://github.com/COMBINE-lab/salmon/releases
635 contains new features, improvements, and bug fixes; please upgrade at your
636 earliest convenience.
637 ###
638 Sign up for the salmon mailing list to hear about new versions, features and updates at:
639 https://oceangenomics.com/subscribe
640 # salmon (alignment-based) v1.9.0
641 # [ program ] => salmon
642 # [ command ] => quant
643 # [ noErrorModel ] => { }
644 # [ threads ] => { 4 }
645 # [ targets ] => { ammended.ref_transcriptome }
646 # [ libType ] => { SF }
647 # [ alignments ] => { WTpoly1_reads_aln_sorted.bam }
648 # [ output ] => { counts }
649 Logs will be written to counts/logs
650 [2023-11-14 22:10:17.514] [jointLog] [info] setting maxHashResizeThreads to 4
651 [2023-11-14 22:10:17.514] [jointLog] [info] Fragment incompatibility prior below threshold. Incompatible fragments will be ignored.
652 Library format { type:single end, relative orientation:none, strandedness:sense }
653 [2023-11-14 22:10:17.514] [jointLog] [info] numQuantThreads = 2
654 parseThreads = 2
655 Checking that provided alignment files have consistent headers . . . done
656 Populating targets from aln = "WTpoly1_reads_aln_sorted.bam", fasta = "ammended. ref_transcriptome" . . .done

@cea295933
Copy link
Author

If I run without DE, no issues with these data or with test data.

@cea295933
Copy link
Author

cea295933 commented Nov 16, 2023

One issue I can see with running with supplied transcriptome is that there are no gold-standard (SGD, RefSeq, NCBI, etc.) yeast transcriptomes that contain good UTR info. In fact, most (if not all) are just ORFs, and just one isoform per gene. There are studies that have tried to investigate this(Pelechano, et al. for example) but they do not report a curated transcriptome and these have not been adopted by the big databases.

@cea295933
Copy link
Author

Just to clarify:

  1. Thanks so much for your help!
  2. Happy to run whatever you need ... not trying to avoid that by supplying previous logs. Just let me know what you'd like to test. Our HPC is currently pretty wide open so I can run multiple jobs at a time.

@sarahjeeeze
Copy link
Contributor

  1. yes sorry try either decreasing in the executor or just delete the executor
  1 executor {
  2     $local {
  3         cpus = 64
  4         memory = "32 GB"
  5     }
  6 }
  1. Yes please, if you use this test in your environment do you get the salmon error?
wget -O differential_expression.tar.gz https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-isoforms/differential_expression.tar.gz && tar -xzvf differential_expression.tar.gz
nextflow run epi2me-labs/wf-transcriptomes \
--fastq  differential_expression/differential_expression_fastq \
--transcriptome-source precomputed \
--de_analysis \
--ref_genome differential_expression/hg38_chr20.fa \
--ref_annotation differential_expression/gencode.v22.annotation.chr20.gff3 \
--direct_rna --minimap2_index_opts '-k 15' \
--ref_transcriptome differential_expression/ref_transcriptome.fasta \
--sample_sheet test_data/sample_sheet.csv
  1. Put quotes around the index opts. --minimap2_index_opts "-w 25"

@cea295933
Copy link
Author

Test DE analysis runs fine. Log below:

Last login: Wed Nov 15 19:06:08 on console

The default interactive shell is now zsh.
To update your account to use zsh, please run chsh -s /bin/zsh.
For more details, please visit https://support.apple.com/kb/HT208050.
caitken:~ caitken$ ssh caitken@jr.vassar.edu
^C
caitken:~ caitken$ ssh caitken@jr.vassar.edu
caitken@jr.vassar.edu's password:

Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-153-generic x86_64)
__ __
/ / / /___ ____ ____ ___ _____
/ // / __ / __ / __ / _ / /
/ __ / /
/ / /
/ / /
/ / __/ /
/
/ //_/ ./ ./_//
/
/ /_/

System information as of Thu 16 Nov 2023 12:29:58 PM EST

System load: 0.0 Users logged in: 3
Usage of /: 17.4% of 1.79TB IPv4 address for docker0: 172.17.0.1
Memory usage: 17% IPv4 address for eth0: 192.168.0.5
Swap usage: 0% IPv4 address for eth1: 143.229.7.20
Processes: 608
Last login: Thu Nov 16 10:55:33 2023 from 143.229.41.79
caitken@ccas020:[12:29]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
caitken@ccas020:
[12:30]$ cd /work/epi2me/
caitken@ccas020:/work/epi2me[12:30]$ ls
50 nextflow
Aitken_epi2me_20231110_poly_emc_noDE.out outdir
Aitken_epi2me_20231110_total_emc2.out output
Aitken_epi2me_20231110_total_emcK10.out Poly_epi2me_noRNA_sbatch
Aitken_epi2me_20231110_total_emcK11.out Poly_epi2me_sbatch_emc_DEonly
Aitken_epi2me_20231110_total_emcK14.out Poly_epi2me_sbatch_emc_noDE.sh
Aitken_epi2me_20231110_total_emc_noDE.out test_data.tar.gz
Aitken_epi2me_20231110_total_emc.out testDE_sbatch
Aitken_epi2me_poly_emc_DEonly.out Total_epi2me_noRNA_sbatch
Aitken_epi2me_poly_noRNA2.out Total_epi2me_sbatch_emc
Aitken_epi2me_poly_noRNA.out Total_epi2me_sbatch_emc_DEonly
Aitken_epi2me_testDE.out Total_epi2me_sbatch_emc_K10
Aitken_epi2me_total_emc_DEonly.out Total_epi2me_sbatch_emc_K11
Aitken_epi2me_total_noRNA.out Total_epi2me_sbatch_emc_K14.sh
chr20 Total_epi2me_sbatch_emc_NCBI.sh
differential_expression Total_epi2me_sbatch_emc_noDE
differential_expression.tar.gz work
ERR6053095_chr20.fastq workspace_dir
caitken@ccas020:/work/epi2me[12:30]$ vi Aitken_epi2me_poly_noRNA2.out
caitken@ccas020:/work/epi2me[12:30]$ vi Poly_epi2me_noRNA_sbatch
caitken@ccas020:/work/epi2me[12:35]$ vi Aitken_epi2me_poly_noRNA2.out
caitken@ccas020:/work/epi2me[12:36]$ sbatch Poly_epi2me_noRNA_sbatch
Submitted batch job 36004
caitken@ccas020:/work/epi2me[12:37]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
caitken@ccas020:/work/epi2me[12:37]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36005 general augcccation fgoggins R 0:19 1 10 node1
caitken@ccas020:/work/epi2me[12:37]$ vi Aitken_epi2me_poly_noRNA2.out
caitken@ccas020:/work/epi2me[12:38]$ sbatch Poly_epi2me_noRNA_sbatch
Submitted batch job 36007
caitken@ccas020:/work/epi2me[12:38]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 0:16 1 10 node1
36005 general augcccation fgoggins R 0:49 1 10 node1
36007 emc Aitken_epi2me_po caitken R 0:01 1 128 node5
caitken@ccas020:/work/epi2me[12:38]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 0:21 1 10 node1
36005 general augcccation fgoggins R 0:54 1 10 node1
caitken@ccas020:/work/epi2me[12:38]$ vi Aitken_epi2me_poly_noRNA2.out
caitken@ccas020:/work/epi2me[12:39]$ sbatch Poly_epi2me_noRNA_sbatch
Submitted batch job 36008
caitken@ccas020:/work/epi2me[12:39]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 1:04 1 10 node1
36005 general augcccation fgoggins R 1:37 1 10 node1
36008 emc Aitken_epi2me_po caitken R 0:01 1 128 node5
caitken@ccas020:/work/epi2me[12:39]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 1:08 1 10 node1
36005 general augcccation fgoggins R 1:41 1 10 node1
36008 emc Aitken_epi2me_po caitken R 0:05 1 128 node5
caitken@ccas020:/work/epi2me[12:39]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 1:10 1 10 node1
36005 general augcccation fgoggins R 1:43 1 10 node1
36008 emc Aitken_epi2me_po caitken R 0:07 1 128 node5
caitken@ccas020:/work/epi2me[12:39]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 1:13 1 10 node1
36005 general augcccation fgoggins R 1:46 1 10 node1
36008 emc Aitken_epi2me_po caitken R 0:10 1 128 node5
caitken@ccas020:/work/epi2me[12:39]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 1:20 1 10 node1
36005 general augcccation fgoggins R 1:53 1 10 node1
caitken@ccas020:/work/epi2me[12:39]$ vi Aitken_epi2me_poly_noRNA2.out
caitken@ccas020:/work/epi2me[12:40]$ sbatch Poly_epi2me_noRNA_sbatch
Submitted batch job 36009
caitken@ccas020:/work/epi2me[12:40]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 2:33 1 10 node1
36009 emc Aitken_epi2me_po caitken R 0:01 1 128 node5
caitken@ccas020:/work/epi2me[12:40]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 2:35 1 10 node1
36009 emc Aitken_epi2me_po caitken R 0:03 1 128 node5
caitken@ccas020:/work/epi2me[12:40]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 2:36 1 10 node1
36009 emc Aitken_epi2me_po caitken R 0:04 1 128 node5
caitken@ccas020:/work/epi2me[12:40]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 2:38 1 10 node1
36009 emc Aitken_epi2me_po caitken R 0:06 1 128 node5
caitken@ccas020:/work/epi2me[12:40]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
36006 general 631G fgoggins R 2:40 1 10 node1
36009 emc Aitken_epi2me_po caitken R 0:08 1 128 node5
caitken@ccas020:/work/epi2me[12:40]$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON)
caitken@ccas020:/work/epi2me[12:40]$ vi Aitken_epi2me_poly_noRNA2.out
caitken@ccas020:/work/epi2me[12:43]$ ls
25 nextflow
50 outdir
Aitken_epi2me_20231110_poly_emc_noDE.out output
Aitken_epi2me_20231110_total_emc2.out Poly_epi2me_noRNA_sbatch
Aitken_epi2me_20231110_total_emcK10.out Poly_epi2me_sbatch_emc_DEonly
Aitken_epi2me_20231110_total_emcK11.out Poly_epi2me_sbatch_emc_noDE.sh
Aitken_epi2me_20231110_total_emcK14.out test_data.tar.gz
Aitken_epi2me_20231110_total_emc_noDE.out testDE_sbatch
Aitken_epi2me_20231110_total_emc.out Total_epi2me_noRNA_sbatch
Aitken_epi2me_poly_emc_DEonly.out Total_epi2me_sbatch_emc
Aitken_epi2me_poly_noRNA2.out Total_epi2me_sbatch_emc_DEonly
Aitken_epi2me_poly_noRNA.out Total_epi2me_sbatch_emc_K10
Aitken_epi2me_testDE.out Total_epi2me_sbatch_emc_K11
Aitken_epi2me_total_emc_DEonly.out Total_epi2me_sbatch_emc_K14.sh
Aitken_epi2me_total_noRNA.out Total_epi2me_sbatch_emc_NCBI.sh
chr20 Total_epi2me_sbatch_emc_noDE
differential_expression work
differential_expression.tar.gz workspace_dir
ERR6053095_chr20.fastq
caitken@ccas020:/work/epi2me[12:43]$ vi Aitken_epi2me_poly_emc_DEonly.out
caitken@ccas020:/work/epi2me[12:46]$ vi Aitken_epi2me_20231110_poly_emc_noDE.out
caitken@ccas020:/work/epi2me[12:47]$ vi Aitken_epi2me_total_emc_DEonly.out
caitken@ccas020:/work/epi2me[12:56]$ vi Aitken_epi2me_testDE.out

24014 [ee/9bb70f] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24015 [06/6fef76] process > pipeline:differential_expre... [100%] 6 of 6 ✔
24016 [5a/06e947] process > pipeline:differential_expre... [100%] 6 of 6 ✔
24017 [f3/3d0571] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24018 [29/44dba5] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24019 [88/6a0146] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24020 [d2/edc948] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24021 [87/d5752d] process > pipeline:makeReport (1) [100%] 1 of 1 ✔
24022 [59/5d6d71] process > output (22) [ 90%] 20 of 22
24023
24024 executor > local (58)
24025 [25/87e83a] process > validate_sample_sheet [100%] 1 of 1 ✔
24026 [cd/a8bb45] process > fastcat (5) [100%] 6 of 6 ✔
24027 [e2/6e936a] process > pipeline:preprocess_ref_ann... [100%] 1 of 1 ✔
24028 [c7/2a7592] process > pipeline:collectFastqIngres... [100%] 6 of 6 ✔
24029 [88/616c92] process > pipeline:getVersions [100%] 1 of 1 ✔
24030 [0e/9b4ea0] process > pipeline:getParams [100%] 1 of 1 ✔
24031 [b8/e04e62] process > pipeline:preprocess_ref_tra... [100%] 1 of 1 ✔
24032 [b1/35adc2] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24033 [ee/9bb70f] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24034 [06/6fef76] process > pipeline:differential_expre... [100%] 6 of 6 ✔
24035 [5a/06e947] process > pipeline:differential_expre... [100%] 6 of 6 ✔
24036 [f3/3d0571] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24037 [29/44dba5] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24038 [88/6a0146] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24039 [d2/edc948] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24040 [87/d5752d] process > pipeline:makeReport (1) [100%] 1 of 1 ✔
24041 [bf/2e04ad] process > output (21) [100%] 22 of 22 ✔
24042
24043 executor > local (58)
24044 [25/87e83a] process > validate_sample_sheet [100%] 1 of 1 ✔
24045 [cd/a8bb45] process > fastcat (5) [100%] 6 of 6 ✔
24046 [e2/6e936a] process > pipeline:preprocess_ref_ann... [100%] 1 of 1 ✔
24047 [c7/2a7592] process > pipeline:collectFastqIngres... [100%] 6 of 6 ✔
24048 [88/616c92] process > pipeline:getVersions [100%] 1 of 1 ✔
24049 [0e/9b4ea0] process > pipeline:getParams [100%] 1 of 1 ✔
24050 [b8/e04e62] process > pipeline:preprocess_ref_tra... [100%] 1 of 1 ✔
24051 [b1/35adc2] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24052 [ee/9bb70f] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24053 [06/6fef76] process > pipeline:differential_expre... [100%] 6 of 6 ✔
24054 [5a/06e947] process > pipeline:differential_expre... [100%] 6 of 6 ✔
24055 [f3/3d0571] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24056 [29/44dba5] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24057 [88/6a0146] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24058 [d2/edc948] process > pipeline:differential_expre... [100%] 1 of 1 ✔
24059 [87/d5752d] process > pipeline:makeReport (1) [100%] 1 of 1 ✔
24060 [bf/2e04ad] process > output (21) [100%] 22 of 22 ✔
24061 Completed at: 14-Nov-2023 10:58:58
24062 Duration : 2m 55s
24063 CPU hours : 0.1
24064 Succeeded : 58
24065

@sarahjeeeze
Copy link
Contributor

ok and now try reference guided

nextflow run epi2me-labs/wf-transcriptomes \
--fastq  differential_expression/differential_expression_fastq \
--de_analysis \
--ref_genome differential_expression/hg38_chr20.fa --transcriptome-source reference-guided \
--ref_annotation differential_expression/gencode.v22.annotation.chr20.gtf \
--direct_rna --minimap2_index_opts '-k 15'  --sample_sheet test_data/sample_sheet.csv

@cea295933
Copy link
Author

running all three now ... btw, the --sampel_sheet test_data/sample_sheet.csv gave me problems previously ... I don't remember if it's not included in the test data one downloads, or whether it's in a different directory or what ... I ended up having to create my own sample sheet (or move it to the directory ... can't remember)

... perhaps you could look into that separately... it might be giving others trouble too

@sarahjeeeze
Copy link
Contributor

sarahjeeeze commented Nov 16, 2023

good idea, yes will add it to the downloadable test set

@cea295933
Copy link
Author

two jobs with my data appear to be running stably, and test run ran without issues (see below). when I get samtools error, it is after long stable run (over 1 h). whereas when I get salmon error, it's usually after 10 minutes of run or less. Both running now for at least 10 m

N E X T F L O W ~ version 23.10.0
Launching https://github.com/epi2me-labs/wf-transcriptomes [boring_planck] DSL2 - revision: 0a8fe19 [master]

WARN: Found unexpected parameters:

  • --transcriptome-source: reference-guided
  • Ignore this warning: params.schema_ignore_params = "transcriptome-source"

|||||||||| _____ ____ ___ ____ __ __ _____ _ _
|||||||||| | | _ _ | | / | | | | __ | | ___
||||| | | | |) | | __) | |/| | | _| |/ ` | ' / |
||||| | |
| /| | / /| | | | ||
| | (
| | |
) _

|||||||||| |_____|
| |
||| ||| ||_,|./|__/
|||||||||| wf-transcriptomes v0.4.1-g0a8fe19

Core Nextflow options
revision : master
runName : boring_planck
containerEngine : docker
container : [withLabel:isoforms:ontresearch/wf-transcriptomes:shae7c9f184996a384e99be68e790f0612f0c732867, withLabel:wf_common:ontresearch/wf-common:sha0a6dc21fac17291f4acb2e0f67bcdec7bf63e6b7]
launchDir : /work/epi2me
workDir : /work/epi2me/work
projectDir : /home/caitken/.nextflow/assets/epi2me-labs/wf-transcriptomes
userName : caitken
profile : standard
configFiles : /home/caitken/.nextflow/assets/epi2me-labs/wf-transcriptomes/nextflow.config

Input Options
fastq : differential_expression/differential_expression_fastq
ref_genome : differential_expression/hg38_chr20.fa
ref_annotation : differential_expression/gencode.v22.annotation.chr20.gtf
direct_rna : true

Sample Options
sample_sheet : differential_expression/sample_sheet.csv

Options for reference-based workflow
minimap2_index_opts: -k 15

Differential Expression Options
de_analysis : true

!! Only displaying parameters that differ from the pipeline defaults !!

If you use epi2me-labs/wf-transcriptomes for your analysis please cite:


This is epi2me-labs/wf-transcriptomes v0.4.1-g0a8fe19.

Checking fastq input.
executor > local (119)
[1b/620c1e] process > validate_sample_sheet [100%] 1 of 1 ✔
[91/df2f5c] process > fastcat (6) [100%] 6 of 6 ✔
[a2/4fc9fc] process > pipeline:preprocess_ref_annotation [100%] 1 of 1 ✔
[46/5e2ffe] process > pipeline:collectFastqIngressResultsInDir (6) [100%] 6 of 6 ✔
[90/bb2a11] process > pipeline:getVersions [100%] 1 of 1 ✔
[1e/02f132] process > pipeline:getParams [100%] 1 of 1 ✔
[c4/1f960d] process > pipeline:build_minimap_index (1) [100%] 1 of 1 ✔
[d8/c25222] process > pipeline:reference_assembly:map_reads (6) [100%] 6 of 6 ✔
[90/7a7229] process > pipeline:split_bam (6) [100%] 6 of 6 ✔
[29/517ddb] process > pipeline:assemble_transcripts (6) [100%] 6 of 6 ✔
[80/d44317] process > pipeline:merge_gff_bundles (6) [100%] 6 of 6 ✔
[2f/7955ff] process > pipeline:run_gffcompare (6) [100%] 6 of 6 ✔
[b9/bc2600] process > pipeline:get_transcriptome (6) [100%] 6 of 6 ✔
[2a/5f66bf] process > pipeline:merge_transcriptomes (1) [100%] 1 of 1 ✔
[bf/9fd364] process > pipeline:differential_expression:checkSampleSheetCond... [100%] 1 of 1 ✔
[11/844a2e] process > pipeline:differential_expression:build_minimap_index_... [100%] 1 of 1 ✔
[06/51b1ab] process > pipeline:differential_expression:map_transcriptome (2) [100%] 6 of 6 ✔
[d3/469d56] process > pipeline:differential_expression:count_transcripts (6) [100%] 6 of 6 ✔
[3b/4f7573] process > pipeline:differential_expression:mergeCounts [100%] 1 of 1 ✔
[70/2b6988] process > pipeline:differential_expression:mergeTPM [100%] 1 of 1 ✔
[61/5666c7] process > pipeline:differential_expression:deAnalysis (1) [100%] 1 of 1 ✔
[5b/9a1800] process > pipeline:differential_expression:plotResults (1) [100%] 1 of 1 ✔
[2a/cb76df] process > pipeline:makeReport (1) [100%] 1 of 1 ✔
[ad/9cda77] process > output (42) [100%] 46 of 46 ✔
Completed at: 16-Nov-2023 13:10:44
Duration : 5m 7s
CPU hours : 0.2
Succeeded : 119

@cea295933
Copy link
Author

circle back once those have completed?

Thanks so much for all of your help!

@sarahjeeeze
Copy link
Contributor

sarahjeeeze commented Nov 16, 2023

No worries, sorry it's not working! For the salmon breakage one - if you go in to a work directory for the map_transcriptome process do the bams contain alignments? I was able to recreate the error you had in the other ticket by inputting an empty bam to that step. That's an error we should be catching and reporting to the user.

Perhaps if you run the workflow with your inputs without the --de_analysis parameter you could share the output report wf-transcriptome-report.html just to check there are alignments between the reference_genome and your reads.

@cea295933
Copy link
Author

cea295933 commented Nov 16, 2023 via email

@cea295933
Copy link
Author

tried to send .html reports but I can't attach in GitHub

@sarahjeeeze
Copy link
Contributor

I think if you zip them then you can attach

@cea295933
Copy link
Author

Reports.zip

@cea295933
Copy link
Author

I have a meeting until later this afternoon ... so not ignoring if I'm radio silent. Will send along results of current runs once I have them. I also added a third run with the reference-guided (with our data) as a point of comparison

@sarahjeeeze
Copy link
Contributor

Could you go to the work directory of the merge_transcriptomes and if its small enough perhaps zip up and send the final_non_redundant_transcriptome.fasta and if possible also the gtf file from that directory? Just to check it looks sensible?

@cea295933
Copy link
Author

Files attached below.

Some updates on where things stand here (updating my posts above). I've been chasing the memory constraint lead. I realized that the jobs I'm running on our HPC our running entirely on virtual memory and using almost no RSS. So for example, the results of htop on the running node show almost every process running at 48 GB of virtual memory and using only 0.4% of available memory (I can provide a screen grab of that if you want). In trying to chase this down, I realized that the memory limits within SLURM were not set properly. Unfortunately, correcting those (I can now ask SLRUM to give me all 512 GB of RAM on the node and it will) still result in the same odd virtual memory usage (unless this is actually normal?) and the processes still fail in the same way. I also noticed that it seemed like nextflow/docker were ignoring my config file that I was passing (I still saw what looked like 4 CPU and 8 GB of ram ... the default ... in the nextflow log). So I modified nextflow.log just to change those two lines to match what I'm requesting (128 CPU and 512 GB of RAM). But still the same failure. So that brings me to my question: **Is this virtual memory usage, and the very low physical memory usage expected? If not, do you have ideas of where to look? I'd like to make as much progress as possible investigating solutions on our end, as I know you're very busy! **

TXandGTF.zip

@sarahjeeeze
Copy link
Contributor

ok those files look reasonable. Its only a very small transcriptome so although you have a lot of reads, I wouldn't expect this step to require that much memory. Could you possibly try to run the workflow on a laptop with docker or singularity? You could even try the desktop app EPI2ME https://labs.epi2me.io/downloads/. Asking my team if they have any ideas about your SLURM issues.

@cea295933
Copy link
Author

cea295933 commented Nov 20, 2023 via email

@cea295933
Copy link
Author

ok ... I just did this with a further subsetted dataset (only 10 fastq.gz files per barcode). Ran on a local PC using desktop GUI ... received the exact same error (log below). I ran this further subsetted version on our HPC and again, same error. I then tried on the local PC again, but using different genome files (I had previously tested this on our HPC but that was before I realized I should not provide transcriptome when using DE and instead rely on generated transcriptome).

The good news: it looks like it proceeds past the point where we've seen the error previously.
The bad news: it now fails at a separate step ... the error suggests it is looking for "+" and "-" in the gff/gtf granges object but doesn't find it (logs below ... labeled Genbank and RefSeq). Do you think we can move forward from here?

A few questions/ideas:

  1. The desktop GUI says that a reference transcriptome is required for DE analysis ... but didn't complain that I did not provide one ... could this be an issue? My understanding was that if one chose reference-guided, the workflow assemblies a transcriptome that then guides the subsequent DE analysis (a question here ... is a separate transcriptome generated per sample? or is there a master merged transcriptome that is used for all samples for the DE analysis?)

  2. I had to change one of the options in the GUI version ... to say that alignments needed to be in at least 2 samples
    minimum (default was three) ... because I only have two replicates of each ... could that be an issue?

nextflow.txt
RefSeq_epi2me.txt
GenBank_epi2me.txt

@cea295933
Copy link
Author

just check on our HPC and the same holds true there ... it runs further with genbank files but crashes at the pipeline:differential_expression:deAnalysis(1) step (see below)

2189 [b6/d0517e] process > output (16) [ 69%] 16 of 23
2190 Retry deAnalysis with gff format setting and version removal.
2191 Retry deAnalysis with gtf format setting and version removal.
2192 [70/24e7e0] NOTE: Process pipeline:differential_expression:deAnalysis (1) terminated with an error exit status (1) -- Execution is retried (2)
2193 [94/280d99] NOTE: Process pipeline:differential_expression:deAnalysis (1) terminated with an error exit status (1) -- Execution is retried (3)
2194 ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)'
2195
2196 Caused by:
2197 Process pipeline:differential_expression:deAnalysis (1) terminated with an error exit status (1)
2198
2199 Command executed:
2200
2201 mkdir merged
2202 mkdir de_analysis
2203 mv de_transcript_counts.tsv merged/all_counts.tsv
2204 mv BarcodesPoly.csv de_analysis/coldata.tsv
2205 de_analysis.R annotation.gtf 3 1 10 3 gtf true
2206
2207 Command exit status:
2208 1
2209
2210 Command output:
2211 Loading counts, conditions and parameters.
2212 Loading annotation database.
2213
2214 Command error:
2215 Loading counts, conditions and parameters.
2216 Warning message:
2217 In read.table(file = file, header = header, sep = sep, quote = quote, :
2218 incomplete final line found by readTableHeader on 'de_analysis/coldata.tsv'
2219 Loading annotation database.
2220 Import genomic features from the file as a GRanges object ... OK
2221 Prepare the 'metadata' data frame ... OK
2222 Make the TxDb object ... Error in .makeTxDb_normarg_transcripts(transcripts) :
2223 values in 'transcripts$tx_strand' must be "+" or "-"
2224 Calls: makeTxDbFromGFF ... makeTxDbFromGRanges -> makeTxDb -> .makeTxDb_normarg_transcripts
2225 Execution halted

@sarahjeeeze
Copy link
Contributor

Ah no I will update that about the reference transcriptome requirement. Oh that's good, we have only tested it as mentioned in the documentation with annotation files from Encode, Ensembl and NCBI.

I think just ensure the strand column in your ref annotation file only has + or - or i think . is ok as well. Let me know if you need any guidance on fixing it, feel free to send the ref_annotation you are using or a link to where its from and I can check it. No it should be ok to set that it needs to be present in 2 samples.

One thing to try may be to reduce the threads parameter to one or two, which will make some steps take a bit longer but may reduce memory requirement at the transcriptome alignment step.

Also in the output directory execution folder did you have any example report.html and timeline.htmls, do they have any info about memory consumption before the workflow failure?

@cea295933
Copy link
Author

Attaching the ones I tried that work ... one of these is NCBI_RefSeq so should work .... but I see "." for entries like centromeres and telomeres ... suggestions on how to fix (I don't remember my Perl for text editing all that well)? Or alternatives for yeast?

GenBank.zip
NCBI_RefSeq.zip

@cea295933
Copy link
Author

cea295933 commented Nov 20, 2023

Here is the execution folder. Bear in mind this is the heavily downsampled data set (only 10 files per barcode), it's likely not testing memory limits like the others. That said, it still is eating up 80 GB of virtual memory on the HPC ... and only 1 G RAM ... is that normal?

I will say that for the other jobs, it was only the preprocessing reads step that appeared hugely memory intensive, at least according to these reports. And that always worked fine on the HPC (whereas it takes ages on a local PC)

execution.zip

@sarahjeeeze
Copy link
Contributor

sarahjeeeze commented Nov 20, 2023

I think any "." entries are ok, have you tried the genomic.gtf as the annotation file from the genbank zip folder, Oh I also noticed ### at the end of each of the annotation files.. I have not seen this before. Perhaps remove the last lines.

@cea295933
Copy link
Author

ok ... trying both of these

even now, I can find the "counts.tsv" in the working folder ... is that complete at this point of the pipeline? if so, I can use that to run DESeq2 in R myself ... (at least as long as this now works on the complete dataset ... haven't tried that yet)

@cea295933
Copy link
Author

those don't appear to work ... attaching the files in the work directory in the event they spark any ideas
files.zip

@sarahjeeeze
Copy link
Contributor

Hi, hmm looks like you have all novel transcripts - does BK006934.2 match up with the reference file fasta header?

@sarahjeeeze
Copy link
Contributor

oh sorry i see you transcript_ids are all 'unassigned_transcript_1234' is that expected?

@cea295933
Copy link
Author

yeah, that's strange. on one of my previous runs I didn't see that ... I don't remember now whether it was refseq or genbank (or it could be a gff vs. gtf) .... I'll try to figure what the combination was, but that all_counts.tsv is below
all_counts.txt

@cea295933
Copy link
Author

ok ... if I run using the GenBank .fa and .gff, I get transcripts assigned. But if I replace with the .gtf (or use any combination of the RefSeq files), I get either unassigned transcripts or MSTRG (which I think come from stringtie). But I appear to get the same/similar error for all.

NCBI Gengank .fa and .gff
1 Reference MUTpoly2 MUTpoly1 WTpoly1 WTpoly2
2 rna-YNCL0010C 1369.0 1222.0 1721.0 2023.0
3 rna-YNCL0012C 1148.0 1044.0 1435.0 1717.0
4 rna-YLR154C-G 274.0 212.0 214.0 369.0
5 rna-YNCH0007W 230.0 212.0 372.0 455.0
6 rna-YNCL0016C 208.0 165.0 276.0 299.0
7 MSTRG.1131.1 188.0 151.0 107.0 112.0
8 rna-YBR118W 175.0 147.0 156.0 147.0
9 rna-YGL103W 150.0 163.0 92.0 114.0
10 rna-YLR110C 142.0 152.0 162.0 196.0

Error
2384 ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)'
2385
2386 Caused by:
2387 Process pipeline:differential_expression:deAnalysis (1) terminated with an error exit status (1)
2388
2389 Command executed:
2390
2391 mkdir merged
2392 mkdir de_analysis
2393 mv de_transcript_counts.tsv merged/all_counts.tsv
2394 mv BarcodesPoly.csv de_analysis/coldata.tsv
2395 de_analysis.R annotation.gtf 3 1 10 3 gtf true
2396
2397 Command exit status:
2398 1
2399
2400 Command output:
2401 Loading counts, conditions and parameters.
2402 Loading annotation database.
2403
2404 Command error:
2405 Loading counts, conditions and parameters.
2406 Warning message:
2407 In read.table(file = file, header = header, sep = sep, quote = quote, :
2408 incomplete final line found by readTableHeader on 'de_analysis/coldata.tsv'
2409 Loading annotation database.
2410 Import genomic features from the file as a GRanges object ... OK
2411 Prepare the 'metadata' data frame ... OK
2412 Make the TxDb object ... Error in .makeTxDb_normarg_transcripts(transcripts) :
2413 values in 'transcripts$tx_strand' must be "+" or "-"
2414 Calls: makeTxDbFromGFF ... makeTxDbFromGRanges -> makeTxDb -> .makeTxDb_normarg_transcripts
2415 Execution halted
2416
2417 Work dir:
2418 /work/epi2me/work/d3/fd1f38bd3fe7e3c1162b91b2ac41c5

NCBI Genbank .fa and .gtf
1 Reference MUTpoly2 MUTpoly1 WTpoly1 WTpoly2
2 unassigned_transcript_4097 1363.0 1221.0 1722.0 2025.0
3 unassigned_transcript_4099 1150.0 1042.0 1436.0 1719.0
4 unassigned_transcript_4110 278.0 215.0 209.0 365.0
5 unassigned_transcript_2712 230.0 212.0 372.0 455.0
6 unassigned_transcript_4104 208.0 166.0 277.0 298.0
7 MSTRG.1131.1 188.0 151.0 107.0 112.0
8 unassigned_transcript_350 175.0 147.0 156.0 147.0
9 unassigned_transcript_2160 150.0 163.0 92.0 114.0
10 unassigned_transcript_4051 142.0 152.0 162.0 196.0

Error
2382 ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)'
2383
2384 Caused by:
2385 Process pipeline:differential_expression:deAnalysis (1) terminated with an error exit status (1)
2386
2387 Command executed:
2388
2389 mkdir merged
2390 mkdir de_analysis
2391 mv de_transcript_counts.tsv merged/all_counts.tsv
2392 mv BarcodesPoly.csv de_analysis/coldata.tsv
2393 de_analysis.R annotation.gtf 3 1 10 3 gtf true
2394
2395 Command exit status:
2396 1
2397
2398 Command output:
2399 Loading counts, conditions and parameters.
2400 Loading annotation database.
2401
2402 Command error:
2403 Loading counts, conditions and parameters.
2404 Warning message:
2405 In read.table(file = file, header = header, sep = sep, quote = quote, :
2406 incomplete final line found by readTableHeader on 'de_analysis/coldata.tsv'
2407 Loading annotation database.
2408 Import genomic features from the file as a GRanges object ... OK
2409 Prepare the 'metadata' data frame ... OK
2410 Make the TxDb object ... Error in .makeTxDb_normarg_transcripts(transcripts) :
2411 values in 'transcripts$tx_strand' must be "+" or "-"
2412 Calls: makeTxDbFromGFF ... makeTxDbFromGRanges -> makeTxDb -> .makeTxDb_normarg_transcripts
2413 Execution halted
2414
2415 Work dir:
2416 /work/epi2me/work/9c/84eec819b9dfa17fda6c0626238884

NCBI Refseq .fa and .gff
1 Reference MUTpoly2 MUTpoly1 WTpoly1 WTpoly2
2 MSTRG.263.1 188.0 151.0 107.0 112.0
3 MSTRG.289.1 159.0 175.0 96.0 121.0
4 MSTRG.457.1 152.0 130.0 94.0 119.0
5 MSTRG.511.1 148.0 157.0 167.0 198.0
6 MSTRG.32.1 140.0 134.0 114.0 106.0
7 MSTRG.768.1 127.0 133.0 81.0 81.0
8 MSTRG.61.1 113.0 96.0 111.0 100.0
9 MSTRG.198.1 112.0 82.0 39.0 38.0
10 MSTRG.515.1 109.0 107.0 74.0 72.0

Error
2283 ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)'
2284
2285 Caused by:
2286 Process pipeline:differential_expression:deAnalysis (1) terminated with an error exit status (1)
2287
2288 Command executed:
2289
2290 mkdir merged
2291 mkdir de_analysis
2292 mv de_transcript_counts.tsv merged/all_counts.tsv
2293 mv BarcodesPoly.csv de_analysis/coldata.tsv
2294 de_analysis.R annotation.gtf 3 1 10 3 gtf true
2295
2296 Command exit status:
2297 1
2298
2299 Command output:
2300 Loading counts, conditions and parameters.
2301 Loading annotation database.
2302 Filtering counts using DRIMSeq.
2303
2304 Command error:
2305 Loading counts, conditions and parameters.
2306 Warning message:
2307 In read.table(file = file, header = header, sep = sep, quote = quote, :
2308 incomplete final line found by readTableHeader on 'de_analysis/coldata.tsv'
2309 Loading annotation database.
2310 Import genomic features from the file as a GRanges object ... OK
2311 Prepare the 'metadata' data frame ... OK
2312 Make the TxDb object ... OK
2313 'select()' returned 1:many mapping between keys and columns
2314 Filtering counts using DRIMSeq.
2315 Error in dmDSdata(counts = counts, samples = coldata) :
2316 mode(counts) %in% "numeric" is not TRUE
2317 Calls: dmDSdata -> stopifnot
2318 Execution halted
2319
2320 Work dir:
2321 /work/epi2me/work/7f/bbb4ae1a2db55fea2627f197d32994

NCBI Refseq .fa and .gtf
1 Reference MUTpoly2 WTpoly1 MUTpoly1 WTpoly2
2 MSTRG.263.1 188.0 107.0 151.0 112.0
3 MSTRG.289.1 159.0 96.0 175.0 121.0
4 MSTRG.457.1 152.0 94.0 130.0 119.0
5 MSTRG.511.1 148.0 167.0 157.0 198.0
6 MSTRG.32.1 140.0 115.0 135.0 106.0
7 MSTRG.768.1 127.0 81.0 133.0 81.0
8 MSTRG.61.1 113.0 111.0 96.0 100.0
9 MSTRG.198.1 112.0 39.0 82.0 38.0
10 MSTRG.515.1 109.0 74.0 107.0 72.0

Error
2206 ERROR ~ Error executing process > 'pipeline:differential_expression:deAnalysis (1)'
2207
2208 Caused by:
2209 Process pipeline:differential_expression:deAnalysis (1) terminated with an error exit status (1)
2210
2211 Command executed:
2212
2213 mkdir merged
2214 mkdir de_analysis
2215 mv de_transcript_counts.tsv merged/all_counts.tsv
2216 mv BarcodesPoly.csv de_analysis/coldata.tsv
2217 de_analysis.R annotation.gtf 3 1 10 3 gtf true
2218
2219 Command exit status:
2220 1
2221
2222 Command output:
2223 Loading counts, conditions and parameters.
2224 Loading annotation database.
2225 Filtering counts using DRIMSeq.
2226
2227 Command error:
2228 Loading counts, conditions and parameters.
2229 Warning message:
2230 In read.table(file = file, header = header, sep = sep, quote = quote, :
2231 incomplete final line found by readTableHeader on 'de_analysis/coldata.tsv'
2232 Loading annotation database.
2233 Import genomic features from the file as a GRanges object ... OK
2234 Prepare the 'metadata' data frame ... OK
2235 Make the TxDb object ... OK
2236 'select()' returned 1:many mapping between keys and columns
2237 Filtering counts using DRIMSeq.
2238 Error in dmDSdata(counts = counts, samples = coldata) :
2239 mode(counts) %in% "numeric" is not TRUE
2240 Calls: dmDSdata -> stopifnot
2241 Execution halted
2242
2243 Work dir:
2244 /work/epi2me/work/01/c91d7314e0a6a9978ccec290d477f9

@sarahjeeeze
Copy link
Contributor

Sorry, I feel like its nearly there! Does it retry that DE_analysis step? I would expect that error but it to retry that step 4 times and hopefully one of the retries will work, as we iterate through trying a few different gtf2/gff3 R options. Check you are on the latest version within the app (if you are in the app). Or try nextflow pull epi2me-labs/wf-transcriptomes if you are on cmd line to check you have the latest version.

@cea295933
Copy link
Author

cea295933 commented Nov 21, 2023 via email

@sarahjeeeze
Copy link
Contributor

No i'm sorry its still not working!

Ok on the files in your files.zip I did this to find the two which have strand set to . ,

grep -P '\t\.\t\.\t' annotation.gtf
BK006938.2      StringTie       transcript      1302119 1302626 1000    .       .       gene_id "MSTRG.844"; transcript_id "MSTRG.844.1"; 
BK006938.2      StringTie       exon    1302119 1302626 1000    .       .       gene_id "MSTRG.844"; transcript_id "MSTRG.844.1"; exon_number "1"; 

Which is causing this error values in 'transcripts$tx_strand' must be "+" or "-", If I run the downstream DE_analysis without these lines it seems to complete.

This is an error being created by stringtie I've not seen it before but will add a step to remove any non stranded annotations hopefully will release that by end of the week, in the meantime you could add these lines to line 92 of subworkflows/differential_expression.nf which might fix it

grep -v -P '\t\.\t\.\t' annotation.gtf > tmp.gtf
mv tmp.gtf annotation.gtf

@cea295933
Copy link
Author

cea295933 commented Nov 21, 2023 via email

@cea295933
Copy link
Author

ok .. I think I found it. Before line 92, right? I see the de_analysis.R script being called on line 92, and it appears to use the annotation.gtf as an input if I understand if correctly

@cea295933
Copy link
Author

ok, I changed differential_expression.nf as below:

"""
88 mkdir merged
89 mkdir de_analysis
90 mv $merged_tsv merged/all_counts.tsv
91 mv $sample_sheet de_analysis/coldata.tsv
92
93 grep -v -P '\t.\t.\t' annotation.gtf > tmp.gtf
94 mv tmp.gtf annotation.gtf
95
96 de_analysis.R annotation.gtf $params.min_samps_gene_expr $params.min_samps_feature_expr $params.min_gene_expr $params.min_feature_expr $annotation_type $strip_version

but now I get the below the moment I launch the workflow, so I think I made the edit incorrectly (sorry!)

ERROR ~ Module compilation error

  • file : /home/caitken/.nextflow/assets/epi2me-labs/wf-transcriptomes/./subworkflows/differential_expression.nf
  • cause: token recognition error at: ' de_analysis/coldata.tsv\n\t\n\tgrep -v -P '\t.' @ line 91, column 21.
    mv $sample_sheet de_analysis/coldata.tsv

@sarahjeeeze
Copy link
Contributor

sarahjeeeze commented Nov 21, 2023

ok sorry forgot to escape the '\'

so should of been -

 grep -v -P '\\t\\.\\t\\.\\t' annotation.gtf > tmp.gtf
 mv tmp.gtf annotation.gtf

@cea295933
Copy link
Author

completed (with the downsampled data set)!!! Huzzah! I just launched jobs for the full data set to see if those work.

Thank you!!!

@sarahjeeeze
Copy link
Contributor

ah okay yes, now the challenge is the full data set - let me know how it goes

@cea295933
Copy link
Author

cea295933 commented Nov 22, 2023

completed! Thank you so much. One question: is there a meaningful difference in the pipeline with and without the minimap2 option we incorporated ("w 50")? Trying to determine whether it's worth testing it without that flag or not, now that everything works.

@sarahjeeeze
Copy link
Contributor

Good point! It may result in a few more unaligned reads, so you may be missing some transcripts. Now its working it might be worth trying with the standard "w 10" to see if it still works, or "w 25".

@sarahjeeeze
Copy link
Contributor

Closing through lack of response and issues solves

@cea295933
Copy link
Author

cea295933 commented Dec 12, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Development

No branches or pull requests

2 participants