Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stringtie 1.3.3 errors when the option to output Deseq2/EdgeR is used #1322

Closed
jennaj opened this issue May 23, 2017 · 14 comments
Closed

Stringtie 1.3.3 errors when the option to output Deseq2/EdgeR is used #1322

jennaj opened this issue May 23, 2017 · 14 comments

Comments

@jennaj
Copy link
Member

@jennaj jennaj commented May 23, 2017

Switching to no additional output or to Ballgown works fine. All inputs appear to meet specification and this is reproducible across different input datasets/target genomes.

Error produced:

Fatal error: Exit code 1 ()
Traceback (most recent call last):
  File "/galaxy/main/deps/_conda/envs/__stringtie@1.3.3/bin/prepDE.py", line 186, in <module>
    t_dict.setdefault(t_id, {})
NameError: name 't_id' is not defined

Example job info

StringTie
Dataset Information
Number:	XX
Name:	StringTie on data X and data X: Assembled transcripts
Created:	Fri 19 May 2017 08:06:59 PM (UTC)
Filesize:	40.5 MB
Dbkey:	mm10
Format:	gtf
Job Information
Galaxy Tool ID:	toolshed.g2.bx.psu.edu/repos/iuc/stringtie/stringtie/1.3.3
Galaxy Tool Version:	1.3.3
Tool Version:	1.3.3
Tool Standard Output:	stdout
Tool Standard Error:	stderr
Tool Exit Code:	1
History Content API ID:	X
Job API ID:	X
History API ID:	X
UUID:	X
Full Path:	/galaxy-repl/main/files/X/X/dataset_X.dat

Tool Parameters
Input Parameter	Value	Note for rerun
Mapped reads to assemble transcripts from	X: SortSam on data X: BAM sorted in coordinate order	
Use GFF file to guide assembly	yes	
Reference annotation to use for guiding the assembly process	X: UCSC Main on Mouse: knownGene (genome)	
Perform abundance estimation only of input transcripts	False	
Output additional files for use in...	deseq2	
Average read length	75	
Whether to cluster genes that overlap with different gene IDs	False	
Options	default	
Job Resource Parameters	no	

Inheritance Chain
StringTie on data X and data X: Assembled transcripts

Command Line
mkdir -p ./special_de_output/sample1/ && ln -s '/galaxy-repl/main/files/X/X/dataset_X.dat' ./special_de_output/sample1/guide.gtf &&  stringtie '/galaxy-repl/main/files/X/X/dataset_X.dat'  -o "/galaxy-repl/main/files/X/X/dataset_X.dat" -p "${GALAXY_SLOTS:-1}" -C '/galaxy-repl/main/files/X/X/dataset_X.dat.dat' -G '/galaxy-repl/main/files/X/X/dataset_X.dat'  -b ./special_de_output/sample1/  && prepDE.py -i ./special_de_output/ -g gene_cout_matrix.tsv -t transcripts_count_matrix.tsv -l 75 && sed -i.bak 's/,/\t/g' transcripts_count_matrix.tsv && sed -i.bak 's/,/\t/g' gene_cout_matrix.tsv

Job Metrics

core

Cores Allocated	4
Job End Time	2017-05-19 15:13:40
Job Runtime (Wall Clock)	6 minutes
Job Start Time	2017-05-19 15:07:24

cpuinfo

Processor Count	32

meminfo

Total System Memory	123.0 GB
Total System Swap	976.6 MB

uname

Operating System	Linux roundup52.XXXX XXX #1 SMP Wed Jan 28 21:11:36 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Job Dependencies
Dependency	Dependency Type	Version
stringtie	conda	1.3.3

cc @davebx

@jennaj jennaj changed the title Stringtie 1.3.3 errors with the option to output Deseq2/EdgeR is used Stringtie 1.3.3 errors when the option to output Deseq2/EdgeR is used May 23, 2017
@jennaj
Copy link
Member Author

@jennaj jennaj commented May 23, 2017

When used at http://usegalaxy.org, there is another small warning that comes up. Not sure if related to current server issues. If "Ballgown" output is selected (same run as above), this is printed in the info comments for the green results datasets:

rm: cannot remove `working/special_de_output/sample1': Directory not empty

Loading

@jennaj
Copy link
Member Author

@jennaj jennaj commented May 23, 2017

This error occurred on jobs run yesterday. I can't reproduce it today. Maybe was a cluster issue. Closing for now, but will reopen if can figure out what triggers it.

Loading

@jennaj jennaj closed this May 23, 2017
@jennaj
Copy link
Member Author

@jennaj jennaj commented Jun 6, 2017

Reports of this error have now come in. I can also now reproduce the error again in a current Galaxy RNA-seq tutorial history, using the latest version of the tool. The last four stringtie jobs were run with the four different related options - three pass, one fails ("additional output files = DEseq/EdgeR").

This seems important to fix with priority. @MoHeydarian @davebx @martenson @natefoo

Loading

@jennaj jennaj reopened this Jun 6, 2017
@jennaj
Copy link
Member Author

@jennaj jennaj commented Jun 6, 2017

Specific problem:

Stringtie-merge performs summary calculations and produces new "transcript" lines in the result (and resets the "Source" and removes other content lines). Running just iGenomes/external GTF through the Stringtie-merge tool will groom it and allow a successful job.

Work-around for now to avoid the NameError: name 't_id' is not defined error:

If using an external GTF source with Stringtie, and not currently using Stringtie merge transcripts, running just the external GTF dataset through the Stringtie merge transcripts tool (as an "input_gtf", leaving the input for "guide_gff" with no dataset selected) will produce a groomed version of the reference annotation that can be used by Stringtie.

How to resolve?

One or more of these? Some are @MoHeydarian ideas and other solutions are likely possible.

  1. Explain this usage on both of the tool forms and in web help/tutorials
  2. Allow Stringtie to detect when a GTF does not have the "transcript" annotation lines and error with a message explaining that the input needs to be groomed/prepared with the merge tool first
  3. Have the primary Stringtie tool create the "transcript" lines if they are not present
  4. Have the primary Stringtie tool not require "transcript" lines if they are not actually used by the tool

ping @bgruening

There is a related error when using a reference GTF and the DEseq2/EdgeR outputs, but it has not been reproduced. Tagging it here for reference.

Loading

@jennaj jennaj closed this Jun 6, 2017
@jennaj jennaj reopened this Jun 6, 2017
@davebx
Copy link
Contributor

@davebx davebx commented Jun 6, 2017

One of the issues is that the prepDE.py expects t_id to be set regardless of whether or not the provided gtf file even contains transcripts. I also noticed that it only increments the transcript length by stop - start if the line currently being examined is an exon, not a transcript. All in all, I'm confused by this script.

Loading

@jennaj
Copy link
Member Author

@jennaj jennaj commented Aug 7, 2017

Still errors.

@bgruening Do you know if a correction for this is in progress or planned for? ETA? Or is cycling/prepping through strigtie-merge the solution? If so, I can add that to the help. Thanks!

Loading

@bgruening
Copy link
Member

@bgruening bgruening commented Aug 7, 2017

Upps, sorry missed this completely, will look into it tomorrow.

Loading

@jennaj
Copy link
Member Author

@jennaj jennaj commented Aug 11, 2017

Hi - any updates?

In short, the tool fails with external GTF files unless process through Stringtie merge first.

Also - any chance that we could modify Stringtie merge to include the "merge" in the output dataset names (and have merge in bold as part of the core tool name) when other changes are made? The same change probably impacts both. Would help a great deal in distinguishing between the assembly vs merge tools in tool panel plus (more importantly for usage) the different output datasets in the history.

Thanks!!!

Loading

@bgruening
Copy link
Member

@bgruening bgruening commented Aug 14, 2017

@jennaj with primary Stringtie tool you mean the upstream tool, not the Galaxy tool? If so 3. and 4. are out of scope here, right?

Could you provide some text for 1) and 2) and I can get this into the tool.

Loading

@jennaj
Copy link
Member Author

@jennaj jennaj commented Oct 6, 2017

Sorry for delay. Yes, after Dave's detective work, 3 & 4 are confirmed to be beyond our control. So, our options are to trap the error and provide help about usage.

Current test history for an example: https://usegalaxy.org/u/jen/h/test-history-stringtie-and-stringtie-merge-133

Suggested text (is what I send to users):

  1. Improve error message when this error is encountered. Leave all original content, but make this new content very obvious to the user. Place at the top of the error report when clicking on a bug report (if possible, most tools do put this at the end). Definitely, put just this in the INFO field for the error dataset in the expanded view. (not in the COMMENTS field, as that is not seen by default).

Input Error: All input GTF datasets must be prepared with "Stringtie merge transcripts".

2a. "StringTie transcript assembly and quantification" tool form

  • Declare under GTF input area

All input GTF datasets must be prepared with Stringtie merge transcripts.

  • Declare in Help area

All StringTie transcript assembly and quantification input GTF datasets must be prepared with Stringtie merge transcripts. On the StringTie merge transcripts form, input the GTF as the "input_gtf" dataset when not performing a merge with other GTFs.

2b. "Stringtie merge transcripts" tool form

  • Declare under the "input_gtf" entry area:

If preparing a single GTF for StringTie transcript assembly and quantification, enter it here.

  • Declare in Help area

Prepares GTF datasets for use with StringTie transcript assembly and quantification. If preparing a single GTF, input as the "input_gtf" dataset.

--

This has to be detailed because the GTF are often the reference GTF. When doing a merge of 2 or more GTFs, and a reference GTF is used, it is entered in a different form field ("guide_gff"). This is entirely different usage and goes against the tool name, so is a bit confusing.

The tool could be renamed to something like: "Stringtie merge and prepare transcripts". It needs a rename anyway, and this has been discussed and agreed upon (gitter?), because both Stringtie tools have just the "Stringtie" part of the tool name highlighted as a link - Galaxy UI, Tool Shed, Workflow editor, etc. Core tool names are intended to be unique across tools.

If you want to coordinate, I could still do the form text edits, but I wouldn't know quite how to do the error trapping/message or changing the tool name (if we even want to, I vote yes still). Lmk.

ping @bgruening Thank you!!!

Loading

@NCEichner
Copy link

@NCEichner NCEichner commented Oct 11, 2017

Hi everyone!

We had the same issue with 't_id' not defined in prepDE.py. I think I can present some solution here but I don't know how to send commits. So maybe someone else can add the changes to the 'stringtie.xml'-file and test if that resolves the problem.
In general the problem is caused because the input of the python script 'prepDE.py' actually is not the stringtie-OUTPUT-file (in the history shown as 'Assembled transcripts') but the .gtf-file which was set in the '-G' option of stringtie (see 'Reference annotation to use for guiding the assembly process' in the GUI). So in the command-line the 'ln' command is at the wrong position and is referencing the wrong file.

We patched the command-section and the output-section of the stringtie.xml as follows:`

    mkdir -p ./special_de_output/sample1/ &&

    #if $input_bam.metadata.ftype == 'sam':
        samtools sort -@ \${GALAXY_SLOTS:-1} '$input_bam' | stringtie
    #else
        stringtie '$input_bam'
    #end if

`
Above we deleted the 3 lines below the mkdir-command.

`

    #if str($guide.use_guide) == 'yes':
        #if $guide.special_outputs.special_outputs_select == 'deseq2':
            &&
            ln -s '$output_gtf' ./special_de_output/sample1/output.gtf
            &&
            prepDE.py
                -i ./special_de_output/
                -g gene_counts.tsv
                -t transcript_counts.tsv
                -l $guide.special_outputs.read_length
                #if str($option_set.options) == 'advanced':
                    -s '$option_set.name_prefix'
                #end if
                #if $guide.special_outputs.clustering:
                    -c
                    --legend legend.tsv
                    &&
                    sed -i.bak 's/,/\t/g' legend.tsv
                    &&
                    sed -i.bak 's/\r//g' legend.tsv
                    &&
                    mv -T legend.tsv "$legend"
                #end if
            &&
            sed -i.bak 's/,/\t/g' transcript_counts.tsv
            &&
            sed -i.bak 's/\r//g' transcript_counts.tsv
            &&
            mv -T transcript_counts.tsv "$transcript_counts"
            &&
            sed -i.bak 's/,/\t/g' gene_counts.tsv
            &&
            sed -i.bak 's/\r//g' gene_counts.tsv
            &&
            mv -T gene_counts.tsv "$gene_counts"
        #end if
    #end if

`
The 'ln' command then was added just before the 'prepDE.py'. It's using the output.gtf from stringtie.
The additional commands are cleaning up the 'prepDE.py' output (.csv to .tsv; carriage return-deletion before LF).
The resultfiles are moved to the database after editing.

`

    <data name="gene_counts" format="tabular" label="${tool.name} on ${on_string}: Gene counts">
        <filter>guide['use_guide'] == 'yes' and guide['special_outputs']['special_outputs_select'] == 'deseq2'</filter>
    </data>
    <data name="transcript_counts" format="tabular" label="${tool.name} on ${on_string}: Transcript counts">
        <filter>guide['use_guide'] == 'yes' and guide['special_outputs']['special_outputs_select'] == 'deseq2'</filter>
    </data>
    <data name="legend" format="tabular" label="${tool.name} on ${on_string}: legend">
        <filter>guide['use_guide'] == 'yes' and guide['special_outputs']['special_outputs_select'] == 'deseq2' and guide['special_outputs']['clustering'] is True</filter>
    </data>

`
In the Output-section we cleaned up the references to the working-directory because the are no longer needed.

Loading

@mblue9
Copy link
Contributor

@mblue9 mblue9 commented Oct 12, 2017

Hi @NCEichner I only learnt how to make commits here pretty recently and I have some notes I started making for our in-house documentation (on how to test Galaxy tools and submit to Github). I've attached them here in case they're any use to you, if you wanted to try making commits yourself sometime (it sounds like you've got good suggestions!).

GAD-TestingatoolusingPlanemo-121017-1229-18.pdf
GAD-SubmittingtoolstoIUC-121017-1242-20.pdf

Loading

@mblue9
Copy link
Contributor

@mblue9 mblue9 commented Oct 12, 2017

Oops, I just noticed something I had added to the wrong place in that planemo doc, corrected version is attached here:
GAD-TestingatoolusingPlanemo-121017-1256-22.pdf

Loading

NCEichner added a commit to NCEichner/tools-iuc that referenced this issue Oct 13, 2017
This is my current version of the updated stringtie.xml that should fix the error described in galaxyproject#1322.
Compared to my first post in galaxyproject#1322 I've added some 'cosmetic' changes regarding the temp-files
@jennaj
Copy link
Member Author

@jennaj jennaj commented Jun 7, 2018

Loading

@jennaj jennaj closed this Jun 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants