-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
edgeR #1471
edgeR #1471
Conversation
tools/edger/edger.xml
Outdated
<requirement type="package" version="3.30.13">bioconductor-limma</requirement> | ||
<requirement type="package" version="1.4.29">r-statmod</requirement> | ||
<requirement type="package" version="0.4.1">r-scales</requirement> | ||
<requirement type="package" version="1.5_9.1">r-locfit</requirement> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is r-locfit
needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because otherwise you get an error about the locfit package being missing with the estimateDisp function (see here: https://support.bioconductor.org/p/75970/)
It's because these packages have some suggested but not required dependencies (see here: https://support.bioconductor.org/p/83334/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment <!-- comment -->
above the requirement explaining this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the requirements, see my new comment below.
tools/edger/edger.xml
Outdated
</when> | ||
</conditional> | ||
</section> | ||
<section name="ct" expanded="false" title="Specify groups to contrast"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops sorry! I've fixed those indentations now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be still broken unfortunately, forgot to push?
tools/edger/edger.xml
Outdated
<validator type="empty_field" /> | ||
<validator type="regex" message="Please only use letters, numbers or underscores">^[\w,-]+$</validator> | ||
</param> | ||
</section> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation
mkdir ./output_dir | ||
|
||
&& | ||
cp '$outReport.files_path'/*.tsv output_dir/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mv
is probably better than cp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think cp
is used because the files are both pulled into a collection (from output_dir
) and included in the html report (so need to stay in $outReport.files_path
) ..a symlink could be a good alternative though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason I'm using cp is because, if you use mv, on exporting the history there are no tsvs at all (neither from the report nor from the collection, are files in collections not included in exported histories?). So I am using cp so that a user who exports a history is not unknowingly missing the tsv files.
@shiltemann would a symlink work for including the files in an exported history?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mblue9 I am not entirely sure actually, maybe not ..but if the files can get big it could be worth testing out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think symlinks will work for object store backends and exported histories, I think we're bound to cp (but these files shouldn't become too big anyway, right ?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what is consider 'big' here but these are just text files of the diff exp results (1 per contrast), I don't think that it'd ever be more than MBs in total, nowhere near as big as the fastqs and bams used to generate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mblue9 yeah I would consider that small, if it were files on order of GBs it would be shame to duplicate them, but this should be fine (besides, I don't know if there is a good way around this at present anyway as Marius pointed out)
Hi @nsoranzo and @shiltemann thanks for the review! I also tests to check the report for which edgeR test was used. |
tools/edger/edger.xml
Outdated
<validator type="regex" message="Please only use letters, numbers or underscores">^[\w]+$</validator> | ||
</param> | ||
<param name="pfactLevel" type="text" label="Primary Factor Levels" | ||
help="Eg. WT,WT,WT,Mut,Mut,Mut NOTE: Please only use letters, numbers or underscores and ensure that the same levels are typed identically with cases matching."> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
help="current help text (case sensitive)"
tools/edger/edger.xml
Outdated
This tool outputs a table of differentially expressed genes for each contrast of interest and a HTML report with plots and additional information. Optionally you can choose to output the normalised counts table and the RData file. | ||
]]></help> | ||
<citations> | ||
<citation type="doi">10.1093/bioinformatics/btp616</citation> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does limma need a citation too? (and maybe other dependencies you included?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to work out what to do with the citations. I've left it with just the edgeR citation here for the moment but I should add the citation for the limma-voom tool that this edgeR tool is based on, and probably limma too. I was going tor wait til I figured out how to combine, this is still WIP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mblue9 what do you mean with combine? You can add multiple citations if you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bgruening by combine I mean that the limma-voom tool and this edgeR one share a lot of code so I was going to see if I could macrofy.
@mblue9 to answer one of your question from your TP. I think it is perfectly fine to do one compined repo with limma/voom. Whatever is best for the maintainer as long as we can create separate TS repos, which we can I think. |
Thanks for these reviews and feedback! I'll work on them (including adding the deseq2-style input) as soon as I can get a chance. |
I've made changes as suggested (and some others) if anyone wants to take a look to see where I'm up to. I haven't done any combining of the limma-voom and edgeR wrappers yet but I'm going to look into that. I'm sure some of the changes I've made could be done more elegantly. I'm happy to make more changes, remove sections etc. In this commit I've
|
If this is not a dependency of limma/edgeR than it is ok to not add it to conda and just here.
That would be super useful I think. |
Thanks @bgruening I'll add the annotation file option to the multiple file input. |
In this case, please feel free to add it to the conda package. |
They will have them listed twice yes. But the resolver will only install one. So this is fine. In a long run we could clean the other packages as well. |
- add annotation file input option to the multiple counts files input, also added test - add getopt library to parse options - add some error handling lines from DESeq2 R script (e.g. Sys.setlocale) - reorder some input options - add some more explanation
To finish this I still need to update the limma conda package and try macrofying, but other than that I'm done with changes for the moment. If people think I should make more just let me know. The changes I've made are below. For updating the limma conda package, I'm only going to add the statmod dependency. I'm leaving scales as a requirement here, as it's not limma that needs it in this tool, it's the alpha function used in the R script (for colour transparency in the MD plot).
|
@nsoranzo @bgruening this is also weirdly failing here even though it passes fine locally (and also passed here previously). Is it something to do with this new override channels line below? As it's failing right after that here and I don't see that line in the pizzly and edgeR builds that passed previously.
|
tools/edger/edger.xml
Outdated
<requirement type="package" version="0.4.1">r-scales</requirement> | ||
|
||
<!-- I will add r-statmod to limma bioconda package --> | ||
<requirement type="package" version="1.4.29">r-statmod</requirement> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conda create
also stuck for me because this is an old version from the bioconda channel, if you update the version to 1.4.30 it should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nsoranzo, I've changed it to use r-statmod 1.4.30 instead of 1.4.29. But why does 1.4.29 work fine for me locally yet fail here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be this conda bug: conda/conda#5536
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see...thanks for the info!
It's green again! |
@yhoogstrate do you have time to look over it? Thanks! |
tools/edger/edger.xml
Outdated
<inputs> | ||
|
||
<!-- Counts and Factors --> | ||
<section name="cnt" expanded="false" title="Input Counts and Factors"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to set expanded to true? It needs to be opened for every usecase anyway
tools/edger/edger.xml
Outdated
<valid initial="string.letters,string.digits"><add value="_" /></valid> | ||
</sanitizer> | ||
</param> | ||
<repeat name="rep_factorLevel" title="Factor Level" min="2" default="2"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 'factor level' based on concepts in R or is this how scientists use it? I was thinking of the word 'condition' but I know my English is not something to rely on. If anyone else also believes 'condition' is better you can change it, otherwise keep it as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I think factor level is probably not the best term to use here for biologists. I don't think I can use condition though as that can also be used to describe one type of factor e.g. this quote from pg 49 limma user guide (https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf):
"The two experimental factors Condition and Tissue could be handled in many ways."
So calling factor levels conditions might be confusing to users?
What about just using group? I've changed it to that for the moment but let me know what you think?
tools/edger/edger.xml
Outdated
<valid initial="string.letters,string.digits"><add value="_" /></valid> | ||
</sanitizer> | ||
</param> | ||
<param name="countsFile" type="data" format="tabular" multiple="true" label="Counts file(s)"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shiltemann expression matrices get sniffed as 'mothur.freq':
GeneID WT3
11287 1601
11298 1834
Does it make sense to make the mothur sniffer less more stringent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, yes it is quite a simple format that I don't know if I can make more stringent (it's just two columns of numbers with alphanumerical headers) ..maybe just remove sniffer altogether if it is producing too many false positives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is being addressed in galaxyproject/galaxy#4781
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing that!
tools/edger/edger.xml
Outdated
</section> | ||
|
||
<!-- Contrasts --> | ||
<section name="ct" expanded="false" title="Specify Groups to Contrast"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section can be removed and the param can be put into the section above.
tools/edger/edger.xml
Outdated
<section name="ct" expanded="false" title="Specify Groups to Contrast"> | ||
<param name="contrast" type="text" label="Contrasts of Interest" help="Eg. Mut-WT,KD-Control"> | ||
<validator type="empty_field" /> | ||
<validator type="regex" message="Please only use letters, numbers or underscores">^[\w,-]+$</validator> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More info: https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf (chapter 8)
I see that comma's are accepted. Could you explain what they're for?
Does it make sense to put this param into a <repeat>
and not further allow comma's?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put the contrast parameter into a repeat. I also added more info and the link to the limma user guide to the param help, is that what you meant?
tools/edger/edger.xml
Outdated
|
||
<!-- Filter Options --> | ||
<section name="filter" expanded="false" title="Filter Low Counts"> | ||
<param name="cpmReq" type="float" value="0" min="0" label="Minimum CPM" help="Treat genes with very low expression as unexpressed and filter out. See the Filter Low Counts section below for more information. Default: 0"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're dealing with a nominal distribution, would it make sense to filter out based on the absolute read counts, e.g. something like this: rowSums(data$counts>= n_sample * 2 )
?
In very large genes, a low CPM can still correspond to a considerable amount of reads, with considerable statistical power, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I've added more flexible filtering. Now you can also filter on absolute read count, either total or per sample, see what you think.
When I try to model batch effects as provided in the example under Factor Information, I get the following error:
|
@yhoogstrate this is great! Thanks a lot for taking a look. I'll work my way through these and get back to you. |
- use same phrases for count file selection box and help section - remove section around inputs - change 'factor level' to 'group' - make contrast param repeat - add ability to filter on raw count values (total and per sample), with tests - add some more help and other text edits Please enter the commit message for your changes. Lines starting
@yhoogstrate thanks again for the last review! I've submitted new changes based on your suggestions and I'll address some of your comments directly above. I couldn't reproduce the batch effects error you got. Can you check if you get it with the new changes? How are you inputting the info? The tool has these 3 tests that use that same info under Factor Information and they all pass but maybe I'm not catching everything:
|
@yhoogstrate can you look at this once more and get it in if you think its ready? |
modelling batches works (but I still need to go over the rest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Truly amazing work
I fully agree with you @yhoogstrate! |
Thanks a lot @mblue9! |
Yay! Thanks @yhoogstrate and all! |
Hello everyone,
In this PR I've made a stab at that first draft of an IUC edgeR tool. As suggested in Gitter, first I took a look at the edgeR Galaxy tools already available and DESeq2, and then I discovered that the limma-voom tool that's already in IUC, could transform very easily into an edgeR tool, with just some small code changes. The limma-voom Galaxy tool was written by the group that are authors of limma and edgeR so I thought I'd start with that as a first version and see if that could be adapted. The code I've submitted here (both the xml and R script) is practically identical to the limma-voom tool (and uses the same test-data input) so I'm wondering if it might be good to make a combined limma-voom/edgeR tool or suite?
So far, I've only included the core functions of edgeR (which I would think could be enough for a first version of the tool?), so this version performs differential expression using the edgeR quasi-likelihood test (plus produces MDS, BCV and MD plots). But the user can also choose to run the likelihood ratio test instead. I have not added other edgeR functions yet (e.g. glmTreat, processAmplicons for CRISPR fastqs etc) as I don't know that they'd be needed in the first version.
In it's current form, this edgeR tool takes the same input as the limma-voom tool, so requires a count matrix, rather than the one file per sample like DESeq2 (like some of you said you like in Gitter). For limma-voom, I had been planning to use the Join by ID tool or something similar to create the input matrix from Featurecounts output, but do people think it's better if users could also input the individual files like Deseq2?
What do people think? I'm happy to change things (or have things changed)
P.S. I have not tried incorporating tags and metadata yet, or doing anything fancy with creating a design matrix. Similar to the limma-voom tool, people can either input the sample info from a file or enter it directly in the tool form.
P.P.S. I was intending to create a Galaxy tutorial for limma-voom anyway to use for teaching here (based on the R workshop material we created here http://combine-australia.github.io/RNAseq-R/). That dataset is the same one used in the edgeR workflow published here: https://f1000research.com/articles/5-1438 so maybe there could be a shared tut that can be used for edgeR and limma-voom. That tut also uses heatmap2 and I was going to see if I can include that, now that it's in IUC 馃槃