edgeR #1471

mblue9 · 2017-09-11T04:34:33Z

Hello everyone,

In this PR I've made a stab at that first draft of an IUC edgeR tool. As suggested in Gitter, first I took a look at the edgeR Galaxy tools already available and DESeq2, and then I discovered that the limma-voom tool that's already in IUC, could transform very easily into an edgeR tool, with just some small code changes. The limma-voom Galaxy tool was written by the group that are authors of limma and edgeR so I thought I'd start with that as a first version and see if that could be adapted. The code I've submitted here (both the xml and R script) is practically identical to the limma-voom tool (and uses the same test-data input) so I'm wondering if it might be good to make a combined limma-voom/edgeR tool or suite?

So far, I've only included the core functions of edgeR (which I would think could be enough for a first version of the tool?), so this version performs differential expression using the edgeR quasi-likelihood test (plus produces MDS, BCV and MD plots). But the user can also choose to run the likelihood ratio test instead. I have not added other edgeR functions yet (e.g. glmTreat, processAmplicons for CRISPR fastqs etc) as I don't know that they'd be needed in the first version.

In it's current form, this edgeR tool takes the same input as the limma-voom tool, so requires a count matrix, rather than the one file per sample like DESeq2 (like some of you said you like in Gitter). For limma-voom, I had been planning to use the Join by ID tool or something similar to create the input matrix from Featurecounts output, but do people think it's better if users could also input the individual files like Deseq2?

What do people think? I'm happy to change things (or have things changed)

P.S. I have not tried incorporating tags and metadata yet, or doing anything fancy with creating a design matrix. Similar to the limma-voom tool, people can either input the sample info from a file or enter it directly in the tool form.

P.P.S. I was intending to create a Galaxy tutorial for limma-voom anyway to use for teaching here (based on the R workshop material we created here http://combine-australia.github.io/RNAseq-R/). That dataset is the same one used in the edgeR workflow published here: https://f1000research.com/articles/5-1438 so maybe there could be a shared tut that can be used for edgeR and limma-voom. That tut also uses heatmap2 and I was going to see if I can include that, now that it's in IUC 😄

nsoranzo · 2017-09-11T08:59:12Z

tools/edger/edger.xml

+        <requirement type="package" version="3.30.13">bioconductor-limma</requirement>
+        <requirement type="package" version="1.4.29">r-statmod</requirement>
+        <requirement type="package" version="0.4.1">r-scales</requirement>
+        <requirement type="package" version="1.5_9.1">r-locfit</requirement>


Is r-locfit needed?

Yes, because otherwise you get an error about the locfit package being missing with the estimateDisp function (see here: https://support.bioconductor.org/p/75970/)
It's because these packages have some suggested but not required dependencies (see here: https://support.bioconductor.org/p/83334/)

Can you add a comment  above the requirement explaining this?

I've changed the requirements, see my new comment below.

nsoranzo · 2017-09-11T19:48:36Z

tools/edger/edger.xml

+                </when>
+            </conditional>
+            </section>
+            <section name="ct" expanded="false" title="Specify groups to contrast">  


Indentation

oops sorry! I've fixed those indentations now

This seems to be still broken unfortunately, forgot to push?

nsoranzo · 2017-09-11T19:48:59Z

tools/edger/edger.xml

+                <validator type="empty_field" />
+                <validator type="regex" message="Please only use letters, numbers or underscores">^[\w,-]+$</validator>
+            </param>
+            </section>


Indentation

nsoranzo · 2017-09-11T19:50:16Z

tools/edger/edger.xml

+mkdir ./output_dir
+
+&&
+cp '$outReport.files_path'/*.tsv output_dir/


mv is probably better than cp

I think cp is used because the files are both pulled into a collection (from output_dir) and included in the html report (so need to stay in $outReport.files_path) ..a symlink could be a good alternative though?

The main reason I'm using cp is because, if you use mv, on exporting the history there are no tsvs at all (neither from the report nor from the collection, are files in collections not included in exported histories?). So I am using cp so that a user who exports a history is not unknowingly missing the tsv files.
@shiltemann would a symlink work for including the files in an exported history?

@mblue9 I am not entirely sure actually, maybe not ..but if the files can get big it could be worth testing out?

I don't think symlinks will work for object store backends and exported histories, I think we're bound to cp (but these files shouldn't become too big anyway, right ?).

I don't know what is consider 'big' here but these are just text files of the diff exp results (1 per contrast), I don't think that it'd ever be more than MBs in total, nowhere near as big as the fastqs and bams used to generate.

@mblue9 yeah I would consider that small, if it were files on order of GBs it would be shame to duplicate them, but this should be fine (besides, I don't know if there is a good way around this at present anyway as Marius pointed out)

mblue9 · 2017-09-12T02:10:54Z

Hi @nsoranzo and @shiltemann thanks for the review!

I also tests to check the report for which edgeR test was used.

yhoogstrate · 2017-09-12T13:51:57Z

tools/edger/edger.xml

+                        <validator type="regex" message="Please only use letters, numbers or underscores">^[\w]+$</validator>
+                    </param>
+                    <param name="pfactLevel" type="text" label="Primary Factor Levels"
+                        help="Eg. WT,WT,WT,Mut,Mut,Mut NOTE: Please only use letters, numbers or underscores and ensure that the same levels are typed identically with cases matching.">


help="current help text (case sensitive)"

yhoogstrate · 2017-09-12T13:58:25Z

tools/edger/edger.xml

+This tool outputs a table of differentially expressed genes for each contrast of interest and a HTML report with plots and additional information. Optionally you can choose to output the normalised counts table and the RData file.
+    ]]></help>
+    <citations>
+        <citation type="doi">10.1093/bioinformatics/btp616</citation>


does limma need a citation too? (and maybe other dependencies you included?)

I need to work out what to do with the citations. I've left it with just the edgeR citation here for the moment but I should add the citation for the limma-voom tool that this edgeR tool is based on, and probably limma too. I was going tor wait til I figured out how to combine, this is still WIP.

@mblue9 what do you mean with combine? You can add multiple citations if you like.

@bgruening by combine I mean that the limma-voom tool and this edgeR one share a lot of code so I was going to see if I could macrofy.

bgruening · 2017-09-16T17:09:31Z

@mblue9 to answer one of your question from your TP. I think it is perfectly fine to do one compined repo with limma/voom. Whatever is best for the maintainer as long as we can create separate TS repos, which we can I think.

mblue9 · 2017-09-19T03:55:35Z

Thanks for these reviews and feedback! I'll work on them (including adding the deseq2-style input) as soon as I can get a chance.

mblue9 · 2017-09-25T07:46:05Z

I've made changes as suggested (and some others) if anyone wants to take a look to see where I'm up to. I haven't done any combining of the limma-voom and edgeR wrappers yet but I'm going to look into that. I'm sure some of the changes I've made could be done more elegantly. I'm happy to make more changes, remove sections etc.

In this commit I've

added the option to input multiple files similar to the deseq2 wrapper (very similar, I took the code from the deseq2 wrapper)
added a test for the multiple files input
added rjson as a requirement (needed for the deseq2-style input)
changed the requirements e.g. removed limma as it's already a dependency of the edgeR conda package. Am wondering if it would be better to add scales and statmod to the limma/edgeR conda packages rather than as dependencies here?
added another plot (for QL dispersions)
with the limma-voom tool you can provide an annotation file as input, I haven't added that to the multiple files input (so currently only possible with the matrix) but I could?
added names to tests
fixed all the indentations (I hope)

bgruening · 2017-09-26T22:37:23Z

changed the requirements e.g. removed limma as it's already a dependency of the edgeR conda package. Am wondering if it would be better to add scales and statmod to the limma/edgeR conda packages rather than as dependencies here?

If this is not a dependency of limma/edgeR than it is ok to not add it to conda and just here.

with the limma-voom tool you can provide an annotation file as input, I haven't added that to the multiple files input (so currently only possible with the matrix) but I could?

That would be super useful I think.

mblue9 · 2017-09-27T05:12:38Z

Thanks @bgruening I'll add the annotation file option to the multiple file input.
Re the dependencies - they're not "official" dependencies of limma but their omission does cause people problems, see here https://support.bioconductor.org/p/83334/ . From that post it seems they're omitted to reduce installation hassles so was wondering then if it would be better (for conda limma users) if they were in the limma conda package as dependencies?

bgruening · 2017-09-27T09:51:31Z

In this case, please feel free to add it to the conda package.

mblue9 · 2017-09-30T00:18:36Z

Ok thanks @bgruening I'll add r-scales and r-statmod to the limma conda package. Before I do though, just to check, will this change matter to the conda packages already existing that have limma and statmod/scales listed separately as dependencies? e.g. these packages below. Will they just end up having scales/statmod listed twice as dependencies (inside limma and explicitly)?

bgruening · 2017-09-30T08:22:34Z

They will have them listed twice yes. But the resolver will only install one. So this is fine. In a long run we could clean the other packages as well.

- add annotation file input option to the multiple counts files input, also added test - add getopt library to parse options - add some error handling lines from DESeq2 R script (e.g. Sys.setlocale) - reorder some input options - add some more explanation

mblue9 · 2017-10-03T21:30:11Z

To finish this I still need to update the limma conda package and try macrofying, but other than that I'm done with changes for the moment. If people think I should make more just let me know.

The changes I've made are below. For updating the limma conda package, I'm only going to add the statmod dependency. I'm leaving scales as a requirement here, as it's not limma that needs it in this tool, it's the alpha function used in the R script (for colour transparency in the MD plot).
I'm now using r-getopt and r-rjson like deseq2 @bgruening I saw that r-getopt and r-rjson are in the deseq2 bioconda packages so was wondering if I should also add them to the limma bioconda package or just leave them as requirements here? btw thanks for your deseq2 wrapper it's been really helpful for this! 😄

add annotation file input option to the multiple counts files input, also added test
add getopt library to parse options
add some error handling lines from DESeq2 R script (e.g. Sys.setlocale)
reorder some input options
add some more explanation

mblue9 · 2017-10-04T00:30:26Z

@nsoranzo @bgruening this is also weirdly failing here even though it passes fine locally (and also passed here previously). Is it something to do with this new override channels line below? As it's failing right after that here and I don't see that line in the pizzly and edgeR builds that passed previously.

2017-10-03 21:58:55,146 DEBUG [galaxy.tools.deps.conda_util] Executing command: /home/travis/conda/bin/conda create -y --override-channels --channel iuc --channel bioconda --channel conda-forge --channel defaults --channel r --name mulled-v1-af8fafd35752d255b1c9e4fcf6ea7046af6cf96b731d4fe565ff86c3f9545a4e bioconductor-edger=3.16.5 r-scales=0.4.1 r-statmod=1.4.29 r-rjson=0.2.15 r-getopt=1.20.0 No output has been received in the last 20m0s, this potentially indicates a stalled build or something wrong with the build itself. Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received The build has been terminated

nsoranzo · 2017-10-04T15:22:57Z

tools/edger/edger.xml

+        <requirement type="package" version="0.4.1">r-scales</requirement>
+
+        <!-- I will add r-statmod to limma bioconda package -->
+        <requirement type="package" version="1.4.29">r-statmod</requirement>


conda create also stuck for me because this is an old version from the bioconda channel, if you update the version to 1.4.30 it should work.

Thanks @nsoranzo, I've changed it to use r-statmod 1.4.30 instead of 1.4.29. But why does 1.4.29 work fine for me locally yet fail here?

I think it may be this conda bug: conda/conda#5536

Ah I see...thanks for the info!

nsoranzo · 2017-10-04T20:19:41Z

It's green again!

bgruening · 2017-10-05T11:30:50Z

@yhoogstrate do you have time to look over it? Thanks!

yhoogstrate · 2017-10-10T06:36:16Z

Can we use exactly the same phrases in the selection box and the help section. And would it then make sense to make e.g. Separate Count Files and Single Count Matrix bold in the help section?

yhoogstrate · 2017-10-10T06:39:43Z

tools/edger/edger.xml

+    <inputs>
+
+        <!-- Counts and Factors -->
+        <section name="cnt" expanded="false" title="Input Counts and Factors">


Would it make sense to set expanded to true? It needs to be opened for every usecase anyway

yhoogstrate · 2017-10-10T06:44:18Z

tools/edger/edger.xml

+                            <valid initial="string.letters,string.digits"><add value="_" /></valid>
+                        </sanitizer>
+                        </param>
+                        <repeat name="rep_factorLevel" title="Factor Level" min="2" default="2">


Is 'factor level' based on concepts in R or is this how scientists use it? I was thinking of the word 'condition' but I know my English is not something to rely on. If anyone else also believes 'condition' is better you can change it, otherwise keep it as it is.

Good point! I think factor level is probably not the best term to use here for biologists. I don't think I can use condition though as that can also be used to describe one type of factor e.g. this quote from pg 49 limma user guide (https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf):

"The two experimental factors Condition and Tissue could be handled in many ways."

So calling factor levels conditions might be confusing to users?
What about just using group? I've changed it to that for the moment but let me know what you think?

yhoogstrate · 2017-10-10T06:46:48Z

tools/edger/edger.xml

+                                <valid initial="string.letters,string.digits"><add value="_" /></valid>
+                            </sanitizer>
+                            </param>
+                            <param name="countsFile" type="data" format="tabular" multiple="true" label="Counts file(s)"/>


@shiltemann expression matrices get sniffed as 'mothur.freq':

GeneID WT3 11287 1601 11298 1834

Does it make sense to make the mothur sniffer ~~less~~ more stringent?

hmm, yes it is quite a simple format that I don't know if I can make more stringent (it's just two columns of numbers with alphanumerical headers) ..maybe just remove sniffer altogether if it is producing too many false positives?

This is being addressed in galaxyproject/galaxy#4781

Thanks for fixing that!

yhoogstrate · 2017-10-10T06:50:31Z

tools/edger/edger.xml

+        </section>
+
+        <!-- Contrasts -->
+        <section name="ct" expanded="false" title="Specify Groups to Contrast">


I think this section can be removed and the param can be put into the section above.

yhoogstrate · 2017-10-10T06:55:41Z

tools/edger/edger.xml

+        <section name="ct" expanded="false" title="Specify Groups to Contrast">
+            <param name="contrast" type="text" label="Contrasts of Interest" help="Eg. Mut-WT,KD-Control">
+                <validator type="empty_field" />
+                <validator type="regex" message="Please only use letters, numbers or underscores">^[\w,-]+$</validator>


More info: https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf (chapter 8)

~~I see that comma's are accepted. Could you explain what they're for?~~

Does it make sense to put this param into a <repeat> and not further allow comma's?

I put the contrast parameter into a repeat. I also added more info and the link to the limma user guide to the param help, is that what you meant?

yhoogstrate · 2017-10-10T07:24:49Z

tools/edger/edger.xml

+
+        <!-- Filter Options -->
+        <section name="filter" expanded="false" title="Filter Low Counts">
+            <param name="cpmReq" type="float" value="0" min="0" label="Minimum CPM" help="Treat genes with very low expression as unexpressed and filter out. See the Filter Low Counts section below for more information. Default: 0"/>


Since we're dealing with a nominal distribution, would it make sense to filter out based on the absolute read counts, e.g. something like this: rowSums(data$counts>= n_sample * 2 )?

In very large genes, a low CPM can still correspond to a considerable amount of reads, with considerable statistical power, isn't it?

Good idea, I've added more flexible filtering. Now you can also filter on absolute read count, either total or per sample, see what you think.

yhoogstrate · 2017-10-10T08:19:13Z

When I try to model batch effects as provided in the example under Factor Information, I get the following error:

Fatal error: Exit code 1 ()
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated
Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.0

mblue9 · 2017-10-10T22:42:54Z

@yhoogstrate this is great! Thanks a lot for taking a look. I'll work my way through these and get back to you.

- use same phrases for count file selection box and help section - remove section around inputs - change 'factor level' to 'group' - make contrast param repeat - add ability to filter on raw count values (total and per sample), with tests - add some more help and other text edits Please enter the commit message for your changes. Lines starting

mblue9 · 2017-10-23T07:13:22Z

@yhoogstrate thanks again for the last review! I've submitted new changes based on your suggestions and I'll address some of your comments directly above.

I couldn't reproduce the batch effects error you got. Can you check if you get it with the new changes? How are you inputting the info? The tool has these 3 tests that use that same info under Factor Information and they all pass but maybe I'm not catching everything:

Test No. 4 (count matrix, factor info from tool form)
Test No. 5 (count matrix, factor info from from file),
Test No. 8 (factor info entered with separate count files)

bgruening · 2017-11-06T07:56:22Z

@yhoogstrate can you look at this once more and get it in if you think its ready?

yhoogstrate · 2017-11-07T08:42:19Z

modelling batches works (but I still need to go over the rest)

yhoogstrate

Truly amazing work

bgruening · 2017-11-07T12:55:07Z

I fully agree with you @yhoogstrate!

nsoranzo · 2017-11-07T14:09:27Z

Thanks a lot @mblue9!

mblue9 · 2017-11-07T19:49:23Z

Yay! Thanks @yhoogstrate and all!

edgeR - first version

e209d0f

nsoranzo reviewed Sep 11, 2017

View reviewed changes

fixes from review

dad9c1e

yhoogstrate reviewed Sep 12, 2017

View reviewed changes

fixes from feedback

7c3e0ba

nsoranzo reviewed Oct 4, 2017

View reviewed changes

use conda-forge statmod 1.4.30

d5b4d37

mblue9 mentioned this pull request Oct 8, 2017

bioconductor-limma - add statmod as dependency bioconda/bioconda-recipes#6247

Merged

5 tasks

yhoogstrate reviewed Oct 10, 2017

View reviewed changes

yhoogstrate self-assigned this Nov 7, 2017

yhoogstrate approved these changes Nov 7, 2017

View reviewed changes

bgruening merged commit eac022c into galaxyproject:master Nov 7, 2017

mblue9 deleted the edgeR branch January 19, 2018 04:58

edgeR #1471

edgeR #1471

Conversation

mblue9 commented Sep 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsoranzo Sep 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shiltemann Sep 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mblue9 commented Sep 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bgruening commented Sep 16, 2017

mblue9 commented Sep 19, 2017

mblue9 commented Sep 25, 2017

bgruening commented Sep 26, 2017

mblue9 commented Sep 27, 2017

bgruening commented Sep 27, 2017

mblue9 commented Sep 30, 2017

bgruening commented Sep 30, 2017

mblue9 commented Oct 3, 2017

mblue9 commented Oct 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsoranzo commented Oct 4, 2017

bgruening commented Oct 5, 2017

yhoogstrate commented Oct 10, 2017

yhoogstrate Oct 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhoogstrate Oct 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhoogstrate Oct 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhoogstrate Oct 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhoogstrate commented Oct 10, 2017

mblue9 commented Oct 10, 2017

mblue9 commented Oct 23, 2017

bgruening commented Nov 6, 2017

yhoogstrate commented Nov 7, 2017

yhoogstrate left a comment

Choose a reason for hiding this comment

bgruening commented Nov 7, 2017

nsoranzo commented Nov 7, 2017

mblue9 commented Nov 7, 2017

nsoranzo Sep 12, 2017 •

edited

Loading

shiltemann Sep 11, 2017 •

edited

Loading

yhoogstrate Oct 10, 2017 •

edited

Loading

yhoogstrate Oct 10, 2017 •

edited

Loading

yhoogstrate Oct 10, 2017 •

edited

Loading

yhoogstrate Oct 10, 2017 •

edited

Loading