Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

edgeR #1471

Merged
merged 6 commits into from
Nov 7, 2017
Merged

edgeR #1471

merged 6 commits into from
Nov 7, 2017

Conversation

mblue9
Copy link
Contributor

@mblue9 mblue9 commented Sep 11, 2017

Hello everyone,

In this PR I've made a stab at that first draft of an IUC edgeR tool. As suggested in Gitter, first I took a look at the edgeR Galaxy tools already available and DESeq2, and then I discovered that the limma-voom tool that's already in IUC, could transform very easily into an edgeR tool, with just some small code changes. The limma-voom Galaxy tool was written by the group that are authors of limma and edgeR so I thought I'd start with that as a first version and see if that could be adapted. The code I've submitted here (both the xml and R script) is practically identical to the limma-voom tool (and uses the same test-data input) so I'm wondering if it might be good to make a combined limma-voom/edgeR tool or suite?

So far, I've only included the core functions of edgeR (which I would think could be enough for a first version of the tool?), so this version performs differential expression using the edgeR quasi-likelihood test (plus produces MDS, BCV and MD plots). But the user can also choose to run the likelihood ratio test instead. I have not added other edgeR functions yet (e.g. glmTreat, processAmplicons for CRISPR fastqs etc) as I don't know that they'd be needed in the first version.

In it's current form, this edgeR tool takes the same input as the limma-voom tool, so requires a count matrix, rather than the one file per sample like DESeq2 (like some of you said you like in Gitter). For limma-voom, I had been planning to use the Join by ID tool or something similar to create the input matrix from Featurecounts output, but do people think it's better if users could also input the individual files like Deseq2?

What do people think? I'm happy to change things (or have things changed)

P.S. I have not tried incorporating tags and metadata yet, or doing anything fancy with creating a design matrix. Similar to the limma-voom tool, people can either input the sample info from a file or enter it directly in the tool form.

P.P.S. I was intending to create a Galaxy tutorial for limma-voom anyway to use for teaching here (based on the R workshop material we created here http://combine-australia.github.io/RNAseq-R/). That dataset is the same one used in the edgeR workflow published here: https://f1000research.com/articles/5-1438 so maybe there could be a shared tut that can be used for edgeR and limma-voom. That tut also uses heatmap2 and I was going to see if I can include that, now that it's in IUC 馃槃

<requirement type="package" version="3.30.13">bioconductor-limma</requirement>
<requirement type="package" version="1.4.29">r-statmod</requirement>
<requirement type="package" version="0.4.1">r-scales</requirement>
<requirement type="package" version="1.5_9.1">r-locfit</requirement>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is r-locfit needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because otherwise you get an error about the locfit package being missing with the estimateDisp function (see here: https://support.bioconductor.org/p/75970/)
It's because these packages have some suggested but not required dependencies (see here: https://support.bioconductor.org/p/83334/)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment <!-- comment --> above the requirement explaining this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the requirements, see my new comment below.

</when>
</conditional>
</section>
<section name="ct" expanded="false" title="Specify groups to contrast">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops sorry! I've fixed those indentations now

Copy link
Member

@nsoranzo nsoranzo Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be still broken unfortunately, forgot to push?

<validator type="empty_field" />
<validator type="regex" message="Please only use letters, numbers or underscores">^[\w,-]+$</validator>
</param>
</section>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation

mkdir ./output_dir

&&
cp '$outReport.files_path'/*.tsv output_dir/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mv is probably better than cp

Copy link
Member

@shiltemann shiltemann Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think cp is used because the files are both pulled into a collection (from output_dir) and included in the html report (so need to stay in $outReport.files_path) ..a symlink could be a good alternative though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason I'm using cp is because, if you use mv, on exporting the history there are no tsvs at all (neither from the report nor from the collection, are files in collections not included in exported histories?). So I am using cp so that a user who exports a history is not unknowingly missing the tsv files.
@shiltemann would a symlink work for including the files in an exported history?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mblue9 I am not entirely sure actually, maybe not ..but if the files can get big it could be worth testing out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think symlinks will work for object store backends and exported histories, I think we're bound to cp (but these files shouldn't become too big anyway, right ?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what is consider 'big' here but these are just text files of the diff exp results (1 per contrast), I don't think that it'd ever be more than MBs in total, nowhere near as big as the fastqs and bams used to generate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mblue9 yeah I would consider that small, if it were files on order of GBs it would be shame to duplicate them, but this should be fine (besides, I don't know if there is a good way around this at present anyway as Marius pointed out)

@mblue9
Copy link
Contributor Author

mblue9 commented Sep 12, 2017

Hi @nsoranzo and @shiltemann thanks for the review!

I also tests to check the report for which edgeR test was used.

<validator type="regex" message="Please only use letters, numbers or underscores">^[\w]+$</validator>
</param>
<param name="pfactLevel" type="text" label="Primary Factor Levels"
help="Eg. WT,WT,WT,Mut,Mut,Mut NOTE: Please only use letters, numbers or underscores and ensure that the same levels are typed identically with cases matching.">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

help="current help text (case sensitive)"

This tool outputs a table of differentially expressed genes for each contrast of interest and a HTML report with plots and additional information. Optionally you can choose to output the normalised counts table and the RData file.
]]></help>
<citations>
<citation type="doi">10.1093/bioinformatics/btp616</citation>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does limma need a citation too? (and maybe other dependencies you included?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to work out what to do with the citations. I've left it with just the edgeR citation here for the moment but I should add the citation for the limma-voom tool that this edgeR tool is based on, and probably limma too. I was going tor wait til I figured out how to combine, this is still WIP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mblue9 what do you mean with combine? You can add multiple citations if you like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bgruening by combine I mean that the limma-voom tool and this edgeR one share a lot of code so I was going to see if I could macrofy.

@bgruening
Copy link
Member

@mblue9 to answer one of your question from your TP. I think it is perfectly fine to do one compined repo with limma/voom. Whatever is best for the maintainer as long as we can create separate TS repos, which we can I think.

@mblue9
Copy link
Contributor Author

mblue9 commented Sep 19, 2017

Thanks for these reviews and feedback! I'll work on them (including adding the deseq2-style input) as soon as I can get a chance.

@mblue9
Copy link
Contributor Author

mblue9 commented Sep 25, 2017

I've made changes as suggested (and some others) if anyone wants to take a look to see where I'm up to. I haven't done any combining of the limma-voom and edgeR wrappers yet but I'm going to look into that. I'm sure some of the changes I've made could be done more elegantly. I'm happy to make more changes, remove sections etc.

In this commit I've

  • added the option to input multiple files similar to the deseq2 wrapper (very similar, I took the code from the deseq2 wrapper)
  • added a test for the multiple files input
  • added rjson as a requirement (needed for the deseq2-style input)
  • changed the requirements e.g. removed limma as it's already a dependency of the edgeR conda package. Am wondering if it would be better to add scales and statmod to the limma/edgeR conda packages rather than as dependencies here?
  • added another plot (for QL dispersions)
  • with the limma-voom tool you can provide an annotation file as input, I haven't added that to the multiple files input (so currently only possible with the matrix) but I could?
  • added names to tests
  • fixed all the indentations (I hope)

@bgruening
Copy link
Member

changed the requirements e.g. removed limma as it's already a dependency of the edgeR conda package. Am wondering if it would be better to add scales and statmod to the limma/edgeR conda packages rather than as dependencies here?

If this is not a dependency of limma/edgeR than it is ok to not add it to conda and just here.

with the limma-voom tool you can provide an annotation file as input, I haven't added that to the multiple files input (so currently only possible with the matrix) but I could?

That would be super useful I think.

@mblue9
Copy link
Contributor Author

mblue9 commented Sep 27, 2017

Thanks @bgruening I'll add the annotation file option to the multiple file input.
Re the dependencies - they're not "official" dependencies of limma but their omission does cause people problems, see here https://support.bioconductor.org/p/83334/ . From that post it seems they're omitted to reduce installation hassles so was wondering then if it would be better (for conda limma users) if they were in the limma conda package as dependencies?

@bgruening
Copy link
Member

In this case, please feel free to add it to the conda package.

@mblue9
Copy link
Contributor Author

mblue9 commented Sep 30, 2017

@bgruening
Copy link
Member

They will have them listed twice yes. But the resolver will only install one. So this is fine. In a long run we could clean the other packages as well.

- add annotation file input option to the multiple counts files input, also added test
- add getopt library to parse options
- add some error handling lines from DESeq2 R script (e.g. Sys.setlocale)
- reorder some input options
- add some more explanation
@mblue9
Copy link
Contributor Author

mblue9 commented Oct 3, 2017

To finish this I still need to update the limma conda package and try macrofying, but other than that I'm done with changes for the moment. If people think I should make more just let me know.

The changes I've made are below. For updating the limma conda package, I'm only going to add the statmod dependency. I'm leaving scales as a requirement here, as it's not limma that needs it in this tool, it's the alpha function used in the R script (for colour transparency in the MD plot).
I'm now using r-getopt and r-rjson like deseq2 @bgruening I saw that r-getopt and r-rjson are in the deseq2 bioconda packages so was wondering if I should also add them to the limma bioconda package or just leave them as requirements here? btw thanks for your deseq2 wrapper it's been really helpful for this! 馃槃

  • add annotation file input option to the multiple counts files input, also added test
  • add getopt library to parse options
  • add some error handling lines from DESeq2 R script (e.g. Sys.setlocale)
  • reorder some input options
  • add some more explanation

@mblue9
Copy link
Contributor Author

mblue9 commented Oct 4, 2017

@nsoranzo @bgruening this is also weirdly failing here even though it passes fine locally (and also passed here previously). Is it something to do with this new override channels line below? As it's failing right after that here and I don't see that line in the pizzly and edgeR builds that passed previously.

2017-10-03 21:58:55,146 DEBUG [galaxy.tools.deps.conda_util] Executing command: /home/travis/conda/bin/conda create -y --override-channels --channel iuc --channel bioconda --channel conda-forge --channel defaults --channel r --name mulled-v1-af8fafd35752d255b1c9e4fcf6ea7046af6cf96b731d4fe565ff86c3f9545a4e bioconductor-edger=3.16.5 r-scales=0.4.1 r-statmod=1.4.29 r-rjson=0.2.15 r-getopt=1.20.0 No output has been received in the last 20m0s, this potentially indicates a stalled build or something wrong with the build itself. Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received The build has been terminated

<requirement type="package" version="0.4.1">r-scales</requirement>

<!-- I will add r-statmod to limma bioconda package -->
<requirement type="package" version="1.4.29">r-statmod</requirement>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conda create also stuck for me because this is an old version from the bioconda channel, if you update the version to 1.4.30 it should work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nsoranzo, I've changed it to use r-statmod 1.4.30 instead of 1.4.29. But why does 1.4.29 work fine for me locally yet fail here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be this conda bug: conda/conda#5536

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see...thanks for the info!

@nsoranzo
Copy link
Member

nsoranzo commented Oct 4, 2017

It's green again!

@bgruening
Copy link
Member

@yhoogstrate do you have time to look over it? Thanks!

@yhoogstrate
Copy link
Member

image
Can we use exactly the same phrases in the selection box and the help section. And would it then make sense to make e.g. Separate Count Files and Single Count Matrix bold in the help section?

<inputs>

<!-- Counts and Factors -->
<section name="cnt" expanded="false" title="Input Counts and Factors">
Copy link
Member

@yhoogstrate yhoogstrate Oct 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to set expanded to true? It needs to be opened for every usecase anyway

<valid initial="string.letters,string.digits"><add value="_" /></valid>
</sanitizer>
</param>
<repeat name="rep_factorLevel" title="Factor Level" min="2" default="2">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 'factor level' based on concepts in R or is this how scientists use it? I was thinking of the word 'condition' but I know my English is not something to rely on. If anyone else also believes 'condition' is better you can change it, otherwise keep it as it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I think factor level is probably not the best term to use here for biologists. I don't think I can use condition though as that can also be used to describe one type of factor e.g. this quote from pg 49 limma user guide (https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf):

"The two experimental factors Condition and Tissue could be handled in many ways."

So calling factor levels conditions might be confusing to users?
What about just using group? I've changed it to that for the moment but let me know what you think?

<valid initial="string.letters,string.digits"><add value="_" /></valid>
</sanitizer>
</param>
<param name="countsFile" type="data" format="tabular" multiple="true" label="Counts file(s)"/>
Copy link
Member

@yhoogstrate yhoogstrate Oct 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiltemann expression matrices get sniffed as 'mothur.freq':

GeneID	WT3
11287	1601
11298	1834

Does it make sense to make the mothur sniffer less more stringent?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yes it is quite a simple format that I don't know if I can make more stringent (it's just two columns of numbers with alphanumerical headers) ..maybe just remove sniffer altogether if it is producing too many false positives?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being addressed in galaxyproject/galaxy#4781

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing that!

</section>

<!-- Contrasts -->
<section name="ct" expanded="false" title="Specify Groups to Contrast">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section can be removed and the param can be put into the section above.

<section name="ct" expanded="false" title="Specify Groups to Contrast">
<param name="contrast" type="text" label="Contrasts of Interest" help="Eg. Mut-WT,KD-Control">
<validator type="empty_field" />
<validator type="regex" message="Please only use letters, numbers or underscores">^[\w,-]+$</validator>
Copy link
Member

@yhoogstrate yhoogstrate Oct 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More info: https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf (chapter 8)

I see that comma's are accepted. Could you explain what they're for?

Does it make sense to put this param into a <repeat> and not further allow comma's?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put the contrast parameter into a repeat. I also added more info and the link to the limma user guide to the param help, is that what you meant?


<!-- Filter Options -->
<section name="filter" expanded="false" title="Filter Low Counts">
<param name="cpmReq" type="float" value="0" min="0" label="Minimum CPM" help="Treat genes with very low expression as unexpressed and filter out. See the Filter Low Counts section below for more information. Default: 0"/>
Copy link
Member

@yhoogstrate yhoogstrate Oct 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're dealing with a nominal distribution, would it make sense to filter out based on the absolute read counts, e.g. something like this: rowSums(data$counts>= n_sample * 2 )?

In very large genes, a low CPM can still correspond to a considerable amount of reads, with considerable statistical power, isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I've added more flexible filtering. Now you can also filter on absolute read count, either total or per sample, see what you think.

@yhoogstrate
Copy link
Member

When I try to model batch effects as provided in the example under Factor Information, I get the following error:

Fatal error: Exit code 1 ()
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated
Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.0

@mblue9
Copy link
Contributor Author

mblue9 commented Oct 10, 2017

@yhoogstrate this is great! Thanks a lot for taking a look. I'll work my way through these and get back to you.

- use same phrases for count file selection box and help section
- remove section around inputs
- change 'factor level' to 'group'
- make contrast param repeat
- add ability to filter on raw count values (total and per sample), with tests
- add some more help and other text edits

 Please enter the commit message for your changes. Lines starting
@mblue9
Copy link
Contributor Author

mblue9 commented Oct 23, 2017

@yhoogstrate thanks again for the last review! I've submitted new changes based on your suggestions and I'll address some of your comments directly above.

I couldn't reproduce the batch effects error you got. Can you check if you get it with the new changes? How are you inputting the info? The tool has these 3 tests that use that same info under Factor Information and they all pass but maybe I'm not catching everything:

  • Test No. 4 (count matrix, factor info from tool form)
  • Test No. 5 (count matrix, factor info from from file),
  • Test No. 8 (factor info entered with separate count files)

@bgruening
Copy link
Member

@yhoogstrate can you look at this once more and get it in if you think its ready?

@yhoogstrate yhoogstrate self-assigned this Nov 7, 2017
@yhoogstrate
Copy link
Member

modelling batches works (but I still need to go over the rest)

Copy link
Member

@yhoogstrate yhoogstrate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truly amazing work

@bgruening
Copy link
Member

I fully agree with you @yhoogstrate!

@bgruening bgruening merged commit eac022c into galaxyproject:master Nov 7, 2017
@nsoranzo
Copy link
Member

nsoranzo commented Nov 7, 2017

Thanks a lot @mblue9!

@mblue9
Copy link
Contributor Author

mblue9 commented Nov 7, 2017

Yay! Thanks @yhoogstrate and all!

@mblue9 mblue9 deleted the edgeR branch January 19, 2018 04:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants