galaxyproject / tools-iuc Public
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tool request: Annotate an interval/bed dataset with genomic features from a BED12, one step #2370
Comments
|
What about chipseeker? That will annotate promoter regions, introns etc in 1 step using bed and gtf. Looks like chipseeker can also accept bedgraph input https://github.com/GuangchuangYu/ChIPseeker/blob/master/R/readPeakFile.R but don’t know if the galaxy version does. |
|
Hey @mblue9 -- wow, that tool works great!! Dataset 21 is an example of output that looks to be "just right" (tabular version). Bedgraph inputs are still processing but haven't died yet. Bed/narrowpeaks did process OK (awesome). Test history: https://usegalaxy.eu:/u/jenj/h/test-chipseeker I'll update the Ghelp post and ask the user to try this tool at usegalaxy.eu + make a request to get this installed at usegalaxy.org. This would putatively work on any bed data, not just peaks ... as long as the output is interpreted with that in mind. Close this out? |
|
Yeah I like it too |
|
@jennaj bedgraph does work, I tested it with the first 10k lines from one of your files (from Dataset No. 7 in your history), see output here https://usegalaxy.eu/u/mblue9/h/imported-test-chipseeker. And now I see one of your bedgraph jobs has worked too but took > 4 hrs as there's > 22 million lines in the file. Btw I made a few small fixes to Chipseeker here bgruening/galaxytools#857 |
|
Agree - bedgraph works :) Nice catches with the patch! |
There isn't a single tool that does this right now. Or, please correct me if wrong!
Input1: simple bed (3-6 columns) or bedgraph
Input2: bed12 full reference annotation
Output: Input1 regions tagged with features from the bed12: Promoter, 5' or 3' UTR, exons, introns, intergenic, coding, non-coding, etc.
Optional output: gene_id, transcript_id .. not sure how to format that, but seems like a logical follow-up question (query region hit promoter region, but what gene is it based on?). The tool would already have access to that info. Output two datasets? One with the summary, one with the expanded hits: original input region + feature type, base overlap, % overlap query, % overlap target, source gene/transcript, unique hit or not, plus maybe some kind of query "index number" so all non-unique hits could be selected/grouped out in a subsequent query? Index number could be the original bed data all merged together into a single field, values separated by underscores or something else informative/parseable.
Be nice to give the user some options about overlap requirements (number of bases, % of query or target length or both) and to pick which feature to tag with (one or more or all).
Some decisions could be made by the tool if any particular query hits more than one feature, or those could be exposed to the user ("prioritize regions by .......").
The tool should be able to handle stranded and non-stranded inputs, ask if strand matters if given in the input, and not fail with expanded bed dataset formats (narrow peaks).
The goal is to answer a question like this one, and it turns out to not be quite so simple. Seems like a reasonable query. Would permit annotating a BED dataset in a similiar way that BAM datasets are annotated by
bedtools TagBed tag BAM alignments based on overlaps with interval files. https://help.galaxyproject.org/t/tools-to-get-precise-annotated-genomic-regions-for-bedgraph-input-data/1105Thoughts?
The text was updated successfully, but these errors were encountered: