snakemake workflow for effective genome sizes #3

daler · 2016-05-09T18:54:05Z

I wanted to use epic on mm10 and dm6 assemblies, and I saw that in config/genomes.py you are storing dictionaries of chromsizes and effective genome sizes for human. Since I at least had to do these two other genomes, I figured I'd build something to support a wider range of genomes and read lengths.

It's a snakemake workflow to download sequences and chromsizes and calculate effective genome sizes for any genomes supported by UCSC. I put it in the scripts dir for now, but I was thinking it would be convenient to add the chromsizes dir and effective_sizes dir to the package data so the data are installed. Then instead of config/genomes.py checking the lookup dicts, it can read the chromsizes files and effective genome size files. Supporting a new genome would then just mean adding the assembly name to the snakefile and running it, then re-running setup.py sdist.

landscape-bot · 2016-05-09T18:54:55Z

Repository health decreased by 2% when pulling b1edb90 on daler:master into 22943ca on endrebak:master.

11 new problems were found (including 0 errors and 4 code smells).
1 problem was fixed (including 0 errors and 1 code smell).

coveralls · 2016-05-09T18:58:00Z

Coverage remained the same at 17.523% when pulling b1edb90 on daler:master into 22943ca on endrebak:master.

endrebak · 2016-05-10T05:50:39Z

Nice work. I'll look into it tonight. I think your idea sounds better, when adding many genomes, reading them all into a dict might take a sec of startup-time.

What do you think of allowing users to enter their own egs on the CL? The jellyfish approach is a bit strict (no mismatches allowed) so the egs becomes somewhat too high and users might know a value that suits their data better.
And what do you think of allowing users to enter a fasta on their command line and having epic compute the egs and chrom sizes? I think this is suboptimal, since a user should rather submit the size/egs data to me so it can be added to epic once and for all. But this might have a use case.

Btw, I do not know how to use git collaboratively, but I think that it would be better if you were able to pull my recent changes and add your changes on top of them ("stashing"/"popping"?) in a feature branch. Some regressions are added if you see the landscape-bot.

daler · 2016-05-10T13:39:35Z

I wasn't worried so much about startup time, just figured it would be one less step to copy the results into a dictionary when supporting a new genome.

It would definitely be useful to specify egs and chromsizes on the CL. That would eliminate the need for genomes.py, but could still fall back on it for defaults. I was actually starting another PR for that but will hold off for now.

I don't think you need anything in the CLI for egs on top of the existing egs script. It took a lot of RAM to run jellyfish, so it would be good to keep it separate as a sort of infrequently-used helper rather than integrated into the tool proper. Plus the jellyfish dependencies get annoying on OSX, which is another reason to keep separate.

Since there are no conflicts, merging this PR will merge these changes into the changes you already have. I'll merge the master branch into this anyway though.

landscape-bot · 2016-05-10T14:04:42Z

Code quality remained the same when pulling 79ccc13 on daler:master into 22943ca on endrebak:master.

coveralls · 2016-05-10T14:07:04Z

Coverage remained the same at 17.523% when pulling 79ccc13 on daler:master into 22943ca on endrebak:master.

endrebak · 2016-05-10T16:06:31Z

Thanks. I would not mind making this an org-repo if that makes things easier for you or you see some other merit in it.

coveralls · 2016-05-10T16:09:34Z

Coverage remained the same at 17.523% when pulling 79ccc13 on daler:master into 22943ca on endrebak:master.

endrebak · 2016-05-10T18:30:26Z

One question: for epic to look up the egs, it needs to know the read length. How did you intend to get the read length in this system*? Compute them from the input file or ask the users to give a length on the command line?

*Is estimating the read length from the first 10-1000(?) reads okay?

daler · 2016-05-10T18:36:55Z

(not sure an org-repo is needed yet)

For read length, how about figure it out from the first 1000 reads, select the closest read length that has has a stored calculated egs, and emit a log message saying what read length was estimated. If the user notices that the read length wasn't correct or is too far off, they can supply an override on the command line.

endrebak · 2016-05-10T18:57:15Z

I thought something along those lines so then we are in agreement. Thanks.

endrebak · 2016-06-28T12:14:36Z

Something funny must have happened when you computed the hg38 genome sizes:

epic/scripts/effective_sizes/hg38_36.txt:Effective genome size:  1.35161359177
epic/scripts/effective_sizes/hg38_50.txt:Effective genome size:  1.42136215844
epic/scripts/effective_sizes/hg38_75.txt:Effective genome size:  1.48818043551
epic/scripts/effective_sizes/hg38_100.txt:Effective genome size:  0.244824913134

The problem is most likely that my code, for some reason, calculated that the length of the genome was 1889563403. Will recompute the length here to understand what went wrong.

Can't reproduce here, so I am guessing something went wrong with your download or conversion to fa :)

Thanks again for the script.

daler · 2016-06-28T13:46:12Z

Thanks, I will re-run too when I get a chance. I seem to remember having memory issues on cluster nodes with this, maybe some truncation was happening. I'll take a closer look.

endrebak · 2016-06-28T15:29:53Z

No need, we have plenty of ram here. I am running it right now.

On Tuesday, June 28, 2016, Ryan Dale notifications@github.com wrote:

Thanks, I will re-run too when I get a chance. I seem to remember having
memory issues on cluster nodes with this, maybe some truncation was
happening. I'll take a closer look.

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#3 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AQ9I0vJOBv1-FCIAXrfmX74--nP9ATNVks5qQSWkgaJpZM4Iaa5g
.

daler added 10 commits May 9, 2016 13:28

add snakemake workflow for calculating effective genome sizes

377d843

add requirements.txt for snakemake workflow

d81086d

add effective genome size intermediate files to .gitignore

aeb00ea

add readme for snakemake workflow

08826a7

add biopython and jellyfish to reqs

30766f8

add chromsizes output

2bd5d1b

add effective size output files

5fdfeba

rm underscore-containing chroms

c7849a3

rm orig path

1acace0

TMPDIR in readme

b1edb90

Merge remote-tracking branch 'upstream/master'

79ccc13

endrebak closed this May 10, 2016

endrebak reopened this May 10, 2016

endrebak added the in progress label May 10, 2016

endrebak merged commit 608b202 into biocore-ntnu:master May 10, 2016

endrebak removed the in progress label May 10, 2016

endrebak mentioned this pull request May 10, 2016

Epic-effective: Automatically remove bad chromos. #1

Closed

endrebak mentioned this pull request May 10, 2016

Make it possible to enter fa/fai file on command line instead of genome #2

Closed

endrebak mentioned this pull request Jul 14, 2016

Using dm3 makes epic crash #24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

snakemake workflow for effective genome sizes #3

snakemake workflow for effective genome sizes #3

daler commented May 9, 2016

landscape-bot commented May 9, 2016

coveralls commented May 9, 2016 •

edited

Loading

endrebak commented May 10, 2016 •

edited

Loading

daler commented May 10, 2016 •

edited

Loading

landscape-bot commented May 10, 2016

coveralls commented May 10, 2016 •

edited

Loading

endrebak commented May 10, 2016

coveralls commented May 10, 2016 •

edited

Loading

endrebak commented May 10, 2016

daler commented May 10, 2016

endrebak commented May 10, 2016

endrebak commented Jun 28, 2016 •

edited

Loading

daler commented Jun 28, 2016

endrebak commented Jun 28, 2016

snakemake workflow for effective genome sizes #3

snakemake workflow for effective genome sizes #3

Conversation

daler commented May 9, 2016

landscape-bot commented May 9, 2016

coveralls commented May 9, 2016 • edited Loading

endrebak commented May 10, 2016 • edited Loading

daler commented May 10, 2016 • edited Loading

landscape-bot commented May 10, 2016

coveralls commented May 10, 2016 • edited Loading

endrebak commented May 10, 2016

coveralls commented May 10, 2016 • edited Loading

endrebak commented May 10, 2016

daler commented May 10, 2016

endrebak commented May 10, 2016

endrebak commented Jun 28, 2016 • edited Loading

daler commented Jun 28, 2016

endrebak commented Jun 28, 2016

coveralls commented May 9, 2016 •

edited

Loading

endrebak commented May 10, 2016 •

edited

Loading

daler commented May 10, 2016 •

edited

Loading

coveralls commented May 10, 2016 •

edited

Loading

coveralls commented May 10, 2016 •

edited

Loading

endrebak commented Jun 28, 2016 •

edited

Loading