-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
snakemake workflow for effective genome sizes #3
Conversation
|
Nice work. I'll look into it tonight. I think your idea sounds better, when adding many genomes, reading them all into a dict might take a sec of startup-time.
Btw, I do not know how to use git collaboratively, but I think that it would be better if you were able to pull my recent changes and add your changes on top of them ("stashing"/"popping"?) in a feature branch. Some regressions are added if you see the landscape-bot. |
I wasn't worried so much about startup time, just figured it would be one less step to copy the results into a dictionary when supporting a new genome. It would definitely be useful to specify egs and chromsizes on the CL. That would eliminate the need for I don't think you need anything in the CLI for egs on top of the existing egs script. It took a lot of RAM to run jellyfish, so it would be good to keep it separate as a sort of infrequently-used helper rather than integrated into the tool proper. Plus the jellyfish dependencies get annoying on OSX, which is another reason to keep separate. Since there are no conflicts, merging this PR will merge these changes into the changes you already have. I'll merge the master branch into this anyway though. |
Thanks. I would not mind making this an org-repo if that makes things easier for you or you see some other merit in it. |
One question: for epic to look up the egs, it needs to know the read length. How did you intend to get the read length in this system*? Compute them from the input file or ask the users to give a length on the command line? *Is estimating the read length from the first 10-1000(?) reads okay? |
(not sure an org-repo is needed yet) For read length, how about figure it out from the first 1000 reads, select the closest read length that has has a stored calculated egs, and emit a log message saying what read length was estimated. If the user notices that the read length wasn't correct or is too far off, they can supply an override on the command line. |
I thought something along those lines so then we are in agreement. Thanks. |
Something funny must have happened when you computed the hg38 genome sizes:
The problem is most likely that my code, for some reason, calculated that the length of the genome was Can't reproduce here, so I am guessing something went wrong with your download or conversion to fa :) Thanks again for the script. |
Thanks, I will re-run too when I get a chance. I seem to remember having memory issues on cluster nodes with this, maybe some truncation was happening. I'll take a closer look. |
No need, we have plenty of ram here. I am running it right now. On Tuesday, June 28, 2016, Ryan Dale notifications@github.com wrote:
|
I wanted to use epic on mm10 and dm6 assemblies, and I saw that in
config/genomes.py
you are storing dictionaries of chromsizes and effective genome sizes for human. Since I at least had to do these two other genomes, I figured I'd build something to support a wider range of genomes and read lengths.It's a snakemake workflow to download sequences and chromsizes and calculate effective genome sizes for any genomes supported by UCSC. I put it in the
scripts
dir for now, but I was thinking it would be convenient to add the chromsizes dir and effective_sizes dir to the package data so the data are installed. Then instead ofconfig/genomes.py
checking the lookup dicts, it can read the chromsizes files and effective genome size files. Supporting a new genome would then just mean adding the assembly name to the snakefile and running it, then re-runningsetup.py sdist
.