Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snakemake workflow for effective genome sizes #3

Merged
merged 11 commits into from
May 10, 2016
Merged

snakemake workflow for effective genome sizes #3

merged 11 commits into from
May 10, 2016

Conversation

daler
Copy link
Contributor

@daler daler commented May 9, 2016

I wanted to use epic on mm10 and dm6 assemblies, and I saw that in config/genomes.py you are storing dictionaries of chromsizes and effective genome sizes for human. Since I at least had to do these two other genomes, I figured I'd build something to support a wider range of genomes and read lengths.

It's a snakemake workflow to download sequences and chromsizes and calculate effective genome sizes for any genomes supported by UCSC. I put it in the scripts dir for now, but I was thinking it would be convenient to add the chromsizes dir and effective_sizes dir to the package data so the data are installed. Then instead of config/genomes.py checking the lookup dicts, it can read the chromsizes files and effective genome size files. Supporting a new genome would then just mean adding the assembly name to the snakefile and running it, then re-running setup.py sdist.

@landscape-bot
Copy link

Code Health
Repository health decreased by 2% when pulling b1edb90 on daler:master into 22943ca on endrebak:master.

@coveralls
Copy link

coveralls commented May 9, 2016

Coverage Status

Coverage remained the same at 17.523% when pulling b1edb90 on daler:master into 22943ca on endrebak:master.

@endrebak
Copy link
Member

endrebak commented May 10, 2016

Nice work. I'll look into it tonight. I think your idea sounds better, when adding many genomes, reading them all into a dict might take a sec of startup-time.

  1. What do you think of allowing users to enter their own egs on the CL? The jellyfish approach is a bit strict (no mismatches allowed) so the egs becomes somewhat too high and users might know a value that suits their data better.

  2. And what do you think of allowing users to enter a fasta on their command line and having epic compute the egs and chrom sizes? I think this is suboptimal, since a user should rather submit the size/egs data to me so it can be added to epic once and for all. But this might have a use case.

Btw, I do not know how to use git collaboratively, but I think that it would be better if you were able to pull my recent changes and add your changes on top of them ("stashing"/"popping"?) in a feature branch. Some regressions are added if you see the landscape-bot.

@daler
Copy link
Contributor Author

daler commented May 10, 2016

I wasn't worried so much about startup time, just figured it would be one less step to copy the results into a dictionary when supporting a new genome.

It would definitely be useful to specify egs and chromsizes on the CL. That would eliminate the need for genomes.py, but could still fall back on it for defaults. I was actually starting another PR for that but will hold off for now.

I don't think you need anything in the CLI for egs on top of the existing egs script. It took a lot of RAM to run jellyfish, so it would be good to keep it separate as a sort of infrequently-used helper rather than integrated into the tool proper. Plus the jellyfish dependencies get annoying on OSX, which is another reason to keep separate.

Since there are no conflicts, merging this PR will merge these changes into the changes you already have. I'll merge the master branch into this anyway though.

@landscape-bot
Copy link

Code Health
Code quality remained the same when pulling 79ccc13 on daler:master into 22943ca on endrebak:master.

@coveralls
Copy link

coveralls commented May 10, 2016

Coverage Status

Coverage remained the same at 17.523% when pulling 79ccc13 on daler:master into 22943ca on endrebak:master.

@endrebak
Copy link
Member

Thanks. I would not mind making this an org-repo if that makes things easier for you or you see some other merit in it.

@coveralls
Copy link

coveralls commented May 10, 2016

Coverage Status

Coverage remained the same at 17.523% when pulling 79ccc13 on daler:master into 22943ca on endrebak:master.

@endrebak endrebak merged commit 608b202 into biocore-ntnu:master May 10, 2016
@endrebak
Copy link
Member

One question: for epic to look up the egs, it needs to know the read length. How did you intend to get the read length in this system*? Compute them from the input file or ask the users to give a length on the command line?

*Is estimating the read length from the first 10-1000(?) reads okay?

@daler
Copy link
Contributor Author

daler commented May 10, 2016

(not sure an org-repo is needed yet)

For read length, how about figure it out from the first 1000 reads, select the closest read length that has has a stored calculated egs, and emit a log message saying what read length was estimated. If the user notices that the read length wasn't correct or is too far off, they can supply an override on the command line.

@endrebak
Copy link
Member

I thought something along those lines so then we are in agreement. Thanks.

@endrebak
Copy link
Member

endrebak commented Jun 28, 2016

Something funny must have happened when you computed the hg38 genome sizes:

epic/scripts/effective_sizes/hg38_36.txt:Effective genome size:  1.35161359177
epic/scripts/effective_sizes/hg38_50.txt:Effective genome size:  1.42136215844
epic/scripts/effective_sizes/hg38_75.txt:Effective genome size:  1.48818043551
epic/scripts/effective_sizes/hg38_100.txt:Effective genome size:  0.244824913134

The problem is most likely that my code, for some reason, calculated that the length of the genome was 1889563403. Will recompute the length here to understand what went wrong.

Can't reproduce here, so I am guessing something went wrong with your download or conversion to fa :)

Thanks again for the script.

@daler
Copy link
Contributor Author

daler commented Jun 28, 2016

Thanks, I will re-run too when I get a chance. I seem to remember having memory issues on cluster nodes with this, maybe some truncation was happening. I'll take a closer look.

@endrebak
Copy link
Member

No need, we have plenty of ram here. I am running it right now.

On Tuesday, June 28, 2016, Ryan Dale notifications@github.com wrote:

Thanks, I will re-run too when I get a chance. I seem to remember having
memory issues on cluster nodes with this, maybe some truncation was
happening. I'll take a closer look.


You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#3 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AQ9I0vJOBv1-FCIAXrfmX74--nP9ATNVks5qQSWkgaJpZM4Iaa5g
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants