-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use pyfaidx as the default SequenceFileDB backend, add support for Python 3 #25
Conversation
It's also worth noting that tests are Passing on @travisci for Pythons 2.6, 2.7, 3.4, 3.5. |
Adding the |
:) Completely unintentional. |
@@ -72,10 +73,10 @@ def get_transcript(name): | |||
# hgvs_name.ref_allele = 'A' | |||
# hgvs_name.alt_allele = 'G' | |||
|
|||
print (hgvs_name.transcript, | |||
print((hgvs_name.transcript, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit of a nit, and I realize this is technically valid Python 2 code, but to guarantee compatibility between Python 2.7 and Python 3, could you add a from __future__ import print_function
to the top of files that use print
as a function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem - will do.
@@ -173,13 +173,13 @@ def _on_forward_strand(self): | |||
seq_3p = self.seq_3p | |||
self.seq_5p = revcomp(seq_3p) | |||
self.seq_3p = revcomp(seq_5p) | |||
self.alleles = map(revcomp, self.alleles) | |||
self.alleles = list(map(revcomp, self.alleles)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: a list comprehension would be more idiomatic (and prevent duplicating memory overhead in python 2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
This is a really nice cleanup, @mdshw5! |
These changes LGTM from a code perspective. @mdrasmus: Do you have any comments about the |
This LGTM too. @mdrasmus could you take a look at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had only a few questions and suggestions, otherwise this looks pretty good to me.
Thanks these contributions!
# Read genome sequence using pygr. | ||
genome = SequenceFileDB('hg19.fa') | ||
# Read genome sequence using pyfaidx. | ||
genome = Genome('/tmp/hg19.fa') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Genome()
allow relative paths? If so, I think the relative paths may be more convenient for people to try out the library. If it requires absolute paths, then this is a very reasonable change to make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Genome
should actually be Fasta
from the pyfaidx library. I'm not sure why I write differently. Yes, Fasta
supports relative paths, as it's using python base library IO classes, so it works as you would expect.
""" | ||
Parse an HGVS name into (chrom, start, end, ref, alt) | ||
|
||
hgvs_name: HGVS name to parse. | ||
genome: pygr compatible genome object. | ||
transcript: Transcript corresponding to HGVS name. | ||
normalize: If True, normalize allele according to VCF standard. | ||
lazy: If True, discard version information from incoming transcript/gene. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see that this is convenient. So it SGTM. However, the user should be very careful though, because sequence can change quite a bit between transcript versions, especially indel changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree this is dangerous. I can't even remember what my use case for this was, so it can just as well be dumped.
""" | ||
hgvs = HGVSName(hgvs_name) | ||
|
||
# Determine transcript. | ||
if hgvs.kind == 'c' and not transcript: | ||
if '.' in hgvs.transcript and lazy: | ||
hgvs.transcript, version = hgvs.transcript.split('.') | ||
elif '.' in hgvs.gene and lazy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, can genes have versions? I don't believe I have seen a HUGO identifier with a .{version}
suffix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might have been a case where I was being too careful to consider all cases. According to Ensembl, genes to not appear to be versioned within annotation releases.
@@ -115,7 +119,7 @@ def read(self, filename): | |||
|
|||
filename: a filename string or file stream. | |||
""" | |||
if isinstance(filename, basestring): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lucaswiman do you suggest a six
usage here that allows python 2 unicode filenames?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good catch. Should probably use string_type
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's quite completely accurate. Both str
(textual) and bytes
(binary) can be used as filenames in both python 2 and python 3, but six.string_types
does not include bytes
in Python 3. IMO the most Pythonic way to write this would be to use duck typing:
if hasattr(filename, 'read'):
infile = filename
else:
with open(filename) as infile:
return self.read(infile)
|
||
def _trim_common_prefix(self): | ||
""" | ||
Trim the common prefix amongst all alleles. | ||
""" | ||
minlength = min(map(len, self.alleles)) | ||
minlength = min(list(map(len, self.alleles))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe min
can take an iterable. Did you find that list
was needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think list
was probably a holdover from running 2to3
, which tends to assume all iterables need to become sequence types.
Thanks for the code review! It seems like I've deleted my original branch, so if you're interested in making any of the changes feel free. If you need me to pull this PR, modify, and submit a new PR I can do that, but things are a bit tricky since the PR lives entirely within your repo now... |
No worries. I can follow up with the last few nits and merge. Thanks again @mdshw5. |
Hi, this looks good, any chance it can be merged into master? I have come to depend on this project but it appears that nobody is maintaining it and taking fixes. Any chance I could be given commit access? |
Sorry to have deleted my fork. You can get a patch from the PR using this URL: https://github.com/counsyl/hgvs/pull/25.patch. |
Thanks, yeah I actually implemented the same basic change (faidx & Python 3) myself, then when I ralised you did it, re-forked, merged your code so that at least it reduces the divergence from this project if it ever gets patched.... Personally I use a database to randomly load transcripts (without requiring loading a massive hash table to reduce startup time), but I'd also like to implement a quick tabix genePred version which I'm sure lots of people would find useful. Can you contact the authors and ask for commit access so you can merge this? Or if they're not interested, should we make a pyhgvs2? I will be happy to contribute my fixes to your version if you want to host it (or we can both have commit access) The other Python HGVS library (Invitae) won't install, and has open bugs about it. With the official branch of this project being Python2 only (which will die in a few months) and a few HGVS conversion bugs open this really needs an active maintainer IMO. |
REP-1044: Handle inversion event
@davmlaw @mdshw5 - I'm not 100% sure why we didn't push commits from these changes publicly before, but I think they're actually in the master branch of this repo now (would love a double check from you @davmlaw). Unfortunately Github marked this PR as "merged" because of difference in numbering between our internal and the remote one - HOWEVER, I think now all of these changes are actually in the public repo. We use the code currently on master in production at Myriad Women's Health. I'm also going to advocate for us to eliminate our internal hgvs repo so that we do all development going forward on the public one. |
Thanks @jtratner! Sounds like a bit of a headache - good luck! |
Also added an option to ignore transcript version numbers when they don't match with the HGVS representation. That was a minor change, and something I wouldn't mind throwing out as I don't think it's a good idea in practice.