Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pyfaidx as the default SequenceFileDB backend, add support for Python 3 #25

Merged
merged 25 commits into from
Nov 21, 2019
Merged

Conversation

mdshw5
Copy link
Contributor

@mdshw5 mdshw5 commented Apr 21, 2016

Also added an option to ignore transcript version numbers when they don't match with the HGVS representation. That was a minor change, and something I wouldn't mind throwing out as I don't think it's a good idea in practice.

@mdshw5
Copy link
Contributor Author

mdshw5 commented Apr 21, 2016

It's also worth noting that tests are Passing on @travisci for Pythons 2.6, 2.7, 3.4, 3.5.

@jdavisp3
Copy link
Contributor

Adding the venv2.7 directory was not intentional, right?

@mdshw5
Copy link
Contributor Author

mdshw5 commented Apr 21, 2016

:) Completely unintentional.

@@ -72,10 +73,10 @@ def get_transcript(name):
# hgvs_name.ref_allele = 'A'
# hgvs_name.alt_allele = 'G'

print (hgvs_name.transcript,
print((hgvs_name.transcript,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a nit, and I realize this is technically valid Python 2 code, but to guarantee compatibility between Python 2.7 and Python 3, could you add a from __future__ import print_function to the top of files that use print as a function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem - will do.

@@ -173,13 +173,13 @@ def _on_forward_strand(self):
seq_3p = self.seq_3p
self.seq_5p = revcomp(seq_3p)
self.seq_3p = revcomp(seq_5p)
self.alleles = map(revcomp, self.alleles)
self.alleles = list(map(revcomp, self.alleles))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: a list comprehension would be more idiomatic (and prevent duplicating memory overhead in python 2).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@lucaswiman
Copy link

This is a really nice cleanup, @mdshw5!

@lucaswiman
Copy link

These changes LGTM from a code perspective. @mdrasmus: Do you have any comments about the pyfaidx changes before merge?

@naegelyd
Copy link
Contributor

naegelyd commented Mar 8, 2017

This LGTM too. @mdrasmus could you take a look at this?

Copy link
Contributor

@mdrasmus mdrasmus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had only a few questions and suggestions, otherwise this looks pretty good to me.

Thanks these contributions!

# Read genome sequence using pygr.
genome = SequenceFileDB('hg19.fa')
# Read genome sequence using pyfaidx.
genome = Genome('/tmp/hg19.fa')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Genome() allow relative paths? If so, I think the relative paths may be more convenient for people to try out the library. If it requires absolute paths, then this is a very reasonable change to make.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Genome should actually be Fasta from the pyfaidx library. I'm not sure why I write differently. Yes, Fasta supports relative paths, as it's using python base library IO classes, so it works as you would expect.

"""
Parse an HGVS name into (chrom, start, end, ref, alt)

hgvs_name: HGVS name to parse.
genome: pygr compatible genome object.
transcript: Transcript corresponding to HGVS name.
normalize: If True, normalize allele according to VCF standard.
lazy: If True, discard version information from incoming transcript/gene.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that this is convenient. So it SGTM. However, the user should be very careful though, because sequence can change quite a bit between transcript versions, especially indel changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree this is dangerous. I can't even remember what my use case for this was, so it can just as well be dumped.

"""
hgvs = HGVSName(hgvs_name)

# Determine transcript.
if hgvs.kind == 'c' and not transcript:
if '.' in hgvs.transcript and lazy:
hgvs.transcript, version = hgvs.transcript.split('.')
elif '.' in hgvs.gene and lazy:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, can genes have versions? I don't believe I have seen a HUGO identifier with a .{version} suffix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might have been a case where I was being too careful to consider all cases. According to Ensembl, genes to not appear to be versioned within annotation releases.

@@ -115,7 +119,7 @@ def read(self, filename):

filename: a filename string or file stream.
"""
if isinstance(filename, basestring):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucaswiman do you suggest a six usage here that allows python 2 unicode filenames?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good catch. Should probably use string_type instead.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's quite completely accurate. Both str (textual) and bytes (binary) can be used as filenames in both python 2 and python 3, but six.string_types does not include bytes in Python 3. IMO the most Pythonic way to write this would be to use duck typing:

if hasattr(filename, 'read'):
    infile = filename
else:
    with open(filename) as infile:
        return self.read(infile)


def _trim_common_prefix(self):
"""
Trim the common prefix amongst all alleles.
"""
minlength = min(map(len, self.alleles))
minlength = min(list(map(len, self.alleles)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe min can take an iterable. Did you find that list was needed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think list was probably a holdover from running 2to3, which tends to assume all iterables need to become sequence types.

@mdshw5
Copy link
Contributor Author

mdshw5 commented Mar 9, 2017

Thanks for the code review! It seems like I've deleted my original branch, so if you're interested in making any of the changes feel free. If you need me to pull this PR, modify, and submit a new PR I can do that, but things are a bit tricky since the PR lives entirely within your repo now...

@mdrasmus
Copy link
Contributor

mdrasmus commented Mar 9, 2017

Thanks for the code review! It seems like I've deleted my original branch, so if you're interested in making any of the changes feel free. If you need me to pull this PR, modify, and submit a new PR I can do that, but things are a bit tricky since the PR lives entirely within your repo now...

No worries. I can follow up with the last few nits and merge. Thanks again @mdshw5.

@davmlaw
Copy link

davmlaw commented Sep 25, 2019

Hi, this looks good, any chance it can be merged into master?

I have come to depend on this project but it appears that nobody is maintaining it and taking fixes. Any chance I could be given commit access?

@mdshw5
Copy link
Contributor Author

mdshw5 commented Sep 25, 2019

Sorry to have deleted my fork. You can get a patch from the PR using this URL: https://github.com/counsyl/hgvs/pull/25.patch.

@davmlaw
Copy link

davmlaw commented Sep 26, 2019

Thanks, yeah I actually implemented the same basic change (faidx & Python 3) myself, then when I ralised you did it, re-forked, merged your code so that at least it reduces the divergence from this project if it ever gets patched....

Personally I use a database to randomly load transcripts (without requiring loading a massive hash table to reduce startup time), but I'd also like to implement a quick tabix genePred version which I'm sure lots of people would find useful.

Can you contact the authors and ask for commit access so you can merge this? Or if they're not interested, should we make a pyhgvs2? I will be happy to contribute my fixes to your version if you want to host it (or we can both have commit access)

The other Python HGVS library (Invitae) won't install, and has open bugs about it. With the official branch of this project being Python2 only (which will die in a few months) and a few HGVS conversion bugs open this really needs an active maintainer IMO.

jtratner pushed a commit that referenced this pull request Nov 20, 2019
@jtratner jtratner merged commit 7f2f454 into counsyl:master Nov 21, 2019
@jtratner
Copy link
Contributor

@davmlaw @mdshw5 - I'm not 100% sure why we didn't push commits from these changes publicly before, but I think they're actually in the master branch of this repo now (would love a double check from you @davmlaw). Unfortunately Github marked this PR as "merged" because of difference in numbering between our internal and the remote one - HOWEVER, I think now all of these changes are actually in the public repo.

We use the code currently on master in production at Myriad Women's Health.

I'm also going to advocate for us to eliminate our internal hgvs repo so that we do all development going forward on the public one.

@mdshw5
Copy link
Contributor Author

mdshw5 commented Nov 21, 2019

Thanks @jtratner! Sounds like a bit of a headache - good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants