Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence Complement does not return the same results on Python3.3.3 as Python3.4 #518

Closed
pkMyt1 opened this issue Apr 16, 2015 · 9 comments

Comments

@pkMyt1
Copy link

pkMyt1 commented Apr 16, 2015

When complementing DNA during a search for mutations I found that when run on Python 3.3.3 I got about 3% fewer hits than on Python 3.4 using Seq(str(ref_seq+alt_seq)).complement(). When I wrote my own simple code to generate the complement this difference went away. The error is rare and not random. I have not been able to capture the actual sequence that is doing this. The difference in the two are significant.

@peterjc
Copy link
Member

peterjc commented Apr 16, 2015

This sounds very serious, but likely very difficult to identify without more information. Have you compared with Python 2.6 or 2.7? That might help work out if this is related to the string/unicode changes.

You said it is not random? Otherwise I might have guessed it was faulty RAM to blame.

@pkMyt1
Copy link
Author

pkMyt1 commented Apr 16, 2015

Not random in that multiple runs always yield the same result. My code will not run on Python 2.x so I cannot test that without some gymnastics. I have been chasing this for about a week now. When I run the code block below with a ref_seq of say 10,000 nucleotides and an alt_seq of say 10,000 also and filtering them so I know ref != alt; Python 3.3.3 I might get a count of 1,000 and Python 3.4 it will be 1,300. If I make the complement myself like in the last code block the results are the same. Both Python 3.3 and 3.4 should treat strings/unicode the same.

list = ['CT', 'CA', 'TA', 'TC']
count = 0
comp = Seq(str(ref_seq+alt_seq)).complement()
for i in list:
   if comp == i:
       count +=1

def complement(seq):
    comp_seq = ''
    for i in range(len(seq)):
        if seq[i] is "G":
            nuc = "C"
        elif seq[i] is "A":
            nuc = "T"
        elif seq[i] is "T":
            nuc = "A"
        elif seq[i] is "C":
            nuc = "G"
        else:
            nuc = "N"
        comp_seq += nuc
    return comp_seq

@pkMyt1
Copy link
Author

pkMyt1 commented Apr 16, 2015

I did some checking to try and find the root of the problem. I copied your code (shown below) to do the complementing and tested it outside of Biopython with both versions of Python. Worked like a charm. I also checked and the version test in your code and it is working fine. I am stumped at what else to check at this point.

    ambiguous_dna_complement = {
    "A": "T",
    "C": "G",
    "G": "C",
    "T": "A",
    "M": "K",
    "R": "Y",
    "W": "W",
    "S": "S",
    "Y": "R",
    "K": "M",
    "V": "B",
    "H": "D",
    "D": "H",
    "B": "V",
    "X": "X",
    "N": "N",
    }
    complement_mapping = ambiguous_dna_complement
    before = ''.join(complement_mapping.keys())
    after = ''.join(complement_mapping.values())
    before = before + before.lower()
    after = after + after.lower()
    ttable = str.maketrans(before, after)
    comp_seq = seq.translate(ttable)

    return comp_seq

@peterjc
Copy link
Member

peterjc commented Apr 16, 2015

I was suggesting comparing Python 2 and Python 3 (just in case this is Python 3 specific which would be a useful clue).

@etal
Copy link
Contributor

etal commented Apr 16, 2015

Can you verify that your Python 3.3.3 and Python 3.4 installations are both using the same version of Biopython? The Seq equality operation in Biopython recently changed semantics to compare string values instead of object identity.

Assuming that's all fine, then if is is being used for string comparison in some places (e.g. if seq[i] is "G"), then maybe Pythons 3.4 changed the way strings are interned. Using == or direct dictionary lookups should protect against that.

@pkMyt1
Copy link
Author

pkMyt1 commented Apr 17, 2015

The Biopython versions are the same (1.65). I was using "is", did not know
about the possibility you brought up so I changed all "is" to "==". It did
not help for this issue but probably makes my code better [?]. If there is
anything else I can do to help on this let me know. I keep looking at it
on my end as well.

On Thu, Apr 16, 2015 at 7:23 PM, Eric Talevich notifications@github.com
wrote:

Can you verify that your Python 3.3.3 and Python 3.4 installations are
both using the same version of Biopython? The Seq equality operation in
Biopython recently changed semantics to compare string values instead of
object identity.

Assuming that's all fine, then if is is being used for string comparison
in some places (e.g. if seq[i] is "G"), then maybe Pythons 3.4 changed
the way strings are interned. Using == or direct dictionary lookups
should protect against that.


Reply to this email directly or view it on GitHub
#518 (comment).

@peterjc
Copy link
Member

peterjc commented Apr 17, 2015

Good idea @etal on checking == versus is.

@pkMyt1 Would you be able to share the entire example (privately if need be)? Otherwise trying to cut this down to a small self contained example would be a sensible debugging step...

@pkMyt1
Copy link
Author

pkMyt1 commented Apr 17, 2015

The program is woven together in a way that cutting it down would be a big
effort. I can send you a zipped file of the program with notes pointing to
the block that is calling Biopython. I would prefer to send it privately.
Unfortunately my smallest BAM is 1.8 Gb. This will not work on a generic
BAM. I have modified the sequence headers in the file. Do you have access
to the hg19.fa reference?

On Fri, Apr 17, 2015 at 9:03 AM, Peter Cock notifications@github.com
wrote:

Good idea @etal https://github.com/etal on checking == versus is.

@pkMyt1 https://github.com/pkMyt1 Would you be able to share the entire
example (privately if need be)? Otherwise trying to cut this down to a
small self contained example would be a sensible debugging step...


Reply to this email directly or view it on GitHub
#518 (comment).

@peterjc
Copy link
Member

peterjc commented Jul 4, 2018

I'm going to close this issue now - I hope the problem went away, but without more information I don't see what else we can do.

@peterjc peterjc closed this as completed Jul 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants