# Generate the data for "mutation matrices"

Both for codons (e.g. ATG, TGC) and amino acids (e.g. M, C).

**NOTE 1:** Because this boils down to calling [`AlignedSegment.get_aligned_pairs()`](https://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment.get_aligned_pairs) once for every linear alignment to the selected MAGs, **the first part of this notebook is currently pretty slow**! I've optimized things to the point where this notebook takes around 20 hours on our cluster.

If necessary / desired it should be possible to speed this up even more, using stuff like parallelization / writing this in a faster language like C / etc. There may also be methods in pysam I've overlooked that will help do this faster (from what I can tell the only way to do this in pysam involves calling `get_aligned_pairs()`, but maybe I've missed something...)

**NOTE 2:** This doesn't actually generate the figures for the matrices -- it just outputs JSON files to a folder named `matrix-jsons/`, and another notebook will generate the figures based on these JSON files. This is intended to make it easier to regenerate the figures using different styles / etc. without having to wait hours for this stuff to finish.

**NOTE 3:** We're redoing things so that the overall matrices are only based on individual MAGs -- so it should be possible to either parallelize this on the level of each MAGs (could probably parallelize things even further), or just only run this for individual MAGs. HOWEVER: for now, I'm going to keep generating this data for all of the three selected MAGs, because (i) we need this data for the Syn/NonSyn and Non-nonsense / Nonsense barplots, and (ii) I don't really want to rewrite this entire notebook since this is currently a one-off analysis...

In [1]:
%run "Header.ipynb"
%run "GeneUtils.ipynb"

In [7]:
import copy
import time
import json
import pysam
import skbio
import pileup
from collections import defaultdict, Counter
from statistics import mean
from parse_sco import parse_sco

In [3]:
# This is a list of all 64 DNA 3-mers (codons). There's probably a fancier way of generating this than what
# we do here, but this is fine.
codons = []

# We figure out the reverse complements of each of the 64 3-mers in advance -- this avoids
# us having to call str(skbio.DNA(c).reverse_complement()) every time we see a codon, and saves a tiny
# amount of time per alignment (the skbio approach took ~9e-5 seconds every time; the new approach takes ~9e-7
# seconds every time). Considering we're going through well over a million alignments, this actually should save
# us maybe a minute or two. Not much, but it's something. Actually is the time I'm spending documenting this to
# make this code readable less than the amount of time we save here? Hmmm.
codon2revcomp = {}
nts = "ACGT"
for i in nts:
    for j in nts:
        for k in nts:
            c = "{}{}{}".format(i, j, k)
            codons.append(c)
            codon2revcomp[c] = str(skbio.DNA(c).reverse_complement())

## Go through all alignments to each genome and figure out which genes they intersect and which codons in these genes they fully cover

__This is the main bottleneck of this notebook__, at least as of writing.

Define a dict which we'll use to keep track of aligned codon frequencies for each codon, for each gene, for each genome.

- For each alignment, see which predicted genes (if any) this alignment intersects within the genome. Note that "intersects" doesn't mean "fully covers".

- For each of these genes, see which codons (if any) this alignment fully covers within the gene.

- Increment aligned codon frequencies for all codons accordingly.

The reason we do things this way, as opposed to iterating over just the alignments overlapping each codon in each gene, is that doing things that way is really slow! I'm pretty sure it's because "find out which alignments overlap this region" is a pretty slow operation when working with large datasets -- and also since these are long reads, doing this on the level of each codon means we're effectively doing a lot of redundant work (you can imagine that, for a given codon, the odds are pretty good that most alignments overlapping it will also overlap adjacent codon(s)).

**Note that we purposefully say "alignment" instead of "read"** since a single read can correspond to multiple distinct linear alignments (e.g. if this read is aligned to both the start and end of a genome, due to a chimeric / supplementary alignment). We assume that no two linear alignments from a read cover the _same_ region of the genome (the upstream filtering stuff should already guarantee this), but this is about as far as we go. (We could also probably filter out supplementary alignments entirely and I suspect that wouldn't change the results here much.)

In [4]:
# Maps sequence IDs to genes (keyed by their Index in the .sco file) to codons (keyed by (0-indexed!)
# left end, i.e. the lower of the two positional boundaries of the codon, regardless of if its gene
# is on the + or - strand) to observed aligned codon frequencies (keyed by just the triplet, e.g. "AAA").
#
# Example:
# {"edge_6104":                                Sequence
#     {1:                                      Gene index in the .sco file
#         {265:                                Left codon position
#             {"TTA": 1000, "TTT": 1, ... }    Aligned codon frequencies for this particular codon
#         }
#     }
# }
seq2gene2codon2alignedcodons = {}

In [5]:
bf = pysam.AlignmentFile("../main-workflow/output/fully-filtered-and-sorted-aln.bam", "rb")

tT1 = time.time()
for seq in SEQS:
    df = parse_sco("../seqs/genes/{}.sco".format(seq))
    
    # We don't actually store any results in this, but we do use it for a slight optimization
    gene2isrev = {}
    
    # Initialize some of the data structures
    # NOTE: this is kind of slow. However, it still finishes within a few seconds, so not the most
    # important thing to optimize
    seq2gene2codon2alignedcodons[seq] = {}
    for gene_data in df.itertuples():
        
        validate_gene_coords(gene_data)        
        seq2gene2codon2alignedcodons[seq][gene_data.Index] = {}
        gene2isrev[gene_data.Index] = (gene_data.Strand == "-")
        
        codon_positions = get_gene_left_codon_positions(gene_data)

        # For each codon in this gene, keep track of all the codons spanning it from the various
        # alignments to this genome.
        for cpleft in codon_positions:
            seq2gene2codon2alignedcodons[seq][gene_data.Index][cpleft] = defaultdict(int)
            
    print("Finished initialization for seq = {}".format(seq))
    alntimes = []
    
    for ri, aln in enumerate(bf.fetch(seq), 1):
        
        t1 = time.time()
        
        # Find all genes that this aln intersects in this genome
        
        # These are 0-indexed coordinates (and segright is offset to the right by one; see
        # https://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment.reference_end)
        segleft = aln.reference_start
        segright = aln.reference_end
        
        if segleft is None or segright is None:
            raise ValueError("Read {} is unmapped? This shouldn't happen!".format(seg.query_name))
        
        if segleft >= segright:
            raise ValueError("Read {}'s coordinates in pysam seem messed up: left = {}, right = {}".format(
                seg.query_name, segleft, segright
            ))
        
        # Convert aligned segment boundaries to 1-indexed coordinates to make comparing with gene
        # coordinates from the .sco file easier.
        # Since segright was already offset to the right by 1, we don't need to do anything for it
        # (the gene coordinates are exact: a gene from [266, 712] starts at base 266 and ends at base 712,
        # using 1-indexing. So in order to make the aln boundaries match, we'd add 1 for segright and then
        # subract 1 since segright was already 1 base off -- and n + 1 - 1 = n. (...math is hard)
        segleft += 1
        
        # Use vectorization to find genes overlapping this aln: see https://stackoverflow.com/a/17071908
        # for details on why parentheses, etc., and
        # https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6
        # for justification on why this is useful (tldr: makes code go fast)
        genes_overlapping_aln = list(
            df.loc[(df["RightEnd"] >= segleft) & (df["LeftEnd"] <= segright)].itertuples()
        )
        # Note about the above thing that just happened: you may be shaking your fist and saying "wait
        # itertuples is slow!" And yeah, kinda. But for whatever reason I've tried multiple times to keep
        # genes_overlapping_aln as a DataFrame (and then later vectorize stuff like checking that a given
        # aligned pair covers a codon within the genes, etc) and the overhead costs seem to slow things down.
        # I am sure it's possible to speed things up more, but right now things seem good enough.

        # (Debugging code)
        # print("{} genes overlap aln {}".format(len(genes_overlapping_aln), ri))
        # print("Read {}, which ranges from {} to {}, overlaps these genes:".format(ri, segleft, segright))
        # print(genes_overlapping_aln)
                
        # If no genes overlap this aln, we are free to move on to the next aln.
        if len(genes_overlapping_aln) > 0:
            
            # Computing this is relatively slow, which is why we jump through so many hoops before we do this.
            # Each entry in get_aligned_pairs() is a tuple with 2 elements:
            # the first is the query/read pos and the second is the reference pos.
            # TODO: would it be possible to only do this for certain positions we care about? get_aligned_pairs()
            # returns a lot of stuff we don't need, e.g. regions of the aln that don't intersect with any genes.
            ap = aln.get_aligned_pairs(matches_only=True)
            
            # Doesn't look like getting this in advance saves much time, but I don't think it hurts.
            read_seq = aln.query_sequence
            
            # We only consider the leftmost position of each codon, so we don't need to bother checking the last
            # two pairs of positions (since neither could be the leftmost position of a codon that this aln
            # fully covers).
            for api, pair1 in enumerate(ap[:-2]):

                # Convert to 1-indexed position for ease of comparison with gene coordinates
                pair1_refpos = pair1[1] + 1
                  
                havent_checked_next_pairs = True
                for gene_data in genes_overlapping_aln:
                    gl = gene_data.LeftEnd
                    gr = gene_data.RightEnd
                    
                    # Check that this pair is located within this gene and is the leftmost position of a
                    # codon in the gene. (Note that this check works for both + or - strand genes.
                    # Whether the leftmost position is the "start" [i.e. CP 1] or "end" [i.e. CP 3] of
                    # the gene changes with the strand of the gene, but we'll account for that later on
                    # when we reverse-complement the codon if needed.)
                    if pair1_refpos >= gl and pair1_refpos <= gr - 2 and ((pair1_refpos - gl) % 3 == 0):
                        
                        # Nice! Looks like this aln fully covers this codon.
                        
                        # If we haven't yet, check that this aln doesn't skip over parts of the codon,
                        # or stuff like that. The reason this check is located *here* (and not
                        # before we loop over the genes) is that it seems like this is a faster strategy:
                        # only run these checks once we KNOW that this pair looks like it fully covers a
                        # codon, since many pairs might not meet that criteria.
                        #
                        # (And by recording that we've run this check once, in havent_checked_next_pairs,
                        # we can save the time cost of running the check multiple times.)
                        #
                        # I feel like an insane person trying to optimize this so much lmao.
                        if havent_checked_next_pairs:
                            # Check that the pairs are all consecutive (i.e. no "jumps" in the read,
                            # and no "jumps" in the reference)
                            # Since we don't consider the last two pairs in ap, pair2 and pair3 should
                            # always be available.
                            pair2 = ap[api + 1]
                            pair3 = ap[api + 2]

                            # For an aligned pair, [0] is the read pos and [1] is the reference pos.
                            # Result of caching this is probably negligible but... may as well
                            p10 = pair1[0]
                            p20 = pair2[0]
                            readpos_consecutive = p20 == (p10 + 1) and pair3[0] == (p20 + 1)
                            if not readpos_consecutive:
                                break

                            # (pair1_refpos is already off by 1 so no need to redo the addition operation)
                            p21 = pair2[1]
                            refpos_consecutive = p21 == pair1_refpos and pair3[1] == (p21 + 1)
                            if not refpos_consecutive:
                                break
                            havent_checked_next_pairs = False
                        
                        # Figure out what the read actually *says* in the alignment here.
                        # (It'll probably be a complete match most of the time, but there will
                        # be some occasional mismatches -- and seeing those is ... the whole point
                        # of this notebook.)

                        # We make sure to index the read by read coords, not reference coords!
                        aligned_codon = read_seq[p10: p10 + 3]

                        # Finally, update information about codon frequencies.
                        gi = gene_data.Index
                        if gene2isrev[gi]:
                            seq2gene2codon2alignedcodons[seq][gi][pair1_refpos][codon2revcomp[aligned_codon]] += 1
                        else:
                            seq2gene2codon2alignedcodons[seq][gi][pair1_refpos][aligned_codon] += 1

        t2 = time.time()
        alntimes.append(t2 - t1)
        # the downside to printing frequent messages is that it inflates the notebook filesize, which can
        # be annoying. hence less frequent messages here (every 10k alignments).
        if ri % 10000 == 0:
            print(f"Seen {ri:,} alignments so far in {seq}.")
            print(f"Total time taken thus far:               {t2-tT1:,.4f} sec.")
            print(f"Average time per alignment for this seq: {mean(alntimes):,.4f} sec.")

    # At this point, we've seen all the alignments to all the codons in this genome.
    # We can now "call" mutations based on the frequencies we've counted.
tT2 = time.time()

print("Figuring all that out took a total of {} seconds.".format(tT2 - tT1))

bf.close()
# TODO: It'd probably be safer to output this after each sequence, rather than at the very end.
with open("matrix-jsons/seq2gene2codon2alignedcodons.json", "w") as dumpster:
    dumpster.write(json.dumps(seq2gene2codon2alignedcodons))

Finished initialization for seq = edge_6104
Seen 10,000 alignments so far in edge_6104.
Total time taken thus far:               533.4696 sec.
Average time per alignment for this seq: 0.0531 sec.
Seen 20,000 alignments so far in edge_6104.
Total time taken thus far:               1,210.8409 sec.
Average time per alignment for this seq: 0.0604 sec.
Seen 30,000 alignments so far in edge_6104.
Total time taken thus far:               1,881.3421 sec.
Average time per alignment for this seq: 0.0626 sec.
Seen 40,000 alignments so far in edge_6104.
Total time taken thus far:               2,390.0748 sec.
Average time per alignment for this seq: 0.0596 sec.
Seen 50,000 alignments so far in edge_6104.
Total time taken thus far:               2,848.1760 sec.
Average time per alignment for this seq: 0.0569 sec.
Seen 60,000 alignments so far in edge_6104.
Total time taken thus far:               3,341.7460 sec.
Average time per alignment for this seq: 0.0556 sec.
Seen 70,000 alignments so far in e

## Using the information we just computed for each genome, "call" mutations and store this information in the frequency data structures we set up earlier

This section is kind of slow, but on the order of "takes a few minutes on the cluster" and not on the order of "takes literally hours to run" (the stuff we just finished doing takes way longer). We could make this next step more efficient if desired; most of my energy on optimization here thus far has been spent on the earlier step in this notebook.

Note that the main output of the above section (`seq2gene2codon2alignedcodons`) has already been written out to a JSON file -- in practice, it's useful to be able to start execution again from this bottom section after stopping the above section (e.g. if something goes wrong _here_, then we don't have to rerun the top part of this notebook again). Even if the entire notebook is run in a single shot, we still save and then load the `seq2gene2codon2alignedcodons` object anyway.

In [4]:
with open("matrix-jsons/seq2gene2codon2alignedcodons.json", "r") as loadster:
    seq2gene2codon2alignedcodons = json.load(loadster)

In [8]:
def generate_mutmatrix_data(seq):
    fasta = skbio.DNA.read("../seqs/{}.fasta".format(seq))
    df = parse_sco("../seqs/genes/{}.sco".format(seq))
    
    # 64x63 dict: each key is a triplet of {A, C, G, T}, and each value is another dict with all the other codons
    codon2codon2freq = {}

    # 21x20 dict: each key is a proteinogenic amino acid (A, C, D, E, F, ...), limited to just
    # stuff in the standard genetic code (i.e. ignoring selenocystine and pyrrolsine) but including
    # "*", representing a stop codon.
    aa2aa2freq = {}

    # 64-key dict: maps each triplet to an integer indicating how frequently this triplet occurs in all genes
    # in the genomes (i.e. not counting mutations into this triplet).
    codon2freq = {}

    # 21-key dict: maps amino acid/stop codon to integer indicating frequency across all genes.
    aa2freq = {}

    aas = set([])
    for c in codons:
        aas.add(str(skbio.DNA(c).translate()))

    # Initialize dicts to 0s
    for c1 in codons:
        codon2codon2freq[c1] = {c2: 0 for c2 in set(codons) - set([c1])}
        codon2freq[c1] = 0

    for aa1 in aas:
        aa2aa2freq[aa1] = {aa2: 0 for aa2 in set(aas) - set([aa1])}
        aa2freq[aa1] = 0   
    
    for gene_data in df.itertuples():
        print("On gene {} in seq {}.".format(gene_data.Index, seq))
        for cpleft in range(gene_data.LeftEnd, gene_data.RightEnd + 1, 3):
            
            # Make note of the codon sequence and amino acid encoded by this codon in the "reference" genome.
            # (Keep in mind that the gene data in the .sco file uses 1-indexed coords, so we need to convert
            # accordingly.)
            codon_dna = fasta[cpleft - 1: cpleft + 2]
            if gene_data.Strand == "-":
                codon_dna = codon_dna.reverse_complement()

            codon_seq = str(codon_dna)
            aa = str(codon_dna.translate())
            
            # Update frequencies accordingly.
            codon2freq[codon_seq] += 1
            aa2freq[aa] += 1
            
            # We can finally compute stats re: number of mismatching and matching codons.
            aligned_codons = seq2gene2codon2alignedcodons[seq][str(gene_data.Index)][str(cpleft)]
            
            # Ignore weird, low-coverage cases. We require that the ref codon is both present in the
            # aligned codons and is (one of) the most common aligned codons (it's ok if it is tied with
            # other codon(s) as the most common). See paper for a description of why this is useful; tldr,
            # it means we can pick the most common non-reference codon and be confident its relative freq is
            # in the range [0%, 50%].
            #
            # We also first account for the case where aligned_codons is empty (i.e. nothing's aligned to
            # this codon), since in that case codon_seq will not be in aligned_codons.
            if len(aligned_codons) > 0:
                max_ac_freq = max(aligned_codons.values())
                if codon_seq in aligned_codons and aligned_codons[codon_seq] == max_ac_freq:
                    num_aligned_codons = sum(aligned_codons.values())

                    # Only call a mutation using the max-freq alt codon, not the sum of all alternate codon freqs.
                    # The max(d, key=d.get) trick is from https://stackoverflow.com/a/280156 (I've used it a lot
                    # here :) Notably, this breaks ties arbitrarily.
                    alt_codons = {c: aligned_codons[c] for c in aligned_codons if c != codon_seq}
                    if len(alt_codons) > 0:
                        max_freq_alt_codon = max(alt_codons, key=aligned_codons.get)
                        max_freq_alt_codon_freq = alt_codons[max_freq_alt_codon]
                    else:
                        # The only codon aligned to this position is the reference codon, so no mutation here.
                        max_freq_alt_codon_freq = 0

                    # print("sum of vals of ac is {}".format(sum(aligned_codons.values())))
                    # print("Codon {} from {} to {} in gene {} in seq {} has mutations: {}".format(
                    #     codon_seq, cpleft, cpleft + 2, gene_data.Index, seq, aligned_codons
                    # ))

                    if pileup.naively_call_mutation_directly(max_freq_alt_codon_freq, num_aligned_codons, 0.5):
                        codon2codon2freq[codon_seq][max_freq_alt_codon] += 1

                        #print(f"{codon_seq}, {aligned_codons}, is mutation! max freq alt codon is {max_freq_alt_codon}")

                        # NOTE: I guess you could argue that we should do this another way, where we actually compute
                        # the translations of all the alt codons and then pick the most common AA/stop codon from there?
                        #
                        # You could argue this either way: doing it based on just the mutated codon keeps the matrices
                        # consistent and lessens the impact of small errors, while taking into account all alt codon
                        # translations could help show weird things where multiple mutations have similar consequences.
                        # Hmm.
                        alt_codon_aa = str(skbio.DNA(max_freq_alt_codon).translate())
                        if alt_codon_aa != aa:
                            aa2aa2freq[aa][alt_codon_aa] += 1
                            # print("Is nonsyn mutation! Alt {} codes for {}; orig coded for {}".format(
                            #     max_freq_alt_codon, alt_codon_aa, aa
                            # ))

    # Write out stuff for further analysis / in case of crisis
    with open(f"matrix-jsons/{seq}-codon2codon2freq.json", "w") as dumpster:
        dumpster.write(json.dumps(codon2codon2freq))

    with open(f"matrix-jsons/{seq}-codon2freq.json", "w") as dumpster:
        dumpster.write(json.dumps(codon2freq))

    with open(f"matrix-jsons/{seq}-aa2aa2freq.json", "w") as dumpster:
        dumpster.write(json.dumps(aa2aa2freq))

    with open(f"matrix-jsons/{seq}-aa2freq.json", "w") as dumpster:
        dumpster.write(json.dumps(aa2freq))

In [9]:
generate_mutmatrix_data("edge_1671")
generate_mutmatrix_data("edge_6104")
generate_mutmatrix_data("edge_2358")

On gene 1 in seq edge_1671.
On gene 2 in seq edge_1671.
On gene 3 in seq edge_1671.
On gene 4 in seq edge_1671.
On gene 5 in seq edge_1671.
On gene 6 in seq edge_1671.
On gene 7 in seq edge_1671.
On gene 8 in seq edge_1671.
On gene 9 in seq edge_1671.
On gene 10 in seq edge_1671.
On gene 11 in seq edge_1671.
On gene 12 in seq edge_1671.
On gene 13 in seq edge_1671.
On gene 14 in seq edge_1671.
On gene 15 in seq edge_1671.
On gene 16 in seq edge_1671.
On gene 17 in seq edge_1671.
On gene 18 in seq edge_1671.
On gene 19 in seq edge_1671.
On gene 20 in seq edge_1671.
On gene 21 in seq edge_1671.
On gene 22 in seq edge_1671.
On gene 23 in seq edge_1671.
On gene 24 in seq edge_1671.
On gene 25 in seq edge_1671.
On gene 26 in seq edge_1671.
On gene 27 in seq edge_1671.
On gene 28 in seq edge_1671.
On gene 29 in seq edge_1671.
On gene 30 in seq edge_1671.
On gene 31 in seq edge_1671.
On gene 32 in seq edge_1671.
On gene 33 in seq edge_1671.
On gene 34 in seq edge_1671.
On gene 35 in seq edge_

On gene 284 in seq edge_1671.
On gene 285 in seq edge_1671.
On gene 286 in seq edge_1671.
On gene 287 in seq edge_1671.
On gene 288 in seq edge_1671.
On gene 289 in seq edge_1671.
On gene 290 in seq edge_1671.
On gene 291 in seq edge_1671.
On gene 292 in seq edge_1671.
On gene 293 in seq edge_1671.
On gene 294 in seq edge_1671.
On gene 295 in seq edge_1671.
On gene 296 in seq edge_1671.
On gene 297 in seq edge_1671.
On gene 298 in seq edge_1671.
On gene 299 in seq edge_1671.
On gene 300 in seq edge_1671.
On gene 301 in seq edge_1671.
On gene 302 in seq edge_1671.
On gene 303 in seq edge_1671.
On gene 304 in seq edge_1671.
On gene 305 in seq edge_1671.
On gene 306 in seq edge_1671.
On gene 307 in seq edge_1671.
On gene 308 in seq edge_1671.
On gene 309 in seq edge_1671.
On gene 310 in seq edge_1671.
On gene 311 in seq edge_1671.
On gene 312 in seq edge_1671.
On gene 313 in seq edge_1671.
On gene 314 in seq edge_1671.
On gene 315 in seq edge_1671.
On gene 316 in seq edge_1671.
On gene 31

On gene 564 in seq edge_1671.
On gene 565 in seq edge_1671.
On gene 566 in seq edge_1671.
On gene 567 in seq edge_1671.
On gene 568 in seq edge_1671.
On gene 569 in seq edge_1671.
On gene 570 in seq edge_1671.
On gene 571 in seq edge_1671.
On gene 572 in seq edge_1671.
On gene 573 in seq edge_1671.
On gene 574 in seq edge_1671.
On gene 575 in seq edge_1671.
On gene 576 in seq edge_1671.
On gene 577 in seq edge_1671.
On gene 578 in seq edge_1671.
On gene 579 in seq edge_1671.
On gene 580 in seq edge_1671.
On gene 581 in seq edge_1671.
On gene 582 in seq edge_1671.
On gene 583 in seq edge_1671.
On gene 584 in seq edge_1671.
On gene 585 in seq edge_1671.
On gene 586 in seq edge_1671.
On gene 587 in seq edge_1671.
On gene 588 in seq edge_1671.
On gene 589 in seq edge_1671.
On gene 590 in seq edge_1671.
On gene 591 in seq edge_1671.
On gene 592 in seq edge_1671.
On gene 593 in seq edge_1671.
On gene 594 in seq edge_1671.
On gene 595 in seq edge_1671.
On gene 596 in seq edge_1671.
On gene 59

On gene 838 in seq edge_1671.
On gene 839 in seq edge_1671.
On gene 840 in seq edge_1671.
On gene 841 in seq edge_1671.
On gene 842 in seq edge_1671.
On gene 843 in seq edge_1671.
On gene 844 in seq edge_1671.
On gene 845 in seq edge_1671.
On gene 846 in seq edge_1671.
On gene 847 in seq edge_1671.
On gene 848 in seq edge_1671.
On gene 849 in seq edge_1671.
On gene 850 in seq edge_1671.
On gene 851 in seq edge_1671.
On gene 852 in seq edge_1671.
On gene 853 in seq edge_1671.
On gene 854 in seq edge_1671.
On gene 855 in seq edge_1671.
On gene 856 in seq edge_1671.
On gene 857 in seq edge_1671.
On gene 858 in seq edge_1671.
On gene 859 in seq edge_1671.
On gene 860 in seq edge_1671.
On gene 861 in seq edge_1671.
On gene 862 in seq edge_1671.
On gene 863 in seq edge_1671.
On gene 864 in seq edge_1671.
On gene 865 in seq edge_1671.
On gene 866 in seq edge_1671.
On gene 867 in seq edge_1671.
On gene 868 in seq edge_1671.
On gene 869 in seq edge_1671.
On gene 870 in seq edge_1671.
On gene 87

On gene 1107 in seq edge_1671.
On gene 1108 in seq edge_1671.
On gene 1109 in seq edge_1671.
On gene 1110 in seq edge_1671.
On gene 1111 in seq edge_1671.
On gene 1112 in seq edge_1671.
On gene 1113 in seq edge_1671.
On gene 1114 in seq edge_1671.
On gene 1115 in seq edge_1671.
On gene 1116 in seq edge_1671.
On gene 1117 in seq edge_1671.
On gene 1118 in seq edge_1671.
On gene 1119 in seq edge_1671.
On gene 1120 in seq edge_1671.
On gene 1121 in seq edge_1671.
On gene 1122 in seq edge_1671.
On gene 1123 in seq edge_1671.
On gene 1124 in seq edge_1671.
On gene 1125 in seq edge_1671.
On gene 1126 in seq edge_1671.
On gene 1127 in seq edge_1671.
On gene 1128 in seq edge_1671.
On gene 1129 in seq edge_1671.
On gene 1130 in seq edge_1671.
On gene 1131 in seq edge_1671.
On gene 1132 in seq edge_1671.
On gene 1133 in seq edge_1671.
On gene 1134 in seq edge_1671.
On gene 1135 in seq edge_1671.
On gene 1136 in seq edge_1671.
On gene 1137 in seq edge_1671.
On gene 1138 in seq edge_1671.
On gene 

On gene 1379 in seq edge_1671.
On gene 1380 in seq edge_1671.
On gene 1381 in seq edge_1671.
On gene 1382 in seq edge_1671.
On gene 1383 in seq edge_1671.
On gene 1384 in seq edge_1671.
On gene 1385 in seq edge_1671.
On gene 1386 in seq edge_1671.
On gene 1387 in seq edge_1671.
On gene 1388 in seq edge_1671.
On gene 1389 in seq edge_1671.
On gene 1390 in seq edge_1671.
On gene 1391 in seq edge_1671.
On gene 1392 in seq edge_1671.
On gene 1393 in seq edge_1671.
On gene 1394 in seq edge_1671.
On gene 1395 in seq edge_1671.
On gene 1396 in seq edge_1671.
On gene 1397 in seq edge_1671.
On gene 1398 in seq edge_1671.
On gene 1399 in seq edge_1671.
On gene 1400 in seq edge_1671.
On gene 1401 in seq edge_1671.
On gene 1402 in seq edge_1671.
On gene 1403 in seq edge_1671.
On gene 1404 in seq edge_1671.
On gene 1405 in seq edge_1671.
On gene 1406 in seq edge_1671.
On gene 1407 in seq edge_1671.
On gene 1408 in seq edge_1671.
On gene 1409 in seq edge_1671.
On gene 1410 in seq edge_1671.
On gene 

On gene 1645 in seq edge_1671.
On gene 1646 in seq edge_1671.
On gene 1647 in seq edge_1671.
On gene 1648 in seq edge_1671.
On gene 1649 in seq edge_1671.
On gene 1650 in seq edge_1671.
On gene 1651 in seq edge_1671.
On gene 1652 in seq edge_1671.
On gene 1653 in seq edge_1671.
On gene 1654 in seq edge_1671.
On gene 1655 in seq edge_1671.
On gene 1656 in seq edge_1671.
On gene 1657 in seq edge_1671.
On gene 1658 in seq edge_1671.
On gene 1659 in seq edge_1671.
On gene 1660 in seq edge_1671.
On gene 1661 in seq edge_1671.
On gene 1662 in seq edge_1671.
On gene 1663 in seq edge_1671.
On gene 1664 in seq edge_1671.
On gene 1665 in seq edge_1671.
On gene 1666 in seq edge_1671.
On gene 1667 in seq edge_1671.
On gene 1668 in seq edge_1671.
On gene 1669 in seq edge_1671.
On gene 1670 in seq edge_1671.
On gene 1671 in seq edge_1671.
On gene 1672 in seq edge_1671.
On gene 1673 in seq edge_1671.
On gene 1674 in seq edge_1671.
On gene 1675 in seq edge_1671.
On gene 1676 in seq edge_1671.
On gene 

On gene 168 in seq edge_6104.
On gene 169 in seq edge_6104.
On gene 170 in seq edge_6104.
On gene 171 in seq edge_6104.
On gene 172 in seq edge_6104.
On gene 173 in seq edge_6104.
On gene 174 in seq edge_6104.
On gene 175 in seq edge_6104.
On gene 176 in seq edge_6104.
On gene 177 in seq edge_6104.
On gene 178 in seq edge_6104.
On gene 179 in seq edge_6104.
On gene 180 in seq edge_6104.
On gene 181 in seq edge_6104.
On gene 182 in seq edge_6104.
On gene 183 in seq edge_6104.
On gene 184 in seq edge_6104.
On gene 185 in seq edge_6104.
On gene 186 in seq edge_6104.
On gene 187 in seq edge_6104.
On gene 188 in seq edge_6104.
On gene 189 in seq edge_6104.
On gene 190 in seq edge_6104.
On gene 191 in seq edge_6104.
On gene 192 in seq edge_6104.
On gene 193 in seq edge_6104.
On gene 194 in seq edge_6104.
On gene 195 in seq edge_6104.
On gene 196 in seq edge_6104.
On gene 197 in seq edge_6104.
On gene 198 in seq edge_6104.
On gene 199 in seq edge_6104.
On gene 200 in seq edge_6104.
On gene 20

On gene 458 in seq edge_6104.
On gene 459 in seq edge_6104.
On gene 460 in seq edge_6104.
On gene 461 in seq edge_6104.
On gene 462 in seq edge_6104.
On gene 463 in seq edge_6104.
On gene 464 in seq edge_6104.
On gene 465 in seq edge_6104.
On gene 466 in seq edge_6104.
On gene 467 in seq edge_6104.
On gene 468 in seq edge_6104.
On gene 469 in seq edge_6104.
On gene 470 in seq edge_6104.
On gene 471 in seq edge_6104.
On gene 472 in seq edge_6104.
On gene 473 in seq edge_6104.
On gene 474 in seq edge_6104.
On gene 475 in seq edge_6104.
On gene 476 in seq edge_6104.
On gene 477 in seq edge_6104.
On gene 478 in seq edge_6104.
On gene 479 in seq edge_6104.
On gene 480 in seq edge_6104.
On gene 481 in seq edge_6104.
On gene 482 in seq edge_6104.
On gene 483 in seq edge_6104.
On gene 484 in seq edge_6104.
On gene 485 in seq edge_6104.
On gene 486 in seq edge_6104.
On gene 487 in seq edge_6104.
On gene 488 in seq edge_6104.
On gene 489 in seq edge_6104.
On gene 490 in seq edge_6104.
On gene 49

On gene 737 in seq edge_6104.
On gene 738 in seq edge_6104.
On gene 739 in seq edge_6104.
On gene 740 in seq edge_6104.
On gene 741 in seq edge_6104.
On gene 742 in seq edge_6104.
On gene 743 in seq edge_6104.
On gene 744 in seq edge_6104.
On gene 745 in seq edge_6104.
On gene 746 in seq edge_6104.
On gene 747 in seq edge_6104.
On gene 748 in seq edge_6104.
On gene 749 in seq edge_6104.
On gene 750 in seq edge_6104.
On gene 751 in seq edge_6104.
On gene 752 in seq edge_6104.
On gene 753 in seq edge_6104.
On gene 754 in seq edge_6104.
On gene 755 in seq edge_6104.
On gene 756 in seq edge_6104.
On gene 757 in seq edge_6104.
On gene 758 in seq edge_6104.
On gene 759 in seq edge_6104.
On gene 760 in seq edge_6104.
On gene 761 in seq edge_6104.
On gene 762 in seq edge_6104.
On gene 763 in seq edge_6104.
On gene 764 in seq edge_6104.
On gene 765 in seq edge_6104.
On gene 766 in seq edge_6104.
On gene 767 in seq edge_6104.
On gene 768 in seq edge_6104.
On gene 769 in seq edge_6104.
On gene 77

On gene 1016 in seq edge_6104.
On gene 1017 in seq edge_6104.
On gene 1018 in seq edge_6104.
On gene 1019 in seq edge_6104.
On gene 1020 in seq edge_6104.
On gene 1021 in seq edge_6104.
On gene 1022 in seq edge_6104.
On gene 1023 in seq edge_6104.
On gene 1024 in seq edge_6104.
On gene 1025 in seq edge_6104.
On gene 1026 in seq edge_6104.
On gene 1027 in seq edge_6104.
On gene 1028 in seq edge_6104.
On gene 1029 in seq edge_6104.
On gene 1030 in seq edge_6104.
On gene 1031 in seq edge_6104.
On gene 1032 in seq edge_6104.
On gene 1033 in seq edge_6104.
On gene 1034 in seq edge_6104.
On gene 1035 in seq edge_6104.
On gene 1036 in seq edge_6104.
On gene 1037 in seq edge_6104.
On gene 1038 in seq edge_6104.
On gene 1039 in seq edge_6104.
On gene 1040 in seq edge_6104.
On gene 1041 in seq edge_6104.
On gene 1042 in seq edge_6104.
On gene 1043 in seq edge_6104.
On gene 1044 in seq edge_6104.
On gene 1045 in seq edge_6104.
On gene 1046 in seq edge_6104.
On gene 1047 in seq edge_6104.
On gene 

On gene 1282 in seq edge_6104.
On gene 1283 in seq edge_6104.
On gene 1284 in seq edge_6104.
On gene 1285 in seq edge_6104.
On gene 1286 in seq edge_6104.
On gene 1287 in seq edge_6104.
On gene 1288 in seq edge_6104.
On gene 1289 in seq edge_6104.
On gene 1290 in seq edge_6104.
On gene 1291 in seq edge_6104.
On gene 1292 in seq edge_6104.
On gene 1293 in seq edge_6104.
On gene 1294 in seq edge_6104.
On gene 1295 in seq edge_6104.
On gene 1296 in seq edge_6104.
On gene 1297 in seq edge_6104.
On gene 1 in seq edge_2358.
On gene 2 in seq edge_2358.
On gene 3 in seq edge_2358.
On gene 4 in seq edge_2358.
On gene 5 in seq edge_2358.
On gene 6 in seq edge_2358.
On gene 7 in seq edge_2358.
On gene 8 in seq edge_2358.
On gene 9 in seq edge_2358.
On gene 10 in seq edge_2358.
On gene 11 in seq edge_2358.
On gene 12 in seq edge_2358.
On gene 13 in seq edge_2358.
On gene 14 in seq edge_2358.
On gene 15 in seq edge_2358.
On gene 16 in seq edge_2358.
On gene 17 in seq edge_2358.
On gene 18 in seq ed

On gene 287 in seq edge_2358.
On gene 288 in seq edge_2358.
On gene 289 in seq edge_2358.
On gene 290 in seq edge_2358.
On gene 291 in seq edge_2358.
On gene 292 in seq edge_2358.
On gene 293 in seq edge_2358.
On gene 294 in seq edge_2358.
On gene 295 in seq edge_2358.
On gene 296 in seq edge_2358.
On gene 297 in seq edge_2358.
On gene 298 in seq edge_2358.
On gene 299 in seq edge_2358.
On gene 300 in seq edge_2358.
On gene 301 in seq edge_2358.
On gene 302 in seq edge_2358.
On gene 303 in seq edge_2358.
On gene 304 in seq edge_2358.
On gene 305 in seq edge_2358.
On gene 306 in seq edge_2358.
On gene 307 in seq edge_2358.
On gene 308 in seq edge_2358.
On gene 309 in seq edge_2358.
On gene 310 in seq edge_2358.
On gene 311 in seq edge_2358.
On gene 312 in seq edge_2358.
On gene 313 in seq edge_2358.
On gene 314 in seq edge_2358.
On gene 315 in seq edge_2358.
On gene 316 in seq edge_2358.
On gene 317 in seq edge_2358.
On gene 318 in seq edge_2358.
On gene 319 in seq edge_2358.
On gene 32

On gene 579 in seq edge_2358.
On gene 580 in seq edge_2358.
On gene 581 in seq edge_2358.
On gene 582 in seq edge_2358.
On gene 583 in seq edge_2358.
On gene 584 in seq edge_2358.
On gene 585 in seq edge_2358.
On gene 586 in seq edge_2358.
On gene 587 in seq edge_2358.
On gene 588 in seq edge_2358.
On gene 589 in seq edge_2358.
On gene 590 in seq edge_2358.
On gene 591 in seq edge_2358.
On gene 592 in seq edge_2358.
On gene 593 in seq edge_2358.
On gene 594 in seq edge_2358.
On gene 595 in seq edge_2358.
On gene 596 in seq edge_2358.
On gene 597 in seq edge_2358.
On gene 598 in seq edge_2358.
On gene 599 in seq edge_2358.
On gene 600 in seq edge_2358.
On gene 601 in seq edge_2358.
On gene 602 in seq edge_2358.
On gene 603 in seq edge_2358.
On gene 604 in seq edge_2358.
On gene 605 in seq edge_2358.
On gene 606 in seq edge_2358.
On gene 607 in seq edge_2358.
On gene 608 in seq edge_2358.
On gene 609 in seq edge_2358.
On gene 610 in seq edge_2358.
On gene 611 in seq edge_2358.
On gene 61

On gene 855 in seq edge_2358.
On gene 856 in seq edge_2358.
On gene 857 in seq edge_2358.
On gene 858 in seq edge_2358.
On gene 859 in seq edge_2358.
On gene 860 in seq edge_2358.
On gene 861 in seq edge_2358.
On gene 862 in seq edge_2358.
On gene 863 in seq edge_2358.
On gene 864 in seq edge_2358.
On gene 865 in seq edge_2358.
On gene 866 in seq edge_2358.
On gene 867 in seq edge_2358.
On gene 868 in seq edge_2358.
On gene 869 in seq edge_2358.
On gene 870 in seq edge_2358.
On gene 871 in seq edge_2358.
On gene 872 in seq edge_2358.
On gene 873 in seq edge_2358.
On gene 874 in seq edge_2358.
On gene 875 in seq edge_2358.
On gene 876 in seq edge_2358.
On gene 877 in seq edge_2358.
On gene 878 in seq edge_2358.
On gene 879 in seq edge_2358.
On gene 880 in seq edge_2358.
On gene 881 in seq edge_2358.
On gene 882 in seq edge_2358.
On gene 883 in seq edge_2358.
On gene 884 in seq edge_2358.
On gene 885 in seq edge_2358.
On gene 886 in seq edge_2358.
On gene 887 in seq edge_2358.
On gene 88

On gene 1142 in seq edge_2358.
On gene 1143 in seq edge_2358.
On gene 1144 in seq edge_2358.
On gene 1145 in seq edge_2358.
On gene 1146 in seq edge_2358.
On gene 1147 in seq edge_2358.
On gene 1148 in seq edge_2358.
On gene 1149 in seq edge_2358.
On gene 1150 in seq edge_2358.
On gene 1151 in seq edge_2358.
On gene 1152 in seq edge_2358.
On gene 1153 in seq edge_2358.
On gene 1154 in seq edge_2358.
On gene 1155 in seq edge_2358.
On gene 1156 in seq edge_2358.
On gene 1157 in seq edge_2358.
On gene 1158 in seq edge_2358.
On gene 1159 in seq edge_2358.
On gene 1160 in seq edge_2358.
On gene 1161 in seq edge_2358.
On gene 1162 in seq edge_2358.
On gene 1163 in seq edge_2358.
On gene 1164 in seq edge_2358.
On gene 1165 in seq edge_2358.
On gene 1166 in seq edge_2358.
On gene 1167 in seq edge_2358.
On gene 1168 in seq edge_2358.
On gene 1169 in seq edge_2358.
On gene 1170 in seq edge_2358.
On gene 1171 in seq edge_2358.
On gene 1172 in seq edge_2358.
On gene 1173 in seq edge_2358.
On gene 

On gene 1414 in seq edge_2358.
On gene 1415 in seq edge_2358.
On gene 1416 in seq edge_2358.
On gene 1417 in seq edge_2358.
On gene 1418 in seq edge_2358.
On gene 1419 in seq edge_2358.
On gene 1420 in seq edge_2358.
On gene 1421 in seq edge_2358.
On gene 1422 in seq edge_2358.
On gene 1423 in seq edge_2358.
On gene 1424 in seq edge_2358.
On gene 1425 in seq edge_2358.
On gene 1426 in seq edge_2358.
On gene 1427 in seq edge_2358.
On gene 1428 in seq edge_2358.
On gene 1429 in seq edge_2358.
On gene 1430 in seq edge_2358.
On gene 1431 in seq edge_2358.
On gene 1432 in seq edge_2358.
On gene 1433 in seq edge_2358.
On gene 1434 in seq edge_2358.
On gene 1435 in seq edge_2358.
On gene 1436 in seq edge_2358.
On gene 1437 in seq edge_2358.
On gene 1438 in seq edge_2358.
On gene 1439 in seq edge_2358.
On gene 1440 in seq edge_2358.
On gene 1441 in seq edge_2358.
On gene 1442 in seq edge_2358.
On gene 1443 in seq edge_2358.
On gene 1444 in seq edge_2358.
On gene 1445 in seq edge_2358.
On gene 

On gene 1679 in seq edge_2358.
On gene 1680 in seq edge_2358.
On gene 1681 in seq edge_2358.
On gene 1682 in seq edge_2358.
On gene 1683 in seq edge_2358.
On gene 1684 in seq edge_2358.
On gene 1685 in seq edge_2358.
On gene 1686 in seq edge_2358.
On gene 1687 in seq edge_2358.
On gene 1688 in seq edge_2358.
On gene 1689 in seq edge_2358.
On gene 1690 in seq edge_2358.
On gene 1691 in seq edge_2358.
On gene 1692 in seq edge_2358.
On gene 1693 in seq edge_2358.
On gene 1694 in seq edge_2358.
On gene 1695 in seq edge_2358.
On gene 1696 in seq edge_2358.
On gene 1697 in seq edge_2358.
On gene 1698 in seq edge_2358.
On gene 1699 in seq edge_2358.
On gene 1700 in seq edge_2358.
On gene 1701 in seq edge_2358.
On gene 1702 in seq edge_2358.
On gene 1703 in seq edge_2358.
On gene 1704 in seq edge_2358.
On gene 1705 in seq edge_2358.
On gene 1706 in seq edge_2358.
On gene 1707 in seq edge_2358.
On gene 1708 in seq edge_2358.
On gene 1709 in seq edge_2358.
On gene 1710 in seq edge_2358.
On gene 

On gene 1957 in seq edge_2358.
On gene 1958 in seq edge_2358.
On gene 1959 in seq edge_2358.
On gene 1960 in seq edge_2358.
On gene 1961 in seq edge_2358.
On gene 1962 in seq edge_2358.
On gene 1963 in seq edge_2358.
On gene 1964 in seq edge_2358.
On gene 1965 in seq edge_2358.
On gene 1966 in seq edge_2358.
On gene 1967 in seq edge_2358.
On gene 1968 in seq edge_2358.
On gene 1969 in seq edge_2358.
On gene 1970 in seq edge_2358.
On gene 1971 in seq edge_2358.
On gene 1972 in seq edge_2358.
On gene 1973 in seq edge_2358.
On gene 1974 in seq edge_2358.
On gene 1975 in seq edge_2358.
On gene 1976 in seq edge_2358.
On gene 1977 in seq edge_2358.
On gene 1978 in seq edge_2358.
On gene 1979 in seq edge_2358.
On gene 1980 in seq edge_2358.
On gene 1981 in seq edge_2358.
On gene 1982 in seq edge_2358.
On gene 1983 in seq edge_2358.
On gene 1984 in seq edge_2358.
On gene 1985 in seq edge_2358.
On gene 1986 in seq edge_2358.
On gene 1987 in seq edge_2358.
On gene 1988 in seq edge_2358.
On gene 

On gene 2237 in seq edge_2358.
On gene 2238 in seq edge_2358.
On gene 2239 in seq edge_2358.
On gene 2240 in seq edge_2358.
On gene 2241 in seq edge_2358.
On gene 2242 in seq edge_2358.
On gene 2243 in seq edge_2358.
On gene 2244 in seq edge_2358.
On gene 2245 in seq edge_2358.
On gene 2246 in seq edge_2358.
On gene 2247 in seq edge_2358.
On gene 2248 in seq edge_2358.
On gene 2249 in seq edge_2358.
On gene 2250 in seq edge_2358.
On gene 2251 in seq edge_2358.
On gene 2252 in seq edge_2358.
On gene 2253 in seq edge_2358.
On gene 2254 in seq edge_2358.
On gene 2255 in seq edge_2358.
On gene 2256 in seq edge_2358.
On gene 2257 in seq edge_2358.
On gene 2258 in seq edge_2358.
On gene 2259 in seq edge_2358.
On gene 2260 in seq edge_2358.
On gene 2261 in seq edge_2358.
On gene 2262 in seq edge_2358.
On gene 2263 in seq edge_2358.
On gene 2264 in seq edge_2358.
On gene 2265 in seq edge_2358.
On gene 2266 in seq edge_2358.
On gene 2267 in seq edge_2358.
On gene 2268 in seq edge_2358.
On gene 

On gene 2508 in seq edge_2358.
On gene 2509 in seq edge_2358.
On gene 2510 in seq edge_2358.
On gene 2511 in seq edge_2358.
On gene 2512 in seq edge_2358.
On gene 2513 in seq edge_2358.
On gene 2514 in seq edge_2358.
On gene 2515 in seq edge_2358.
On gene 2516 in seq edge_2358.
On gene 2517 in seq edge_2358.
On gene 2518 in seq edge_2358.
On gene 2519 in seq edge_2358.
On gene 2520 in seq edge_2358.
On gene 2521 in seq edge_2358.
On gene 2522 in seq edge_2358.
On gene 2523 in seq edge_2358.
On gene 2524 in seq edge_2358.
On gene 2525 in seq edge_2358.
On gene 2526 in seq edge_2358.
On gene 2527 in seq edge_2358.
On gene 2528 in seq edge_2358.
On gene 2529 in seq edge_2358.
On gene 2530 in seq edge_2358.
On gene 2531 in seq edge_2358.
On gene 2532 in seq edge_2358.
On gene 2533 in seq edge_2358.
On gene 2534 in seq edge_2358.
On gene 2535 in seq edge_2358.
On gene 2536 in seq edge_2358.
On gene 2537 in seq edge_2358.
On gene 2538 in seq edge_2358.
On gene 2539 in seq edge_2358.
On gene 