# Highly Mutated Intergenic Region Tables

Based on the `HighlyMutatedGeneTable` notebook, but less complex since we don't care about codons/etc. here.

- For every sequence:
  - Determine intergenic regions. This is doable by finding all positions that *aren't* in any genes, then
    converting this collection of positions to maximally-long contiguous blocks of positions.
  - Each intergenic region has a left end (inclusive), right end (inclusive), length, and mutation rate.
  - For every intergenic region in this sequence:
    - For every position in this region:
      - See if this position is mutated (> 0.5% mutation rate) and increment a count if so
    - Compute total mutation rate for this intergenic region.
    - Update list of intergenic regions with mutation rate.
    
TODO: export info abt # of intergenic regions so you can include in table 1.

TODO: include info abt left/right pos in gene tables so easier to match?

In [1]:
%run "Header.ipynb"
%run "LoadMutationJSONData.ipynb"
%run "GeneUtils.ipynb"

## Define `IGRegion` class

This is a very pared down version of the `Gene` class.

In [17]:
class IGRegion:
    def __init__(self, leftend, rightend, seq, minfreq=0.5):
        self.leftend = leftend
        self.rightend = rightend
        if self.leftend > self.rightend:
            # Unlike in Genes, we allow for IG Regions to only include a single position. Biologically
            # this makes sense since there isn't any notion of codons -- and practically, this is the case
            # for e.g. position 30,036 (1-indexed) in the CAMP genome, at least right now.
            raise ValueError("Left end of region must be <= right end of region")
            
        self.positions = range(leftend, rightend + 1)
        self.length = len(self.positions)
        
        self.seq = seq
        self.num_total_mutations = None
        self.total_mutation_rate = None
        
        self._count_mutations(minfreq)
        
    def _count_mutations(self, minfreq):
        self.num_total_mutations = 0
        minfreq_pct = minfreq / 100
        for pos in self.positions:
            is_mut = get_val(seq, pos, lambda cov, mismatches: 1 if (mismatches / cov) > minfreq_pct else 0)
            if is_mut == 1:
                self.num_total_mutations += 1
                    
        self.total_mutation_rate = self.num_total_mutations / self.length
        
    def __str__(self):
        return f"IGRegion in {seq2name[seq]}: [{self.leftend}, {self.rightend}]"
    
    def __repr__(self):
        return self.__str__()

## Define (& count mutations in) intergenic regions; export tables

In [21]:
seq2intergenicregion_ct = {}
seq2mutated_intergenicregion_ct = {}

for seq in SEQS:
    df = parse_sco("../seqs/genes/{}.sco".format(seq))
    bases_in_genes = set()
    for gene_data in df.itertuples():
        bases_in_genes |= set(get_pos_interval_from_gene(gene_data))
    
    bases_not_in_genes = sorted(set(range(1, seq2len[seq] + 1)) - bases_in_genes)
    igregions = []
    # We want to create one entry in igregions for every maximum-length sequence of continuous positions
    # in bases_not_in_genes. igregions will contain IGRegion objects, defined above; and curr_region is
    # a list of positions that will be converted eventually into an IGRegion.
    curr_region = []
    for pos in bases_not_in_genes:
        if len(curr_region) == 0 or pos == curr_region[-1] + 1:
            curr_region.append(pos)
        else:
            igregions.append(IGRegion(curr_region[0], curr_region[-1], seq))
            curr_region = [pos]
    
    # We'll end up having a "last" intergenic region that isn't interrupted by anything.
    # Make sure this is included.
    if len(curr_region) > 0:
        igregions.append(IGRegion(curr_region[0], curr_region[-1], seq))
    
    igregions_by_mutrate = sorted(igregions, key=lambda r: r.total_mutation_rate)
    
    with open("misc-text/{}-highest-mutated-igregions.tex".format(seq), "w") as tblfile:
        # Select top 10 regions for each sequence, going by total mutation rate.
        # We reverse the order so that the highest-mutation-rate gene is written first, etc.
        # (like a "leaderboard")
        for r in igregions_by_mutrate[-10:][::-1]:
            tblfile.write("{:,} & {:,} & {:,} & {:.2f}\% \\\\ \\hline\n".format(
                r.leftend, r.rightend, r.length, r.total_mutation_rate * 100
            ))
            
    # Save some extra info for the report
    seq2intergenicregion_ct[seq] = len(igregions)
    seq2mutated_intergenicregion_ct[seq] = len([r for r in igregions if r.num_total_mutations > 0])

## Also, output some stats about numbers of intergenic regions

In [20]:
with open("misc-text/gene-mutation-stats.tex", "w") as of:
    # The \endinput is needed to prevent LaTeX from inserting a bunch of space after this text is
    # included using \input{}. See https://tex.stackexchange.com/a/18018.
    # (For some reason, using \unskip removed ALL whitespace, which looked ugly -- this is the best
    # solution I've found thus far.)
    of.write(
        (
            "Out of a total of {:,} / {:,} / {:,} intergenic regions in the {} / {} / {} genomes, respectively, "
            "{:,} / {:,} / {:,} regions have nonzero mutation rates (using the threshold $minfreq=0.5\%$).\endinput"
        ).format(
            *[seq2intergenicregion_ct[seq] for seq in SEQS],
            *[seq2name[seq] for seq in SEQS],
            *[seq2mutated_intergenicregion_ct[seq] for seq in SEQS]
        )
    )