# Validating Bowtie Results

We want to confirm that the results we're getting from Bowtie make sense.

Since each gRNA row has a `gene_name`, we can check whether gRNAs targeting the same gene (1) appear on the same chromosome and, if so, whether they (2) are close to one another on that chromosome.

In [19]:
import numpy
from guide.dataset import GuideDataset
from collections import defaultdict

In [20]:
dataset = GuideDataset('data/example_guide_data_with_bowtie_with_mfold.tsv')
points = dataset.points

First let's group all of the rows of the TSV by `gene_name`:

In [21]:
genes = list(set([p.row['gene_name'] for p in points]))
points_by_gene = defaultdict(list)
for p in points:
    points_by_gene[p.row['gene_name']].append(p)

print(len(points_by_gene))
print(len(genes))
list(points_by_gene[genes[0]])

17419
17419


[<guide.datapoint.GuideDatapoint at 0x12f881320>,
 <guide.datapoint.GuideDatapoint at 0x12f8813c8>,
 <guide.datapoint.GuideDatapoint at 0x12f881470>]

Now let's check to see if all points with the same `gene_name` had exact Bowtie matches in the same chromosome -- and if so, whether they were close to each other within that chromosome.

In [22]:
perfect_genes = []
flawed_genes = []

def bowties(points): return [p.bowtie_result() for p in points if p.bowtie_result().exact_match()]
def chromosomes(points): return [b.chromosome() for b in points]

for gene in genes:
    points = points_by_gene[gene]
    _bowties = bowties(points)
    chromes = chromosomes(_bowties)
    if len(set(chromes)) == 1:
        indexes = [b.exact_match().index for b in _bowties]
        sigma, mean, count = numpy.std(indexes), numpy.mean(indexes), len(indexes)
        perfect_genes.append([gene, [sigma, mean, count]])
    elif len(set(chromes)) > 1:
        flawed_genes.append(gene)
        
print("genes for which all points' bowtie results were on the same chromosome:")
print(len(perfect_genes))
print("genes for which some points' bowtie results were on different chromosomes:")
print(len(flawed_genes))

genes for which all points' bowtie results were on the same chromosome:
17385
genes for which some points' bowtie results were on different chromosomes:
2


In [23]:
perfect_genes[0:10]

[['SYNE1', [41624.312529524803, 152598848.66666666, 3]],
 ['TP53AIP1', [832.46756387138544, 128936502.57142857, 7]],
 ['VWCE', [1704.4467394142887, 61293252.75, 4]],
 ['EXPH5', [25888.26351240075, 108567561.75, 4]],
 ['GAPVD1', [1716.0270066348023, 125300706.75, 4]],
 ['HDLBP', [1402.7461236802617, 241264708.75, 4]],
 ['CCDC111', [2834.0441510322312, 184661016.5, 4]],
 ['PILRB', [8.5, 100358806.5, 2]],
 ['PHYHD1', [1217.5, 128935257.5, 2]],
 ['TMEM244', [1050.0162961698368, 129844300.33333333, 3]]]

For the most part, it looks like the Bowtie search results for gRNAs targeting the same genes make sense! Almost all of the gRNA exact matches are on the same chromosome, and fairly closely clustered together within that same chromosome, which we would expect if they are truly targeting the same gene. 

In [24]:
mean_sigma = numpy.mean([pg[1][0] for pg in perfect_genes])
print(mean_sigma)

6800.31310316


On average, the hit indexes for the same target gene are an average (again) of 6800bp removed from their average (again again). Since [Google](http://bionumbers.hms.harvard.edu/bionumber.aspx?&id=104316&ver=1) tells me that the average gene length in the human genome is about 10000-15000bp, that makes sense and is consistent with them actually being on the same gene. 