In [1]:
import numpy as np
from IPython.display import HTML
from bokeh.plotting import output_notebook, show
import genomes_dnj.lct_interval.chrom_plots as cp
output_notebook(hide_banner=True)

## SNP Series Statistics

This notebook presents some statistics for all SNP's and series in the autosome analysis data set.  Some plots of series data are presented for chromosome 2 and for a smaller region of chromosome 2 that includes the variations responsible for lactase persistence.

In [2]:
import genomes_dnj.series_anal.chrom_series_stats as cs
stats_obj = cs.autosome_snp_series_stats_cls()
stats_obj.do_stats()
HTML(stats_obj.stats_html())

chr,series,in_snps,not_in_snps,in_ratio,snp_mean,snp_med,snp_std,len_mean,len_med,len_std
1,74387,796158,647635,0.55,10,6,13.53,53616,23620,89920
2,81160,865501,682045,0.56,10,6,13.24,55164,25382,84828
3,68222,744533,556878,0.57,10,6,14.58,58394,25487,97904
4,69448,784085,545117,0.59,11,7,15.2,49059,25030,70950
5,61476,667469,502587,0.57,10,7,13.89,55855,26331,87644
6,61184,701990,483992,0.59,11,7,16.66,57924,25325,99090
7,56078,607361,476403,0.56,10,6,14.92,53050,22929,93440
8,53578,575287,451502,0.56,10,6,14.01,51636,22112,96151
9,41365,419307,381488,0.52,10,6,12.66,43501,20065,66119
10,47184,509139,410134,0.55,10,6,15.83,49812,21916,85663


A simple method was used with the thousand genome phase 3 data to group SNP's into series across all of the autosomes.  The criteria used to group the SNP's roughly was:

    Each chromosome expressing the series needed to express 90% of the SNP's in it.
    
    At least 90% of the chromosomes expressing any SNP in the series had to express the series.
    
    The series had to include at least 4 SNP's.
    
The table above showed that over 940,000 series were identified and that more then half of the SNP's were grouped into some series.  The average series contained 10 SNP's.  But there was a substantial amount of variation both in the number of 
SNP's and the length of the chromosome covered by a series.  The median values for both the number of SNP's and the length is significantly lower then the mean because particularly long series with particularly large numbers of SNP's make a large contribution to the averages.

### Chromosomes Expressing Active SNP Series

Statistics were accumulated for the active series at each place in a chromosome where a series started or ended.  An active series is one that had a start prior to the measurement position and an end after it.  The plot below for chromosome 2 shows that it is common for almost all of the 5008 sampled chromosomes in the thousand genome phase 3 data to be expressing an active series. The label "LCT" identifies the location in the region of chromosome 2 where the lactase gene and the series of SNP's associated with lactase persistence are located. 

In [3]:
plt0 = cp.chrom2_stats('active_series_allele_count')
show(plt0)

### Active SNP Series

This plot shows the number of active series at each measured location of chromosome 2

In [4]:
plt1 = cp.chrom2_stats('active_series')
show(plt1)

### SNP's In Active Series

This plot shows the number of SNP's in active series at each measured location.  The region associated with lactase persistence
is the one with the largest number on chromosome 2.

In [5]:
plt2 = cp.chrom2_stats('active_series_snp_count')
show(plt2)

### Lactase Region Chromosomes Expressing Active Series

This plot shows the count of chromosomes expressing an active series in a small part of chromosome 2 that includes the region
associated with lactase persistence.  Note the regular pattern of the count of these chromosomes dropping to or at least near zero.  The region associated with lactase persistence is exceptionally long.  But it does appear to have some internal structure.

In [6]:
plt3 = cp.chrom2_intvl_stats('active_series_allele_count')
show(plt3)

### Lactase Region Active Series

This plot shows the number of active series in the same region of chromosome 2

In [7]:
plt4 = cp.chrom2_intvl_stats('active_series')
show(plt4)

### Lactase Region SNP's In Active Series

This plot shows the count of SNP's in active series for that same part of chromosome 2.

In [8]:
plt5 = cp.chrom2_intvl_stats('active_series_snp_count')
show(plt5)