# Gazave et al. 2014 neutral region data

D. P. Rice

3/14/17

# Introduction

This is an initial exploration of the data from Gazave et al. 2014, which sequenced ~500 European individuals to high coverage at 15 putatively neutral loci. The data consist of:
1. __The target region definitions__ I've extracted these using OCR from Table S2
2. __The SNVs__ I've downloaded a vcf from Alon Kleinan's website and extracted the allele counts relevant fields into a simple text file using `bcftools`

The goal of this document is to perform the following quality checks:
1. The numbers in the target regions should match those in table S2 (checking for OCR errors).
2. Regions should make sense: target length = end - start, covered bases <= length.
3. Region descriptive statistics should match main text and SI claims.
4. SNV positions should be within target regions.
5. Descriptive statistics of SNVs should match claims in the paper.
6. Site frequency spectrum should match Fig. 2.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Target regions

In [8]:
targets = pd.read_table('../data/GazaveEtal/gazave_targets.txt', delim_whitespace=True, thousands=',')
targets.head()

Unnamed: 0,chr,start,end,length,n20x_1_indiv,n20x_450_indiv,nSNVs
0,1,237146097,237165997,19900,19902,19764,180
1,1,237360801,237380801,20000,20004,19539,197
2,4,164000000,164020000,20000,17962,16117,96
3,4,164115673,164135673,20000,19768,19443,119
4,6,165342001,165355001,13000,11098,10659,86


In [9]:
targets.describe()

Unnamed: 0,chr,start,end,length,n20x_1_indiv,n20x_450_indiv,nSNVs
count,15.0,15.0,15.0,15.0,15.0,15.0,15.0
mean,6.8,136485100.0,136499500.0,14416.0,13667.666667,13065.866667,122.266667
std,3.233751,58546360.0,58547730.0,5279.405811,4936.985411,4731.881019,35.531609
min,1.0,49565730.0,49579730.0,5340.0,5285.0,5066.0,79.0
25%,5.0,83298060.0,83312060.0,11500.0,11051.0,10831.5,99.5
50%,7.0,133266100.0,133271400.0,14000.0,12577.0,12002.0,117.0
75%,10.0,165377700.0,165390200.0,20000.0,18865.0,17197.5,128.5
max,10.0,237360800.0,237380800.0,20000.0,20004.0,19764.0,197.0


## Target lengths and locations

The easiest way to check for OCR errors is to check the column sums against the *Total* row of table, which says the totals are:
- length: 216,240
- 20x in 1 indiv: 205,015
- 20x in 450 indivs: 195,988
- number of snvs: 1,834

In [11]:
print('length:', targets.length.sum())
print('1 indiv:', targets.n20x_1_indiv.sum())
print('450 ind:', targets.n20x_450_indiv.sum())
print('n snvs:', targets.nSNVs.sum())

length: 216240
1 indiv: 205015
450 ind: 195988
n snvs: 1834


This checks out. There could be some subtle OCR errors, but that seems unlikely.

Now, compare length to end - start:

In [12]:
targets.end - targets.start == targets.length

0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
11    True
12    True
13    True
14    True
dtype: bool

This also checks out.

The SI states that target lengths range from 5,340bp to 20,00bp.

In [22]:
(targets.length.min(), targets.length.max())

(5340, 20000)

Checks out.

The SI states that the targets were selected in pairs and triplets so that they are in partial LD. It says:
>duplets or triplets cover physical distance of 195 kb on average (range: 93 kb and 430 kb)

We can try to identify the duplets and triplets by eye and check this.

In [26]:
print(targets[['chr', 'start', 'end']])

    chr      start        end
0     1  237146097  237165997
1     1  237360801  237380801
2     4  164000000  164020000
3     4  164115673  164135673
4     6  165342001  165355001
5     6  165413304  165425304
6     6  165470500  165482500
7     7   49565734   49579734
8     7   49645200   49659200
9    10   82981023   83001023
10   10   83207715   83215715
11   10   83388407   83408407
12   10  133143109  133150109
13   10  133231083  133242083
14   10  133266053  133271393


These fall into 6 clear sets: Chr1, Chr4, Chr7, Chr10 around 83 Mb, and Chr10 around 133 Mb.

In [42]:
group_start = [0,2,4,7,9,12]
group_end = [1,3,6,8,11,14]
group_lengths = np.array(targets.end[group_end]) - np.array(targets.start[group_start])
print(group_lengths)
print('Min:', min(group_lengths))
print('Max:', max(group_lengths))
print('Avg:', np.mean(group_lengths))

[234704 135673 140499  93466 427384 128284]
Min: 93466
Max: 427384
Avg: 193335.0


The minimum is correct, but the maximum and the average are both a little low. **Maybe should double check this.**

## Coverage

For each target, the length should be greater than the number of bases with 20x coverage in one sample, which should be greater than the number of bases with 20x coverage in 450 samples. 

In [13]:
targets.length >= targets.n20x_1_indiv

0     False
1     False
2      True
3      True
4      True
5     False
6      True
7      True
8      True
9      True
10    False
11     True
12    False
13    False
14     True
dtype: bool

In [14]:
targets.n20x_1_indiv >= targets.n20x_450_indiv

0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
11    True
12    True
13    True
14    True
dtype: bool

In [17]:
targets.length >= targets.n20x_450_indiv

0      True
1      True
2      True
3      True
4      True
5     False
6      True
7      True
8      True
9      True
10     True
11     True
12    False
13    False
14     True
dtype: bool

The counts for 20x coverage in at least 450 individuals are all less than the bp covered at 20x in at least 1 individual, as they should be. But there are sometimes more bases covered than the target length. Presumably, this is some sort of overhang. **I should investigate and understand this.**

In [18]:
targets.length - targets.n20x_1_indiv

0       -2
1       -4
2     2038
3      232
4     1902
5       -2
6        0
7     1423
8        0
9      128
10      -4
11    5466
12      -3
13      -4
14      55
dtype: int64

In [20]:
targets.length - targets.n20x_450_indiv

0      136
1      461
2     3883
3      557
4     2341
5       -2
6        0
7     2817
8      221
9     1722
10     622
11    7227
12      -3
13      -4
14     274
dtype: int64

When considering only sites with 20x coverage in 450 individuals, there are never more than 4 extra bases, so we might be ok neglecting this.

SNVs:
- 1,834 high quality SNVs
- genotyped in 450 individuals for 95% of SNVs (these are the ones used in the SFS figures.)
- downsampled to 900 random chromosomes for figures
- proportion of singletons = 38.4%
- Ti/Tv ratio = 2.22

SFS figures:
- downsample to 900 chromosomes (hypergeomtric?); remove snps with fewer called genotypes
- SFS bins: (1, 2, 3, 4, 5, 6-10, 11-20, 21-50, 51-100, 101-200, 201-450); normalize counts by number of categories