## Analysis of population genetic data in Python with scikit-allel

This tutorial will provide an introduction to the powerful [scikit-allel](https://scikit-allel.readthedocs.io/en/stable/#) package, which we'll use to manipulate and describe our variant (SNP) data. To begin, we'll import it, along with the standard Python library numpy:

In [4]:
import allel
import numpy as np

### Basic data structures and functions

Before we use empirical data, let's get a feel for scikit-allel's GenotypeArray objects using a simple example. Here, we set up a GenotypeArray with two individuals (as columns) and three loci (as rows). Each integer in this array refers to an allele, where 0 indicates the reference allele, 1 the first alternate allele, 2 the second, etc. Any negative integer indicates missing data.

In [12]:
g = allel.GenotypeArray([[[0, 0], [0, 0]],
                         [[0, 0], [0, 1]],
                         [[0, 0], [1, 1]],
                         [[0, 1], [1, 1]],
                         [[1, 1], [1, 1]],
                         [[0, 0], [1, 2]],
                         [[0, 1], [1, 2]],
                         [[0, 1], [-1, -1]],
                         [[-1, -1], [-1, -1]]])

This is what that object looks like: 

In [13]:
g

Unnamed: 0,0,1,Unnamed: 3
0,0/0,0/0,
1,0/0,0/1,
2,0/0,1/1,
...,...,...,...
6,0/1,1/2,
7,0/1,./.,
8,./.,./.,


Our GenotypeArray object "g" has attributes reflecting its dimensions, its number of variants, ploidy and sample size. For example: 

In [14]:
g.ndim

3

In [15]:
g.shape

(9, 2, 2)

In [16]:
g.n_variants

9

In [17]:
g.ploidy

2

In [18]:
g.n_samples

2

With this object, we can begin to get a feel for scikit-allel's functions for describing diversity and divergence. For example, we can calculate observed heterozygosity:

In [25]:
allel.heterozygosity_observed(g)

array([0. , 0.5, 0. , 0.5, 0. , 0.5, 1. , 1. , nan])

A natural next step is to see whether these freququencies are a deviation from Hardy-Weinberg Equilibrium. To do so, we first calculate allele frequencies for each locus, using the `count_alleles().to_frequencies()` function:

In [28]:
af = g.count_alleles().to_frequencies()

In [29]:
af

array([[1.  , 0.  , 0.  ],
       [0.75, 0.25, 0.  ],
       [0.5 , 0.5 , 0.  ],
       [0.25, 0.75, 0.  ],
       [0.  , 1.  , 0.  ],
       [0.5 , 0.25, 0.25],
       [0.25, 0.5 , 0.25],
       [0.5 , 0.5 , 0.  ],
       [ nan,  nan,  nan]])

In this array, each row is a locus, and the columns 0,1 and 2 refer to reference, 1st alternate, and 2nd alternate alleles. Next, we can use these data to calculate expected heterozygosity:

In [27]:
allel.heterozygosity_expected(af, ploidy=2)

array([0.   , 0.375, 0.5  , 0.375, 0.   , 0.625, 0.625, 0.5  ,   nan])

Looks different to me! 

## Loading and examining variant data

Next up, we're going to pivot to looking at empirical data: specifically, WGS data from across These data are in Variant Call Format, and can be found as `syma.vcf.gz` in the tutorial folder. 

To begin, we'll use scikit-allel to import the .vcf as a `numpy` array. Here, it's important to use the regex `'*'` wildcard, in order to extract all possible data from the file. 

In [20]:
s = allel.read_vcf("syma.vcf.gz", fields='*')

The `numpy` array object `r` has a set of keys representing aspects of the .vcf file. We can list these with the `.keys()` function, sorting by name:

In [19]:
sorted(s.keys())

['calldata/AD',
 'calldata/DP',
 'calldata/GQ',
 'calldata/GT',
 'calldata/PL',
 'samples',
 'variants/AC',
 'variants/AF',
 'variants/ALT',
 'variants/AN',
 'variants/BaseQRankSum',
 'variants/CHROM',
 'variants/DP',
 'variants/DS',
 'variants/Dels',
 'variants/ExcessHet',
 'variants/FILTER_LowQual',
 'variants/FILTER_PASS',
 'variants/FS',
 'variants/HaplotypeScore',
 'variants/ID',
 'variants/InbreedingCoeff',
 'variants/MLEAC',
 'variants/MLEAF',
 'variants/MQ',
 'variants/MQ0',
 'variants/MQRankSum',
 'variants/POS',
 'variants/QD',
 'variants/QUAL',
 'variants/REF',
 'variants/RPA',
 'variants/RU',
 'variants/ReadPosRankSum',
 'variants/SOR',
 'variants/STR',
 'variants/altlen',
 'variants/is_snp',
 'variants/numalt']

Here, anything begining with "callset" is from the INFO field of the .vcf, while "variants" refers to data associated with each SNP (or site, if you have invariant alleles in your file). (Some of this information, e.g. depth, is redundant, depending on your SNP caller.) To examine any of these in greater detail, we can select the relevant index. For example, to look at the position of these SNPs (relative to their reference), we can select `variants/POS` key:

In [160]:
s['variants/POS']

array([50757, 50772, 50774, ...,   696,   710,   714], dtype=int32)

Or a list of all sample IDs:

In [161]:
s['samples']

array(['EL10_toro', 'EL11_toro', 'EL13_toro', 'EL18_mega', 'EL19_mega',
       'EL1_mega', 'EL20_mega', 'EL21_toro', 'EL23_mega', 'EL24_mega',
       'EL27_mega', 'EL29_ochr', 'EL32_toro', 'EL39_toro', 'EL40_toro',
       'EL41_toro', 'EL42_toro', 'EL43_toro', 'EL44_toro', 'EL45_toro',
       'EL46_toro', 'EL47_toro', 'EL48_ochr', 'EL49_toro', 'EL4_mega',
       'EL51_toro', 'EL52_toro', 'EL53_toro', 'EL54_toro', 'EL55_toro',
       'EL59_toro', 'EL60_toro', 'EL6_mega', 'EL8_toro', 'EL9_toro'],
      dtype=object)

Besides simply *looking* at these data, we can use these keys to filter our .vcf based on certain criteria. Let's start by dropping all sites below a certain quality score. First, though, how many SNPs do we actually have? We can evaluate this by taking the first entry from the `shape` attribute of the array: 

In [162]:
s['calldata/GT'].shape[0]

87923

Let's practice by selecting a single scaffold from the vcf file. First, we'll generate a list of all the scaffolds in the .vcf header:

In [163]:
scaffolds = s['variants/CHROM']
scaffolds

array(['scaffold_0', 'scaffold_0', 'scaffold_0', ..., 'scaffold_3597',
       'scaffold_3597', 'scaffold_3597'], dtype=object)

Next, we'll generate a boolean array, indicating whether each site in the .vcf is found on scaffold_0 or not:

In [164]:
scaffold_0 = (scaffolds[0:]=='scaffold_0')
scaffold_0

array([ True,  True,  True, ..., False, False, False])

As confirmation, we can check that the length of this array is equivalent to the number of sites in the .vcf:

In [165]:
len(scaffold_0)

87923

Next, let's take the actual genotype calls from the vcf-turned-numpy-array and turn it into the GenotypeArray object we're familiar with:

In [166]:
gt = s['calldata/GT']
gt = allel.GenotypeArray(gt)
gt

Unnamed: 0,0,1,2,3,4,...,30,31,32,33,34,Unnamed: 12
0,0/0,1/1,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
1,0/1,0/1,0/1,0/0,0/0,...,0/0,0/0,0/0,0/1,0/0,
2,0/1,0/1,0/1,0/0,0/0,...,0/0,0/0,0/0,0/1,0/0,
...,...,...,...,...,...,...,...,...,...,...,...,...
87920,0/1,0/1,0/1,0/1,0/1,...,0/1,0/1,0/1,0/1,0/1,
87921,0/0,0/0,0/0,0/0,0/0,...,0/0,0/1,0/0,0/0,0/0,
87922,0/0,0/0,0/0,0/0,0/0,...,0/1,0/0,0/0,0/0,0/0,


We can then subset this by using the `compress()` function in numpy, with the `scaffold_0` object as our designated slice, applied to the first (or 0th, because of Python's dumb indexing) axis of the data, meaning the rows / sites:

In [168]:
gt_scaffold_0 = gt.compress(scaffold_0, axis=0)
gt_scaffold_0

Unnamed: 0,0,1,2,3,4,...,30,31,32,33,34,Unnamed: 12
0,0/0,1/1,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
1,0/1,0/1,0/1,0/0,0/0,...,0/0,0/0,0/0,0/1,0/0,
2,0/1,0/1,0/1,0/0,0/0,...,0/0,0/0,0/0,0/1,0/0,
...,...,...,...,...,...,...,...,...,...,...,...,...
902,1/1,0/0,0/0,0/1,0/1,...,0/0,0/0,./.,0/0,0/0,
903,1/1,0/0,0/0,0/0,0/1,...,0/0,0/1,./.,0/0,0/0,
904,1/1,0/1,1/1,0/1,0/0,...,0/0,0/0,./.,0/1,0/0,


Only 904 sites now! Much more manageable. Next, let's filter sites based on a quality scores. We'll first check the minimum quality score in our data: 

In [171]:
q_scores = s['variants/QUAL']
min(q_scores)

30.03

While the minimum Q-score looks good here (because—full disclosure—this .vcf has been filtered before), we might as well boost this to 60 and see what happens to our total number of SNPs:

In [184]:
pass_Q = (q_scores[0:]>60)

In [186]:
gt_q60 = gt.compress(pass_Q, axis=0)
gt_q60

Unnamed: 0,0,1,2,3,4,...,30,31,32,33,34,Unnamed: 12
0,0/0,1/1,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
1,0/1,0/1,0/1,0/0,0/0,...,0/0,0/0,0/0,0/1,0/0,
2,0/1,0/1,0/1,0/0,0/0,...,0/0,0/0,0/0,0/1,0/0,
...,...,...,...,...,...,...,...,...,...,...,...,...
83302,0/1,0/1,0/1,0/1,0/1,...,0/1,0/1,0/1,0/1,0/1,
83303,0/0,0/0,0/0,0/0,0/0,...,0/0,0/1,0/0,0/0,0/0,
83304,0/0,0/0,0/0,0/0,0/0,...,0/1,0/0,0/0,0/0,0/0,


That's all great, but we had to return to the original .vcf file there—meaning we didn't apply both filters simultaneously. To do that, we can use `numpy's` `logical_and` function to find the sites that pass both filters: 

In [199]:
passing = np.logical_and(pass_Q, scaffold_0)

In [202]:
final = gt.compress(passing, axis=0)

In [203]:
final

Unnamed: 0,0,1,2,3,4,...,30,31,32,33,34,Unnamed: 12
0,0/0,1/1,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
1,0/1,0/1,0/1,0/0,0/0,...,0/0,0/0,0/0,0/1,0/0,
2,0/1,0/1,0/1,0/0,0/0,...,0/0,0/0,0/0,0/1,0/0,
...,...,...,...,...,...,...,...,...,...,...,...,...
847,1/1,0/0,0/0,0/1,0/1,...,0/0,0/0,./.,0/0,0/0,
848,1/1,0/0,0/0,0/0,0/1,...,0/0,0/1,./.,0/0,0/0,
849,1/1,0/1,1/1,0/1,0/0,...,0/0,0/0,./.,0/1,0/0,


As you can see, we've further reduced the number of sites on scaffold 1 based on our quality filter threshhold. 

### Principal Component Analysis

Introduced by Patterson, Price, and Reich in 2006, PCA has quickly become a ubiquitious tool for exploring genomic data. scikit-allel implements a fast version of this with its `allel.pca()` function.   