# Introduction to `scikit-allel`

The following Jupyter Notebook provides `Python` code to perform simple analyses of genome-wide single nucleotide polymorphism (SNP) data using the powerful `scikit-allel` package. As part of the your final project in Conservation Genetics you will use functions from this package to analyze empirical data; today, we will work with toy datasets to learn about the structure of data objects and how functions interact with them. To begin, you'll again need to copy this notebook and save an editable version to your own student directory. I also recommend spend some time reviewing `scikit-allel's` [documentation](https://scikit-allel.readthedocs.io/en/stable/), paying particular attention to what functions look like and how they operate.

**QUESTION 1**: Pick a function from the ["Statistics and plotting"](https://scikit-allel.readthedocs.io/en/stable/stats.html) section of `scikit-allel's` docs. What are its parameters, and what do they mean? Add a `Markdown` or `Raw` chunk below to write your answer.

We will first load `scitkit-allel` and a few other packages:

In [20]:
import os
import allel
import numpy as np
import pandas as pd
np.set_printoptions(legacy='1.25')

## Basic data structures and functions


Let's get a feel for scikit-allel's `GenotypeArray` objects using a simple example. Here, we set up a `GenotypeArray` with two individuals (as columns) and three loci (as rows). Each integer in this array refers to an allele, where 0 indicates the reference allele, 1 the first alternate allele, 2 the second, etc. Any negative integer indicates missing data. Notice that we are again using `.` after the package name (`allel`) to specify where a function comes from. This time, however, we are creating something called a `class`, which assigns a particular set of variables and associated functions (technically "methods") to the object at hand:

In [33]:
g = allel.GenotypeArray([[[0, 0], [0, 0]],
                         [[0, 0], [0, 1]],
                         [[0, 0], [1, 1]],
                         [[0, 1], [1, 1]],
                         [[1, 1], [1, 1]],
                         [[0, 0], [1, 1]],
                         [[0, 1], [1, 0]],
                         [[0, 1], [-1, -1]],
                         [[-1, -1], [-1, -1]]])

This is what that object looks like:

In [34]:
g

Unnamed: 0,0,1,Unnamed: 3
0,0/0,0/0,
1,0/0,0/1,
2,0/0,1/1,
...,...,...,...
6,0/1,1/0,
7,0/1,./.,
8,./.,./.,


Our GenotypeArray object "g" has attributes reflecting its dimensions, its number of variants, ploidy and sample size. For example:

In [35]:
g.ndim

3

In [36]:
g.shape

(9, 2, 2)

In [37]:
g.n_variants

9

In [38]:
g.ploidy

2

In [39]:
g.n_samples

2

With this object, we can begin to get a feel for `scikit-allel's` functions for describing diversity and divergence. For example, we can calculate observed heterozygosity, generating an array for each locus:

In [40]:
het_obs = allel.heterozygosity_observed(g)
het_obs

array([0. , 0.5, 0. , 0.5, 0. , 0. , 1. , 1. , nan])

Using the `namean()` function in `numpy` to ignore `NA` values, we can calculate the genome-wide average:

In [41]:
np.nanmean(het_obs)

np.float64(0.375)

A natural next step is to see whether these freququencies are a deviation from Hardy-Weinberg Equilibrium. To do so, we first calculate allele frequencies for each locus, using the `count_alleles().to_frequencies()` function. (Note that this is an example of of an object having a particular method—in this case, a method unique to the `GenotypeArray()` class.)

In [42]:
af = g.count_alleles().to_frequencies()
af

array([[1.  , 0.  ],
       [0.75, 0.25],
       [0.5 , 0.5 ],
       [0.25, 0.75],
       [0.  , 1.  ],
       [0.5 , 0.5 ],
       [0.5 , 0.5 ],
       [0.5 , 0.5 ],
       [ nan,  nan]])

In this array, each row is a locus, and the columns 0 and 1 and refer to reference and alternate alleles (though note that there can be more columns if the locus is not diallelic). Next, we can use these data to calculate expected heterozygosity:

In [46]:
allel.heterozygosity_expected(af, ploidy=2)

array([0.   , 0.375, 0.5  , 0.375, 0.   , 0.5  , 0.5  , 0.5  ,   nan])

Looks different to me!

**QUESTION 2**: What is the length of object `af`? Add a chunk of code below to answer this question.

We can select a single site using brackets—for example, the following line of code gives us the frequencies of the two alleles at the first site in the genome:

In [50]:
af[1]

array([0.75, 0.25])

We can use the following function to calculate observed heterozygosity at one or more SNPs, again using brackets to select a particular location in the genome:

In [63]:
h_o = allel.heterozygosity_observed(g)[3] #this selects the third variant
h_o

np.float64(0.0)

The function `allel.heterozygosity_expected` works similarly, albeit for the frequency data, not the `GenotypeArray`:

In [64]:
h_e = allel.heterozygosity_expected(af, ploidy=2)[3] #this selects the third variant
h_e

np.float64(0.5)

**QUESTION 3**: Recall that the inbreeding coefficient can be calculated as $F = 1 - \frac{H_o}{H_e}$. Add a chunk below to determine the inbreeding coefficient for the *second* variant. What can you infer from the answer? (This will take more than one line of `Python`.)