# Genotype data in FAPS

In most cases, researchers will have a sample of offspring, maternal and candidate paternal individuals typed at a set of markers. In this section we'll look in more detail at how FAPS deals with genotype data to build a matrix we can use for sibship inference. 
Currently, tools for doing this in FAPS assume you are using biallelic, unlinked SNPs for a diploid; we'll discuss what you can do if your system doesn't meet these criteria later in this section.

## Genotype objects

Genotype data are stored in a class of objects called a `genotypeArray`. We'll illustrate how these work with simulated data, since not all information is available for real-world data sets. We first generate a vector of population allele frequencies for 10 unlinked SNP markers, and use these to create a population of five adult individuals. The optional argument `family_names` allows you to name this generation.

In [48]:
from faps import *
import numpy as np

allele_freqs = np.random.uniform(0.3,0.5,10)
mypop = make_parents(5, allele_freqs, family_name='my_population')

The object we just created contains information about the genotypes of each of the ten parent individuals. Genotypes are stored as *N*x*L*x2-dimensional arrays, where *N* is the number of individuals and *L* is the number of loci. We can view the genotype for the first parent like so (recall that Python starts counting from zero, not one):

In [49]:
mypop.geno[0]

array([[1, 1],
       [1, 0],
       [0, 0],
       [0, 0],
       [1, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0]])

You could subset the array by indexes the genotypes, for example by taking only the first two individuals and the first five loci:

In [53]:
mypop.geno[:2, :5]

array([[[1, 1],
        [1, 0],
        [0, 0],
        [0, 0],
        [1, 0]],

       [[1, 0],
        [0, 1],
        [1, 0],
        [0, 1],
        [0, 1]]])

For realistic examples with many more loci, this obviously gets unwieldy pretty soon. It's cleaner to supply a list of individuals to keep or remove to the `subset` and `drop` functions. These return return a new `genotypeArray` for the individuals of interest.

In [48]:
print mypop.subset([0,2]).names
print mypop.drop([0,2]).names

['my_population_0' 'my_population_2']
['my_population_1' 'my_population_3' 'my_population_4']


A `genotypeArray` contains other useful information about the individuals:

In [46]:
print mypop.names # individual names
print mypop.size  # number of individuals
print mypop.nloci # numbe of loci typed.

['my_population_0' 'my_population_1' 'my_population_2' 'my_population_3'
 'my_population_4']
5
10


`make_sibships` is a convenient way to generate a single half-sibling array from individuals in `mypop`. This code mates makes a half-sib array with individual 0 as the mothers, with individuals 1, 2 and 3 contributing male gametes. Each father has four offspring each.

In [49]:
progeny = make_sibships(mypop, 0, [1,2,3], 4, 'myprogeny')

With this generation we can extract a little extra information from the `genotypeArray` than we could from the parents about their parents and family structure.

In [50]:
print progeny.fathers
print progeny.mothers
print progeny.families
print progeny.nfamilies

['my_population_1' 'my_population_1' 'my_population_1' 'my_population_1'
 'my_population_2' 'my_population_2' 'my_population_2' 'my_population_2'
 'my_population_3' 'my_population_3' 'my_population_3' 'my_population_3']
['my_population_0' 'my_population_0' 'my_population_0' 'my_population_0'
 'my_population_0' 'my_population_0' 'my_population_0' 'my_population_0'
 'my_population_0' 'my_population_0' 'my_population_0' 'my_population_0']
['my_population_0/my_population_1' 'my_population_0/my_population_2'
 'my_population_0/my_population_3']
3


Of course with real data we would not normally know the identity of the father or the number of families, but this is useful for checking accuracy in simulations. It can also be useful to look up the positions of the parents in another list of names. This code finds the indices of the mothers and fathers of the offspring in the names listed in `mypop`.

In [51]:
print progeny.parent_index('mother', mypop.names)
print progeny.parent_index('father', mypop.names)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]


## Importing genotype data

You can import genotype data from a text or CSV (comma-separated text) file. Both can be easily exported from a spreadsheet program. Rows index individuals, and columns index each typed locus. More specifically:

1. Offspring names should be given in the first column
2. If the data are offspring, names of the mothers are given in the second column.
3. If known for some reason, names of fathers can be given as well.
4. Genotype information should be given *to the right* of columns indicating individual or parental names, with locus names in the column headers.

SNP genotype data must be biallelic, that is they can only be homozygous for the first allele, heterozygous, or homozygous for the second allele. These should be given as 0, 1 and 2 respectively. If genotype data is missing this should be entered as NA.

The following code imports genotype information on real samples of offspring and candidate parents. Offspring are a half-sibling array of wild-pollinated snpadragon seedlings collected in the Spanish Pyrenees. The candidate parents are as many of the wild adult plants as we could find.

In [4]:
adults   = read_genotypes('data_files/parents_SNPs_2012.csv', genotype_col=1, delimiter=',')
offspring = read_genotypes('data_files/M0009_offspring.csv', genotype_col=2, mothers_col=1)

Again, Python starts counting from zero rather than one, so the first column is really column zero, and so on. Because these are CSV, there was no need to specify that data are delimited by commas, but this is included for illustration.

You can call summaries for each locus of the number of individuals with misisng data, and the mean heterozygosity. These can be helpful in identifying dubious loci. This code prints data for the first ten loci.

In [5]:
print offspring.missing_data[:9]
print offspring.heterozyosity[:9]

[ 0.09756098  0.02439024  0.02439024  0.2195122   0.02439024  0.07317073
  0.12195122  0.          0.09756098]
[ 0.29268293  0.6097561   0.63414634  0.34146341  0.58536585  0.51219512
  0.26829268  0.09756098  0.56097561]
