# Paternity arrays

Tom Ellis, March 2017

Paternity arrays are the what sibship clustering is built on in FAPS. They contain information about the probability that each candidate male is the father of each individual offspring. This information is stored in a `paternityArray` object, along with other related information. A `paternityArray` can either be imported directly, or created from genotype data.

This notebook will examine how to:

1. Create a `paternityArray` from marker data.
2. Examine what information it contains.
3. Read and write a `paternityArray` to disk, or import a custom `paternityArray`.

Once you have made your `paternityArray`, the [next step]((https://github.com/ellisztamas/faps/blob/master/docs/04%20Sibship%20clustering.ipynb) is to cluster the individuals in your array into full sibship groups.

## Creating a `paternityArray` from genotype data

To create a `paternityArray` from genotype data we need to specficy `genotypeArray`s for the offspring, mothers and candidate males. Currently only biallelic SNP data are supported.

We will illustrate this with a small simulated example again with four adults and six offspring.

In [2]:
from faps import *
import numpy as np

np.random.seed(49) # this ensures you get exactly the same answers as I do.
allele_freqs = np.random.uniform(0.3,0.5,10)
mypop        = make_parents(4, allele_freqs, family_name='my_population')
progeny      = make_sibships(mypop, 0, [1,2], 3, 'myprogeny')

We need to supply a `genotypeArray` for the mothers. This needs to have an entry for for every offspring, i.e. six replicates of the mother.

In [3]:
mum_index = progeny.parent_index('mother', mypop.names) # positions in the mothers in the array of adults
mothers   = mypop.subset(mum_index) # genotypeArray of the mothers

To create the `paternityArray` we also need to supply information on the genotyping error rate (mu), and population allele frequencies. For the latter, in this case we can either take the population allele frequencies defined above, or estimate them from the data, which will give slightly different answers. The function `paternity_array` creates an object of class `paternityArray`.

In [4]:
mu = 0.001
sample_af = mypop.allele_freqs()
patlik = paternity_array(offspring = progeny, mothers = mothers, males= mypop, mu=mu)

Note that if you try running the above cell but with mu=0, it will run, but throw a warning. The reason is that setting mu to zero tends to make sibship clustering unstable, so it automatically sets mu to a very small number.

## `paternityArray` structure

A `paternityArray` inherits information about individuals from found in a `genotypeArray`, for example:

In [5]:
print patlik.candidates
print patlik.mothers
print patlik.offspring

['my_population_0' 'my_population_1' 'my_population_2' 'my_population_3']
['my_population_0' 'my_population_0' 'my_population_0' 'my_population_0'
 'my_population_0' 'my_population_0']
['myprogeny_0' 'myprogeny_1' 'myprogeny_2' 'myprogeny_3' 'myprogeny_4'
 'myprogeny_5']


The most important part of the `paternityArray` is the likelihood array, which represent the log likelihood that each candidate male is the true father of each offspring individual. In this case it will be a 6x4 dimensional array, which we could view with:

In [6]:
patlik.lik_array

array([[-38.55110042, -28.89021394, -45.36535423, -34.87888561],
       [-38.55110042, -28.89021394, -33.16469526, -28.89021394],
       [-39.2422496 , -28.89171056, -46.05650341, -35.57003479],
       [-44.76308772, -29.57986649, -29.57887198, -30.26951904],
       [-35.10370984, -47.5737391 , -30.27451103, -36.61993904],
       [-33.0302623 , -40.53516318, -28.2010635 , -34.5464915 ]])

You can see that individuals one (the mother) and four are incompatible as fathers with any offspring, because the first and last columns all show log likelihoods of negative infinity. The second candidate is the only compatible father for the first full-sibship, and the third father is the only compatible father for the second sibship (second and third columns respectively).

In toy example we know the error rate to be zero, and have given it as such. In reality this will almost never be true, and we would usually need to incorporate an error rate. Here's what happens to `patlik` if we use a realistic error rate.

In [7]:
mu = 0.0015
patlik = paternity_array(offspring = progeny, mothers = mothers, males= mypop, mu=mu)
patlik.lik_array

array([[-37.73555189, -28.88345936, -44.14239594, -34.46527484],
       [-37.73555189, -28.88345936, -32.75536552, -28.88345936],
       [-38.42570356, -28.88570176, -44.83254761, -35.15542651],
       [-43.54076683, -29.57136863, -29.56988098, -30.2592779 ],
       [-34.69094363, -46.34848767, -30.26675986, -36.20536912],
       [-32.62048862, -39.71672958, -28.19630485, -34.1349141 ]])

No candidate can now be excluded explicitly, but we can see that the true sires have log liklihoods around ten units higher than the imposters. This means that the true sire is around 22,000-fold more likely to be the true father than the other candidates. That's quite a lot!

The `paternityArray` also includes information that the true sire is not in the sample of candidate males. In this case this is not helpful, because we know sampling is complete, but in real examples is seldom the case. By default this is defined as the liklihood of generating the offspring genotypes given the known mothers genotype and alleles drawn from population allele frequencies. Here, values for the six offspring are higher than the likelihoods for the non-sires, indicating that they are no more likely to be the true sire than a random unrelated individual.

In [8]:
patlik.lik_absent

array([-32.99492493, -35.48829256, -34.07749314, -35.1092052 ,
       -36.21397013, -34.4246085 ])

This becomes more informative when we combine likelihoods about sampled and unsampled fathers. For fractional analyses we really want to know the probability that the father was unsampled vs sampled, and how probable it is that a single candidate is the true sire. To do this, `patlik.lik_array` is concatenate with values in `patlik.lik_absent`, then normalises the array so that values in each row sum to one. Printing the shape of the array demonstrates that we have gained a column.

In [9]:
patprob = patlik.prob_array
print patprob.shape # dimensions of the prob_array

(6, 5)


If we sum the rows we see that they do indeed add up to one now. Probabilities are stored as log probabilities, so we have to exponentiate first.

In [10]:
np.exp(patprob).sum(axis=1)

array([1., 1., 1., 1., 1., 1.])

### Modifying a `paternityArray`

We can alter the information in `patlik.prob_array` to reflect different prior beliefs about the dataset. (In contrast, it's seldom a good idea to manipulate the likelihoods from genetic data contained in `patlik.lik_array`).

For example, often the mother is included in the sample of candidate males, either because you are using the same array for multiple families, or self-fertilisation is a biological possibility. In a lot of cases though the mother cannot simultaneously be the sperm/pollen donor, and it is necessary to set the rate of self-fertilisation to zero (the natural logarithm of zero is negative infinity).

In [11]:
patlik.adjust_prob_array(selfing_rate=0)

array([[           -inf, -1.99493854e-02, -1.52788860e+01,
        -5.60176487e+00, -4.13141495e+00],
       [           -inf, -7.04172407e-01, -4.57607856e+00,
        -7.04172407e-01, -7.30900561e+00],
       [           -inf, -7.42725188e-03, -1.59542731e+01,
        -6.27715200e+00, -5.19921863e+00],
       [           -inf, -9.19505403e-01, -9.18017758e-01,
        -1.60741467e+00, -6.45734198e+00],
       [           -inf, -1.60869630e+01, -5.23518911e-03,
        -5.94384444e+00, -5.95244546e+00],
       [           -inf, -1.15250325e+01, -4.60778144e-03,
        -5.94321704e+00, -6.23291143e+00]])

The likelihoods for the mother have changed to 0 (negative infinity on the log scale). You can set any selfing rate between zero and one if you have a good idea of what the value should be and how much it varies. Otherwise it may be better to estimate the selfing rate from the data, or else estimate it some other way.

`adjust_prob_array` always refers back to the original `patlik.lik_array` and `patlik.lik_absent`, which remain unchanged. Calling `adjust_prob_array` will not alter the data stored for `patlik.prob_array` unless you assign it yourself.

In [12]:
patlik.prob_array = patlik.adjust_prob_array(selfing_rate=0)

You can also set likelihoods for particular individuals to zero manually. You might want to do this if you wanted to test the effects of incomplete sampling on your results, or if you had a good reason to suspect that some candidates could not possibly be the sire (for example, if the data are multigenerational, and the candidate was born after the offspring). This is done with the argument `purge`.

In [13]:
patlik.adjust_prob_array(purge=0)

array([[           -inf, -1.99493854e-02, -1.52788860e+01,
        -5.60176487e+00, -4.13141495e+00],
       [           -inf, -7.04172407e-01, -4.57607856e+00,
        -7.04172407e-01, -7.30900561e+00],
       [           -inf, -7.42725188e-03, -1.59542731e+01,
        -6.27715200e+00, -5.19921863e+00],
       [           -inf, -9.19505403e-01, -9.18017758e-01,
        -1.60741467e+00, -6.45734198e+00],
       [           -inf, -1.60869630e+01, -5.23518911e-03,
        -5.94384444e+00, -5.95244546e+00],
       [           -inf, -1.15250325e+01, -4.60778144e-03,
        -5.94321704e+00, -6.23291143e+00]])

This has removed the first individual (notice that this is identical to the previous example, because in this case the first individual is the mother). If you supply an integer or a list of integers to `purge`, these will be treated as indexes of individuals to be removed. Alternatively you can supply a float between zero and one, which will be interpreted as a proportion of the candidates to be removed at random, which can be useful for simulations.

In [14]:
patprob2 = patlik.adjust_prob_array(purge=0.4)
np.isinf(patprob2).mean(1) # proportion missing along each row.

array([0.4, 0.4, 0.4, 0.4, 0.4, 0.4])

You can specify the proportion $\theta$ of the population of candidate males which are missing with the option `missing_parents`. The likelihoods for non-sampled parents will be weighted by $\theta$, and likelihoods for sampled candidates by $1-\theta$.

Of course, rows still need to sum to one. Luckily `adjust_prob_array` does that automatically.

In [15]:
np.exp(patprob2).sum(1)

array([1., 1., 1., 1., 1., 1.])

In [16]:
patlik.adjust_prob_array(missing_parents=0.1)

array([[-8.85780563e+00, -5.71310609e-03, -1.52646497e+01,
        -5.58752859e+00, -6.31440325e+00],
       [-9.55574046e+00, -7.03647931e-01, -4.57555409e+00,
        -7.03647931e-01, -9.50570571e+00],
       [-9.54258126e+00, -2.57945901e-03, -1.59494253e+01,
        -6.27230421e+00, -7.39159542e+00],
       [-1.48875083e+01, -9.18110141e-01, -9.16622495e-01,
        -1.60601941e+00, -8.65317129e+00],
       [-4.43898384e+00, -1.60965279e+01, -1.48000687e-02,
        -5.95340932e+00, -8.15923491e+00],
       [-4.43892341e+00, -1.15351644e+01, -1.47396423e-02,
        -5.95334890e+00, -8.44026787e+00]])

You might want to remove candidates who have an a priori very low probability of paternity, for example to reduce the memory requirements of the `paternityArray`. One simple rule is to exclude any candidates with more than some arbritray number of loci with opposing homozygous genotypes relative to the offspring (you want to allow for a small number, in case there are genotyping errors). This is done with `max_clashes`.

In [17]:
patlik.adjust_prob_array(max_clashes=3)

array([[-8.87218216e+00, -2.00896315e-02, -1.52790262e+01,
        -5.60190511e+00, -4.13155520e+00],
       [-9.55633569e+00, -7.04243161e-01, -4.57614932e+00,
        -7.04243161e-01, -7.30907637e+00],
       [-9.54750043e+00, -7.49863389e-03, -1.59543445e+01,
        -6.27722339e+00, -5.19929001e+00],
       [-1.48889039e+01, -9.19505745e-01, -9.18018100e-01,
        -1.60741501e+00, -6.45734232e+00],
       [-4.44126987e+00, -1.60988139e+01, -1.70861033e-02,
        -5.95569536e+00, -5.96429637e+00],
       [-4.44064986e+00, -1.15368908e+01, -1.64660894e-02,
        -5.95507534e+00, -6.24476974e+00]])

The option `max_clashes` refers back to a matrix that counts the number of such incompatibilities for each offspring-candidate pair. When you create a `paternityArray` from `genotypeArray` objects, this matrix is created automatically ad can be called with:

In [18]:
patlik.clashes

array([[0, 0, 2, 1],
       [0, 0, 1, 0],
       [0, 0, 2, 1],
       [0, 0, 0, 0],
       [0, 3, 0, 1],
       [0, 2, 0, 1]])

You can recreate this manually with:

In [19]:
incompatibilities(mypop, progeny)

array([[0, 0, 2, 1],
       [0, 0, 1, 0],
       [0, 0, 2, 1],
       [0, 0, 0, 0],
       [0, 3, 0, 1],
       [0, 2, 0, 1]])

Notice that this array has a row for each offspring, and a column for each candidate father. The first column is for the mother, which is why everything is zero.

## Importing a `paternityArray`

Frequently you may wish to save an array and reload it. Otherwise, you may be working with a more exotic system than FAPS currently supports, such as microsatellite markers or a funky ploidy system. In this case you can create your own matrix of paternity likelihoods and import this directly as a `paternityArray`. Firstly, we can save the array we made before to disk by supplying a path to save to:

In [20]:
patlik.write('mypatlik.csv')

We can reimport it again using `read_paternity_array`. This function is similar to the function for importing a `genotypeArray`, and the data need to have a specific structure:

1. Offspring names should be given in the first column
2. Names of the mothers are usually given in the second column.
3. If known for some reason, names of fathers can be given as well.
4. Likelihood information should be given *to the right* of columns indicating individual or parental names, with candidates' names in the column headers.
5. The final column should specify a likelihood that the true sire of an individual has *not* been sampled. Usually this is given as the likelihood of drawing the paternal alleles from population allele frequencies.

In [23]:
patlik = read_paternity_array('mypatlik.csv', mothers_col=1, likelihood_col=2)


Of course, you can of course generate your own `paternityArray` and import it in the same way. This is especially useful if your study system has some specific marker type or genetic system not supported by FAPS.

One caveat with importing data is that the array of opposing homozygous loci is not imported automatically. You can either import this as a separate text file, or you can recreate this as above:

In [22]:
incompatibilities(mypop, progeny)

array([[0, 0, 2, 1],
       [0, 0, 1, 0],
       [0, 0, 2, 1],
       [0, 0, 0, 0],
       [0, 3, 0, 1],
       [0, 2, 0, 1]])