*This post has some examples of analysing a genetic cross, using [scikit-allel](@@) and standard scientific Python libraries ([NumPy](@@), [matplotlib](@@), etc.). As usual, if you spot any errors or have any suggestions, please drop a comment below.*

## Setup

In [15]:
import numpy as np
import pandas
import h5py
import allel

I'm going to use data from the [Ag1000G](@@) project [phase 1 data releases](@@), which includes genotype calls for four genetic crosses. Each cross involves two parents (a mother and a father) and up to 20 offspring (progeny). These are mosquito crosses, but mosquitoes are diploid (like us), so the genetics are the same as if analysing a cross or family of any other diploid species.

Here's some information about the crosses.

In [1]:
samples = pandas.read_csv('data/phase1.AR3.1/samples/cross.samples.meta.txt', sep='\t')
samples.head()

Unnamed: 0,ox_code,cross,role,n_reads,median_cov,mean_cov,sex,colony_id
0,AD0231-C,29-2,parent,451.762,20.0,19.375,F,ghana
1,AD0232-C,29-2,parent,572.326,25.0,24.37,M,kisumu
2,AD0234-C,29-2,progeny,489.057,16.0,15.742,F,
3,AD0235-C,29-2,progeny,539.649,17.0,17.364,F,
4,AD0236-C,29-2,progeny,537.237,17.0,17.284,F,


In [2]:
samples.cross.value_counts()

29-2    22
46-9    22
36-9    20
42-4    16
Name: cross, dtype: int64

So there are four crosses. The two largest (crosses '29-2' and '46-9') each have 22 individuals (2 parents, 20 progeny), and the smallest ('42-4') has 16 individuals (2 parents, 14 progeny).

All individuals in all crosses have been sequenced on Illumina HiSeq machines, and then have had genotypes called at variant sites discovered in a cohort of wild specimens. The genotype data were originally in [VCF format](@@), however for ease of analysis we've converted the data to [HDF5 format](@@).

Open the file containing genotype data for chromosome arm 3R.

In [3]:
callset = h5py.File('data/phase1.AR3/variation/crosses/ar3/hdf5/ag1000g.crosses.phase1.ar3sites.3R.h5',
                    mode='r')
callset

<HDF5 file "ag1000g.crosses.phase1.ar3sites.3R.h5" (mode r)>

To analyse your own data using the examples shown below, you would need to convert the genotype data to either NumPy or HDF5 format. If you have the data in VCF format then you can use the [vcfnp](@@) utility to perform the conversion. There is some documentation in the [vcfnp README](@@) but please feel free to email me if you run into any difficulty. This data conversion is the painful step, if you can get over it then the rest should be relatively plain sailing.

Here I am going to start from unphased genotype data. If you have already phased the data that's fine, convert to NumPy or HDF5 as you would for unphased data.

In total I have genotype calls in 80 individuals at 22,632,425 SNPs on chromosome 3R. However, I'm only going to analyse data for a single cross, '29-2', between a mother from the 'Ghana' colony and a father from the 'Kisumu' colony. I can subset out the genotype data for just this cross, and keep only SNPs that are segregating in this cross. I'm also only going to keep SNPs that passed all quality filters.

In [6]:
genotypes = allel.GenotypeChunkedArray(callset['3R/calldata/genotype'])
genotypes

Unnamed: 0,0,1,2,3,4,...,75,76,77,78,79,Unnamed: 12
0,1/1,0/1,0/1,0/1,0/1,...,0/1,0/1,0/1,0/1,0/1,
1,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
2,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
...,...,...,...,...,...,...,...,...,...,...,...,...
22632422,./.,./.,./.,./.,./.,...,0/0,./.,./.,0/0,0/0,
22632423,./.,./.,./.,./.,./.,...,0/0,./.,./.,./.,0/0,
22632424,./.,./.,./.,./.,./.,...,./.,./.,./.,./.,1/1,


In [11]:
# locate the indices of the samples within the callset
sample_indices = samples[samples.cross == '29-2'].index.values.tolist()
sample_indices

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]

In [12]:
# do an allele count to find segregating variants
ac = genotypes.count_alleles(max_allele=3, subpop=sample_indices)[:]
ac

Unnamed: 0,0,1,2,3,Unnamed: 5
0,19,25,0,0,
1,44,0,0,0,
2,44,0,0,0,
...,...,...,...,...,...
22632422,0,0,0,0,
22632423,0,0,0,0,
22632424,0,0,0,0,


In [13]:
# how many SNPs are segregating within the cross?
loc_seg = ac.is_segregating()
np.count_nonzero(loc_seg)

2142258

In [17]:
# locate SNPs that passed all quality filters
loc_pass = callset['3R/variants/FILTER_PASS'][:]

In [18]:
# perform the subset and load the results into memory uncompressed
genotypes_cross = genotypes.subset(loc_seg & loc_pass, sample_indices)[:]
genotypes_cross

Unnamed: 0,0,1,2,3,4,...,17,18,19,20,21,Unnamed: 12
0,./.,0/0,0/1,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
1,./.,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
2,0/1,0/0,0/1,0/1,0/0,...,0/0,0/0,0/0,0/0,0/0,
...,...,...,...,...,...,...,...,...,...,...,...,...
709396,1/1,0/0,0/1,0/1,0/1,...,0/1,0/1,0/1,0/1,0/1,
709397,0/0,1/1,0/1,0/1,0/1,...,0/1,0/1,0/1,0/1,0/1,
709398,0/0,1/1,0/1,0/1,0/1,...,0/1,0/1,0/1,0/1,0/1,


Now I have an array of genotype calls at 709,399 segregating SNPs in 22 individuals. The mother is the first column, the father is the second column, and the progeny are the remaining columns. You'll notice that the mother's genotype call is missing at the first two SNPs: we could remove these, but we'll leave them in, just to check that the analyses are robust to some missing data.

## Phasing by transmission

I'm starting from unphased genotype calls, so the first thing to do is phase the calls to generate haplotypes. There are several options for phasing a cross. Here I'm going to use the [`phase_by_transmission()`](http://scikit-allel.readthedocs.io/en/latest/stats/mendel.html#allel.stats.mendel.phase_by_transmission) function from [scikit-allel](@@), because it's convenient and fast (couple of seconds). We've found this function works well for crosses with relatively large numbers of progeny. However, if you have a smaller family with only a couple of progeny, and/or you have a more complicated pedigree with multiple generations, try phasing with [SHAPEIT2 + DuoHMM](@@).

In [19]:
genotypes_cross_phased = allel.phase_by_transmission(genotypes_cross, window_size=100)
genotypes_cross_phased

Unnamed: 0,0,1,2,3,4,...,17,18,19,20,21,Unnamed: 12
0,./.,0/0,0/1,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
1,./.,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
2,0|1,0|0,1|0,1|0,0|0,...,0|0,0|0,0|0,0|0,0|0,
...,...,...,...,...,...,...,...,...,...,...,...,...
709396,1|1,0|0,1|0,1|0,1|0,...,1|0,1|0,1|0,1|0,1|0,
709397,0|0,1|1,0|1,0|1,0|1,...,0|1,0|1,0|1,0|1,0|1,
709398,0|0,1|1,0|1,0|1,0|1,...,0|1,0|1,0|1,0|1,0|1,


Notice that most of the genotype calls in the snippet shown above now have a pipe character ('`|`') as the allele separator, indicating the call is phased. The first two SNPs are not phased, however, because the genotype call for one of the parents is missing.

## Visualising transmission

Now we have phased data, let's plot the transmission of alleles from parents to progeny. This is a useful diagnostic for assessing the quality of the phasing, and also gives an indication of how much recombination has occurred.

Before plotting, I'm going to separate out the data into maternal and paternal haplotypes. The maternal haplotypes are the two haplotypes carried by the mother, and the haplotype in each of the progeny inherited from the mother. The paternal haplotypes are the same but for the father.

In [39]:
# pull out mother's genotypes from the first column
genotypes_mother = genotypes_cross_phased[:, 0]
# convert to haplotype array
haplotypes_mother = genotypes_mother.to_haplotypes()
# pull out maternal haplotypes from the progeny
haplotypes_progeny_maternal = allel.HaplotypeArray(genotypes_cross_phased[:, 2:, 0])
# stack mother's haplotypes alongside haplotypes she transmitted to her progeny
haplotypes_maternal = haplotypes_mother.concatenate(haplotypes_progeny_maternal, axis=1)
haplotypes_maternal

Unnamed: 0,0,1,2,3,4,...,17,18,19,20,21,Unnamed: 12
0,.,.,0,0,0,...,0,0,0,0,0,
1,.,.,0,0,0,...,0,0,0,0,0,
2,0,1,1,1,0,...,0,0,0,0,0,
...,...,...,...,...,...,...,...,...,...,...,...,...
709396,1,1,1,1,1,...,1,1,1,1,1,
709397,0,0,0,0,0,...,0,0,0,0,0,
709398,0,0,0,0,0,...,0,0,0,0,0,


Let's fix on the mother for a moment. The mother has two haplotypes. Because recombination occurs during gamete formation, each haplotype the mother passes on to her progeny is a unique mosaic of her own two haplotypes. For any SNP where the two maternal haplotypes carry a different allele, we can "paint" the maternal haplotypes within the progeny according to which of the mother's two alleles were inherited, using the [`paint_transmission()`](@@) function.

In [26]:
painting_maternal = allel.paint_transmission(haplotypes_mother, haplotypes_progeny_maternal)
painting_maternal

array([[6, 6, 6, ..., 6, 6, 6],
       [6, 6, 6, ..., 6, 6, 6],
       [2, 2, 1, ..., 1, 1, 1],
       ..., 
       [4, 4, 4, ..., 4, 4, 4],
       [3, 3, 3, ..., 3, 3, 3],
       [3, 3, 3, ..., 3, 3, 3]], dtype=uint8)

This new "painting" array is an array of integer codes, where each number means something. The meanings are given in the help text for the [`paint_transmission()`](@@) function:

In [35]:
help(allel.paint_transmission)

Help on function paint_transmission in module allel.stats.mendel:

paint_transmission(parent_haplotypes, progeny_haplotypes)
    Paint haplotypes inherited from a single diploid parent according to
    their allelic inheritance.
    
    Parameters
    ----------
    parent_haplotypes : array_like, int, shape (n_variants, 2)
        Both haplotypes from a single diploid parent.
    progeny_haplotypes : array_like, int, shape (n_variants, n_progeny)
        Haplotypes found in progeny of the given parent, inherited from the
        given parent. I.e., haplotypes from gametes of the given parent.
    
    Returns
    -------
    painting : ndarray, uint8, shape (n_variants, n_progeny)
        An array of integers coded as follows: 1 = allele inherited from
        first parental haplotype; 2 = allele inherited from second parental
        haplotype; 3 = reference allele, also carried by both parental
        haplotypes; 4 = non-reference allele, also carried by both parental
        hapl

We are particularly interested in plotting the "1" and "2" values, because these occur where the mother's haplotypes carried two different alleles, and so we have information about which allele has been transmitted.

We can do the same for the father.

In [37]:
genotypes_father = genotypes_cross_phased[:, 1]
haplotypes_father = genotypes_father.to_haplotypes()
haplotypes_progeny_paternal = allel.HaplotypeArray(genotypes_cross_phased[:, 2:, 1])
haplotypes_paternal = haplotypes_father.concatenate(haplotypes_progeny_paternal, axis=1)
haplotypes_paternal

Unnamed: 0,0,1,2,3,4,...,17,18,19,20,21,Unnamed: 12
0,0,0,1,0,0,...,0,0,0,0,0,
1,0,0,0,0,0,...,0,0,0,0,0,
2,0,0,0,0,0,...,0,0,0,0,0,
...,...,...,...,...,...,...,...,...,...,...,...,...
709396,0,0,0,0,0,...,0,0,0,0,0,
709397,1,1,1,1,1,...,1,1,1,1,1,
709398,1,1,1,1,1,...,1,1,1,1,1,


In [38]:
painting_paternal = allel.paint_transmission(haplotypes_father, haplotypes_progeny_paternal)
painting_paternal

array([[5, 3, 3, ..., 3, 3, 3],
       [3, 3, 3, ..., 3, 3, 3],
       [3, 3, 3, ..., 3, 3, 3],
       ..., 
       [3, 3, 3, ..., 3, 3, 3],
       [4, 4, 4, ..., 4, 4, 4],
       [4, 4, 4, ..., 4, 4, 4]], dtype=uint8)

Notice the "5" code in this snippet. This code indicates an allele that is found on a progeny haplotype but not present on either of the parent's haplotypes. This is not possible according to the rules of Mendelian transmission, and hence is a form of "Mendelian error".

Now we have these "paintings", it is fairly straightforward to plot them.