Today we made the first [data release](https://www.malariagen.net/data/ag1000g-phase2-ar1) from phase 2 of the *Anopheles gambiae* 1000 genomes project ([Ag1000G](http://www.malariagen.net/ag1000g)). These data include variant calls and haplotypes for 1,142 wild-caught mosquitoes from 13 African countries, and 234 mosquitoes from 11 lab crosses. In this article I thought I would give a quick tour of the data release, summarizing some of the main features of the data.

The data are available for download from a [public FTP site](ftp://ngs.sanger.ac.uk/production/ag1000g/phase2/AR1/). I have a copy of some of the files downloaded to a directory on my computer, so I'll be loading data from there. This article was written as a [Jupyter notebook](@@TODO), and I'll be using a bit of Python code to help inspect the data.

In [2]:
release_dir = 'data/ag1000g/phase2/AR1'

## Sample metadata

Some metadata about the mosquitoes included in Ag1000G phase 2 is available in the "samples" sub-directory. Let's look at the wild-caught mosquitoes first, then the lab crosses. 

### Wild-caught mosquitoes

Here are metadata for the mosquitoes sampled from natural populations:

In [16]:
import os
import pandas as pd
wild_samples = pd.read_csv(os.path.join(release_dir, 'samples', 'samples.meta.txt'),
                           sep='\t', index_col='ox_code')
wild_samples.head()

Unnamed: 0_level_0,src_code,population,country,region,contributor,contact,year,m_s,sex,n_sequences,mean_coverage
ox_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AA0040-C,Twifo_Praso__E2,GHcol,Ghana,Twifo_Praso,David Weetman,,2012.0,M,F,95033368,30.99
AA0041-C,Twifo_Praso__H3,GHcol,Ghana,Twifo_Praso,David Weetman,,2012.0,M,F,95843804,31.7
AA0042-C,Takoradi_C7,GHcol,Ghana,Takoradi,David Weetman,,2012.0,M,F,107420666,35.65
AA0043-C,Takoradi_H8,GHcol,Ghana,Takoradi,David Weetman,,2012.0,M,F,95993752,29.46
AA0044-C,Takoradi_D10,GHcol,Ghana,Takoradi,David Weetman,,2012.0,M,F,103044262,33.67


The "ox_code" column is the main identifier we use for each mosquito in our analyses. Just to confirm how many individuals:

In [17]:
len(wild_samples)

1142

Here's a breakdown of number of mosquitoes by country:

In [18]:
wild_samples.country.value_counts()

Cameroon             297
Burkina Faso         167
Uganda               112
Guinea-Bissau         91
Angola                78
Cote d'Ivoire         71
Gabon                 69
Ghana                 67
Gambia, The           65
Kenya                 48
Guinea                44
France                24
Equatorial Guinea      9
Name: country, dtype: int64

The mosquitoes from "France" were collected on Mayotte Island, and the mosquitoes from Equatorial Guinea were collected on Bioko Island.

Ag1000G phase 2 includes mosquitoes from two species, *An. gambiae* and *An. coluzzii*. It also includes mosquitoes from populations which are hard to assign unambiguously to either *An. gambiae* or *An. coluzzii* because of some apparent mixed ancestry. To aid with downstream analyses we have assigned each mosquito to one of 16 populations, based on country of origin and species. Here's a breakdown of number of mosquitoes by population:

In [19]:
wild_samples.population.value_counts()

CMgam    297
UGgam    112
BFgam     92
GW        91
AOcol     78
BFcol     75
CIcol     71
GAgam     69
GM        65
GHcol     55
KE        48
GNgam     40
FRgam     24
GHgam     12
GQgam      9
GNcol      4
Name: population, dtype: int64

Each population identifier is formed by concatenating the two letter country code (e.g., "CM" for Cameroon) with an abbreviation for the species ("gam" means *An. gambiae*, "col" means *An. coluzzii*). There are three populations (GW, GM, KE) where we have not divided by species because of mixed ancestry. For all other populations, the assignment of species for each individual was based on the results of the conventional PCR-based molecular tests.

### Lab crosses

There are also 11 lab crosses included in this release. Each cross comprises 2 parents and up to 20 progeny. 

In [20]:
cross_samples = pd.read_csv(os.path.join(release_dir, 'samples', 'cross.samples.meta.txt'),
                            sep='\t', usecols=range(1, 9), index_col='ox_code')
cross_samples.head()

Unnamed: 0_level_0,cross,role,n_reads,median_cov,mean_cov,sex,colony_id
ox_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AD0142-C,18-5,parent,60486753,26,25.824447,F,Ghana
AD0143-C,18-5,parent,58509103,19,18.800118,M,Kisumu/G3
AD0146-C,18-5,progeny,101612499,44,43.494594,,
AD0147-C,18-5,progeny,50710020,16,16.284487,,
AD0148-C,18-5,progeny,59023991,19,18.978021,,


Each cross has been given an identifier like "18-5", these are just arbitrary identifiers and don't mean anything. Here's a breakdown of number of individuals by cross:

In [21]:
cross_samples.cross.value_counts()

46-9    22
29-2    22
18-5    22
36-9    22
37-3    22
45-1    22
47-6    22
80-2    22
78-2    21
73-2    21
42-4    16
Name: cross, dtype: int64

The parents of the crosses came from various commonly used lab colonies, e.g., "Mali" or "Pimperena". Because of the way the crosses were performed, in some cases we could not be completely certain of the parent colony, and these are labelled as ambiguous, e.g.., "Kisumu/G3". Here's a count of which colonies were used for the parents:

In [22]:
cross_samples[cross_samples.role == 'parent'].colony_id.value_counts()

Mali            6
Kisumu          5
Ghana           4
Akron           2
Kisumu/Ghana    2
Pimperena       2
Kisumu/G3       1
Name: colony_id, dtype: int64

## Variation data

We have run SNP discovery on the wild-caught individuals, using GATK UnifiedGenotyper. We have then annotated the variants discovered with various quality filters. The canonical VCF files containing these variation data are in the "variation/main/vcf/all" sub-folder, with one VCF file per chromosome arm.

In [43]:
from humanize import naturalsize
vcf_dir = os.path.join(release_dir, 'variation', 'main', 'vcf', 'all')
for file_name in os.listdir(vcf_dir):
    if file_name.endswith('.vcf.gz'):
        file_path = os.path.join(vcf_dir, file_name)
        print(file_name, ':', naturalsize(os.stat(file_path).st_size))

ag1000g.phase2.ar1.UNKN.vcf.gz : 44.4 GB
ag1000g.phase2.ar1.X.vcf.gz : 67.9 GB
ag1000g.phase2.ar1.3L.vcf.gz : 131.8 GB
ag1000g.phase2.ar1.2R.vcf.gz : 177.3 GB
ag1000g.phase2.ar1.Y_unplaced.vcf.gz : 24.1 MB
ag1000g.phase2.ar1.3R.vcf.gz : 181.1 GB
ag1000g.phase2.ar1.2L.vcf.gz : 152.3 GB


Some of these files are reasonably large. To make life a bit easier, various subsets of the variation data are also available, which may be more convenient for some analyses. We've also converted the variation data to HDF5 format files, which can be faster to process for some analyses.

I'm going to use some HDF5 files from the "variation/main/hdf5/lite" sub-folder to extract some summary statistics about the variation data. These files have been pared down to contain only the essential data need for most analyses, and so are a bit smaller and easier to move around.

In [45]:
hdf5_dir = os.path.join(release_dir, 'variation', 'main', 'hdf5', 'lite')
for file_name in os.listdir(hdf5_dir):
    if file_name.endswith('.h5'):
        file_path = os.path.join(hdf5_dir, file_name)
        print(file_name, ':', naturalsize(os.stat(file_path).st_size))

ag1000g.phase2.ar1.pass.biallelic.lite.h5 : 4.7 GB
ag1000g.phase2.ar1.lite.h5 : 17.7 GB
ag1000g.phase2.ar1.pass.lite.h5 : 7.0 GB


Let's get a summary of how many variants were discovered:

In [53]:
import h5py
import numpy as np
callset = h5py.File(os.path.join(hdf5_dir, 'ag1000g.phase2.ar1.lite.h5'), mode='r')
chromosomes = '2L', '2R', '3L', '3R', 'UNKN', 'X', 'Y_unplaced'
n_variants_total = 0
n_variants_total_pass = 0
for chrom in chromosomes:
    n_variants = len(callset[chrom]['variants/POS'])
    n_variants_total += n_variants
    n_variants_pass = np.count_nonzero(callset[chrom]['variants/FILTER_PASS'][:])
    n_variants_total_pass += n_variants_pass
    print(chrom, ': {:,} SNPs; {:,} PASS ({:.1f}%)'
          .format(n_variants, n_variants_pass, n_variants_pass * 100 / n_variants))
print('Total : {:,} SNPs; {:,} PASS ({:.1f}%)'
      .format(n_variants_total, n_variants_total_pass, n_variants_total_pass * 100 / n_variants_total))


2L : 21,442,865 SNPs; 11,524,923 PASS (53.7%)
2R : 24,767,689 SNPs; 15,425,222 PASS (62.3%)
3L : 18,167,056 SNPs; 10,640,388 PASS (58.6%)
3R : 24,943,504 SNPs; 14,481,509 PASS (58.1%)
UNKN : 6,759,497 SNPs; 0 PASS (0.0%)
X : 9,389,639 SNPs; 5,765,843 PASS (61.4%)
Y_unplaced : 16,448 SNPs; 0 PASS (0.0%)
Total : 105,486,698 SNPs; 57,837,885 PASS (54.8%)


So there are 105 million SNPs in the raw dataset, of which 57 million (54.8%) passed all our quality filters.

An interesting feature of the data is how common multiallelic SNPs are. E.g., considering only PASS variants:

In [57]:
for chrom in chromosomes:
    variants = callset[chrom]['variants']
    num_alleles = variants['num_alleles'][:]
    filter_pass = variants['FILTER_PASS'][:]
    n_variants_pass = np.count_nonzero(filter_pass)
    if n_variants_pass:
        num_alleles_pass = num_alleles[filter_pass]
        allelism_count = np.bincount(num_alleles_pass)
        print(chrom, ':', 
              '{:.1f}% biallelic;'.format(allelism_count[2] * 100 / n_variants_pass),
              '{:.1f}% triallelic;'.format(allelism_count[3] * 100 / n_variants_pass),
              '{:.1f}% quadallelic;'.format(allelism_count[4] * 100 / n_variants_pass))

2L : 77.3% biallelic; 20.8% triallelic; 1.9% quadallelic;
2R : 78.1% biallelic; 20.0% triallelic; 1.9% quadallelic;
3L : 74.2% biallelic; 23.2% triallelic; 2.6% quadallelic;
3R : 74.3% biallelic; 23.1% triallelic; 2.6% quadallelic;
X : 77.6% biallelic; 20.5% triallelic; 2.0% quadallelic;
