Today we made the first [data release](https://www.malariagen.net/data/ag1000g-phase2-ar1) from phase 2 of the *Anopheles gambiae* 1000 genomes project ([Ag1000G](http://www.malariagen.net/ag1000g)). These data include variant calls and haplotypes for 1,142 wild-caught mosquitoes from 13 African countries, and 234 mosquitoes from 11 lab crosses. In this article I thought I would give a quick tour of the data release, summarizing some of the main features of the data.

The data are available for download from a [public FTP site](ftp://ngs.sanger.ac.uk/production/ag1000g/phase2/AR1/). I have a copy of some of the files downloaded to a directory on my computer, so I'll be loading data from there.

In [1]:
release_dir = 'data/ag1000g/phase2/AR1'

## Population sampling

Some metadata about the mosquitoes we've sampled is available in the "samples" sub-directory. Let's load metadata for the wild-caught mosquitoes.

In [3]:
import os
import pandas as pd
samples = pd.read_csv(os.path.join(release_dir, 'samples', 'samples.meta.txt'),
                      sep='\t', index_col='ox_code')
samples.head()

Unnamed: 0_level_0,src_code,population,country,region,contributor,contact,year,m_s,sex,n_sequences,mean_coverage
ox_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AA0040-C,Twifo_Praso__E2,GHcol,Ghana,Twifo_Praso,David Weetman,,2012.0,M,F,95033368,30.99
AA0041-C,Twifo_Praso__H3,GHcol,Ghana,Twifo_Praso,David Weetman,,2012.0,M,F,95843804,31.7
AA0042-C,Takoradi_C7,GHcol,Ghana,Takoradi,David Weetman,,2012.0,M,F,107420666,35.65
AA0043-C,Takoradi_H8,GHcol,Ghana,Takoradi,David Weetman,,2012.0,M,F,95993752,29.46
AA0044-C,Takoradi_D10,GHcol,Ghana,Takoradi,David Weetman,,2012.0,M,F,103044262,33.67


The "ox_code" column is the main identifier we use for each mosquito in our analyses. Just to confirm how many individuals:

In [12]:
len(samples)

1142

Here's a breakdown of number of mosquitoes by country:

In [6]:
samples.country.value_counts()

Cameroon             297
Burkina Faso         167
Uganda               112
Guinea-Bissau         91
Angola                78
Cote d'Ivoire         71
Gabon                 69
Ghana                 67
Gambia, The           65
Kenya                 48
Guinea                44
France                24
Equatorial Guinea      9
Name: country, dtype: int64

The mosquitoes from "France" were collected on Mayotte Island, and the mosquitoes from Equatorial Guinea were collected on Bioko Island.

### Population definitions

Ag1000G phase 2 includes mosquitoes from two species, *An. gambiae* and *An. coluzzii*. It also includes mosquitoes from populations which are hard to assign unambiguously to either *An. gambiae* or *An. coluzzii* because of some apparent mixed ancestry. To aid with downstream analyses we have assigned each mosquito to one of 16 populations, based on country of origin and species. Here's a breakdown of number of mosquitoes by population:

In [17]:
samples.population.value_counts()

CMgam    297
UGgam    112
BFgam     92
GW        91
AOcol     78
BFcol     75
CIcol     71
GAgam     69
GM        65
GHcol     55
KE        48
GNgam     40
FRgam     24
GHgam     12
GQgam      9
GNcol      4
Name: population, dtype: int64

Each population identifier is formed by concatenating the two letter country code (e.g., "CM" for Cameroon) with an abbreviation for the species ("gam" means *An. gambiae*, "col" means *An. coluzzii*). There are three populations (GW, GM, KE) where we have not divided by species because of mixed ancestry. For all other populations, the assignment of species for each individual was based on the results of the conventional PCR-based molecular tests.

### Lab crosses

There are also 11 lab crosses included in this release. Each cross comprises 2 parents and up to 20 progeny. 

In [39]:
crosses = pd.read_csv(os.path.join(release_dir, 'samples', 'cross.samples.meta.txt'),
                      sep='\t', usecols=range(1, 9), index_col='ox_code')
crosses.head()

Unnamed: 0_level_0,cross,role,n_reads,median_cov,mean_cov,sex,colony_id
ox_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AD0142-C,18-5,parent,60486753,26,25.824447,F,Ghana
AD0143-C,18-5,parent,58509103,19,18.800118,M,Kisumu/G3
AD0146-C,18-5,progeny,101612499,44,43.494594,,
AD0147-C,18-5,progeny,50710020,16,16.284487,,
AD0148-C,18-5,progeny,59023991,19,18.978021,,


Each cross has been given an identifier like "18-5", these are just arbitrary identifiers and don't mean anything. Here's a breakdown of number of individuals by cross:

In [43]:
crosses.cross.value_counts()

37-3    22
46-9    22
47-6    22
80-2    22
45-1    22
36-9    22
29-2    22
18-5    22
73-2    21
78-2    21
42-4    16
Name: cross, dtype: int64

The parents of the crosses came from various commonly used lab colonies, e.g., "Mali" or "Pimperena". Because of the way the crosses were performed, in some cases we could not be completely certain of the parent colony, and these are labelled as ambiguous, e.g.., "Kisumu/G3". Here's a count of which colonies were used for the parents:

In [45]:
crosses[crosses.role == 'parent'].colony_id.value_counts()

Mali            6
Kisumu          5
Ghana           4
Kisumu/Ghana    2
Akron           2
Pimperena       2
Kisumu/G3       1
Name: colony_id, dtype: int64

## Variation data

We have run SNP discovery on the cohort of wild-caught individuals, using GATK UnifiedGenotyper. We have then annotated the variants discovered with various quality filters. The canonical VCF files containing these variation data are in the "variation/main/vcf/all" sub-folder.

In [48]:
print(os.listdir(os.path.join(release_dir, 'variation', 'main', 'vcf', 'all')))

FileNotFoundError: [Errno 2] No such file or directory: 'data/ag1000g/phase2/AR1/variation/main/vcf/all'

Various subsets of the variation data are also available, which may be more convenient for some analyses. We've also converted the variation data to HDF5 format files, which can be faster to process for some analyses. 

Here's a few summary statistics about the variation data: