*This post gives an introduction to functions for extracting data from [Variant Call Format (VCF)](http://TODO) files and loading into [NumPy](http://TODO) arrays, [pandas](http://TODO) data frames or [HDF5](http://TODO) files for ease of analysis. These functions are available in [scikit-allel](http://TODO) version 1.1 or later. Any feedback or bug reports welcome.*

## Introduction

### Variant Call Format (VCF)

VCF is a widely-used file format for genetic variation data. Here is an example of a small VCF file, based on the example given in the [VCF specification](https://samtools.github.io/hts-specs/VCFv4.3.pdf):

In [41]:
with open('example.vcf', mode='r') as vcf:
    print(vcf.read(-1))

##fileformat=VCFv4.3
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002	NA00003
20	14370	rs6054257	G	A	29	PASS	DP=14;AF=0.5;DB	GT:DP	0/0:1	0/1:8	1/1:5
20	17330	.	T	A	3	q10	DP=11;AF=0.017	GT:DP	0/0:3	0/1:5	0/0:41
20	1110696	rs6040355	A	G,T	67	PASS	DP=10;AF=0.333,0.667;DB	GT:DP	0/2:6	1/2:0	2/2:4
20	1230237	.	T	.	47	PASS	DP=13	GT:DP	0/0:7	0/0:4	0/0:.
20	1234567	

A VCF file begins with a number of meta-information lines, which start with two hash ('##') characters. Then there is a single header line beginning with a single hash ('#') character. After the header line there are data lines, with each data line describing a genetic variant at a particular position relative to the reference genome of whichever species you are studying. Each data line is divided into fields separated by tab characters. There are 9 fixed fields, labelled "CHROM", "POS", "ID", "REF", "ALT", "QUAL", "FILTER", "INFO" and "FORMAT". Following these are fields containing data about samples. 

For example, the first data line in the file above describes a variant on chromosome 20 at position 14370 relative to the B36 assembly of the human genome. The reference allele is 'G' and the alternate allele is 'A', so this is a single nucleotide polymorphism (SNP). In this file there are three fields with data about samples labelled 'NA00001', 'NA00002' and 'NA00003'. The genotype call in the first sample is '0/0', meaning that individual 'NA0001' is homozygous for the reference allele at this position. The genotype call for the second sample is '0/1' (you may need to scroll across to see this), which means that individual 'NA00002' is heterozygous for the reference and alternate alleles at this position.

### NumPy/pandas/HDF5/...

There are a number of software tools that can read VCF files and perform various analyses. However, if your dataset is large and/or you need to do some bespoke analysis, then it can be faster and more convenient to first extract the necessary data from the VCF file and load into a more efficient storage container.

For analysis and plotting of numerical data in Python, it is very convenient to load data into [NumPy arrays](@@TODO). A NumPy array is an in-memory data structure that provides support for fast arithmetic and data manipulation. For analysing tables of data, [pandas DataFrames](@@TODO) provide useful features such as querying, aggregation and joins. When data are too large to fit into main memory, [HDF5 files](@@TODO) and [Zarr arrays](@@TODO) can provide fast on-disk storage and retrieval of numerical arrays. 

[scikit-allel](@@TODO) is a Python package intended to enable exploratory analysis of large-scale genetic variation data. Version 1.1.0 of scikit-allel adds some new functions for extracting data from VCF files and loading the data into NumPy arrays, pandas DataFrames or HDF5 files. Once you have extracted these data, there are many analyses that can be run interactively on a commodity laptop or desktop computer, even with large-scale datasets from population resequencing studies. To give a flavour of what can be done, there are a few previous articles on my blog, including [variant and sample QC](@@TODO), [allele frequency differentiation](@@TODO), [population structure](@@TODO), and [recombination in genetic crosses](@@TODO).

## [`read_vcf()`](@@TODO)

@@TODO

In [5]:
import allel
print(allel.__version__)

1.1.0b5


In [6]:
callset = allel.read_vcf('example.vcf')

In [7]:
sorted(callset.keys())

['calldata/GT',
 'samples',
 'variants/ALT',
 'variants/CHROM',
 'variants/FILTER_PASS',
 'variants/ID',
 'variants/POS',
 'variants/QUAL',
 'variants/REF']

In [8]:
callset['samples']

array(['NA00001', 'NA00002', 'NA00003'], dtype=object)

In [9]:
callset['variants/CHROM']

array(['20', '20', '20', '20', '20'], dtype=object)

In [10]:
callset['variants/POS']

array([  14370,   17330, 1110696, 1230237, 1234567], dtype=int32)

In [11]:
callset['variants/QUAL']

array([ 29.,   3.,  67.,  47.,  50.], dtype=float32)

In [12]:
callset['calldata/GT']

array([[[0, 0],
        [0, 1],
        [1, 1]],

       [[0, 0],
        [0, 1],
        [0, 0]],

       [[0, 2],
        [1, 2],
        [2, 2]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 1],
        [0, 2],
        [1, 1]]], dtype=int8)

In [13]:
gt = allel.GenotypeArray(callset['calldata/GT'])
gt

Unnamed: 0,0,1,2
0,0/0,0/1,1/1
1,0/0,0/1,0/0
2,0/2,1/2,2/2
3,0/0,0/0,0/0
4,0/1,0/2,1/1


In [14]:
gt.is_het()

array([[False,  True, False],
       [False,  True, False],
       [ True,  True, False],
       [False, False, False],
       [ True,  True, False]], dtype=bool)

In [15]:
gt.count_het(axis=1)

array([1, 1, 2, 0, 2])

In [34]:
ac = gt.count_alleles()
ac

Unnamed: 0,0,1,2,3
0,9,3,0,0
1,1,3,3,5


### Selecting fields

@@TODO

In [16]:
callset = allel.read_vcf('example.vcf', fields=['DP', 'calldata/HQ'])
sorted(callset.keys())

['calldata/HQ', 'variants/DP']

In [17]:
callset['variants/DP']

array([14, 11, 10, 13,  9], dtype=int32)

In [18]:
callset['calldata/HQ']

array([[[51, 51],
        [51, 51],
        [-1, -1]],

       [[58, 50],
        [65,  3],
        [-1, -1]],

       [[23, 27],
        [18,  2],
        [-1, -1]],

       [[56, 60],
        [51, 51],
        [-1, -1]],

       [[-1, -1],
        [-1, -1],
        [-1, -1]]], dtype=int8)

In [19]:
callset = allel.read_vcf('example.vcf', fields='*')
sorted(callset.keys())

['calldata/DP',
 'calldata/GQ',
 'calldata/GT',
 'calldata/HQ',
 'samples',
 'variants/AA',
 'variants/AF',
 'variants/ALT',
 'variants/CHROM',
 'variants/DB',
 'variants/DP',
 'variants/FILTER_PASS',
 'variants/FILTER_q10',
 'variants/FILTER_s50',
 'variants/H2',
 'variants/ID',
 'variants/NS',
 'variants/POS',
 'variants/QUAL',
 'variants/REF',
 'variants/is_snp',
 'variants/numalt',
 'variants/svlen']

### Data type

@@TODO

In [20]:
callset = allel.read_vcf('example.vcf', fields=['DP'])
callset['variants/DP']

array([14, 11, 10, 13,  9], dtype=int32)

In [21]:
callset = allel.read_vcf('example.vcf', fields=['DP'], types={'DP': 'int16'})
callset['variants/DP']

array([14, 11, 10, 13,  9], dtype=int16)

In [22]:
callset = allel.read_vcf('example.vcf', fields=['DP'], types={'DP': 'float32'})
callset['variants/DP']

array([ 14.,  11.,  10.,  13.,   9.], dtype=float32)

In [23]:
callset = allel.read_vcf('example.vcf', fields=['REF'])
callset['variants/REF']

array(['G', 'T', 'A', 'T', 'GTC'], dtype=object)

In [24]:
callset = allel.read_vcf('example.vcf', fields=['REF'], types={'REF': 'S3'})
callset['variants/REF']

array([b'G', b'T', b'A', b'T', b'GTC'], 
      dtype='|S3')

In [25]:
callset = allel.read_vcf('example.vcf', fields=['REF'], types={'REF': 'S1'})
callset['variants/REF']

array([b'G', b'T', b'A', b'T', b'G'], 
      dtype='|S1')

### Number of values

@@TODO

In [26]:
callset = allel.read_vcf('example.vcf', fields=['ALT'])
callset['variants/ALT']

array([['A', '', ''],
       ['A', '', ''],
       ['G', 'T', ''],
       ['', '', ''],
       ['G', 'GTCT', '']], dtype=object)

In [27]:
callset = allel.read_vcf('example.vcf', fields=['ALT'], numbers={'ALT': 5})
callset['variants/ALT']

array([['A', '', '', '', ''],
       ['A', '', '', '', ''],
       ['G', 'T', '', '', ''],
       ['', '', '', '', ''],
       ['G', 'GTCT', '', '', '']], dtype=object)

In [28]:
callset = allel.read_vcf('example.vcf', fields=['ALT'], numbers={'ALT': 1})
callset['variants/ALT']

array(['A', 'A', 'G', '', 'G'], dtype=object)

### Genotype ploidy

@@TODO

In [29]:
with open('example_polyploid.vcf', mode='r') as f:
    print(f.read(-1))

##fileformat=VCFv4.3
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample1	sample2	sample3
20	14370	.	G	A	.	.	.	GT	0/0/0/0	0/0/0/1	0/0/1/1
20	17330	.	T	A,C,G	.	.	.	GT	1/1/2/2	0/1/2/3	3/3/3/3



In [30]:
callset = allel.read_vcf('example_polyploid.vcf', fields=['calldata/GT'], numbers={'GT': 4})
callset['calldata/GT']

array([[[0, 0, 0, 0],
        [0, 0, 0, 1],
        [0, 0, 1, 1]],

       [[1, 1, 2, 2],
        [0, 1, 2, 3],
        [3, 3, 3, 3]]], dtype=int8)

In [32]:
gt = allel.GenotypeArray(callset['calldata/GT'])
gt

Unnamed: 0,0,1,2
0,0/0/0/0,0/0/0/1,0/0/1/1
1,1/1/2/2,0/1/2/3,3/3/3/3


In [33]:
gt.is_het()

array([[False,  True,  True],
       [ True,  True, False]], dtype=bool)

In [35]:
ac = gt.count_alleles()
ac

Unnamed: 0,0,1,2,3
0,9,3,0,0
1,1,3,3,5


### Selecting a genome region

@@TODO

In [37]:
callset = allel.read_vcf('example.vcf', region='20:1000000-1231000')
callset['variants/CHROM'], callset['variants/POS']

(array(['20', '20'], dtype=object), array([1110696, 1230237], dtype=int32))

### Selecting samples

@@TODO

In [38]:
callset = allel.read_vcf('example.vcf', samples=['NA00001', 'NA00003'])
callset['samples']

array(['NA00001', 'NA00003'], dtype=object)

In [39]:
allel.GenotypeArray(callset['calldata/GT'])

Unnamed: 0,0,1
0,0/0,1/1
1,0/0,0/0
2,0/2,2/2
3,0/0,0/0
4,0/1,1/1


## [`vcf_to_npz()`](@@TODO)

@@TODO

## [`vcf_to_hdf5()`](@@TODO)

@@TODO

## [`vcf_to_zarr()`](@@TODO)

@@TODO

## [`vcf_to_dataframe()`](@@TODO)

@@TODO

## [`vcf_to_csv()`](@@TODO)

@@TODO

## [`vcf_to_recarray()`](@@TODO)

@@TODO

## Worked examples

### Human 1000 genomes phase 3

@@TODO

### Pf3k (Plasmodium falciparum) release 5.1

@@TODO

## Other datasets

### Ag1000G (Anopheles gambiae) phase 1 AR3 release

@@TODO

## Further reading

@@TODO


## Post-script: changes from `vcfnp`

The new functions available in `scikit-allel` supercede a package I previously wrote for extracting data from VCF files called [`vcfnp`](@@TODO). I rewrote this functionality from the ground up and ported the functionality to `scikit-allel` for two main reasons. Firstly, `vcfnp` was slow and so you needed a cluster to parse big VCF files, which is obviously a pain. The new functions in `scikit-allel` should be up to ~40 times faster. Secondly, the `vcfnp` API was somewhat complicated, requiring three separate steps to get data from VCF into an HDF5 file or Zarr store. The new functions in `scikit-allel` hopefully simplify this process, enabling data to be extracted from VCF and loaded into any of a variety of storage containers via a single function call.

If you previously used `vcfnp` here are a few notes on some of the things that have changed.

* No need for separate function calls to extract data from variants and calldata fields, both can be extracted via a single call to `read_vcf()` or any of the `vcf_to_...()` functions described above.
* Data can be extracted from VCF and loaded into HDF5 with a single function call to `vcf_to_hdf5()`; i.e., no need to first extract parts of the data out to .npy files then load into HDF5.
* No need to use a cluster or do any parallelisation, it should be possible to run `vcf_to_hdf5()` or `vcf_to_zarr()` on a whole VCF on a half-decent desktop or laptop computer, although big VCF files might take a couple of hours and require a reasonably large hard disk.
* The default NumPy data type for string fields has changed to use 'object' dtype, which means that strings of any length will be stored automatically (i.e., no need to configure separate dtypes for each string field) and there will be no truncation of long strings.
* Previously in `vcfnp` the genotype calls were extracted into a special field called 'genotype' separate from the 'GT' calldata field if requested. In `scikit-allel` the default behaviour is to parse the 'GT' field as a 3-dimensional integer array and return simply as 'calldata/GT'. If you really want to process the 'GT' field as a string then you can override this by setting the type for 'calldata/GT' to 'S3' or 'object'.
