*This post gives an introduction to functions for extracting data from [Variant Call Format (VCF)](https://samtools.github.io/hts-specs/VCFv4.3.pdf) files and loading into [NumPy](http://www.numpy.org/) arrays, [pandas](http://pandas.pydata.org/) data frames or [HDF5](https://support.hdfgroup.org/HDF5/) files for ease of analysis. These functions are available in [scikit-allel](http://scikit-allel.readthedocs.io/en/latest/) version 1.1 or later. Any feedback or bug reports welcome.*

## Introduction

### Variant Call Format (VCF)

VCF is a widely-used file format for genetic variation data. Here is an example of a small VCF file, based on the example given in the [VCF specification](https://samtools.github.io/hts-specs/VCFv4.3.pdf):

In [1]:
with open('example.vcf', mode='r') as vcf:
    print(vcf.read(-1))

##fileformat=VCFv4.3
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002	NA00003
20	14370	rs6054257	G	A	29	PASS	DP=14;AF=0.5;DB	GT:DP	0/0:1	0/1:8	1/1:5
20	17330	.	T	A	3	q10	DP=11;AF=0.017	GT:DP	0/0:3	0/1:5	0/0:41
20	1110696	rs6040355	A	G,T	67	PASS	DP=10;AF=0.333,0.667;DB	GT:DP	0/2:6	1/2:0	2/2:4
20	1230237	.	T	.	47	PASS	DP=13	GT:DP	0/0:7	0/0:4	./.:.
20	1234567	

A VCF file begins with a number of meta-information lines, which start with two hash ('##') characters. Then there is a single header line beginning with a single hash ('#') character. After the header line there are data lines, with each data line describing a genetic variant at a particular position relative to the reference genome of whichever species you are studying. Each data line is divided into fields separated by tab characters. There are 9 fixed fields, labelled "CHROM", "POS", "ID", "REF", "ALT", "QUAL", "FILTER", "INFO" and "FORMAT". Following these are fields containing data about samples. 

For example, the first data line in the file above describes a variant on chromosome 20 at position 14370 relative to the B36 assembly of the human genome. The reference allele is 'G' and the alternate allele is 'A', so this is a single nucleotide polymorphism (SNP). In this file there are three fields with data about samples labelled 'NA00001', 'NA00002' and 'NA00003'. The genotype call in the first sample is '0/0', meaning that individual 'NA0001' is homozygous for the reference allele at this position. The genotype call for the second sample is '0/1' (you may need to scroll across to see this), which means that individual 'NA00002' is heterozygous for the reference and alternate alleles at this position.

### NumPy/pandas/HDF5/...

There are a number of software tools that can read VCF files and perform various analyses. However, if your dataset is large and/or you need to do some bespoke analysis, then it can be faster and more convenient to first extract the necessary data from the VCF file and load into a more efficient storage container.

For analysis and plotting of numerical data in Python, it is very convenient to load data into [NumPy arrays](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html). A NumPy array is an in-memory data structure that provides support for fast arithmetic and data manipulation. For analysing tables of data, [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/dsintro.html) provide useful features such as querying, aggregation and joins. When data are too large to fit into main memory, [HDF5 files](http://docs.h5py.org/en/latest/quick.html) and [Zarr arrays](http://zarr.readthedocs.io/en/latest/tutorial.html) can provide fast on-disk storage and retrieval of numerical arrays. 

[scikit-allel](http://scikit-allel.readthedocs.io/en/latest/) is a Python package intended to enable exploratory analysis of large-scale genetic variation data. Version 1.1.0 of scikit-allel adds some new functions for extracting data from VCF files and loading the data into NumPy arrays, pandas DataFrames or HDF5 files. Once you have extracted these data, there are many analyses that can be run interactively on a commodity laptop or desktop computer, even with large-scale datasets from population resequencing studies. To give a flavour of what analyses can be done, there are a few previous articles on my blog, touching on topics including [variant and sample QC](http://alimanfoo.github.io/2016/06/10/scikit-allel-tour.html), [allele frequency differentiation](http://alimanfoo.github.io/2015/09/21/estimating-fst.html), [population structure](http://alimanfoo.github.io/2015/09/28/fast-pca.html), and [genetic crosses](http://alimanfoo.github.io/2017/02/14/mendelian-transmission.html).

Until now, getting data out of VCF files and into NumPy etc. has been a bit of a pain point. Hopefully the new scikit-allel functions will make this a bit less of a hurdle. Let's take a look at the new functions...

## [`read_vcf()`](@@TODO)

Let's start with the scikit-allel function [`read_vcf()`](@@TODO). First, some imports:

In [2]:
# import scikit-allel
import allel
# check which version is installed
print(allel.__version__)

1.1.0b5


Read the example VCF file shown above, using default arguments:

In [3]:
callset = allel.read_vcf('example.vcf')

The `callset` object returned by `read_vcf()` is a Python dictionary (`dict`). It contains several NumPy arrays, each of which can be accessed via a key. Here are the available keys: 

In [4]:
sorted(callset.keys())

['calldata/GT',
 'samples',
 'variants/ALT',
 'variants/CHROM',
 'variants/FILTER_PASS',
 'variants/ID',
 'variants/POS',
 'variants/QUAL',
 'variants/REF']

The 'samples' array contains sample identifiers extracted from the header line in the VCF file.

In [5]:
callset['samples']

array(['NA00001', 'NA00002', 'NA00003'], dtype=object)

All arrays with keys beginning 'variants/' come from the fixed fields in the VCF file. For example, here is the data from the 'CHROM' field:

In [6]:
callset['variants/CHROM']

array(['20', '20', '20', '20', '20'], dtype=object)

Here is the data from the 'POS' field:

In [7]:
callset['variants/POS']

array([  14370,   17330, 1110696, 1230237, 1234567], dtype=int32)

Here is the data from the 'QUAL' field:

In [8]:
callset['variants/QUAL']

array([ 29.,   3.,  67.,  47.,  50.], dtype=float32)

All arrays with keys beginning 'calldata/' come from the sample fields in the VCF file. For example, here are the actual genotype calls from the 'GT' field:

In [9]:
callset['calldata/GT']

array([[[ 0,  0],
        [ 0,  1],
        [ 1,  1]],

       [[ 0,  0],
        [ 0,  1],
        [ 0,  0]],

       [[ 0,  2],
        [ 1,  2],
        [ 2,  2]],

       [[ 0,  0],
        [ 0,  0],
        [-1, -1]],

       [[ 0,  1],
        [ 0,  2],
        [ 1,  1]]], dtype=int8)

Note the -1 values for one of the genotype calls. By default scikit-allel uses -1 to indicate a missing value for any integer array (although you can change this if you want).

Because working with genotype calls is a very common task, scikit-allel has a special [`GenotypeArray`](@@TODO) class which adds some convenient functionality to an array of genotype calls. To use this class, pass the array into the class constructor, e.g.:

In [10]:
gt = allel.GenotypeArray(callset['calldata/GT'])
gt

Unnamed: 0,0,1,2
0,0/0,0/1,1/1
1,0/0,0/1,0/0
2,0/2,1/2,2/2
3,0/0,0/0,./.
4,0/1,0/2,1/1


One of the things that the `GenotypeArray` class does is provide a slightly more visually-appealing representation when used in a Jupyter notebook, as can be seen above. There are also methods for making various computations over the genotype calls. For example, the `is_het()` method locates all heterozygous genotype calls:

In [11]:
gt.is_het()

array([[False,  True, False],
       [False,  True, False],
       [ True,  True, False],
       [False, False, False],
       [ True,  True, False]], dtype=bool)

To give another example, the `count_het()` method will count heterozygous calls, summing over variants (axis=0) or samples (axis=1) if requested. E.g., to count the number of het calls per variant:

In [12]:
gt.count_het(axis=1)

array([1, 1, 2, 0, 2])

One more example, here is how to perform an allele count, i.e., count the number times each allele (0=reference, 1=first alternate, 2=second alternate, etc.) is observed for each variant:

In [13]:
ac = gt.count_alleles()
ac

Unnamed: 0,0,1,2
0,3,3,0
1,5,1,0
2,1,1,4
3,4,0,0
4,2,3,1


### Selecting fields

VCF files can often contain many fields of data, and you may only need to extract some of them to perform a particular analysis. You can select which fields to extract by passing a list of strings as the `fields` argument. For example, let's extract the 'DP' field from within the 'INFO' field, and let's also extract the 'DP' field from the sample data:

In [14]:
callset = allel.read_vcf('example.vcf', fields=['variants/DP', 'calldata/DP'])
sorted(callset.keys())

['calldata/DP', 'variants/DP']

I chose these two fields to illustrate the point that sometimes the same field name (e.g., 'DP') can be used both within the INFO field of a VCF and also within the sample data. When selecting fields, to make sure there is no ambiguity, you can include a prefix which is either 'variants/' or 'calldata/'. For example, if you provide 'variants/DP', then the `read_vcf()` function will look for an INFO field named 'DP'. If you provide 'calldata/DP' then `read_vcf()` will look for a FORMAT field named 'DP' within the sample data. 

If you are feeling lazy, you can drop the 'variants/' prefix, in which case `read_vcf()` will assume you are looking for data from one of the fixed fields. E.g.:

In [15]:
callset = allel.read_vcf('example.vcf', fields=['DP', 'calldata/DP'])
sorted(callset.keys())

['calldata/DP', 'variants/DP']

Here is the data that we extracted:

In [16]:
callset['variants/DP']

array([14, 11, 10, 13,  9], dtype=int32)

In [17]:
callset['calldata/DP']

array([[ 1,  8,  5],
       [ 3,  5, 41],
       [ 6,  0,  4],
       [ 7,  4, -1],
       [ 4,  2,  3]], dtype=int16)

If you want to extract absolutely everything from a VCF file, then you can provide a special value ``'*'`` as the ``fields`` argument: 

In [18]:
callset = allel.read_vcf('example.vcf', fields='*')
sorted(callset.keys())

['calldata/DP',
 'calldata/GT',
 'samples',
 'variants/AF',
 'variants/ALT',
 'variants/CHROM',
 'variants/DB',
 'variants/DP',
 'variants/FILTER_PASS',
 'variants/FILTER_q10',
 'variants/FILTER_s50',
 'variants/ID',
 'variants/POS',
 'variants/QUAL',
 'variants/REF',
 'variants/is_snp',
 'variants/numalt',
 'variants/svlen']

### Data type

NumPy arrays can have various data types, including signed integers ('int8', 'int16', 'int32', 'int64'), unsigned integers ('uint8', 'uint16', 'uint32', 'uint64'), floating point numbers ('float32', 'float64'), variable length strings ('object') and fixed length strings (e.g., 'S4' for a 4-character ASCII string). scikit-allel will try to choose a sensible default data type for the fields you want to extract, based on the meta-information in the VCF file, but you can override these you like by passing a dictionary as the `types` argument. 

For example, by default the 'DP' field is loaded into a 32-bit integer array:

In [22]:
callset = allel.read_vcf('example.vcf', fields=['DP'])
callset['variants/DP']

array([14, 11, 10, 13,  9], dtype=int32)

To save some memory, you might choose a 16-bit integer array instead:

In [23]:
callset = allel.read_vcf('example.vcf', fields=['DP'], types={'DP': 'int16'})
callset['variants/DP']

array([14, 11, 10, 13,  9], dtype=int16)

You can also choose a floating-point data type if you like, even for fields that are declared as type 'Integer' in the VCF meta-information, and vice versa. e.g.:

In [24]:
callset = allel.read_vcf('example.vcf', fields=['DP'], types={'DP': 'float32'})
callset['variants/DP']

array([ 14.,  11.,  10.,  13.,   9.], dtype=float32)

For fields containing textual data, there are two choices for data type. By default, scikit-allel will use an 'object' data type, which means that values are stored as an array of Python strings. E.g.:

In [23]:
callset = allel.read_vcf('example.vcf', fields=['REF'])
callset['variants/REF']

array(['G', 'T', 'A', 'T', 'GTC'], dtype=object)

The advantage of using 'object' dtype is that strings can be of any length. Alternatively, you can use a fixed-length string dtype, e.g.:

In [25]:
callset = allel.read_vcf('example.vcf', fields=['REF'], types={'REF': 'S3'})
callset['variants/REF']

array([b'G', b'T', b'A', b'T', b'GTC'], 
      dtype='|S3')

Note that fixed-length string dtypes will cause any string values longer than the requested number of characters to be truncated. I.e., there can be some data loss. E.g., if using a single-character string for the 'REF' field, the correct value of 'GTC' for the final variant will get truncated to 'G':

In [26]:
callset = allel.read_vcf('example.vcf', fields=['REF'], types={'REF': 'S1'})
callset['variants/REF']

array([b'G', b'T', b'A', b'T', b'G'], 
      dtype='|S1')

### Number of values

Some fields like 'ALT', 'AC' and 'AF' can have a variable number of values. I.e., each variant may have a different number of data values for these fields. One trade-off you have to make when loading data into NumPy arrays is that you cannot have arrays with a variable number of items per row. Rather, you have to fix the maximum number of possible items. While you lose some flexibility, you gain speed of access.

For fields like 'ALT', scikit-allel will choose a default number of expected values, which is set at 3. E.g., here is what you get by default:

In [27]:
callset = allel.read_vcf('example.vcf', fields=['ALT'])
callset['variants/ALT']

array([['A', '', ''],
       ['A', '', ''],
       ['G', 'T', ''],
       ['', '', ''],
       ['G', 'GTCT', '']], dtype=object)

In this case, 3 is more that we need, because no variant has more than 2 ALT values. However, some VCF files (especially those including INDELs) may have more than 3 ALT values. 

If you need to increase or decrease the expected number of values for any field, you can provide a dictionary as the `numbers` argument. E.g., increase the number of ALT values to 5:

In [28]:
callset = allel.read_vcf('example.vcf', fields=['ALT'], numbers={'ALT': 5})
callset['variants/ALT']

array([['A', '', '', '', ''],
       ['A', '', '', '', ''],
       ['G', 'T', '', '', ''],
       ['', '', '', '', ''],
       ['G', 'GTCT', '', '', '']], dtype=object)

Some care is needed here, because if you choose a value that is lower than the maximum number of values in the VCF file, then any extra values will not get extracted. E.g., the following would be fine if all variants were biallelic, but would lose information for any multi-allelic variants:

In [29]:
callset = allel.read_vcf('example.vcf', fields=['ALT'], numbers={'ALT': 1})
callset['variants/ALT']

array(['A', 'A', 'G', '', 'G'], dtype=object)

### Genotype ploidy

By default, scikit-allel assumes you are working with a diploid organism, and so expects to parse out 2 alleles for each genotype call. If you are working with an organism with some other ploidy, you can change the expected number of alleles via the `number` argument.

For example, here is an example VCF with tetraploid genotype calls:

In [30]:
with open('example_polyploid.vcf', mode='r') as f:
    print(f.read(-1))

##fileformat=VCFv4.3
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample1	sample2	sample3
20	14370	.	G	A	.	.	.	GT	0/0/0/0	0/0/0/1	0/0/1/1
20	17330	.	T	A,C,G	.	.	.	GT	1/1/2/2	0/1/2/3	3/3/3/3



Here is how to indicate the ploidy:

In [31]:
callset = allel.read_vcf('example_polyploid.vcf', fields=['calldata/GT'], numbers={'GT': 4})
callset['calldata/GT']

array([[[0, 0, 0, 0],
        [0, 0, 0, 1],
        [0, 0, 1, 1]],

       [[1, 1, 2, 2],
        [0, 1, 2, 3],
        [3, 3, 3, 3]]], dtype=int8)

As shown earlier for diploid calls, the `GenotypeArray` class can provide some extra functionality, e.g.:

In [35]:
gt = allel.GenotypeArray(callset['calldata/GT'])
gt

Unnamed: 0,0,1,2
0,0/0/0/0,0/0/0/1,0/0/1/1
1,1/1/2/2,0/1/2/3,3/3/3/3


In [36]:
gt.is_het()

array([[False,  True,  True],
       [ True,  True, False]], dtype=bool)

In [37]:
ac = gt.count_alleles()
ac

Unnamed: 0,0,1,2,3
0,9,3,0,0
1,1,3,3,5


### Selecting a genome region

@@TODO

In [37]:
callset = allel.read_vcf('example.vcf', region='20:1000000-1231000')
callset['variants/CHROM'], callset['variants/POS']

(array(['20', '20'], dtype=object), array([1110696, 1230237], dtype=int32))

### Selecting samples

@@TODO

In [38]:
callset = allel.read_vcf('example.vcf', samples=['NA00001', 'NA00003'])
callset['samples']

array(['NA00001', 'NA00003'], dtype=object)

In [39]:
allel.GenotypeArray(callset['calldata/GT'])

Unnamed: 0,0,1
0,0/0,1/1
1,0/0,0/0
2,0/2,2/2
3,0/0,0/0
4,0/1,1/1


## [`vcf_to_npz()`](@@TODO)

@@TODO

## [`vcf_to_hdf5()`](@@TODO)

@@TODO

## [`vcf_to_zarr()`](@@TODO)

@@TODO

## [`vcf_to_dataframe()`](@@TODO)

@@TODO

## [`vcf_to_csv()`](@@TODO)

@@TODO

## [`vcf_to_recarray()`](@@TODO)

@@TODO

## Worked examples

### Human 1000 genomes phase 3

@@TODO

### Pf3k (Plasmodium falciparum) release 5.1

@@TODO

## Other datasets

### Ag1000G (Anopheles gambiae) phase 1 AR3 release

@@TODO

## Further reading

@@TODO


## Post-script: changes from `vcfnp`

The new functions available in `scikit-allel` supercede a package I previously wrote for extracting data from VCF files called [`vcfnp`](@@TODO). I rewrote this functionality from the ground up and ported the functionality to `scikit-allel` for two main reasons. Firstly, `vcfnp` was slow and so you needed a cluster to parse big VCF files, which is obviously a pain. The new functions in `scikit-allel` should be up to ~40 times faster. Secondly, the `vcfnp` API was somewhat complicated, requiring three separate steps to get data from VCF into an HDF5 file or Zarr store. The new functions in `scikit-allel` hopefully simplify this process, enabling data to be extracted from VCF and loaded into any of a variety of storage containers via a single function call.

If you previously used `vcfnp` here are a few notes on some of the things that have changed.

* No need for separate function calls to extract data from variants and calldata fields, both can be extracted via a single call to `read_vcf()` or any of the `vcf_to_...()` functions described above.
* Data can be extracted from VCF and loaded into HDF5 with a single function call to `vcf_to_hdf5()`; i.e., no need to first extract parts of the data out to .npy files then load into HDF5.
* No need to use a cluster or do any parallelisation, it should be possible to run `vcf_to_hdf5()` or `vcf_to_zarr()` on a whole VCF on a half-decent desktop or laptop computer, although big VCF files might take a couple of hours and require a reasonably large hard disk.
* The default NumPy data type for string fields has changed to use 'object' dtype, which means that strings of any length will be stored automatically (i.e., no need to configure separate dtypes for each string field) and there will be no truncation of long strings.
* Previously in `vcfnp` the genotype calls were extracted into a special field called 'genotype' separate from the 'GT' calldata field if requested. In `scikit-allel` the default behaviour is to parse the 'GT' field as a 3-dimensional integer array and return simply as 'calldata/GT'. If you really want to process the 'GT' field as a string then you can override this by setting the type for 'calldata/GT' to 'S3' or 'object'.
