*This post gives an introduction to functions for extracting data from [Variant Call Format (VCF)](http://TODO) files and loading into [NumPy](http://TODO) arrays, [pandas](http://TODO) data frames or [HDF5](http://TODO) files for ease of analysis. These functions are available in [scikit-allel](http://TODO) version 1.1 or later.*

## Variant Call Format (VCF)

If you are are already familiar with the VCF format, please skip to the next section.

VCF is a widely-used file format for genetic variation data. Here is an example of a small VCF file, based on the example given in the [VCF specification](@@TODO) (TODO update):

In [3]:
with open('example.vcf', mode='r') as vcf:
    print(vcf.read(-1))

##fileformat=VCFv4.0
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FILTER=<ID=q10,Description="Quality below 10">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Typ

A VCF file begins with a number of header lines, which start with a hash ('#') character. After the header lines comes the data, with each row describing a genetic variant at a particular position relative to the genome of whichever species you are studying. The data in each row are divided into fields separated by tab characters. There are 9 fixed fields, labelled "CHROM", "POS", "ID", "REF", "ALT", "QUAL", "FILTER", "INFO" and "FORMAT". Following these are variable fields containing data about samples. In this file there are three variable fields with data about samples labelled 'NA00001', 'NA00002' and 'NA00003'.

For example, the first row provides data about a variant on chromosome 19 at position 111 (i.e., the 111th nucleotide in the reference sequence). The reference allele is 'A' and the alternate allele is 'C', so this is a single nucleotide polymorphism (SNP). The genotype call in the first sample is '0/0', meaning that individual 'NA0001' is homozygous for the reference allele at this position. The genotype call for the third sample is '0/1' (you may need to scroll across to see this), which means that individual 'NA00003' is heterozygous for the reference and alternate alleles at this position.

## Why NumPy/pandas/HDF5?

There are a number of software tools that can read VCF files and perform common analyses. However, if your dataset is large and/or you need to do some bespoke analysis, then working with VCF files can be a bit slow and inconvenient.

For analysis and plotting of numerical data in Python, it is very convenient to load data into [NumPy arrays](@@TODO). A NumPy array is an in-memory data structure that provides support for fast arithmetic and data manipulation. For analysing tables of data, [pandas DataFrames](@@TODO) provide useful features such as querying, aggregation and joins. When data are too large to fit into main memory, [HDF5 files](@@TODO) can provide fast on-disk storage and retrieval of numerical arrays. 

[scikit-allel](@@TODO) is a Python package intended to enable exploratory analysis of large-scale genetic variation data. Version 1.1.0 of scikit-allel adds some new functions for extracting data from VCF files and loading the data into NumPy arrays, pandas DataFrames or HDF5 files. Once you have extracted these data, there are many analyses that can be run interactively on a commodity laptop or desktop computer, even with large-scale datasets from population resequencing studies. To give a flavour of what can be done, there are a few previous articles on my blog, covering topics including [variant and sample QC](@@TODO), [allele frequency differentiation](@@TODO), [population structure](@@TODO), and [recombination in genetic crosses](@@TODO).

## [`read_vcf()`](@@TODO)

@@TODO

### Selecting fields

@@TODO

### Data type

@@TODO

### Number of values

@@TODO

### Selecting a genome region

@@TODO

### Selecting samples

@@TODO

### Ploidy

@@TODO

## [`vcf_to_npz()`](@@TODO)

@@TODO

## [`vcf_to_hdf5()`](@@TODO)

@@TODO

## [`vcf_to_zarr()`](@@TODO)

@@TODO

## [`vcf_to_dataframe()`](@@TODO)

@@TODO

## [`vcf_to_csv()`](@@TODO)

@@TODO

## [`vcf_to_recarray()`](@@TODO)

@@TODO