# <span style="color:gray">ipyrad-analysis toolkit:</span> vcf_to_hdf5

This notebook demonstrates and validates the `vcf_to_hdf5` tool using simulated data. See the empirical example for a further description of the tool and application to empirical data. 

### Required software

In [1]:
# conda install ipyrad -c conda-forge -c bioconda
# conda install ipcoal -c conda-forge

In [2]:
import ipyrad.analysis as ipa
import ipcoal
import toytree

### Example 1: Simulate 1K short RAD loci

Here we simulate many short loci to emulate a RAD-seq data set.

In [3]:
# get a tree topology
tree = toytree.rtree.unittree(ntips=5, treeheight=1e5, seed=123)

# setup simulation model
mod = ipcoal.Model(tree=tree, Ne=1e5, nsamples=2)

# simulate loci
mod.sim_loci(nloci=1000, nsites=100)

# write vcf to file
mod.write_vcf(name="test-short", outdir="/tmp", diploid=True)

# show vcf as dataframe
vcfdf = mod.write_vcf(diploid=True)
vcfdf

wrote 1816 SNPs across 825 linkage blocks to /tmp/test-short.vcf


Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT,r0,r1,r2,r3,r4
0,1,19,.,G,"T,A",99,PASS,.,GT,1|0,0|0,0|0,2|0,0|0
1,1,33,.,C,G,99,PASS,.,GT,1|0,0|0,0|0,0|0,0|0
2,1,83,.,C,A,99,PASS,.,GT,0|0,0|0,0|0,0|1,0|0
3,2,16,.,A,G,99,PASS,.,GT,0|0,0|0,0|0,1|0,0|0
4,3,15,.,T,G,99,PASS,.,GT,0|0,0|0,1|0,0|0,0|0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1811,997,87,.,A,G,99,PASS,.,GT,1|0,0|0,0|0,0|0,0|0
1812,998,10,.,A,G,99,PASS,.,GT,1|1,0|0,0|0,0|0,0|0
1813,998,33,.,G,C,99,PASS,.,GT,0|0,1|1,1|1,1|1,1|1
1814,998,80,.,A,C,99,PASS,.,GT,0|0,0|0,1|0,0|0,0|0


For the conversion tool we can set the `ld_block_size` to be larger than the length of loci (scaffolds), which will ensure that all SNPs on the same locus will be treated as being linked, and all loci will be treated as independent (unlinked). The chunksize only affects how fast the data are processed. In this case we expect the number of linkage block to be equal to the number of scaffolds/loci. 

In [4]:
# setup converter tool
tool = ipa.vcf_to_hdf5(
    data="/tmp/test-short.vcf", 
    name="test-short", 
    workdir="/tmp",
    ld_block_size=50000,
    chunksize=10000,
)

# write converted file
tool.run(force=True)

Indexing VCF to HDF5 database file
VCF: 1816 SNPs; 825 scaffolds
[####################] 100% 0:00:04 | converting VCF to HDF5 
HDF5: 1816 SNPs; 825 linkage groups
SNP database written to /tmp/test-short.snps.hdf5


### Example 2: Simulate 5 large chromosomes and break into many linkage blocks

In this case the number of linkage blocks will be much greater than the number of scaffolds, since we are treating every 5Kb block as independent of other blocks
on the same scaffold. 


In [5]:
# get a tree topology
tree = toytree.rtree.unittree(ntips=5, treeheight=1e5, seed=123)

# setup simulation model
mod = ipcoal.Model(tree=tree, Ne=1e5, nsamples=2)

# simulate loci
mod.sim_loci(nloci=5, nsites=1e5)

# write vcf
mod.write_vcf(name="test-long", outdir="/tmp", diploid=True)

# get vcf as dataframe
vcfdf = mod.write_vcf(diploid=True)
vcfdf

wrote 9110 SNPs across 5 linkage blocks to /tmp/test-long.vcf


Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT,r0,r1,r2,r3,r4
0,1,18,.,T,C,99,PASS,.,GT,0|0,1|0,0|0,0|0,0|0
1,1,151,.,A,G,99,PASS,.,GT,0|0,0|0,0|0,0|1,0|0
2,1,274,.,A,G,99,PASS,.,GT,0|0,0|0,1|1,0|0,0|0
3,1,316,.,A,G,99,PASS,.,GT,0|0,0|0,0|0,0|0,0|1
4,1,389,.,C,A,99,PASS,.,GT,0|0,0|0,0|0,0|0,1|1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9105,5,99802,.,G,C,99,PASS,.,GT,0|0,1|1,0|1,0|1,0|0
9106,5,99810,.,A,T,99,PASS,.,GT,0|1,0|0,1|0,1|0,1|1
9107,5,99815,.,A,T,99,PASS,.,GT,0|0,0|0,0|0,0|1,0|0
9108,5,99849,.,A,C,99,PASS,.,GT,0|0,0|1,0|0,0|0,0|0


In this case the `ld_block_size` is smaller than the size of the scaffolds, and so we expect that scaffolds will be split up such that we will end up with more linkage blocks than the original number of scaffolds. In the result here we find 100 linkage blocks. This means that in downstream analyses we will be able to sample 100 unlinked SNPs where in each replicate analysis the precise SNP sampled from each block will vary depending on the random seed. 

In [7]:
# setup converter tool
tool = ipa.vcf_to_hdf5(
    data="/tmp/test-long.vcf", 
    name="test-long", 
    workdir="/tmp",
    ld_block_size=5000,
    chunksize=10000,
)

# write converted file
tool.run(force=True)

Indexing VCF to HDF5 database file
VCF: 9110 SNPs; 5 scaffolds
[####################] 100% 0:00:00 | converting VCF to HDF5 
HDF5: 9110 SNPs; 100 linkage groups
SNP database written to /tmp/test-long.snps.hdf5
