-
Notifications
You must be signed in to change notification settings - Fork 0
Spinning
If you're looking to parse the Thousand Genome VCF files, collect together the VCF files from the FTP site with this command:
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/*.gz
Next, download the sample ancestry metdata:
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel -O samples.tab
You can use these files as input data to the Harvestman pre-processing step:
./HarvestmanConsole -a 37 -o harvestman-processed -v <path-to-downloaded-vcfs>
This will parse the VCF data into a binary format that Harvestman can quickly re-read. The '-a' parameter specifies which reference genome assembly to use (GRCh37).
Harvestman expects VCF files as the initial input. You can input an unlimited number of VCF files, but there are two main constraints. First, they must all be generated against the same version of the reference genome. The program has been tested with major builds GRCh37 and GRCh38. Earlier builds and non-human reference genomes are not currently supported. You can specify the assembly version with the '-a' flag, ('./HarvestmanConsole spin -a 37 '). The second constraint is that patient/sample IDs in the VCF header must either be unique or reference the same sample across all files.
You must also provide a tab-delimited file that contains the machine learning labels of the different samples. You specify this in a file called 'samples.tab' in the root directory of the input, or any subdirectories. The first line is a header, in the following format:
<id-column-label> <label-column-1> ... <label-column-n>
The data rows are in the format:
<sample-id> <label-value-1> ... <label-value-n>
For instance, the first lines of the TG1K release are:
sample pop super_pop gender
HG00096 GBR EUR male
HG00097 GBR EUR female
The first non-header line specifies that you will find a sample 'HG00096' in one or more VCF files, and it comes from a male who has ancestry in Great Britain / Europe. You can reference the columns in the header (pop, super_pop, or gender) during the train phase to produce feature sets and models based around those labels.