This notebook contains code for predicting 1000 Genomes populations using principal component analysis (PCA) and a random forest classifier
The VCFs can be downloaded from here
The population information file can be downloaded from here:
Before using the 1000 Genomes VCF files, we first do a bit of pre-processing.
First, we concatenate the VCFs together into one giant VCF file using bcftools. The command for doing this is in the script
Then, we filter the variants to use a subset using vcftools. The command for doing this is:
vcftools --gzvcf all.1kg.phase3_shapeit2_mvncall_integrated_v1b.20130502.vcf.gz --snps pruned_SS2_ids_out.txt --recode
Finally, the data should be ready for processing.