# 1. Load 1000 Genomes Data

We're just going to work with genotyping data for chromosome 21, which is the [smallest human chromosome](https://en.wikipedia.org/wiki/Chromosome_21). (We've chosen chr21 because it's relatively small -- this should make things a bit faster. Of course, this is a limitation that should not be followed in practice.)



## 1.1. Convert the chromosome 21 `vcf.gz` file to a simpler matrix format

Regardless of the dimensionality reduction method we'll use, the input is ultimately the same -- a matrix of SNPs and samples.

Here, we use a shell command **that was adapted from PSET 2 part 1** for converting a `vcf.gz` file to a tab-delimited matrix format. This command is contained in [`vcf2tab.sh`](https://github.com/fedarko/plink-182/blob/master/vcf2tab.sh). It's a moderately complex command that does a few things besides just converting this data, so we explain certain parts of it here for the sake of clarity:

### `bcftools query` options

* `-e'AF<0.01' -e'AF>0.99'`

  * These `bcftools query` options filter out SNPs with listed allele frequencies of less than 1% or greater than 99% (in either case, where the minor allele frequency is < 1%).
  
* `-e'MULTI_ALLELIC=1'`

  * This option explicitly filters out multiallelic SNPs (`MULTI_ALLELIC` is a flag, and the `=1` check is [how bcftools expects expressions to test for flags being true](http://samtools.github.io/bcftools/bcftools.html#expressions)). We do this here rather than later on because **judging SNPs as multiallelic based only on if any samples have genotypes containing `2` or `3` will silently fail to recognize multiallelic SNPs where all of the samples present only have alleles classified as `0` or `1`.**

  * This is demonstrably the case for PSET 2's data: **TODO substantiate claim beyond md5sums being different, like write a python script to go through the vcf and find all lines where the SNP is tagged as `MULTI_ALLELIC` but none of the samples' genotypes at that SNP contain `2` or `3`.**
  
  * These sorts of corner cases are probably pretty rare, especially for datasets with lots of samples (where the odds of at least one sample having a `2` or `3` in its genotype become more and more likely). However, these corner cases do introduce some bias nonetheless, so accounting for them helps ensure that our downstream analyses use the highest-quality data we can provide.
  
* `-f "[\t%GT\t]\n"`

  * This option makes `bcftools query` output samples' genotypes for each SNP, where the genotypes are begun and followed by tabs and the SNPs are separated by newlines. (Each genotype is represented as something like `[tab]0|1[tab]`; see the "Extracting per-sample tags" section [here](https://samtools.github.io/bcftools/howtos/query.html).)

  * Since this is phased genotyping data, we can safely assume that samples' genotypes are represented with pipes (`0|0`) instead of slashes (`0/0`). See [section 1.4.2 of the VCF spec](https://samtools.github.io/hts-specs/VCFv4.2.pdf) for details.

### `sed` options

* `sed 's/0|0/0/g' | sed 's/0|1/1/g' | sed 's/1|0/1/g' | sed 's/1|1/2/g'`

  * These calls to `sed` convert the genotypes output from `bcftools query` into numbers -- here, this is just the number of minor alleles each sample has (0, 1, or 2).

### `grep` options (not applicable any more)
* PSET 2's version of this command included a `grep -v "|"` call after the `sed` calls, which filtered out any SNP lines still containing a `|` (which would have been the case for any SNPs that contained alleles besides `0` or `1`, e.g. `0|2`, since they wouldn't have matched any of the regular expression queries provided to `sed` above).

* This helped filter *some, but not all* multiallelic SNPs, as described above. Since we now filter out multiallelic SNPs "upstream," using `bcftools query`, we can safely remove this line.

In [5]:
%%bash

nohup ~/plink-182/vcf2tab.sh

### For reference, retrieve sample and SNP IDs using `bcftools query`
The sample retrieval command was also taken from PSET 2, part 1.

In [6]:
%%bash
CHR_21_VCF_GZ=/datasets/cs284s-sp20-public/1000Genomes/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

# Get sample IDs
bcftools query -l $CHR_21_VCF_GZ > ~/project/sample_ids.txt

# Get SNP IDs (use the same subset of SNPs as above)
bcftools query -e'AF<0.01' -e'AF>0.99' -e'MULTI_ALLELIC=1' -f "%ID\n" $CHR_21_VCF_GZ > ~/project/snp_ids.txt