# Filter genotype data
This notebook was adapted from one written by Ryan Waples: 
https://github.com/rwaples/chum_populations

### Bring:
* raw genotypes in [MAP/PED format](http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml)
    individual pedigree information and genotype calls
    * \*.ped file individual pedigree information and genotype calls
    * \*.map file - variant information file
   
These files can be generated by the **`populations`** program within **`Stacks`** using the "--plink" flag

### Take away:
* Set of genotypes passing filtering criteria.

### Programs used:
* [PLINK v1.9b](http://pngu.mgh.harvard.edu/~purcell/plink2/index.html)
    
### Steps
1. Setup
2. Filter based on missingness.
3. Test for Hardy-Weinberg equilibrium (HWE) and quantify minor allele frequency (MAF) within each population.
4. Apply filters based on HWE and MAF.
5. Keep only one SNP per locus.
6. Summary statistics

Im not sure how Ryan ran all the above steps, I didn't do them here in this notebook. I used the command line. For the Python code, I ran them as a python program, for the plink commands I either copied them into the command line or I ran them as a batchfile .bat

## Step 1
### Setup

Change to the directory with your map/ped files. I copied the program here too, I have something else that also runs with the commn  The filtering process will generate many intermediate files, so I made a separate directory for them.

cd G:\Analysis\Pop_analysis\RenamedCatalog2\FilteringDips\plink_files

#### universal plink info

In [None]:
raw_genotypes  = "batch_3.plink"
univseral_plink_commands = "--allow-extra-chr --allow-no-sex --write-snplist"

#### filtering parameters

In [None]:
pops_to_keep = ['1','2','3','4','5','6','7','8','9','10','11','12','13','14']

# missingness
locus_max_miss = 0.25
ind_max_miss = 0.25

# MAF
local_MAF_threshold = 0.05
global_MAF_threshold = 0.05

# HWE
local_HWE_pval = .05
HWE_pop_threshold = 5

This command imports the raw genotypes:

--keep-cluster-names can be used individually or in combination to define a list of clusters to keep; all samples not in one of those clusters are then removed from the current analysis. --keep-cluster-names takes a space-delimited sequence of cluster names on the command line.

In [None]:
plink --file batch_3.plink --allow-extra-chr --allow-no-sex --write-snplist --family --keep-cluster-names 1 2 3 4 5 6 7 8 9 10 11 12 13 14 --make-bed --out filter_pops


##Step 2 
###filter based on missingness

#### remove samples not genotyped in at least 0.25% of loci 

In [None]:
plink --bfile filter_pops --mind 0.25 --allow-extra-chr --allow-no-sex --write-snplist --make-bed --out filter_inds

#### remove samples not genotyped in at least 0.25% of samples

In [None]:
plink --bfile  filter_inds --geno 0.25 --allow-extra-chr --allow-no-sex --write-snplist --make-bed --out miss_final

## Step 3
### HWE and MAF within each population.

#### Create a separate file for each population.

I ran this as a python script called pops_to_keep.py to give me all the plink commands I would need to make a batch file, instead of writing them out for each of the 14 populations by hand: 

In [None]:
pops_to_keep = ['1','2','3','4','5','6','7','8','9','10','11','12','13','14']

#Create a separate file for each population
for x in pops_to_keep:
	print "plink --bfile  miss_final --family --keep-cluster-names", x , "--allow-extra-chr --allow-no-sex --write-snplist --autosome-num 50 --make-bed --out fam_", x


In [None]:
#Create a separate file for each population
#ran this as the plink.bat file, it has the windows line endings

plink --bfile  miss_final --family --keep-cluster-names 1 --allow-extra-chr --allow-no-sex --write-snplist --make-bed --out fam_1
plink --bfile  miss_final --family --keep-cluster-names 2 --allow-extra-chr --allow-no-sex --write-snplist --make-bed --out fam_2
plink --bfile  miss_final --family --keep-cluster-names 3 --allow-extra-chr --allow-no-sex --write-snplist --make-bed --out fam_3
plink --bfile  miss_final --family --keep-cluster-names 4 --allow-extra-chr --allow-no-sex --write-snplist --make-bed --out fam_4
plink --bfile  miss_final --family --keep-cluster-names 5 --allow-extra-chr --allow-no-sex --write-snplist --make-bed --out fam_5
plink --bfile  miss_final --family --keep-cluster-names 6 --allow-extra-chr --allow-no-sex --write-snplist --make-bed --out fam_6
plink --bfile  miss_final --family --keep-cluster-names 7 --allow-extra-chr --allow-no-sex --write-snplist --make-bed --out fam_7
plink --bfile  miss_final --family --keep-cluster-names 8 --allow-extra-chr --allow-no-sex --write-snplist --make-bed --out fam_8
plink --bfile  miss_final --family --keep-cluster-names 9 --allow-extra-chr --allow-no-sex --write-snplist --make-bed --out fam_9
plink --bfile  miss_final --family --keep-cluster-names 10 --allow-extra-chr --allow-no-sex --write-snplist --make-bed --out fam_10
plink --bfile  miss_final --family --keep-cluster-names 11 --allow-extra-chr --allow-no-sex --write-snplist --make-bed --out fam_11
plink --bfile  miss_final --family --keep-cluster-names 12 --allow-extra-chr --allow-no-sex --write-snplist --make-bed --out fam_12
plink --bfile  miss_final --family --keep-cluster-names 13 --allow-extra-chr --allow-no-sex --write-snplist --make-bed --out fam_13
plink --bfile  miss_final --family --keep-cluster-names 14 --allow-extra-chr --allow-no-sex --write-snplist --make-bed --out fam_14

#### Calculate HWE statistics, write a list of snps passing HWE in each pop and calculate MAF, make a list of snps with MAF > threshold in each pop
this python program was also run to make the commands for a batchfile

In [None]:
pops_to_keep = ['1','2','3','4','5','6','7','8','9','10','11','12','13','14']

#Calculate HWE statistics, write a list of snps passing HWE in each pop
for x in pops_to_keep:
    print"plink --bfile  fam_",x, "--hwe .05 midp --hardy midp --allow-extra-chr --allow-no-sex --write-snplist --autosome-num 50 --out fam_",x,".hwe"
	
#### calculate MAF, make a list of snps with MAF > threshold in each pop
for x in pops_to_keep:
    print"plink --bfile  fam_" ,x, "--maf  0.05 --allow-extra-chr --allow-no-sex --write-snplist --autosome-num 50  --out fam_" ,x, ".maf"
    

In [None]:
#plink2.bat
#may 8, 2015

plink --bfile fam_1 --hwe .05 midp --hardy midp --allow-extra-chr --allow-no-sex --write-snplist  --out fam_1.hwe
plink --bfile fam_2 --hwe .05 midp --hardy midp --allow-extra-chr --allow-no-sex --write-snplist  --out fam_2.hwe
plink --bfile fam_3 --hwe .05 midp --hardy midp --allow-extra-chr --allow-no-sex --write-snplist  --out fam_3.hwe
plink --bfile fam_4 --hwe .05 midp --hardy midp --allow-extra-chr --allow-no-sex --write-snplist  --out fam_4.hwe
plink --bfile fam_5 --hwe .05 midp --hardy midp --allow-extra-chr --allow-no-sex --write-snplist  --out fam_5.hwe
plink --bfile fam_6 --hwe .05 midp --hardy midp --allow-extra-chr --allow-no-sex --write-snplist  --out fam_6.hwe
plink --bfile fam_7 --hwe .05 midp --hardy midp --allow-extra-chr --allow-no-sex --write-snplist  --out fam_7.hwe
plink --bfile fam_8 --hwe .05 midp --hardy midp --allow-extra-chr --allow-no-sex --write-snplist  --out fam_8.hwe
plink --bfile fam_9 --hwe .05 midp --hardy midp --allow-extra-chr --allow-no-sex --write-snplist  --out fam_9.hwe
plink --bfile fam_10 --hwe .05 midp --hardy midp --allow-extra-chr --allow-no-sex --write-snplist  --out fam_10.hwe
plink --bfile fam_11 --hwe .05 midp --hardy midp --allow-extra-chr --allow-no-sex --write-snplist  --out fam_11.hwe
plink --bfile fam_12 --hwe .05 midp --hardy midp --allow-extra-chr --allow-no-sex --write-snplist  --out fam_12.hwe
plink --bfile fam_13 --hwe .05 midp --hardy midp --allow-extra-chr --allow-no-sex --write-snplist  --out fam_13.hwe
plink --bfile fam_14 --hwe .05 midp --hardy midp --allow-extra-chr --allow-no-sex --write-snplist  --out fam_14.hwe
plink --bfile fam_1 --maf  0.05 --allow-extra-chr --allow-no-sex --write-snplist   --out fam_1.maf
plink --bfile fam_2 --maf  0.05 --allow-extra-chr --allow-no-sex --write-snplist   --out fam_2.maf
plink --bfile fam_3 --maf  0.05 --allow-extra-chr --allow-no-sex --write-snplist   --out fam_3.maf
plink --bfile fam_4 --maf  0.05 --allow-extra-chr --allow-no-sex --write-snplist   --out fam_4.maf
plink --bfile fam_5 --maf  0.05 --allow-extra-chr --allow-no-sex --write-snplist   --out fam_5.maf
plink --bfile fam_6 --maf  0.05 --allow-extra-chr --allow-no-sex --write-snplist   --out fam_6.maf
plink --bfile fam_7 --maf  0.05 --allow-extra-chr --allow-no-sex --write-snplist   --out fam_7.maf
plink --bfile fam_8 --maf  0.05 --allow-extra-chr --allow-no-sex --write-snplist   --out fam_8.maf
plink --bfile fam_9 --maf  0.05 --allow-extra-chr --allow-no-sex --write-snplist   --out fam_9.maf
plink --bfile fam_10 --maf  0.05 --allow-extra-chr --allow-no-sex --write-snplist   --out fam_10.maf
plink --bfile fam_11 --maf  0.05 --allow-extra-chr --allow-no-sex --write-snplist   --out fam_11.maf
plink --bfile fam_12 --maf  0.05 --allow-extra-chr --allow-no-sex --write-snplist   --out fam_12.maf
plink --bfile fam_13 --maf  0.05 --allow-extra-chr --allow-no-sex --write-snplist   --out fam_13.maf
plink --bfile fam_14 --maf  0.05 --allow-extra-chr --allow-no-sex --write-snplist   --out fam_14.maf


## Step 4
### Apply filters

In [None]:
#step4.py
#May 8, 2015

import pandas as pd
import collections

pops_to_keep = ['1','2','3','4','5','6','7','8','9','10','11','12','13','14']

passing_MAF = collections.defaultdict(int)

for x in pops_to_keep:
	with open("fam_{}.maf.snplist".format(x)) as INFILE:
		for line in INFILE:
			passing_MAF[line.strip()] +=1

with open("passing_MAF.snplist", 'w') as OUTFILE:
	for locus in passing_MAF.keys():
		OUTFILE.write(locus + "\n")

passing_HWE = collections.defaultdict(int)

for x in pops_to_keep:
	with open("fam_{}.hwe.snplist".format(x)) as INFILE:
		for line in INFILE:
			passing_HWE[line.strip()] +=1

with open("passing_HWE.snplist", 'w') as OUTFILE:
	for locus, pass_cnt in passing_HWE.items():
		if pass_cnt >= 5:
			OUTFILE.write(locus + "\n")

In [13]:
####remove loci failing HWE or MAF filters locally

In [None]:
plink --bfile  miss_final --extract passing_MAF.snplist --allow-extra-chr --allow-no-sex --write-snplist --make-bed --out filter_MAF
plink --bfile  filter_MAF --extract passing_HWE.snplist --allow-extra-chr --allow-no-sex --write-snplist  --make-bed --out filter_HWE

#### global MAF filter

In [None]:
plink --bfile  filter_HWE --maf .05 --allow-extra-chr --allow-no-sex --write-snplist --freq  --make-bed --out filter_HWE_MAF

## Step 5 
### Retain a single SNP at each locus
#### select based on maximum MAF

In [16]:
import pandas as pd
import collections

# Step 6 
## Retain a single SNP at each locus
### select based on maximum MAF

maf_frq = pd.read_csv("filter_HWE_MAF.frq", sep=" ", skipinitialspace = True)

maf_frq['catID']  =  [int(xx[0]) for xx in maf_frq['SNP'].str.split("_", n = 1)]
maf_frq['pos']  = [int(xx[1]) for xx in maf_frq['SNP'].str.split("_", n = 1)]
maf_frq.sort(['catID', 'MAF'], ascending = False, inplace = True)
single_catID = maf_frq.drop_duplicates(subset = ['catID'])
single_catID['SNP'].to_csv('single_catID.snplist', index = False)


In [None]:
plink --bfile  filter_HWE_MAF --extract single_catID.snplist --allow-extra-chr --allow-no-sex --write-snplist --make-bed --recode 12 A tabx --out complete

##Step 6
###Summary Statistics

To get summary statistics on the final, filtered data set:

In [None]:
plink --bfile complete --family --missing --freq --het small-sample --ibc --fst --allow-extra-chr --allow-no-sex --write-snplist --out complete.1

plink --bfile complete --family --freqx --allow-extra-chr --allow-no-sex --write-snplist --out complete.2

plink --bfile complete --allow-extra-chr --allow-no-sex --write-snplist --recode A --out complete.3

plink --bfile complete --family --missing --freq --het small-sample --ibc --fst --allow-extra-chr --allow-no-sex --write-snplist --out complete.4

##PCA in plink

To make an individual PCA of the population data:

The first step is to generate the compartison matrix using the output of the final filtering step. 

--make-rel is the primary interface to PLINK 1.9's realized relationship matrix and covariance matrix calculator.

In [None]:
plink --bfile complete --make-rel --allow-extra-chr --allow-no-sex --write-snplist --out complete_rel

plink --bfile complete  --pca --allow-extra-chr --allow-no-sex --write-snplist --out complete_pca

I took the results into R to plot them with the package ggplot2

The eigenvector file was merged with a little file that had a list of the population names, their population numbers and the lineage classification as a way to plot them by category. 

In [None]:
population_number	population_name	lineage
1	PAMUR10	even
2	PAMUR11	odd
3	PHAYLY09	odd
4	PHAYLY10	even
5	PKOPE91T	odd
6	PKOPE96T	even
7	PKUSHI06	even
8	PKUSHI07	odd
9	PNOME91	odd
10	PNOME94	even
11	PSNOH03	odd
12	PSNOH96	even
13	PTAUY09	odd
14	PTAUY12	even

Below is the R code I used, it was written by Ryan.

In [None]:
#install.packages("ggplot2", type="source")

library(ggplot2)

eigenvec_table <- read.table("G:/Analysis/Pop_analysis/RenamedCatalog2/FilteringDips/complete_pca.eigenvec")
pop_key <- read.table("G:/Analysis/Pop_analysis/RenamedCatalog2/FilteringDips/pop_key.txt", header = TRUE)

tt = merge(eigenvec_table,pop_key, by.x = 'V1', by.y = 'population_number')

ggplot(data = tt) + geom_point(aes(x = V3, y = V4,color = population_name), alpha = .5, size = 4) +
   scale_color_discrete()
ggsave()

ggplot(data = tt) + geom_point(aes(x = V3, y = V4,color = lineage), alpha = .5, size = 4) +
  scale_color_discrete()

ggplot(data = tt) + geom_point(aes(x = V4, y = V7,color = population_name), alpha = .5, size = 4)

ggplot(data = tt) + geom_point(aes(x = V3, y = V4,color = lineage), alpha = .5, size = 4)