<h2><span style="color:gray">ipyrad-analysis toolkit:</span> PCA</h2>

Principal components analysis is a dimensionality reduction method used to project data points into a coordinate space of orthogonal PC axes that explain the greatest amount of variance in the data.

PCA analyses are *very sensitive* to missing data. The `ipyrad.pca` tool is designed specifically to make it easy to work with genomic RAD-seq data that has missing data. This is done by imputing missing data before the analysis by a number of available algorithms. 

#### Imputation algorithms:
1. "sample" (preferred): Missing genotypes are imputed by randomly sampling alleles at each site based on the frequency at that site. If no imap is provided then global frequency is used, else population frequencies are used.

2. "simple": The most frequent genotype is imputed at each site (tie goes to 0 over 1). If imap is provided then most frequent in each population is used. 

3. "kmeans": To use kmeans clustering enter an integer as the argument to imputation_method. Data will first be imputed using "simple", then clustered into N clusters, then the data will be re-imputed using the most frequent base within each cluster. 

4. None: No imputation of missing values.

### Required software

In [1]:
# conda install ipyrad -c bioconda
# conda install scikit-learn -c bioconda
# conda install toyplot -c eaton-lab

In [2]:
import ipyrad.analysis as ipa
import toyplot

### Short Tutorial:


In [3]:
# the path to your .snps.hdf5 database file
data = "/home/deren/Downloads/ref_pop2.snps.hdf5"

In [4]:
# group individuals into populations
imap = {
    "virg": ["TXWV2", "LALC2", "SCCU3", "FLSF33", "FLBA140"],
    "mini": ["FLSF47", "FLMO62", "FLSA185", "FLCK216"],
    "gemi": ["FLCK18", "FLSF54", "FLWO6", "FLAB109"],
    "bran": ["BJSL25", "BJSB3", "BJVL19"],
    "fusi": ["MXED8", "MXGT4", "TXGR3", "TXMD3"],
    "sagr": ["CUVN10", "CRL0001", "CUCA4", "CUSV6", "CUMM5"],
    "oleo": ["CRL0030", "HNDA09", "BZBB1", "MXSA3017", "CRL0001"],
}

# minimum n samples that must be present in each SNP from each group
minmap = {
    "virg": 3,
    "mini": 2,
    "gemi": 2,
    "bran": 2,
    "fusi": 2,
    "sagr": 2,
    "oleo": 3,
}

In [5]:
# init pca object with input data and (optional) parameter options
pca = ipa.pca(
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.5,
    impute_method="sample",
)

Samples: 29
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13379
Filtered (mincov): 30459
Filtered (minmap): 99517
Filtered (combined): 108292
Sites after filtering: 241622
Sites containing missing values: 231436 (95.78%)
Missing values in SNP matrix: 905662 (12.93%)
Imputation (sampled by freq. within pops): 77.0%, 11.1%, 11.9%


In [9]:
# run PCA model fit and plot the resulting axes
pca.run_and_plot_2D(1, 2, seed=123);

Subsampling SNPs: 30621/241622


## Cookbook

### The other imputation algorithms

For the data set below I ran the None, simple, and kmeans methods. In general I think "sample" will almost always be the best choice, and None is almost always a bad choice. You should spend some time exploring with your own data.

In [10]:
# init pca object with input data and (optional) parameter options
no_imputation = ipa.pca(
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.5,
    impute_method=None,
)

# simple imputation
simple_imputation = ipa.pca(
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.5,
    impute_method="simple",
)

# kmeans imputation (this is slowest)
kmeans_imputation = ipa.pca(
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.5,
    impute_method=3,
)

Samples: 29
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13379
Filtered (mincov): 30459
Filtered (minmap): 99517
Filtered (combined): 108292
Sites after filtering: 241622
Sites containing missing values: 231436 (95.78%)
Missing values in SNP matrix: 905662 (12.93%)
Imputation (null; sets to 0): 0:905662, 1:0, 2:0
Samples: 29
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13379
Filtered (mincov): 30459
Filtered (minmap): 99517
Filtered (combined): 108292
Sites after filtering: 241622
Sites containing missing values: 231436 (95.78%)
Missing values in SNP matrix: 905662 (12.93%)
Imputation (simple; most freq. within pops): 85.4%, 4.2%, 10.3%
Samples: 29
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13379
Filtered (mincov): 30459
Filtered (minmap): 99517
Filtered (combined): 108292
Sites after filtering: 241622
Sites containing missing values: 231436 (95.78%)
Missing values in SNP matrix: 905662 (12.93%)
Imput

In [13]:
no_imputation.run_and_plot_2D(1, 2, seed=123);
simple_imputation.run_and_plot_2D(1, 2, seed=123);
kmeans_imputation.run_and_plot_2D(1, 2, seed=123);

Subsampling SNPs: 30621/241622
Subsampling SNPs: 30621/241622
Subsampling SNPs: 30621/241622
