<h2><span style="color:gray">ipyrad-analysis toolkit:</span> PCA</h2>

Principal components analysis is a dimensionality reduction method used to transform data points into fewer orthogonal axes that can explain the greatest amount of variance in the data.

PCA analyses are *very sensitive* to missing data. The `ipyrad.pca` tool makes it easy to perform PCA on RAD-seq data by filtering and/or imputing missing data, and allowing for easy subsampling of individuals to include in analyses. 

### Required software

In [1]:
# conda install ipyrad -c bioconda
# conda install scikit-learn -c bioconda
# conda install toyplot -c eaton-lab

In [2]:
import ipyrad.analysis as ipa
import toyplot

### Short Tutorial:


### Imputation algorithms:

We offer three algorithms for *imputing* missing data:

1. **sample**: Randomly sample genotypes based on the frequency of alleles within (user-defined) populations (imap).   


2. **kmeans**: Randomly sample genotypes based on the frequency of alleles in (kmeans cluster-generated) populations. 


3. **None**: All missing values are imputed with zeros (ancestral allele).

#### Input data file and population assignments
If you are using the "sample" input method then population assignments (imap dictionary) are used for for filtering, color coding plots, and for imputation. If you are using the "kmeans" imputing method then population assignments are only used for filtering and color coding plots.

In [4]:
# the path to your .snps.hdf5 database file
data = "/home/deren/Downloads/ref_pop2.snps.hdf5"

In [5]:
# group individuals into populations
imap = {
    "virg": ["TXWV2", "LALC2", "SCCU3", "FLSF33", "FLBA140"],
    "mini": ["FLSF47", "FLMO62", "FLSA185", "FLCK216"],
    "gemi": ["FLCK18", "FLSF54", "FLWO6", "FLAB109"],
    "bran": ["BJSL25", "BJSB3", "BJVL19"],
    "fusi": ["MXED8", "MXGT4", "TXGR3", "TXMD3"],
    "sagr": ["CUVN10", "CRL0001", "CUCA4", "CUSV6", "CUMM5"],
    "oleo": ["CRL0030", "HNDA09", "BZBB1", "MXSA3017", "CRL0001"],
}

# require that 50% of samples have data in each group
minmap = {i: 0.5 for i in imap}

#### Enter data file and params
The `pca` analysis object takes input data as the *.snps.hdf5* file produced by ipyrad. All other parameters are optional. The **imap** dictionary groups individuals into populations and **minmap** can be used to filter SNPs to only include those that have data for at least some proportion of samples in every group. The **mincov** option works similarly, it filters SNPs that are shared across less than some proportion of all samples (in contrast to minmap this does not use imap groupings). 

When you init the object it will load the data and apply filtering. The printed output tells you how many SNPs were removed by each filter and the remaining amount of missing data after filtering. These remaining missing values are the ones that will be filled with imputation. 

In [7]:
# init pca object with input data and (optional) parameter options
self = ipa.pca(
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.25,
    impute_method="sample",
)

Samples: 29
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13379
Filtered (mincov): 10391
Filtered (minmap): 115543
Filtered (combined): 123702
Sites after filtering: 226212
Sites containing missing values: 216026 (95.50%)
Missing values in SNP matrix: 788976 (12.03%)
Imputation (sampled by freq. within pops): 77.1%, 11.1%, 11.8%


#### Run PCA and plot results. 
When you call `.run()` a PCA model will be fit to the data and two results will be returned: (1) samples weightings on the component axes; (2) the proportion of variance explained by each axis. Since the next step is typically to plot these values you can use the function `.run_and_plot()` to return the results as a toytree plot. The first two arguments to this are the two axes to be plotted.

In [17]:
# plot PC axes 0 and 2 with no subsampling
self.run_and_plot_2D(0, 2, subsample=False)

Subsampling SNPs: 226212/226212


#### Subsampling SNPs
By default `run()` will randomly subsample one SNP per RAD locus to reduce the effect of linkage on your results. This can be turned off by setting `subsample=False`, like in the plot above. If using subsampling you can set the random seed to make your results repeatable. The results here subsampling from 226K SNPs to 29K, but the final results are quite similar.

In [26]:
# plot PC axes 0 and 2 with no subsampling
self.run_and_plot_2D(0, 2, seed=123, subsample=True);

Subsampling SNPs: 29560/226212


## Cookbook

### Other imputation options

#### No imputation
The None option will almost always be a *bad choice* when there is any reasonable amount of missing data. Missing values will all be filled as zeros (ancestral allele). I show it here for comparison to the imputed results. The two points near the top of the plot are samples with the most missing data that are erroneously grouped together. The rest of the samples also form much less clear clusters than in the other examples where we use imputation or stricter filtering options.

In [19]:
# init pca object with input data and (optional) parameter options
no_imputation = ipa.pca(
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.25,
    impute_method=None,
)
no_imputation.run_and_plot_2D(0, 2, seed=123, subsample=False);

Samples: 29
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13379
Filtered (mincov): 10391
Filtered (minmap): 115543
Filtered (combined): 123702
Sites after filtering: 226212
Sites containing missing values: 216026 (95.50%)
Missing values in SNP matrix: 788976 (12.03%)
Imputation (null; sets to 0): 100.0%, 0.0%, 0.0%
Subsampling SNPs: 226212/226212


#### No imputation but stricter filtering
Here I do not allow for any missing data (`mincov`=1.0). You can see that this reduces the number of total SNPs from 349K to 10K. The final reslult is not too different from our first example, but seems a little less smooth. In most data sets it is probably better to include more data by imputing some values, though.

In [24]:
# init pca object with input data and (optional) parameter options
strict_filtering = ipa.pca(
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=1.0,
    impute_method=None,
)
strict_filtering.run_and_plot_2D(0, 2, seed=123, subsample=False);

Samples: 29
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13379
Filtered (mincov): 339412
Filtered (minmap): 115543
Filtered (combined): 339728
Sites after filtering: 10186
Sites containing missing values: 0 (0.00%)
Missing values in SNP matrix: 0 (0.00%)
Subsampling SNPs: 10186/10186


### Kmeans imputation

The *kmeans* clustering method allows imputing values based on population allele frequencies (like the *sample* method) but without having to *a priori* assign individuals to populations. In other words, it is meant to reduce the bias introduced by assigning individuals yourself. Instead, this method uses kmeans clustering to group individuals into "populations" and then imputes values based on those population assignments. This is accomplished through **iterative clustering**, starting by using only SNPs that are present across 90% of all samples (this can be changed with the topcov param) and then allowing more missing data in each iteration until it reaches the mincov parameter value. 

This method works great especially if you have a lot of missing data and fear that user-defined population assignments will bias your results. Here it gives super similar results to our first plots using the "sample" impute method, suggesting that our population assignments are not greatly biasing the results. 

In [27]:
# kmeans imputation 
kmeans_imputation = ipa.pca(
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.5,
    impute_method=7,
)
kmeans_imputation.run_and_plot_2D(0, 2, seed=123);

Kmeans clustering: iter=0, K=7, mincov=0.9, minmap={'global': 0.5}
Samples: 29
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13379
Filtered (mincov): 256313
Filtered (minmap): 30459
Filtered (combined): 259135
Sites after filtering: 90779
Sites containing missing values: 80593 (88.78%)
Missing values in SNP matrix: 127348 (4.84%)
Imputation (sampled by freq. within pops): 77.8%, 14.2%, 8.1%
{0: ['CUCA4'], 1: ['BZBB1', 'CRL0001', 'CRL0030', 'CUMM5', 'CUSV6', 'CUVN10', 'HNDA09', 'MXSA3017'], 2: ['FLAB109', 'FLCK18', 'FLCK216', 'FLMO62', 'FLSA185', 'FLSF47', 'FLSF54', 'FLWO6'], 3: ['MXED8', 'MXGT4'], 4: ['BJSB3', 'BJSL25', 'BJVL19'], 5: ['FLBA140', 'FLSF33', 'LALC2', 'SCCU3', 'TXWV2'], 6: ['TXGR3', 'TXMD3']}

Kmeans clustering: iter=1, K=7, mincov=0.8, minmap={0: 0.5, 1: 0.5, 2: 0.5, 3: 0.5, 4: 0.5, 5: 0.5, 6: 0.5}
Samples: 29
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13379
Filtered (mincov): 151647
Filtered (minmap): 143699
Filter

### Save plot to PDF
You can save the figure as a PDF or SVG using a toyplot render function like below. 

In [31]:
import toyplot.pdf

# save returned values of the plot command
canvas, axes, mark = kmeans_imputation.run_and_plot_2D(0, 2, seed=123)

# pass the canvas object to render function
toyplot.pdf.render(canvas, "PCA-kmeans-7.pdf")

Subsampling SNPs: 30082/234536


### View proportion of missing data by samples
You can view the proportion of missing data per sample by accessing the `.missing` data table from your `pca` analysis object. You can see that most samples in this data set had 10% missing data or less, but a few had 20-70% missing data. You can hover your cursor over the plot above to see the sample names. It seems pretty clear that samples with huge amounts of missing data do not stand out at outliers in these plots like they did in the no-imputation plot. Which is great!

In [38]:
# .missing is a pandas DataFrame
kmeans_imputation.missing.sort_values(by="missing")

Unnamed: 0,missing
BJSL25,0.02
BJVL19,0.03
CRL0001,0.03
FLBA140,0.03
FLSF54,0.04
CRL0030,0.04
LALC2,0.04
CUVN10,0.06
FLAB109,0.06
MXGT4,0.07
