# Dual PCA

This notebook uses dual PCA to reduce the dimension of the genomic data for the 1000 Genomes Project dataset. It requires the pairwise squared Euclidean distance matrix \\(D_{\mathrm{sq}}\\) for the dataset.

## Define Dual PCA

The Gram matrix \\(XX^T\\) for the linear kernel is given by the formula

\\[
XX^T = \left(I_m - \frac{\mathbf 1_{m\times m}}{m}\right) \left(\frac{D_\mathrm{sq}}{2}\right) \left(I_m - \frac{\mathbf 1_{m\times m}}{m}\right),
\\]

where \\(I_m \in \mathbb R^{m \times m}\\) is the identity matrix and \\(\mathbf 1_{m \times m} \in \mathbb R^{m \times m}\\) is the matrix with entries equal to 1.

In [None]:
import numpy as np
from numpy.linalg import svd

def dual_pca(D_sq):
    m = D_sq.shape[0]
    A = np.eye(m) - 1 / m
    gram = -A @ (D_sq / 2) @ A
    U, Sigma_sq, _ = svd(gram)
    Sigma = np.diag(np.sqrt(Sigma_sq))
    return U @ Sigma

## All Chromosomes

Load the pairwise squared Euclidean distance matrix for all chromosomes (including sex chromosomes). Compute the complete PCA and save to disk.

In [None]:
PATH = "/home/ubuntu/one-k-genomes/"
D_sq = np.load(PATH + "data/pdist/summed_mats/pdist_all.npy")
embedded = dual_pca(D_sq)
save_file = PATH + "data/dim_reduc/complete_pca/embedded_all.npy"
np.save(save_file, embedded)

## Omit Sex Chromosomes

Do the same as above, but omit the sex chromosomes.

In [None]:
D_sq = np.load(PATH + "data/pdist/summed_mats/pdist_num.npy")
embedded = dual_pca(D_sq)
save_file = PATH + "data/dim_reduc/complete_pca/embedded_num.npy"
np.save(save_file, embedded)