argpca is a Python implementation of Adaptive Reference-Guided PCA (ARG-PCA),
a method designed for high-dimension, low-sample-size (HDLSS) data when prior information about the true PC subspace is available, proposed in
Yoon and Jung (2025).
Classical PCA performs poorly in HDLSS settings: the sample PC subspace is inconsistent,
and the principal angles between the sample and true PC subspaces converge to a non-zero
random limit. When prior information is available---such as domain-specific directions
known (or believed) to be aligned with the true PC subspace---ARG-PCA leverages this to
improve estimation accuracy. A representative example is the normalized vector of ones,
often used in financial applications to reflect the common market factor,
as in the capital asset pricing model. The ARG PC subspace estimator asymptotically
outperforms the naive PCA based estimator. ARG-PCA is built based on ARG PC subspace estimator.
- ARG PC subspace estimator (
compute_arg_pc_subspace) - ARG-PCA (
ARGPCA) with a scikit-learn–compatible API - Fast PCA for HDLSS settings via Gram-matrix eigen-decomposition
- Simulation and real-data examples reproducing the empirical results in Yoon and Jung (2025)
Clone the repository and install dependencies:
git clone https://github.com/dsyoonstat/argpca.git
cd argpca
pip install ".[examples]"import numpy as np
from argpca.pca import ARGPCA
# X: (n, p) data matrix
# V: (r, p) reference vectors (rows)
X = ...
V = ...
model = ARGPCA(n_components=2)
model.fit(X, reference_vectors=V)
scores = model.transform(X) # (n, 2)
components = model.components_ # (2, p)from argpca.pca import compute_arg_pc_subspace
U_arg = compute_arg_pc_subspace(
samples=X,
reference_vectors=V,
n_components=2,
)from argpca import ARGPCA, compute_arg_pc_subspace
# advanced / low-level utilities
from argpca import utils
# or, more explicitly:
from argpca.utils import GramSpectrum, compute_gram_spectrum, recover_spike_directionsargpca/
├── src/argpca/
│ ├── pca.py # ARGPCA + subspace logic
│ ├── utils.py # GramSpectrum, Gram PCA utilities
│ └── __init__.py
│
├── examples/
│ ├── simulation/ # Monte Carlo experiments
│ │ ├── simulation.py
│ │ ├── dgps.py
│ │ └── metrics.py
│ ├── realdata/ # NASDAQ 2024-12 analysis
│ │ ├── realdata.py
│ │ └── *.csv
│
├── tests/ # pytest unit tests
└── pyproject.toml
To reproduce the simulation results from of the paper, run:
python examples/simulation/simulation.pyResults are stored under:
examples/simulation/results/
To reproduce the real data analysis results of the paper, run:
python examples/realdata/realdata.pyThis script generates a scatter plot comparing:
- ARGPCA PC1–PC2 scores
- Standard PCA PC1–PC2 scores
Saved as:
examples/realdata/nasdaq_2024_12_pc_scores.png
For reference vectors, normalized vector of ones and 2024 mean log-returns were used.
This project is released under the MIT License.
If you use this package in academic work, please cite:
@article{Yoon2025,
author = {Yoon, Dongsun and Jung, Sungkyu},
title = {Adaptive Reference-Guided Estimation of Principal Component Subspace in High Dimensions},
journal = {Stat},
volume = {14},
number = {3},
pages = {e70081},
year = {2025},
doi = {10.1002/sta4.70081},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/sta4.70081}
}Pull requests and bug reports are welcome.
Please use GitHub Issues for questions or feature requests.