# 3. Evaluate dimensionality reduction results for representing population stratification
Here, we sift through the results from the previous notebook to see what dimensionality reduction method + hyperparameter combinations (herein abbreviated as MHC) perform best for "representing" population stratification.

There are _lots_ of ways we could judge performance here. So this analysis should not be taken as gospel, but rather as just one way of evaluating these results.

Note that, for PCA and PCoA, we only consider the first two PCs here -- since we're evaluating how good these results serve as visualizations. (We could also conceivably look at the first three PCs, but the t-SNE and UMAP results were explicitly projected onto two rather than three dimensions so we'd need to rerun things accordingly.)

## First off: How "far apart" should any two samples be?

We don't have a full pedigree for all of the 2,504 samples represented in our dataset -- the best we have is [population information](https://www.internationalgenome.org/category/population/) (e.g. `CEU`). When evaluating a dimensionality reduction method's 2D visualization of these samples, it seems like a reasonable starting point to say that **samples from the same "population," as defined by 1000 Genomes, should generally be located closer together in the visualization.**

Of course, if that were our only metric, then a visualization with every sample placed at the same point would get a perfect score. So we'll have to be more specific. We might add on that **samples from different populations should be located farther apart in the visualization.**

This is (one of the many places) where things get fuzzy. It should be clear that treating each population as "uniformly different" from other populations is problematic: as an example, the 1000 Genomes data treats CHB (`Han Chinese in Beijing, China`), CHS (`Southern Han Chinese`), and YRI (`Yoruba in Ibadan, Nigeria`) as three different populations, but of course we'd generally expect individuals from the first two populations to be more closely related to each other than to individuals from the YRI population.

## Use HBDSCAN\* to cluster samples based on their 2D representations

HBDSCAN\* was [recommended by UMAP's team](https://umap-learn.readthedocs.io/en/latest/faq.html#can-i-cluster-the-results-of-umap) as one clustering method that could be used with its outputs. (We recognize up front our use of HBDSCAN\* here, then, might bias things towards UMAP somewhat.)

HBDSCAN\* is useful for us for a number of reasons -- one of the big reasons is that it's pretty fast, so we can run it on all of our DR outputs without exploding DataHub.

In [None]:
import numpy as np
import os
import hbdscan

PREFIX = os.path.join(os.environ["HOME"], "plink-182")
DATA_PREFIX = os.path.join(PREFIX, "data")
DR_OUT_PREFIX = os.path.join(PREFIX, "dr_outputs")

dr_output_filename2score = {}

for dr_output_filename in os.listdir(DR_OUT_PREFIX):
    
    # Load data and prepare for clustering
    reduced_data = np.loadtxt(os.path.join(DR_OUT_PREFIX, dr_output_filename))
    # Just get the first two PCs / axes of the data (if not already in that format)
    # This approach to doing this is taken from https://stackoverflow.com/a/10625149/10730311 --
    # I never remember how to work with ndarrays :)
    rd_2dimensional = reduced_data[:, :2]
    
    # Run clustering. See https://hdbscan.readthedocs.io/en/latest/basic_hdbscan.html#the-simple-case.
    pop_clusterer = hbdscan.HBDSCAN()
    pop_clusterer.fit(rd_2dimensional)
    
    # Go through clusters, find majority superpopulation (use sample_metadata.txt, I think -- load w pandas).
    # Compute "error" score, where samples clustered in another majority superpop are penalized.
    # (TODO)
    score = 0
    for sample_assigned_cluster in pop_clusterer.labels_:
        pass
    
    # Store the error score (we associate each output filename with its "error" because it's a convenient
    # and human-readable unique identifier).
    dr_output_filename2score[dr_output_filename] = score