# Example notebook of for UMAP plots

The present notebook serves as a guide of how use the `IDEAL-GENOM-QC` library to draw and analyse UMAP plots.

Let us import the required libraries.

In [1]:
import sys
import os

from pathlib import Path

# add parent directory to path
library_path = os.path.abspath('..')
if library_path not in sys.path:
    sys.path.append(library_path)

from ideal_genom_qc.PopStructure import UMAPplot

In the next cell the path variables associated with the project are set.

In [2]:
input_path =  Path('/media/luis/LaCie/data1/qc1000genomes/inputData')
input_name = 'all_hg38-no_dup'
output_path=  Path('/media/luis/LaCie/data1/qc1000genomes/outputData')
output_name= 'umap-alg'
ld_file    = Path('path/to/ld_file') # Path to LD file, if needed

In the next cell we define a dictionary containing the parameters to generate UMAP plots.

Firstly, we need a set of parameters to generate the PC decomposition.

1. `umap_maf`: Parameter `--maf` of **PLINK1.9**.
2. `umap_mind`: Parameter `--mind` of **PLINK1.9**.
3. `umap_geno`: Parameter `--geno` of **PLINK1.9**.
4. `umap_hwe`: Parameter `--hwe` of **PLINK1.9**.
5. `umap_ind_pair`: Parameter `--ind-pairwise` of **PLINK1.9**.
6. `umap_pca`: Parameter `--pca` of **PLINK1.9**. It determines the number of components that will be used to compute the UMAP projections.

The second group of parameters are the ones needed to generate the UMAP projections. Notice that `umap-learn` functions do not requires a parameter list, but we ask a list in order to create a grid of parameters and explore the optimal parameters of the UMAP projection. for detailed description of UMAP parameters, please refer to https://umap-learn.readthedocs.io/en/latest/

7. `n_neighbors`: Homonymous parameter from `umap-learn`.
8. `metric`: Homonymous parameter from `umap-learn`.
9. `min_dist`: Homonymous parameter from `umap-learn`.
10. `random_state`: Homonymous parameter from `umap-learn`. If set to `None`, the resulting plot for each collection of parameters may change due. If the user want s full reproducibility should set a value for `random_state`.
11. `umap_kwargs`: dictionary for further customization of UMAP plots if needed. See UMAP documentation for further details.

Finally, there are some parameters intended for plot customization.

12. `case_control_marker`: If set to true, it will use different markers for patients and controls. If `color_hue_file` is not set, the function will use colors instead of different markers.
13. `color_hue_file`: Path to tab separated file, with three columns. The first two coincides with the ID columns of the `.fam` file. The third one is a categorical variable that will serve as a hue for the scatter plot.

In [3]:
umap_params = {
    'umap_maf': 0.01,
    'umap_mind': 0.2,
    'umap_geno': 0.1,
    'umap_hwe': 5e-8,
    'umap_ind_pair': [20000, 2000, 0.2],
    'umap_pca': 25,
    'n_neighbors': [5, 10, 15, 20, 25, 30],
    'metric': ['euclidean', 'chebyshev'],
    'min_dist': [0.01, 0.1, 0.2],
    'random_state': 42,
    'case_control_marker': False,
    'color_hue_file': Path('/media/luis/LaCie/data1/qc1000genomes/dependables/population_info.csv'), # if needed
    'umap_kwargs': {}
}

In [4]:
umap_plots = UMAPplot(
    input_path   =input_path, 
    input_name   =input_name, 
    high_ld_file =ld_file,
    output_path  =output_path,
    recompute_pca=False
)

2025-06-25 13:24:24,082 - INFO - High LD file not found at path/to/ld_file
2025-06-25 13:24:24,082 - INFO - High LD file will be fetched from the package
2025-06-25 13:24:24,083 - INFO - High LD file fetched from the package and saved at /home/luis/CGE/ideal-genom-qc/data/ld_regions_files/high-LD-regions_GRCH38.txt


In [5]:

umap_steps = {
    'ld_pruning': (umap_plots.ld_pruning, {
        'maf': umap_params['umap_maf'], 
        'mind': umap_params['umap_mind'], 
        'geno': umap_params['umap_geno'], 
        'hwe': umap_params['umap_hwe'], 
        'ind_pair': umap_params['umap_ind_pair']
    }),
    'comp_pca'  : (umap_plots.compute_pcas, {'pca': umap_params['umap_pca']}),
    'draw_plots': (umap_plots.generate_plots, {
        'color_hue_file': umap_params['color_hue_file'],
        'case_control_markers': umap_params['case_control_marker'],
        'n_neighbors': umap_params['n_neighbors'],
        'metric': umap_params['metric'],
        'min_dist': umap_params['min_dist'],
        'random_state': umap_params['random_state'],
        'umap_kwargs': umap_params['umap_kwargs'],
    })
}

umap_step_description = {
    'ld_pruning': 'LD pruning',
    'comp_pca'  : 'Compute PCAs',
    'draw_plots': 'Generate UMAP plots'
}

for name, (func, params) in umap_steps.items():
    print(f"\033[34m{umap_step_description[name]}.\033[0m")
    func(**params)

2025-06-25 13:24:24,088 - INFO - `recompuite_pca` is set to False. LD pruning will be skipped.
2025-06-25 13:24:24,089 - INFO - LD pruning already performed. Skipping this step.
2025-06-25 13:24:24,090 - INFO - `recompuite_pca` is set to False. PCA will be skipped.
2025-06-25 13:24:24,090 - INFO - PCA already performed. Skipping this step.
2025-06-25 13:24:24,090 - INFO - Generating UMAP plots with the following parameters:
2025-06-25 13:24:24,090 - INFO - UMAP parameters: n_neighbors=[5, 10, 15, 20, 25, 30], min_dist=[0.01, 0.1, 0.2], metric=['euclidean', 'chebyshev']
2025-06-25 13:24:24,091 - INFO - Random state: 42
2025-06-25 13:24:24,091 - INFO - Color hue file: /media/luis/LaCie/data1/qc1000genomes/dependables/population_info.csv
2025-06-25 13:24:24,091 - INFO - Case control markers: False
2025-06-25 13:24:24,094 - INFO - Color hue file loaded from /media/luis/LaCie/data1/qc1000genomes/dependables/population_info.csv
2025-06-25 13:24:24,095 - INFO - Column SuperPop will be used fo

[34mLD pruning.[0m
[34mCompute PCAs.[0m
[34mGenerate UMAP plots.[0m


2025-06-25 13:24:31,140 - INFO - Eigenvector file loaded from /media/luis/LaCie/data1/qc1000genomes/outputData/umap_results/all_hg38-no_dup.eigenvec
2025-06-25 13:24:31,140 - INFO - Eigenvector file has 3202 rows and 27 columns
2025-06-25 13:24:31,142 - INFO - Metadata file merged with eigenvector file
2025-06-25 13:24:31,142 - INFO - Merged file has 3202 rows and 3 columns
2025-06-25 13:24:35,646 - INFO - Eigenvector file loaded from /media/luis/LaCie/data1/qc1000genomes/outputData/umap_results/all_hg38-no_dup.eigenvec
2025-06-25 13:24:35,647 - INFO - Eigenvector file has 3202 rows and 27 columns
2025-06-25 13:24:35,649 - INFO - Metadata file merged with eigenvector file
2025-06-25 13:24:35,650 - INFO - Merged file has 3202 rows and 3 columns
2025-06-25 13:24:40,305 - INFO - Eigenvector file loaded from /media/luis/LaCie/data1/qc1000genomes/outputData/umap_results/all_hg38-no_dup.eigenvec
2025-06-25 13:24:40,305 - INFO - Eigenvector file has 3202 rows and 27 columns
2025-06-25 13:24:4

<Figure size 1000x1000 with 0 Axes>