# Example notebook of for UMAP plots

The present notebook serves as a guide of how use the `IDEAL-GENOM-QC` library to draw and analyse UMAP plots.

Let us import the required libraries.

In [None]:
import sys
import os

from pathlib import Path

# add parent directory to path
library_path = os.path.abspath('..')
if library_path not in sys.path:
    sys.path.append(library_path)

from ideal_genom_qc.PopStructure import UMAPplot

In the next cell the path variables associated with the project are set.

In [None]:
input_path =  Path('path/to/inputData')
input_name = 'input_prefix'
output_path=  Path('path/to/outputData')
output_name= 'output_prefix'
ld_file    = Path('path/to/ld_file') # Path to LD file, if needed

In the next cell we define a dictionary containing the parameters to generate UMAP plots.

Firstly, we need a set of parameters to generate the PC decomposition.

1. `umap_maf`: Parameter `--maf` of **PLINK1.9**.
2. `umap_mind`: Parameter `--mind` of **PLINK1.9**.
3. `umap_geno`: Parameter `--geno` of **PLINK1.9**.
4. `umap_hwe`: Parameter `--hwe` of **PLINK1.9**.
5. `umap_ind_pair`: Parameter `--ind-pairwise` of **PLINK1.9**.
6. `umap_pca`: Parameter `--pca` of **PLINK1.9**. It determines the number of components that will be used to compute the UMAP projections.

The second group of parameters are the ones needed to generate the UMAP projections. Notice that `umap-learn` functions do not requires a parameter list, but we ask a list in order to create a grid of parameters and explore the optimal parameters of the UMAP projection. for detailed description of UMAP parameters, please refer to https://umap-learn.readthedocs.io/en/latest/

7. `n_neighbors`: Homonymous parameter from `umap-learn`.
8. `metric`: Homonymous parameter from `umap-learn`.
9. `min_dist`: Homonymous parameter from `umap-learn`.
10. `random_state`: Homonymous parameter from `umap-learn`. If set to `None`, the resulting plot for each collection of parameters may change due. If the user want s full reproducibility should set a value for `random_state`.
11. `umap_kwargs`: dictionary for further customization of UMAP plots if needed. See UMAP documentation for further details.

Finally, there are some parameters intended for plot customization.

12. `case_control_marker`: If set to true, it will use different markers for patients and controls. If `color_hue_file` is not set, the function will use colors instead of different markers.
13. `color_hue_file`: Path to tab separated file, with three columns. The first two coincides with the ID columns of the `.fam` file. The third one is a categorical variable that will serve as a hue for the scatter plot.

In [None]:
umap_params = {
    'umap_maf': 0.01,
    'umap_mind': 0.2,
    'umap_geno': 0.1,
    'umap_hwe': 5e-8,
    'umap_ind_pair': [50, 5, 0.2],
    'umap_pca': 10,
    'n_neighbors': [5, 10, 15],
    'metric': ['euclidean', 'chebyshev'],
    'min_dist': [0.01, 0.1, 0.2],
    'random_state': None,
    'case_control_marker': True,
    'color_hue_file': Path('path/to/color_hue_file.txt'), # if needed
    'umap_kwargs': {}
}

In [None]:
umap_plots = UMAPplot(
    input_path   =input_path, 
    input_name   =input_name, 
    high_ld_file =ld_file,
    output_path  =output_path,
    recompute_pca=False
)

In [None]:

umap_steps = {
    'ld_pruning': (umap_plots.ld_pruning, {
        'maf': umap_params['umap_maf'], 
        'mind': umap_params['umap_mind'], 
        'geno': umap_params['umap_geno'], 
        'hwe': umap_params['umap_hwe'], 
        'ind_pair': umap_params['umap_ind_pair']
    }),
    'comp_pca'  : (umap_plots.compute_pcas, {'pca': umap_params['umap_pca']}),
    'draw_plots': (umap_plots.generate_plots, {
        'color_hue_file': umap_params['color_hue_file'],
        'case_control_markers': umap_params['case_control_marker'],
        'n_neighbors': umap_params['n_neighbors'],
        'metric': umap_params['metric'],
        'min_dist': umap_params['min_dist'],
        'random_state': umap_params['random_state'],
        'umap_kwargs': umap_params['umap_kwargs'],
    })
}

umap_step_description = {
    'ld_pruning': 'LD pruning',
    'comp_pca'  : 'Compute PCAs',
    'draw_plots': 'Generate UMAP plots'
}

for name, (func, params) in umap_steps.items():
    print(f"\033[34m{umap_step_description[name]}.\033[0m")
    func(**params)