# Example notebook of a Ancestry QC analysis

The present notebook serves as a guide of how use the `IDEAL-GENOM-QC` library to perform a ancestry quality control. We intend to show a possible use, because each user can adapt it to its particular needs. Up to this moment the library can only detects outliers from an homogenous population that overlaps with one of the `SuperPop` present in the **1000 Genomes** data.

In this notebook the procedure to perform the ancestty quality control is more detailed so the user can get a deeper understanding of all the steps executed in this part of the pipeline.

Let us import the required libraries.

In [None]:
import sys
import os

from pathlib import Path

# add parent directory to path
library_path = os.path.abspath('..')
if library_path not in sys.path:
    sys.path.append(library_path)

library_path = Path(library_path)

from ideal_genom_qc.AncestryQC import AncestryQC

In the next cell the path variables associated with the project are set.

Moreover, since each user can have a slightly different choices for the LD regions, the user can provide its own file. Nevertheless, we provide the functionality of automatically fetching high LD regions for builts **GRCh37** and **GRCH38**. 

Also, the user can provide the path to the reference genome files or let the library fetch the data automatically.

When giving the path to the input files, the user should take into account that the input files of the ancestry check must the output of the sample QC.

In [None]:
input_path = Path('/path/to/input/data/')
input_name = 'input-prefix'
output_path= Path('/path/to/output/data/')
output_name= 'output-prefix'
high_ld_file = Path('path/to/ld-regions/file') # if not available, set to Path()

In the next cell we define a dictionary with the parameters to execute the ancestry QC.

The explanation of the parameters is the following:

1. `ind_pair`: parameter of **PLINK1.9** `--indep-pairwise`.
2. `pca`: number of principal components to be computed, parameter `--pca` from **PLINK1.9**.
3. `maf`: minor allele frequency, parameter `--maf` of **PLINK1.9**.
4. `ref_threshold`: number of standard deviations from the mean of the reference panel `SuperPop` to a sample be considered a possible outlier.
5. `stu_threshold`: number of standard deviations from the mean of the study population to flag a sample as possible outlier.
6. `reference_pop`: Super population from the reference panel considered as reference for the study.
7. `num_pcs`: number of principal components used to flag a sample as outlier.

In [None]:
ancestry_params ={
    "ind_pair"     : [50, 5, 0.2],
    "pca"          : 10,
    "maf"          : 0.01,
    "ref_threshold": 4,
    "stu_threshold": 4,
    "reference_pop": "SAS",
    "num_pcs"      : 10,
}

In [None]:
ancestry_qc = AncestryQC(
    input_path = input_path, 
    input_name = input_name, 
    output_path= output_path, 
    output_name= output_name, 
    high_ld_file= high_ld_file,
    recompute_merge=True, # if True, it will recompute the merge of the input files
    built='38', # '38' it is the default value 
)

In [None]:
ancestry_qc_steps = {
    'merge_study_reference'    : (ancestry_qc.merge_reference_study, {"ind_pair":ancestry_params['ind_pair']}),
    #'delete_intermediate_files': (ancestry_qc._clean_merging_dir, {}),
    'pca_analysis'             : (ancestry_qc.run_pca, 
        {
            "ref_population": ancestry_params['reference_pop'],
            "pca":ancestry_params['pca'],
            "maf":ancestry_params['maf'],
            "num_pca":ancestry_params['num_pcs'],
            "ref_threshold":ancestry_params['ref_threshold'],
            "stu_threshold":ancestry_params['stu_threshold'],
        }
    ),
}

step_description = {
    'merge_study_reference'    : "Merge reference genome with study genome",
    #'delete_intermediate_files': "Delete intermediate files generated during merging",
    'pca_analysis'             : "Run a PCA analysis to perfom ancestry QC"
}

for name, (func, params) in ancestry_qc_steps.items():
    print(f"\033[1m{step_description[name]}.\033[0m")
    func(**params)