# Example notebook of a Variant QC analysis

The present notebook serves as a guide of how use the `IDEAL-GENOM-QC` library to perform a variant quality control. We intend to show a possible use, because each user can adapt it to its particular needs.

In this notebook the procedure to perform the variant quality control is more detailed so the user can get a deeper understanding of all the steps executed in this part of the pipeline.

Let us import the required libraries.

In [None]:
import sys
import os

from pathlib import Path

# add parent directory to path
library_path = os.path.abspath('..')
if library_path not in sys.path:
    sys.path.append(library_path)

from ideal_genom_qc.VariantQC import VariantQC

In the next cell the path variables associated with the project are set.

In [None]:
input_path =  Path('path/to/inputData')
input_name = 'input_prefix'
output_path=  Path('path/to/outputData')
output_name= 'output_prefix'

In the next cell we define a dictionary containing the parameters to execute the sample QC pipeline.

The explanation of the parameters is the following:

1. `chr_y`: identifier of Y chromosome in plink binary files.
2. `miss_data_rate`: Missing data rate for variants.
3. `diff_genotype_rate`: Different genotype rate.
4. `geno`: Parameter `--geno` of **PLINK1.9**.
5. `hwe`: Parameter `--hwe` of **PLINk1.9**.
6. `maf`: Parameter `--maf` of **PLINK1.9**.

In [None]:
variant_params = {
    'chr_y': 24,
    'miss_data_rate': 0.2,
    'diff_genotype_rate': 1e-5,
    'geno': 0.1,
    'maf': 5e-8,
    'hwe': 5e-8,
}

Initialize the class `VariantQC`.

In [None]:
variant = VariantQC(
    input_path=input_path,
    input_name=input_name,
    output_path=output_path,
    output_name=output_name,
)

In [None]:
variant_qc_steps = {
    'Missing data rate'         : (variant.execute_missing_data_rate, {'chr_y': variant_params['chr_y']}),
    'Different genotype'        : (variant.execute_different_genotype_call_rate, {}),
    'Hardy-Weinberg equilibrium' : (variant.execute_hwe_test, {}),
}

step_description = {
    'Missing data rate'         : 'Compute missing data rate for males and females',
    'Different genotype'        : 'Case/control nonrandom missingness test',
    'Hardy-Weinberg equilibrium' : 'Hardy-Weinberg equilibrium test',
}

for name, (func, params) in variant_qc_steps.items():
    print(f"\033[1m{step_description[name]}.\033[0m")
    func(**params)

Small dashboard with a report of the variant failing QC steps.

In [None]:
variant.get_fail_variants(
    marker_call_rate_thres=variant_params['miss_data_rate'], 
    case_controls_thres=variant_params['diff_genotype_rate'],
    hwe_threshold=variant_params['hwe'],
)

In [None]:
variant.execute_drop_variants(geno=variant_params['geno'], hwe=variant_params['hwe'], maf=variant_params['maf'])