# Example notebook of a Sample QC analysis

The present notebook serves as a guide of how use the `IDEAL-GENOM-QC` library to perform a sample quality control. We intend to show a possible use, because each user can adapt it to its particular needs.

In this notebook the procedure to perform the sample quality control is more detailed so the user can get a deeper understanding of all the steps executed in this part of the pipeline.

Let us import the required libraries.

In [None]:
import sys
import os

import pandas as pd

from pathlib import Path

# add parent directory to path
library_path = os.path.abspath('..')
if library_path not in sys.path:
    sys.path.append(library_path)

from ideal_genom_qc.SampleQC import SampleQC

In the next cell the path variables associated with the project are set.

Moreover, since each user can have a slightly different choices for the LD regions, the user can provide its own file. Nevertheless, we provide the functionality of automatically fetching high LD regions for builts **GRCh37** and **GRCH38**.

Let us set the path parameters to execute the sample QC.

In [None]:
input_path = Path('path/to/inputData')
input_name = 'input_prefix'
output_path=  Path('path/to/outputData')
output_name= 'output_prefix'
high_ld_file = Path('path/to/ld-regions/file') # if not available, set to None

In the next cell we define a dictionary containing the parameters to execute the sample QC pipeline.

The explanation of the parameters is the following:

1. `rename_snp`: if `True` the pipeline will change the SNPs identifiers to the format chr_pos_a1_a2.
2. `hh_to_missing`: if `True` the pipeline sets heterozygous haploid to missing, check **PLINK1.9** command `--set-hh-missing` for more details.
3. `use_kingship`: if `True` the pipeline will use the kingship estimator for duplicates and relatedness check.
4. `ind_pair`: parameters of **PLINK1.9** `--indep-pairwise`.
5. `mind`: parameter of **PLINK1.9** `--mind`.
6. `sex_check`: parameters of **PLINK1.9** `--sex-check`.
7. `maf`: minor allele frequency, parameter of **PLINK1.9** `--maf`.
8. `het_deviation`: values of deviation from heterozigosity.
9. `kingship`: parameters of **PLINK1.9** `--king-cutoff`.
10. `ibd_threshold`: threshold to filter samples according to IBD.

In [None]:
sample_params = {
    "rename_snp"   : True,
    "hh_to_missing": True,
    "use_kingship" : True,
    "ind_pair"     : [50, 5, 0.2],
    "mind"         : 0.2,
    "sex_check"    : [0.2, 0.8],
    "maf"          : 0.01,
    "het_deviation": 3,
    "kingship"     : 0.354,
    "ibd_threshold": 0.185
}

Initialize the class `SampleQC`.

In [None]:
sample = SampleQC(
    input_path      =input_path,
    input_name      =input_name,
    output_path     =output_path,
    output_name     =output_name,
    high_ld_file    =high_ld_file,
    built           ='38', # '38' it is the default value 
)

Execute the pipeline steps of the sample quality control. Since the ides of a notebook is to build a more interactive interface, the next steps do not drop the samples failing QC, they just find the samples.

In [None]:
sample_qc_steps = {
    'rename SNPs'           : (sample.execute_rename_snpid, {"rename": sample_params['rename_snp']}),
    'hh_to_missing'         : (sample.execute_haploid_to_missing, {"hh_to_missing": sample_params['hh_to_missing']}),
    'ld_pruning'            : (sample.execute_ld_pruning, {"ind_pair": sample_params['ind_pair']}),
    'miss_genotype'         : (sample.execute_miss_genotype, { "mind": sample_params['mind']}),
    'sex_check'             : (sample.execute_sex_check, {"sex_check": sample_params['sex_check']}),
    'heterozygosity'        : (sample.execute_heterozygosity_rate, {"maf": sample_params['maf']}),
    'duplicates_relatedness': (sample.execute_duplicate_relatedness, {"kingship": sample_params['kingship'], "use_king": sample_params['use_kingship']})
}

step_description = {
    'rename SNPs'           : 'Rename SNPs to chr:pos:ref:alt',
    'hh_to_missing'         : 'Solve hh warnings by setting to missing',
    'ld_pruning'            : 'Perform LD pruning',
    'miss_genotype'         : 'Get samples with high missing rate',
    'sex_check'             : 'Get samples with discordant sex information',
    'heterozygosity'        : 'Get samples with high heterozygosity rate',
    'duplicates_relatedness': 'Get samples with high relatedness rate or duplicates'
}

for name, (func, params) in sample_qc_steps.items():
    print(f"\033[1m{step_description[name]}.\033[0m")
    func(**params)

Here a small dashboard with a report of the call rate missingness is shown. The cap on the Y-axis can be selected without re-running the whole pipeline, so it can be selected according to each user need. Moreover, the plots could help to choose the best call rate threshold according to the data.

In [None]:
fail_call_rate = sample.report_call_rate(
            directory    =sample.results_dir, 
            filename     =sample.call_rate_miss, 
            threshold    =sample_params['mind'],
            plots_dir    =sample.plots_dir, 
            y_axis_cap   =10
        )

Now, the samples failing sex check are collected and a plot is shown where the user can check the number of problematic samples.

In [None]:
fail_sexcheck = sample.report_sex_check(
            directory          =sample.results_dir, 
            sex_check_filename =sample.sexcheck_miss, 
            xchr_imiss_filename=sample.xchr_miss,
            plots_dir          =sample.plots_dir
        )

Here a small dashboard with a report of the heterozigosity is shown. The cap on the Y-axis can be selected without re-running the whole pipeline, so it can be selected according to each user need. Moreover, the plots could help to choose a different deviation from the mean of the heterozigosity rate. Notice that the analysis has been divided for SNPs having a MAF of less than 1% and those above that threshold.

In [None]:
fail_het_greater = sample.report_heterozygosity_rate(
            directory           = sample.results_dir, 
            summary_ped_filename= sample.summary_greater, 
            autosomal_filename  = sample.maf_greater_miss, 
            std_deviation_het   = sample_params['het_deviation'],
            maf                 = sample_params['maf'],
            split               = '>',
            plots_dir           = sample.plots_dir
        )

In [None]:
fail_het_gless= sample.report_heterozygosity_rate(
            directory           = sample.results_dir, 
            summary_ped_filename= sample.summary_less, 
            autosomal_filename  = sample.maf_less_miss, 
            std_deviation_het   = sample_params['het_deviation'],
            maf                 = sample_params['maf'],
            split               = '<',
            plots_dir           = sample.plots_dir
        )

In [None]:

if sample.use_king:
    fail_dup_relatednes = pd.read_csv(
                sample.kinship_miss,
                sep=r'\s+',
                engine='python'
            )

    # filter samples that failed duplicates and relatedness check
    fail_dup_relatednes.columns = ['FID', 'IID']
    fail_dup_relatednes['Failure'] = 'Duplicates and relatedness'

In [None]:
if not sample_params['use_kingship']:

    sample.use_king = False

    fail_dup_relatednes = sample.report_ibd_analysis(ibd_threshold=0.185)

    fail_dup_relatednes

All failing samples are collected and concatenated in a single pandas DataFrame and saved. A summary of the amount of samples failing each step is shown. 

In [None]:
df_fails = pd.concat(
    [fail_call_rate, fail_sexcheck, fail_het_greater, fail_het_gless, fail_dup_relatednes],
    ignore_index=True
)

total_fails = df_fails.shape[0]
duplicates = df_fails.duplicated(subset=['FID', 'IID']).sum()
summary = df_fails['Failure'].value_counts().reset_index()

df_fails = df_fails.drop_duplicates(subset=['FID', 'IID'])

df_fails.to_csv(
    os.path.join(sample.fails_dir, 'fail_samples.txt'), sep='\t',
    index=False
)

In [None]:
summary

In [None]:
print('Total samples failing QC: ', total_fails)
print('Number of duplicated samples: ', duplicates)
print('Unique samples failing QC: ', total_fails-duplicates)

Finally, the failing samples are dropped and cleaned `PLINK` files are generated.

In [None]:
sample.execute_drop_samples()

Some intermediate files are deleted to save space.

In [None]:
sample.clean_input_folder()
sample.clean_result_folder()