# Process Imputed Files

The present notebook serves as a guide of how use the `IDEAL-GENOM` library to process the imputed files. We intend to show a possible use, because each user can adapt it to its particular needs.

The first step is to import the requires libraries.

In [1]:
import sys
import os

import pandas as pd

# add parent directory to path
library_path = os.path.abspath('..')
if library_path not in sys.path:
    sys.path.append(library_path)

from ideal_genom.preprocessing.post_imputation import PostImputation

In the next widgets the user must input the paths and filenames needed to process imputed data.

1. `input_path`: folder with the input data. The pipeline assumes that the output of imputation is a collection of 22 zip files (one for each chromosome) with names `chr*.zip`;
2. `dependables_path`: folder with external files needed to process the data, for example the file with the LD regions, in this case the file name must be `high-LD-regions.txt`;
3. `output_path`: folder to output the results;
4. `output_name`: the prefix of the PLINK binary files.

In [2]:
import ipywidgets as widgets
from IPython.display import display

# Create interactive widgets for input
input_path = widgets.Text(
    value='/media/luis/LaCie/valente_gwas/inputData/',
    description='Path to input zip files:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)

dependables_path = widgets.Text(
    value='/media/luis/LaCie/valente_gwas/dependables/',
    description='Path to dependable files:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)

output_path = widgets.Text(
    value='/media/luis/LaCie/valente_gwas/outputData/',
    description='Path to output files:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)
output_name = widgets.Text(
    value='test_valente',
    description='Name of the resulting files:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)
# Display the widgets
display(input_path, dependables_path, output_path, output_name)

# Function to get the text parameter values
def get_params():
    return input_path.value, dependables_path.value, output_path.value, output_name.value

Text(value='/media/luis/LaCie/valente_gwas/inputData/', description='Path to input zip files:', layout=Layout(…

Text(value='/media/luis/LaCie/valente_gwas/dependables/', description='Path to dependable files:', layout=Layo…

Text(value='/media/luis/LaCie/valente_gwas/outputData/', description='Path to output files:', layout=Layout(wi…

Text(value='test_valente', description='Name of the resulting files:', layout=Layout(width='50%'), style=TextS…

In [3]:
path_params = get_params()
print('input_path: ', path_params[0])
print('dependables: ', path_params[1])
print('output_path: ', path_params[2])
print('output_name: ', path_params[3])

input_path:  /media/luis/LaCie/valente_gwas/inputData/
dependables:  /media/luis/LaCie/valente_gwas/dependables/
output_path:  /media/luis/LaCie/valente_gwas/outputData/
output_name:  test_valente


With this info we can initializa the class `PostImputation`.

In [4]:
post_imp = PostImputation(
    input_path =path_params[0],
    dependables=path_params[1],
    output_path=path_params[2],
    output_name=path_params[3]
)

In the next widgets, please provide the parameters needed to execute the pipeline.

1. `zip_password`: password to unzip the imputed data;
2. `r2_threshold`: threshold to filter the imputed data according to $R^2$;
3. `ref_genome`: name of the file with the information to normalize the data;
4. `ref_annotation`: name of the file with the information needed to annotate the SNPs.

In [5]:
zip_password = widgets.Text(
    value='dummypwd',
    description='Password for the zip file:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)

r2_threshold = widgets.FloatText(
    value=0.3,  # Default value
    description='R2 threshold (float):',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='25%')
)

ref_genome = widgets.Text(
    value='hs37d5.fa',
    description='Reference genome file name:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)

ref_annotation = widgets.Text(
    value='ensembl_concat.GRCh37.vcf.gz',
    description='Reference annotation file name:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)

display(zip_password, r2_threshold, ref_genome, ref_annotation)

def get_post_imputation_params():

    post_imputation_params = dict()
    
    post_imputation_params['zip_password']  = zip_password.value
    post_imputation_params['r2_threshold']  = r2_threshold.value
    post_imputation_params['ref_genome']    = ref_genome.value
    post_imputation_params['ref_annotation']= ref_annotation.value
    
    return post_imputation_params

Text(value='dummypwd', description='Password for the zip file:', layout=Layout(width='50%'), style=TextStyle(d…

FloatText(value=0.3, description='R2 threshold (float):', layout=Layout(width='25%'), style=DescriptionStyle(d…

Text(value='hs37d5.fa', description='Reference genome file name:', layout=Layout(width='50%'), style=TextStyle…

Text(value='ensembl_concat.GRCh37.vcf.gz', description='Reference annotation file name:', layout=Layout(width=…

In [6]:
post_imputation_params = get_post_imputation_params()
post_imputation_params

{'zip_password': 'dummypwd',
 'r2_threshold': 0.3,
 'ref_genome': 'hs37d5.fa',
 'ref_annotation': 'ensembl_concat.GRCh37.vcf.gz'}

Execute the pipeline steps.

In [7]:
post_imp_steps = {
    #'unzip_chrom' : (post_imp.execute_unzip_chromosome_files, (post_imputation_params['zip_password'],)),
    #'filter_by_R2': (post_imp.execute_filter_variants, (post_imputation_params['r2_threshold'],)),
    #'normalize'   : (post_imp.execute_normalize_vcf, (post_imputation_params['ref_genome'],)),
    #'index'       : (post_imp.execute_index_vcf, ()),
    #'annotate'    : (post_imp.execute_annotate_vcf, (post_imputation_params['ref_annotation'],)),
    'concatenate' : (post_imp.execute_concat_vcf, ()),
    'get_plink'   : (post_imp.get_plink_files, ()),
}

step_description = {
    #'unzip_chrom' : 'Unzip chromosome files',
    #'filter_by_R2': 'Filter imputed variants by R2',
    #'normalize'   : 'Normalize VCF files',
    #'index'       : 'Index VCF files',
    #'annotate'    : 'Annotate VCF files',
    'concatenate' : 'Concatenate VCF files',
    'get_plink'   : 'Get PLINK files',
}

for name, (func, params) in post_imp_steps.items():
    print(f"\033[1m{step_description[name]}.\033[0m")
    func(*params)

[1mConcatenate VCF files.[0m


Checking the headers and starting positions of 22 files
Concatenating /media/luis/LaCie/valente_gwas/outputData/post_imputation/annotated_normalized_chr1.dose.vcf.gz	40.153340 seconds
Concatenating /media/luis/LaCie/valente_gwas/outputData/post_imputation/annotated_normalized_chr2.dose.vcf.gz	43.700640 seconds
Concatenating /media/luis/LaCie/valente_gwas/outputData/post_imputation/annotated_normalized_chr3.dose.vcf.gz	36.183638 seconds
Concatenating /media/luis/LaCie/valente_gwas/outputData/post_imputation/annotated_normalized_chr4.dose.vcf.gz	36.764294 seconds
Concatenating /media/luis/LaCie/valente_gwas/outputData/post_imputation/annotated_normalized_chr5.dose.vcf.gz	31.360012 seconds
Concatenating /media/luis/LaCie/valente_gwas/outputData/post_imputation/annotated_normalized_chr6.dose.vcf.gz	32.354308 seconds
Concatenating /media/luis/LaCie/valente_gwas/outputData/post_imputation/annotated_normalized_chr7.dose.vcf.gz	29.996464 seconds
Concatenating /media/luis/LaCie/valente_gwas/out

Successfully concatenated and outputted to: /media/luis/LaCie/valente_gwas/outputData/post_imputation/annotated_normalized_combined_1_22.vcf.gz
[1mGet PLINK files.[0m
PLINK v2.00a6LM 64-bit Intel (18 Aug 2024)     www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /media/luis/LaCie/valente_gwas/outputData/analysis_ready/test_valente.log.
Options in effect:
  --make-bed
  --memory 38865.0
  --out /media/luis/LaCie/valente_gwas/outputData/analysis_ready/test_valente
  --snps-only just-acgt
  --threads 30
  --vcf /media/luis/LaCie/valente_gwas/outputData/post_imputation/annotated_normalized_combined_1_22.vcf.gz

Start time: Mon Jan 20 16:12:08 2025
63863 MiB RAM detected, ~58293 available; reserving 38865 MiB for main
workspace.
Using up to 30 threads (change this with --threads).
--vcf: 13682293 variants scanned.
--vcf: 13631k variants converted.
/media/luis/LaCie/valente_gwas/outputData/analysis_ready/test_valente-