# Process concatenated PLINK binary files

The present notebook serves as a guide of how use the `IDEAL-GENOM` library to process PLINK binary files obtained after processing the imputed files. We intend to show a possible use, because each user can adapt it to its particular needs.

The first step is to import the requires libraries.

In [1]:
import sys
import os

# add parent directory to path
library_path = os.path.abspath('..')
if library_path not in sys.path:
    sys.path.append(library_path)

from ideal_genom.preprocessing.preparatory import Preparatory

In the next widgets the user must input the paths and filenames needed to process `PLINK` binary files.

1. `input_path`: folder with the input data. The pipeline assumes that the files are `.bed`, `.bim`, `.fam` files;
2. `input_name`: prefix of the `PLINK` binary files:
3. `dependables_path`: folder with external files needed to process the data, for example the file with the LD regions, in this case the file name must be `high-LD-regions.txt`;
4. `output_path`: folder to output the results;
5. `output_name`: the prefix of the PLINK binary files.

In [2]:
import ipywidgets as widgets
from IPython.display import display

# Create interactive widgets for input
input_path = widgets.Text(
    value='/media/luisggon/LaCie/data1/LuxGiantimputed/outputData/post_imputation/analysis_ready/',
    description='Path to input PLINK binary files:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)

input_name = widgets.Text(
    value='luxgiant_imputed_noprobID',
    description='Prefix of PLINK binary files:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)

dependables_path = widgets.Text(
    value='/media/luisggon/LaCie/data1/LuxGiantimputed/dependables/',
    description='Path to dependable files:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)

output_path = widgets.Text(
    value='/media/luisggon/LaCie/data1/LuxGiantimputed/outputData/',
    description='Path to output files:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)
output_name = widgets.Text(
    value='luxgiant_imputed_noprobID_processed',
    description='Prefix of the output files:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)
# Display the widgets
display(input_path, input_name, dependables_path, output_path, output_name)

# Function to get the text parameter values
def get_params():
    return input_path.value, input_name.value, dependables_path.value, output_path.value, output_name.value

Text(value='/media/luisggon/LaCie/data1/LuxGiantimputed/outputData/post_imputation/analysis_ready/', descripti…

Text(value='luxgiant_imputed_noprobID', description='Prefix of PLINK binary files:', layout=Layout(width='50%'…

Text(value='/media/luisggon/LaCie/data1/LuxGiantimputed/dependables/', description='Path to dependable files:'…

Text(value='/media/luisggon/LaCie/data1/LuxGiantimputed/outputData/', description='Path to output files:', lay…

Text(value='luxgiant_imputed_noprobID_processed', description='Prefix of the output files:', layout=Layout(wid…

In [3]:
path_params = get_params()
print('input_path: ', path_params[0])
print('input_name: ', path_params[1])
print('dependables: ', path_params[2])
print('output_path: ', path_params[3])
print('output_name: ', path_params[4])

input_path:  /media/luisggon/LaCie/data1/LuxGiantimputed/outputData/post_imputation/analysis_ready/
input_name:  luxgiant_imputed_noprobID
dependables:  /media/luisggon/LaCie/data1/LuxGiantimputed/dependables/
output_path:  /media/luisggon/LaCie/data1/LuxGiantimputed/outputData/
output_name:  luxgiant_imputed_noprobID_processed


With this info we can initialize the class `Preparatory`.

In [None]:
preps = Preparatory(
    input_path =path_params[0],
    input_name =path_params[1],
    dependables=path_params[2],
    output_path=path_params[3],
    output_name=path_params[4]
)

In the next widgets, please provide the parameters needed to execute the pipeline.

1. `maf`: minor allele frequency;
2. `geno`: genotype missing rate;
3. `hwe`: Hardy-Weinberg equilibrium;
4. `mind`: individual missing rate;
5. `ind_pair`: independent pairwise;
6. `pca`: number of component used for the principal components decomposition.

In [5]:
maf = widgets.FloatText(
    value=0.05,
    description='Minor Allele Frequency:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)

geno = widgets.FloatText(
    value=0.1,
    description='Genotype missing rate:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)

hwe = widgets.FloatText(
    value=5e-8,
    description='Hardy-Weinberg Equilibrium:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)

mind = widgets.FloatText(
    value=0.1,
    description='Individual missing rate:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)

ind_par = widgets.Textarea(
    value='50, 5, 0.2',
    description='indep pairwise (comma-separated):',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='25%')
)

pca = widgets.IntText(
    value=10,
    description='Number of Principal Components:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)

display(maf, geno, hwe, mind, ind_par, pca)

def get_preps_params():

    preps_params = dict()

    indep = ind_par.value.split(',')

    preps_params['maf']     = maf.value
    preps_params['geno']    = geno.value
    preps_params['hwe']     = hwe.value
    preps_params['mind']    = mind.value
    preps_params['ind_pair']= [int(indep[0]), int(indep[1]), float(indep[2])]
    preps_params['pca']     = pca.value

    return preps_params

FloatText(value=0.05, description='Minor Allele Frequency:', layout=Layout(width='50%'), style=DescriptionStyl…

FloatText(value=0.1, description='Genotype missing rate:', layout=Layout(width='50%'), style=DescriptionStyle(…

FloatText(value=5e-08, description='Hardy-Weinberg Equilibrium:', layout=Layout(width='50%'), style=Descriptio…

FloatText(value=0.1, description='Individual missing rate:', layout=Layout(width='50%'), style=DescriptionStyl…

Textarea(value='50, 5, 0.2', description='indep pairwise (comma-separated):', layout=Layout(width='25%'), styl…

IntText(value=10, description='Number of Principal Components:', layout=Layout(width='50%'), style=Description…

In [6]:
preps_params = get_preps_params()
preps_params

{'maf': 0.05,
 'geno': 0.1,
 'hwe': 5e-08,
 'mind': 0.1,
 'ind_pair': [50, 5, 0.2],
 'pca': 10}

Execute the pipeline steps.

In [7]:
prep_steps = {
    'ld_prune': (preps.execute_ld_prunning, {
        'maf': preps_params['maf'], 
        'geno': preps_params['geno'], 
        'mind': preps_params['mind'], 
        'hwe': preps_params['hwe'], 
        'ind_pair': preps_params['ind_pair'],
    }),
    'pca': (preps.execute_pc_decomposition, {
        'pca': preps_params['pca']
    }),
}

step_description = {
    'ld_prune': 'Linkage Disequilibrium Prunning',
    'pca'     : 'Principal Component Analysis'
}

for name, (func, params) in prep_steps.items():
    print(f"\033[1m{step_description[name]}.\033[0m")
    func(**params)


[1mLinkage Disequilibrium Prunning.[0m
PLINK v1.9.0-b.7.7 64-bit (22 Oct 2024)            cog-genomics.org/plink/1.9/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /media/luisggon/LaCie/data1/LuxGiantimputed/outputData/preparatory/luxgiant_imputed_noprobID_processed_prunning.log.
Options in effect:
  --bfile /media/luisggon/LaCie/data1/LuxGiantimputed/outputData/post_imputation/analysis_ready/luxgiant_imputed_noprobID
  --chr 1-22
  --exclude /media/luisggon/LaCie/data1/LuxGiantimputed/dependables/high-LD-regions.txt
  --geno 0.1
  --hwe 5e-08
  --indep-pairwise 50 5 0.2
  --maf 0.05
  --make-bed
  --out /media/luisggon/LaCie/data1/LuxGiantimputed/outputData/preparatory/luxgiant_imputed_noprobID_processed_prunning
  --range
  --threads 10

Note: --range flag deprecated.  Use e.g. "--extract range <filename>".
13795 MB RAM detected; reserving 6897 MB for main workspace.
23169127 variants loaded from .bim file.
11170 people (7458 males, 3712 