# Scalable pipeline for computing LD matrix in big sample phenotype

## Aim

To extract the summary statistics and genotype on specific genomic regions and calculate their LD matrix.

## Pre-requisites

### Two way to use this pipelin in csglogin

`export PATH=/home/yh3455/miniconda3/bin:$PATH`

### Or insatll the following packages in your env

Make sure you install the pre-requisited before running this notebook:

```
pip install LDtoolsets
```

### Input

- `--region-file`, including a list of regions
    - Each locus will be represented by one line in the region file with 3 columns chr, start, and end. e.g. `7 27723990 28723990`
- `--geno-path`, the path of a genotype inventory, which lists the path of all genotype file in `bgen` format or in `plink` format.
    - The list is a file with 2 columns: `chr genotype_file_chr.ext`. 
    - The first column is chromosome ID, the 2nd file is genotype for that chromosome.
    - When chromosome ID is 0, it implies that the genotype file contains all the genotypes.
- `--pheno-path`, the path of a phenotype. Only for one genotype data. If `None`, only `pld` will be calculated.
    - The phenotype file should have a column with the name `IID`, which is used to represent the sample ID.
- `--sumstats-path`, the path of the GWAS file, including all summary statistics (eg, $\hat{\beta}$, $SE(\hat{\beta})$ and p-values)
    - These summary statistics should contain at least these columns: `chrom, pos, ref, alt, snp_id, bhat, sbhat, p`
- `--unrelated-samples`, the file path of unrelated samples with a column named `IID`. If `None`, all samples will be considered unrelative.  
- `--cwd`, the path of output directory


- `--imp-geno-path`, the path of a genotype inventory, which lists the path of all genotype file in `bgen` format or in `plink` format.
    - The list is a file with 2 columns: `chr genotype_file_chr.ext`. 
    - The first column is chromosome ID, the 2nd file is genotype for that chromosome.
    - When chromosome ID is 0, it implies that the genotype file contains all the genotypes.
- `--imp-sumstats-path`, the path of the GWAS file, including all summary statistics (eg, $\hat{\beta}$, $SE(\hat{\beta})$ and p-values)
    - These summary statistics should contain at least these columns: `chrom, pos, ref, alt, snp_id, bhat, sbhat, p`
- `--imp-ref`, the reference genome if exome genotype and imputed genotype are different. If `None`, The two genotype data will be considered from the same  

### Output
- `rg_stat`, the reginonal summary stats
    - The rowname is the variant ID.
    - It should contain at least the following columns: `CHR, BP, SNP, ALT, REF, BETA, SE, Z, P`.
- `rg_geno`,the regional genotypes
    - The rowname is the variant ID, which should match with the rowname of `rg_stat`.
    - The column name is the sample's IID, which is sorted by the sample in phenotype.
- `pld`, the regional approximate population LD calculated by unrelated individuals
- `sld`, the regional approximate sample LD calcualted by unrelated individuals in a phenotype.

## Example command

```
sos run /home/yh3455/Github/bioworkflows/GWAS/LD_merged_exo_imp.ipynb     default    --cwd /home/yh3455/Github/bioworkflows/GWAS/test    --region-file /home/dmc2245/UKBiobank/results/LD_clumping/092321_f3393_200Kexomes/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2_f3393.regenie.snp_stats.clumped_region    --pheno-path /home/dmc2245/UKBiobank/phenotype_files/hearing_impairment/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2.tsv    --geno-path /home/dmc2245/UKBiobank/data/exome_files/project_VCF/072721_run/plink/092321_UKBB_qc_exome_geno_path.txt   --sumstats-path /home/dmc2245/UKBiobank/results/REGENIE_results/results_exome_data/090921_f3393_hearing_aid_200K/*.snp_stats.gz     --unrelated-samples /home/dmc2245/UKBiobank/results/083021_PCA_results/090221_ldprun_unrelated/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.090221_ldprun_unrelated.filtered.prune.txt  --job-size 1
```

## Workflow codes

In [1]:
[global]
# Work directory where output will be saved to
parameter: cwd = path
# Region specifications
parameter: region_file = path
# Genotype file inventory
parameter: geno_path = path
# Phenotype path
parameter: pheno_path = path
# Sample file path, for bgen format
parameter: bgen_sample_path = path('.')
# Path to summary stats file
parameter: sumstats_path = path
# Path to summary stats format configuration
parameter: format_config_path = path('.')
# Path to samples of unrelated individuals
parameter: unrelated_samples = path
# Number of tasks to run in each job on cluster
parameter: job_size = int
# Number of tasks to run in each job on cluster
parameter: imp_geno_path = path
# Path to summary stats file
parameter: imp_sumstats_path = path
# The reference genome of imputed genotype data
parameter: imp_ref = str
parameter: walltime = '12h'
parameter: mem = '60G'
fail_if(not region_file.is_file(), msg = 'Cannot find regions to extract. Please specify them using ``--region-file`` option.')
# Load all regions of interest. Each item in the list will be a region: (chr, start, end)
regions = list(set([tuple(x.strip().split()) for x in open(region_file).readlines() if x.strip()]))

In [1]:
[default_1 (export utils script)]
depends: Py_Module('pandas'), Py_Module('numpy'), Py_Module('dask'), Py_Module('LDtools')
parameter: scan_window = 500000
output: f'{cwd:a}/utils.py'
report: expand = '${ }', output=f'{cwd:a}/utils.py'
    import pandas as pd
    import numpy as np
    import dask.array as da
    from LDtools.liftover import *
    from LDtools.genodata import *
    from LDtools.sumstat import *
    from LDtools.ldmatrix import *
    from LDtools.utils import *


    def main(region,geno_path,sumstats_path,pheno_path,unr_path,imp_geno_path,imp_sumstats_path,imp_ref,output_sumstats,output_LD,bgen_sample_path):

        print('1. Preprocess sumstats (regenie format) and extract it from a region')
        if pheno_path is not None:
            # Load phenotype file
            pheno = pd.read_csv(pheno_path, header=0, delim_whitespace=True, quotechar='"')
        if unr_path is not None:
            # Load unrelated sample file
            unr = pd.read_csv(unr_path, header=0, delim_whitespace=True, quotechar='"')  
        # Load the file of summary statistics and standardize it.
        exome_sumstats = Sumstat(sumstats_path)
        exome_geno = Genodata(geno_path,bgen_sample_path)

        print('1.1. Region extraction')
        exome_sumstats.extractbyregion(region)
        exome_geno.extractbyregion(region)
        exome_sumstats.match_ss(exome_geno.bim)
        exome_geno.geno_in_stat(exome_sumstats.ss)

        if imp_geno_path is not None:
            #two genotype data
            imput_sumstats = Sumstat(imp_sumstats_path)
            imput_geno = Genodata(imp_geno_path,bgen_sample_path)

            if imp_ref is None:
                imput_sumstats.extractbyregion(region)
                imput_geno.extractbyregion(region)
                imput_sumstats.match_ss(imput_geno.bim)
                imput_geno.geno_in_stat(imput_sumstats.ss)
            else:
                print('1.2. LiftOver the region')
                hg38toimpref = Liftover('hg38',imp_ref)
                imp_region = hg38toimpref.region_liftover(region)
                imput_sumstats.extractbyregion(imp_region)
                imput_geno.extractbyregion(imp_region)
                imput_sumstats.match_ss(imput_geno.bim)
                imput_geno.geno_in_stat(imput_sumstats.ss)

                print('1.3. Regional SNPs Liftover')
                impreftohg38 = Liftover(imp_ref,'hg38') #oppsite with hg38toimpref
                imput_geno.bim = impreftohg38.bim_liftover(imput_geno.bim)
                imput_sumstats.ss.POS = list(imput_geno.bim.pos)
                imput_sumstats.ss.SNP = list(imput_geno.bim.snp)

            print('1.1.1 Get exome unique sumstats and geno and Combine sumstats')
            snp_match = compare_snps(exome_sumstats.ss,imput_sumstats.ss)
            exome_sumstats.ss = exome_sumstats.ss.loc[snp_match.qidx[snp_match.exact==False].drop_duplicates()] #remove by exact match. can be improved.
            exome_geno.geno_in_stat(exome_sumstats.ss)
            sumstats = pd.concat([exome_sumstats.ss,imput_sumstats.ss])
        else:
            #one genotype data
            sumstats = exome_sumstats

        print('2. Remove relative samples')
        if unr_path is not None:
            exome_geno.geno_in_unr(unr)
            if imp_geno_path is not None:
                imput_geno.geno_in_unr(unr)
        else:
            print('Warning:There is no file of relative sample. All sample are included in computing LD matrix')

        if pheno_path is not None:
            print('Warning: This function has been implementd yet.')
            pass #sld and pld

        print('3. Calculate LD matrix')
        if imp_geno_path is None:
            cor_da = geno_corr(exome_geno.bed.T)
        else:
            xx = geno_corr(exome_geno.bed.T)
            yy = geno_corr(imput_geno.bed.T,step=500)

            imput_fam = imput_geno.fam.copy()
            imput_fam.index = list(imput_fam.iid.astype(str))
            imput_fam['i'] = list(range(imput_fam.shape[0]))
            imput_fam_comm = imput_fam.loc[list(exome_geno.fam.iid.astype(str))]
            imput_geno.extractbyidx(list(imput_fam_comm.i),row=False)
            xy = geno_corr(exome_geno.bed.T,imput_geno.bed.T,step=500)
            cor_da = da.concatenate([da.concatenate([xx,xy],axis=1),da.concatenate([xy.T,yy],axis=1)],axis=0)

        print('4. Output sumstats and LD matrix')
        index = list(sumstats.SNP.apply(shorten_id))
        sumstats.SNP = index
        sumstats.index = list(range(sumstats.shape[0]))
        sumstats.to_csv(output_sumstats, sep = "\t", header = True, index = False,compression='gzip')

        corr = cor_da.compute()
        np.fill_diagonal(corr, 1)
        corr = pd.DataFrame(corr, columns=index)
        corr.to_csv(output_LD, sep = "\t", header = True, index = False,compression='gzip')


## Extract data

This step runs in parallel for all loci listed in the region file (via `for_each`).

In [3]:
[default_2 (extract genotypes)]
depends: f'{cwd:a}/utils.py'
input: geno_path, pheno_path, sumstats_path, unrelated_samples, imp_geno_path,imp_sumstats_path, bgen_sample_path, for_each = 'regions'
output: sumstats = f'{cwd:a}/{_regions[0]}_{_regions[1]}_{_regions[2]}/{sumstats_path:bn}_{_regions[0]}_{_regions[1]}_{_regions[2]}.sumstats.gz',
        genotype = f'{cwd:a}/{_regions[0]}_{_regions[1]}_{_regions[2]}/{sumstats_path:bn}_{_regions[0]}_{_regions[1]}_{_regions[2]}.genotype.gz',
        pld = f'{cwd:a}/{_regions[0]}_{_regions[1]}_{_regions[2]}/{sumstats_path:bn}_{_regions[0]}_{_regions[1]}_{_regions[2]}.pre_pop_ld.pickle'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = 1, tags = f'{step_name}_{_output[0]:bn}'
python: expand = '${ }', input = f'{cwd:a}/utils.py', stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    

    import os
    # output path files that we will need in our final version
    output_sumstats = ${_output['sumstats']:r}
    output_genotype = ${_output['genotype']:r}
    output_pld = ${_output['pld']:r}

    # this general path is used to create other temporary files that we need to calculate the ld matrices later on
    cwd = os.getcwd()
    output_general = '${cwd}/${_regions[0]}_${_regions[1]}_${_regions[2]}/${sumstats_path:bn}_${_regions[0]}_${_regions[1]}_${_regions[2]}'

    input_geno_path = ${_input[0]:r}
    input_pheno_path = ${_input[1]:r}
    input_sumstats_path = ${_input[2]:r}
    input_unrelated_samples = ${_input[3]:r}
    imp_geno_path = ${_input[4]:r}
    imp_sumstats_path = ${_input[5]:r}
    bgen_sample_path = ${_input[6]:r}
    imp_ref =  '${imp_ref}'
    
    input_format_config = ${format_config_path:r} if ${format_config_path.is_file()} else None

    
    # Load genotype file for the region of interest
    geno_inventory = dict([x.strip().split() for x in open(input_geno_path).readlines() if x.strip()])
    imp_geno_inventory = dict([x.strip().split() for x in open(imp_geno_path).readlines() if x.strip()])
    chrom = "${_regions[0]}"
    if chrom.startswith('chr'):
        chrom = chrom[3:]
    if chrom not in geno_inventory:
        geno_file = geno_inventory['0']
        imp_geno_path = imp_geno_inventory['0']
    else:
        geno_file = geno_inventory[chrom]
        imp_geno_path = imp_geno_inventory[chrom]

    if not os.path.isfile(geno_file):
        # relative path
        if not os.path.isfile('${_input[0]:ad}/' + geno_file):
            raise ValueError(f"Cannot find genotype file {geno_file}")
        else:
            geno_file = '${_input[0]:ad}/' + geno_file


    region = (int(chrom), ${_regions[1]}, ${_regions[2]})

    print(region, geno_file, input_sumstats_path, input_pheno_path, input_unrelated_samples,imp_geno_path,imp_sumstats_path,imp_ref,
                                output_sumstats, output_pld,bgen_sample_path)
    main(region, geno_file, input_sumstats_path, input_pheno_path, input_unrelated_samples,imp_geno_path,imp_sumstats_path,imp_ref,
                                output_sumstats, output_pld,bgen_sample_path)