# Extracting data for genomic regions of interest

## Aim

To extract the summary statistics and genotype on specific genomic regions and calculate their LD matrix.

## Pre-requisites

Make sure you install the pre-requisited before running this notebook:

```
pip install pybgen
pip install pandas_plink
pip install scipy
```

## Input and Output

### Input

- `--region-file`, including a list of regions
    - Each locus will be represented by one line in the region file with 3 columns chr, start, and end. e.g. `7 27723990 28723990`
- `--geno-path`, the path of a genotype inventory, which lists the path of all genotype file in `bgen` format or in `plink` format.
    - The list is a file with 2 columns: `chr genotype_file_chr.ext`. 
    - The first column is chromosome ID, the 2nd file is genotype for that chromosome.
    - When chromosome ID is 0, it implies that the genotype file contains all the genotypes.
- `--pheno-path`, the path of a phenotype.
    - The phenotype file should have a column with the name `IID`, which is used to represent the sample ID.
- `--bgen-sample-path`, the path of a file including the sample in the `bgen` files.
    - If the genotype file is in `bgen` format, you should provide this path.
- `--sumstats-path`, the path of the GWAS file, including all summary statistics (eg, $\hat{\beta}$, $SE(\hat{\beta})$ and p-values)
    - These summary statistics should contain at least these columns: `chrom, pos, a0, a1, snp_id, bhat, sbhat, p`
- `--unrelated-samples`, the file path of unrelated samples with a column named `IID`.   
- `--cwd`, the path of output directory

### Output
- `rg_stat`, the reginonal summary stats
    - The rowname is the variant ID.
    - It should contain at least the following columns: `CHR, BP, SNP, ALT, REF, BETA, SE, Z, P`.
- `rg_geno`,the regional genotypes
    - The rowname is the variant ID, which should match with the rowname of `rg_stat`.
    - The column name is the sample's IID, which is sorted by the sample in phenotype.
- `pld`, the regional approximate population LD calculated by unrelated individuals
- `sld`, the regional approximate sample LD calcualted by unrelated individuals in a phenotype.

## Workflow usage

Using our minimal working example data-set where we have already generated results for fastGWA,

```
sos run Region_Extraction.ipynb \
    --cwd candidate_loci \
    --region-file data/regions.txt \
    --pheno-path data/phenotypes.txt \
    --geno-path data/genotype_inventory.txt \
    --bgen-sample-path data/imputed_genotypes.sample \
    --sumstats-path output/phenotypes_BMI.fastGWA.snp_stats.gz \
    --unrelated-samples data/unrelated_samples.txt
```

## Workflow codes

In [2]:
[global]
# Work directory where output will be saved to
parameter: cwd = path
# Region specifications
parameter: region_file = path
# Genotype file inventory
parameter: geno_path = path
# Phenotype path
parameter: pheno_path = path
# Sample file path, for bgen format
parameter: bgen_sample_path = path('.')
# Path to summary stats file
parameter: sumstats_path = path
# Path to summary stats format configuration
parameter: format_config_path = path('.')
# Path to samples of unrelated individuals
parameter: unrelated_samples = path
# Number of tasks to run in each job on cluster
parameter: job_size = int

fail_if(not region_file.is_file(), msg = 'Cannot find regions to extract. Please specify them using ``--region-file`` option.')
# Load all regions of interest. Each item in the list will be a region: (chr, start, end)
regions = [x.strip().split() for x in open(region_file).readlines() if x.strip()]

## Some utility functions

- `plink_slice`: The function of extracting regional genotype
   - p: list of bim, fam, bed
   - pb: index of bim
   - pf: index of fam

In [2]:
[utils]
output: f'{cwd:a}/utils.py'
report: expand = '${ }', output=_output
    import numpy as np
    import pandas as pd

    def read_sumstat(file, config_file):
        sumstats = pd.read_csv(file, compression='gzip', header=0, sep='\t', quotechar='"')
        if config_file is not None:
            import yaml
            config = yaml.safe_load(open(config_file, 'r'))
            try:
                sumstats = sumstats.loc[:,list(config.values())]
            except:
                raise ValueError(f'According to {config_file}, input summary statistics should have the following columns: {list(config.values())}.')
            sumstats.columns = list(config.keys())
        return sumstats

    def region_index(bim,region, chrom_col=0, pos_col=3):
        chr_bool = bim.iloc[:,chrom_col].astype(int) == region[0]
        chr_ind = chr_bool[chr_bool].index
        reg_bool = (bim.iloc[:,pos_col][chr_ind]>region[1]) & (bim.iloc[:,pos_col][chr_ind]<region[2])
        return chr_ind[reg_bool]     

    from scipy.stats import norm
    def p2z(pval,beta,twoside=True):
        if twoside:
            pval = pval/2
        z = [abs(norm.ppf(p)) if b>0 else -abs(norm.ppf(p)) for p,b in zip(pval,beta)]
        return z

    def regional_stat(ss,ind):
        ss['Z'] = p2z(ss.P,ss.BETA)
        return ss.iloc[ind,:]

    def plink_slice(p,pb=None,pf=None):
        (bim,fam,bed)=p
        if pb is not None:
            bim = bim.iloc[pb]
            bed = bed[pb,:]
        if pf is not None:
            fam = fam.iloc[pf]
            bed = bed[:,pf]
        bed = bed.compute(num_workers=1)
        return(bim,fam,bed)

    def LD_matrix(bed):
        snps = pd.DataFrame(bed.transpose())
        ld = snps.corr()
        return ld

    def bgen_region(region,geno,dtype='float16'):
        snps,genos=[],[]
        i=0
        for t,g in geno[0].iter_variants_in_region('0'+str(region[0]) if region[0]<10 else str(region[0]),region[1],region[2]):
            snps.append([int(t.chrom),t.name,0.0,t.pos,t.a1,t.a2,i])
            genos.append(g.astype(dtype))
            i+=1
        return(pd.DataFrame(snps,columns=['chrom','snp','cm','pos','a0','a1','i']),np.array(genos))

    def extract_region(region,gwas,geno,pheno,unr,plink=True):

        # Extract the summary stat
        print('Extracting summary statistics ...')
        ss_ind = region_index(gwas,region, chrom_col=0, pos_col=1)
        rg_stat = regional_stat(gwas,ss_ind)
        rg_stat.index = rg_stat.SNP
        #
        print("Extracting genotypes ...")
        if plink:
            print("The genotype is plink format")
            rg_bim,rg_fam,rg_bed = plink_slice(geno,pb=list(region_index(geno[0],region,chrom_col=0,pos_col=3)))
        else:
            print("The genotype is bgen format")
            rg_bim,rg_bed=bgen_region(region,geno,dtype='float16')
            rg_fam = geno[1]
        rg_bim.index = rg_bim.snp
        rg_fam.index = rg_fam.iid.astype(str)
        rg_bed = pd.DataFrame(rg_bed,index=rg_bim.index,columns=rg_fam.index)
        if not list(rg_stat.index)==list(rg_bim.index):
            print('The regional genotype file and the regional result file do not match with each other. The overlapping variants will be selected.')
            #overlapping variants
            com_row_idx = rg_stat.index.intersection(rg_bim.index,sort=False)
            if len(com_row_idx) == 0:
                raise ValueError("Variants ID do not match between summary statistics and reference genotype")
            rg_stat = rg_stat.loc[com_row_idx,:]
            rg_bed = rg_bed.loc[com_row_idx,:]
        
        # Calculate the LD matrix based on unrelated individuals
        print("Calculating LD matrix ...")
        iid_unr = rg_fam.index.intersection(pd.Index(unr.IID.astype(str))) #order based on rg_fam
        pop_ld_approx = LD_matrix(rg_bed.loc[:,iid_unr])
        pheno.index = pheno.IID.astype(str)
        iid_ph = pheno.index.intersection(rg_fam.index) #order based on pheno
        sample_ld_approx = LD_matrix(rg_bed.loc[:,iid_unr.intersection(iid_ph)])
        
        # genotypes in the sample of a specific phenotype with ordering match
        if not list(iid_ph)==list(pheno.IID.astype(str)):
            print('Warning: Some samples with phenotype do not have genotypes')
        rg_bed = rg_bed.loc[:,iid_ph]
        print("Data extraction complete!")
        return dict(stats=rg_stat,geno=rg_bed,pld=pop_ld_approx,sld=sample_ld_approx)

## Extract data

This step runs in parallel for all loci listed in the region file (via `for_each`).

In [1]:
[default (extract genotypes)]
depends: Py_Module('pandas_plink'), Py_Module('pybgen')
input: geno_path, pheno_path, sumstats_path, unrelated_samples, output_from('utils'), for_each = 'regions'
output: sumstats = f'{cwd:a}/{_regions[0]}_{_regions[1]}_{_regions[2]}/{sumstats_path:bn}_{_regions[0]}_{_regions[1]}_{_regions[2]}.sumstats.gz',
        genotype = f'{cwd:a}/{_regions[0]}_{_regions[1]}_{_regions[2]}/{sumstats_path:bn}_{_regions[0]}_{_regions[1]}_{_regions[2]}.genotype.gz',
        pld = f'{cwd:a}/{_regions[0]}_{_regions[1]}_{_regions[2]}/{sumstats_path:bn}_{_regions[0]}_{_regions[1]}_{_regions[2]}.population_ld.gz',
        sld = f'{cwd:a}/{_regions[0]}_{_regions[1]}_{_regions[2]}/{sumstats_path:bn}_{_regions[0]}_{_regions[1]}_{_regions[2]}.sample_ld.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '60G', cores = 1, tags = f'{step_name}_{_output[0]:bn}'
python: expand = '${ }', input = _input[4], stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    
    # Load the file of summary statistics and standardize it.
    sumstats = read_sumstat(${_input[2]:r}, ${format_config_path:r} if ${format_config_path.is_file()} else None)

    # Load phenotype file
    pheno = pd.read_csv(${_input[1]:r}, header=0, delim_whitespace=True, quotechar='"')
    # Load unrelated sample file
    unr = pd.read_csv(${_input[3]:r}, header=0, delim_whitespace=True, quotechar='"')
    
    # Load genotype file for the region of interest
    geno_inventory = dict([x.strip().split() for x in open(${_input[0]:r}).readlines() if x.strip()])
    chrom = "${_regions[0]}"
    if chrom.startswith('chr'):
        chrom = chrom[3:]
    if chrom not in geno_inventory:
        geno_file = geno_inventory['0']
    else:
        geno_file = geno_inventory[chrom]
    import os
    if not os.path.isfile(geno_file):
        # relative path
        if not os.path.isfile('${_input[0]:ad}/' + geno_file):
            raise ValueError(f"Cannot find genotype file {geno_file}")
        else:
            geno_file = '${_input[0]:ad}/' + geno_file
    if geno_file.endswith('.bed'):
        plink = True
        from pandas_plink import read_plink
        geno = read_plink(geno_file)
    elif geno_file.endswith('.bgen'):
        plink = False
        from pybgen import PyBGEN
        bgen = PyBGEN(geno_file)
        sample_file = geno_file.replace('.bgen', '.sample')
        if not os.path.isfile(sample_file):
            if not os.path.isfile(${bgen_sample_path:r}):
                raise ValueError(f"Cannot find the matching sample file ``{sample_file}`` for ``{geno_file}``.\nYou can specify path to sample file for all BGEN files using ``--bgen-sample-path``.")
            else:
                sample_file = ${bgen_sample_path:r}
        bgen_fam = pd.read_csv(sample_file, header=0, delim_whitespace=True, quotechar='"',skiprows=1)
        bgen_fam.columns = ['fid','iid','missing','sex']
        geno = [bgen,bgen_fam]
    else:
        raise ValueError('Plesae provide the genotype files with PLINK binary format or BGEN format')
    
    rg_info = extract_region((${_regions[0]}, ${_regions[1]}, ${_regions[2]}), sumstats, geno, pheno, unr, plink)
    rg_info['stats'].to_csv(${_output['sumstats']:r}, sep = "\t", header = True, index = True)
    rg_info['geno'].to_csv(${_output['genotype']:r}, sep = "\t", header = True, index = True)
    rg_info['pld'].to_csv(${_output['pld']:r}, sep = "\t", header = True, index = True)
    rg_info['sld'].to_csv(${_output['sld']:r}, sep = "\t", header = True, index = True)