# Liftover Pipeline

##  Aim
To liftover chromosome related file from one reference to another reference, for example hg19 -> hg38

## Pre-requisites

Make sure you install the pre-requisited before running this notebook:

```
pip install cugg -U
```

## Input

- `--cwd`, Work directory where output will be saved to
- `--input_file`, the path of input file which can be plink format, gvcf/vcf format, sumstat format.
    - if plink format, provide the path of `bim` file 
    - if gvcf/vcf format, the file must have gvcf and vcf in suffixes
    - other format will be considered as sumstat format, whose header should have CHR, POS, A0 and A1 columns
- `--yml_file`, if the sumstat header doesn't have CHR, POS, A0 and A1 columns, you need to provide a ymal file to describe the format of your file, such as following. the first 5 row is required. **ID is the combindation of key words from the word before `:` in the yml file.**
```
ID: CHR,POS,A0,A1
CHR: CHR
POS: POS
A0: REF
A1: ALT
SNP: SNP
STAT: BETA
SE: SE
P: P
```
- `--ouput_file`, the name of ouput file which will be saved under `cwd` path
- `--fr`, From reference genome, defaut is `hg19`
- `--to`,To reference genome, defaut is `hg38`
- `--remove-missing`, boolen, Remove SNPs failed to liftover (default to False)
- `--rename`, boolen, Rename variants' ID (default to True). **Only implemented to sumstat liftover**

## Output

new file with chromosomes and positions which are liftovered from old reference to new reference.
if adding the `--remove-missing` parameter, the SNPs failed to liftover will be removed in the output. otherwise, their chr and pos will be replaced by 0 and 0.

**For plink files, if no `--remove-missing` parameter, only bim file will be updated.**

## Examples

```
sos run liftover.ipynb --input_file file.vcf.gz --output_file new_file.vcf.gz --cwd output
    
```

## Debugs
For your first time running, if you get the error related with downloading chain file (usually it works), you can download them from the following links.
- hg19 -> hg38: https://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.chain.gz
- hg38 -> hg19: http://hgdownload.cse.ucsc.edu/goldenpath/hg38/liftOver/hg38ToHg19.over.chain.gz

And move the chain file to `~/.liftover` folder.

In [1]:
[global]
# Work directory where output will be saved to
parameter: cwd = path
# Input file which can be plink format, gvcf/vcf format, sumstat format.
parameter: input_file = path
# The path of yaml file with input file format, only for sumstat file.
parameter: yml_file = path('.') 
# the name of ouput file which will be saved under cwd path
parameter: output_file = path
# From reference genome, defaut is hg19
parameter: fr = 'hg19'
# To reference genome, defaut is hg38
parameter: to = 'hg38'
# Remove SNPs failed to liftover (defaults to False)
parameter: remove_missing = False
# Rename Variant ID
parameter: rename = True
# Container
parameter: container = ''

In [None]:
[default_1 (export utils script)]
depends:  Py_Module('cugg'), Py_Module('pathlib'),Py_Module('pandas')
output: f'{cwd:a}/utils.py'
report: expand = '${ }', output=f'{cwd:a}/utils.py'

    import pandas as pd
    from pathlib import Path
    from cugg.genodata import *
    from cugg.sumstat import Sumstat
    from cugg.liftover import Liftover
    def liftover(input_path,output_path,yml=None,fr='hg19',to='hg38',remove_missing=True,rename=True):
        lf = Liftover(fr,to)
        print("liftover from " + fr +" to " +to)
        print("Removing SNPs failed to liftover is", remove_missing)
        #file type detection, sumstats, plink, vcf,gvcf, >>>future bgen
        input_path = Path(input_path)
        input_suffixes = set(input_path.suffixes)
        output_path = Path(output_path)
        if not input_path.exists(): print("The file is not exist:", input_path)
        if input_path.suffix in ['.bim','.bed','.fam']:
            if remove_missing:
                geno = Genodata(str(input_path.with_suffix('.bed')))
                bim = geno.bim
            else:
                bim = read_bim(input_path)
            new_bim = lf.bim_liftover(bim)
            idx = new_bim.chrom == 0
            if idx.all():
                raise("No SNP liftover from old reference to new reference. The reference genome should be wrong!!!")
            if remove_missing:
                geno.bim = new_bim
                geno.extractbyidx(~idx)
                geno.export_plink(output_path.with_suffix('.bed'))
                print("Total number SNPs ",new_bim.shape[0],". Removing SNPs failed to liftover ", sum(idx))
            else:
                write_bim(output_path.with_suffix('.bim'),new_bim)
                print("Total number SNPs ",new_bim.shape[0],". The number of SNPs failed to liftover ", sum(idx),". Their chr and pos is replaced with 0, 0")
        elif len(input_suffixes.intersection(['.gvcf','.vcf']))>0:
            lf.vcf_liftover(input_path,output_path,remove_missing)
        else:
            print("This file is considered as sumstat format file")
            sums = Sumstat(input_path,config_file=yml,rename=rename)
            new_sums = lf.sumstat_liftover(sums.ss,rename)
            idx = new_sums.CHR == 0
            if remove_missing:
                new_sums[~idx].to_csv(output_path, compression='gzip', sep = "\t", header = True, index = False)
                print("Total number SNPs ",new_sums.shape[0],". Removing SNPs failed to liftover ", sum(idx))
            else:
                new_sums.to_csv(output_path, compression='gzip', sep = "\t", header = True, index = False)
                print("Total number SNPs ",new_sums.shape[0],". The number of SNPs failed to liftover ", sum(idx),". Their chr and pos is replaced with 0, 0")
    def read_bim(fn):
        header = ["chrom", "snp", "cm","pos","a0", "a1"]
        df = pd.read_csv(fn,delim_whitespace=True,header=None,names=header,compression=None,engine="c",iterator=False)
        df["i"] = range(df.shape[0])
        return df

In [None]:
[default_2 (do liftover)]
depends: f'{cwd:a}/utils.py'
input: input_file
output: f'{cwd}/{output_file}'
python: input = f'{cwd:a}/utils.py', expand = '${ }', stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    
    
    import os.path
    input_path=${_input[0]:r}
    output_path=${_output[0]:r}
    fr = '${fr}'
    to = '${to}'
    remove_missing=${remove_missing}
    rename = ${rename}
    yml_file = '${yml_file}'
    if not os.path.isfile(yml_file):
        yml_file = None
    print(fr,to,remove_missing)
    liftover(input_path,output_path,yml_file,fr,to,remove_missing,rename)