# Genome-wide Linkage Analysis

## Aim

To phase haplotypes from vcfs and run genome-wide linkage analysis

### Input


- `--cwd`, work directory where the output will be saved to
- `--chrom`, including a list of chromosomes
    - e.g. `1 2 3`
- `--fam-path`, the path of the fam file.
- `--vcf-path`, the path of a genotype file in `vcf` format.
- `--anno-path`, the path of a annotation file.
- `--anno-path`, the path of a sample source file.

### Output
- haplotypes
- lods

## Command Interface

In [None]:
sos run seqlink_sos.ipynb -h

ContextualVersionConflict: (jupyter-client 6.1.12 (/mnt/mfs/hgrcgrid/homes/yh3455/miniconda3/lib/python3.8/site-packages), Requirement.parse('jupyter-client>=7.0.2'), {'sos-notebook'})

## Example command

### seqlink
```
sos run nbs/seqlink_sos.ipynb seqlink --cwd data/wg20220316 --fam_path data/new_trim_ped_famless17_no:xx.fam --vcf_path /mnt/mfs/statgen/alzheimers-family/linkage_files/geno/full_sample/vcf/full_sample.vcf.gz --anno_path MWE/annotation --pop_path data/full_sample_fam_pop.txt --chrom 9 10
```
### linkage
```
sos run nbs/seqlink_sos.ipynb linkage --cwd data/wg20220316 --fam_path data/new_trim_ped_famless17_no:xx.fam --chrom 1 2 3 4 5 6 7 8
```
### seqlink and linkage
```
sos run nbs/seqlink_sos.ipynb --cwd data/wg20220316 --fam_path data/new_trim_ped_famless17_no:xx.fam --vcf_path /mnt/mfs/statgen/alzheimers-family/linkage_files/geno/full_sample/vcf/full_sample.vcf.gz --anno_path MWE/annotation --pop_path data/full_sample_fam_pop.txt --chrom 9 10
```

## Workflow codes

In [None]:
[global]
# Work directory where output will be saved to
parameter: cwd = path
# Fam file
parameter: fam_path = path
parameter: chrom = list
parameter: walltime = '24h'
parameter: mem = '100G'

In [None]:
[seqlink (phasing haps)]
# VCF file
parameter: vcf_path = path
# annotation path
parameter: anno_path = path
# Sample source file path
parameter: pop_path = path
input: fam_path, vcf_path, anno_path, pop_path, for_each = 'chrom'
output: f'{cwd:a}/chr{_chrom}test'
task: walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand = '${ }', stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    
    echo "start"
    seqlink --fam ${fam_path} --vcf ${vcf_path} \
    --anno '${anno_path:a}/EFIGA_NIALOAD_chr${_chrom}.hg38.hg38_multianno.csv' \
    --pop ${pop_path} \
    -o ${_output}  \
    -f 'MERLIN' --build 'hg38' --freq 'AF' --bin 1 --maf-cutoff 0.05 --jobs 1

In [None]:
[linkage (linkage analysis)]
input: cwd, fam_path, for_each = 'chrom'
output: f'{cwd:a}/chr{_chrom}test/tmp/CACHE/chr{_chrom}test'
task: walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
python: expand = '${ }', stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'

    import glob
    import pandas as pd
    import numpy as np
    import pickle
    from SEQLinkage.linkage import *

    def run_gene_lods(file,fam,cutoff=None):
        with open(file+'.pickle', 'rb') as handle:
            genes = pickle.load(handle)
        gene_variants,gene_fam_haps = format_haps_bunch(genes,fam)
        if cutoff is not None:
            for f,variants in gene_variants.items():
                gene_fam_haps[f]=gene_fam_haps[f].loc[:,[True]*6+list(np.repeat((variants.freqs>cutoff)[variants.uniq],2))]
        res = parallel_lods(gene_fam_haps.values())
        smy_res = sum_variant_lods(res)
        with open(file+'cutoff'+str(cutoff)+'.result','wb') as handle:
            pickle.dump(smy_res, handle, protocol=pickle.HIGHEST_PROTOCOL)

    fam17 = pd.read_csv(f'${fam_path}',delim_whitespace=True,header=None,names=['fid','iid','fathid','mothid','sex','ad'])
    fam17.index = list(fam17.iid)
    fam17.ad[fam17.ad==-9]=0
    fam17_d = {}
    for i in fam17.fid.unique():
        fam17_d[i] = fam17[fam17.fid==i]
    inputs=glob.glob(f'${_input[0]}/chr${_chrom}test/tmp/CACHE/chr${_chrom}test*.pickle')
    for i in inputs:
        print(i[:-7])
        run_gene_lods(i[:-7],fam17_d)