In [1]:
import os, sys

rootdir = os.path.dirname(os.path.dirname(os.getcwd()))
print(rootdir)
sys.path.insert(0, rootdir)

/home/camille/1000Genomes/src


# Pooling-imputation performance from real bead chip data with high density markers

Performs imputation on data from the chromosome 20 of 1000GP on not pooled data.
The 1KGP data spans from POS 60,343 to POS 62,965,354 i.e. a size of ca. 63Mb.

The target and the reference markers from the intersection between the real bead chip Illumina Infinium OmniExpress - 24 Kit
 and the chr20 1000GP data. These are ca. 17,791 markers which corresponds to a density of approx. 1 SNP per 3,35 kb 
 (bibliography Beagle07 mentions 1 SNP per 3 kb for low-density coverage).
 
The samples are assigned to the reference panel or the study population in the same layout as imputing pooled samples
  on high-density genotyped markers (see 20200709).

Apply pooling simulation on this data, and imputation with Phaser for 1 individual (dev version): 
**pool and impute bead chip markers only, compute metrics and plot statistics** 

In [2]:
try:
    os.mkdir('/home/camille/PoolImpHuman/data/20200801')
except FileExistsError:
    pass
os.chdir('/home/camille/PoolImpHuman/data/20200801')

In [3]:
print('Configure directory')
%sx ln -s ~/1000Genomes/src/VCFPooling/python/omniexpress24chr20.ipynb ./
%sx ln -s ~/1000Genomes/src/VCFPooling/python/omniexpress_20200801.ipynb ./
%sx ln -s ../omniexpress24/InfiniumOmniExpress-chr20-CHROM-POS.txt ./

# Use same target population as with imputation from pooled data
%sx ln -s ../20200709/study.population
%sx ln -s ../20200709/reference.panel
%sx ln -s ../20200709/IMP.chr20.snps.gt.vcf.gz

Configure directory


[]

### Prepare experimental VCF files (low-density coverage)

In [4]:
print('Create/copy individual files (true and imputed')
%sx cp ./phaser/omniexpressHD/HDHD/sHG01063.IMP.chr20.pooled.snps.gl.full.genos.vcf.gz ./sHG01063.IMP.chr20.pooled.imputed.vcf.gz
%sx bcftools view -Oz -o sHG01063.IMP.chr20.snps.gt.vcf.gz -s HG01063 IMP.chr20.snps.gt.vcf.gz

Create/copy individual files (true and imputed


[]

In [5]:
print('Index files')
%sx bcftools index sHG01063.IMP.chr20.pooled.imputed.vcf.gz
%sx bcftools index sHG01063.IMP.chr20.snps.gt.vcf.gz

Index files


[]

In [6]:
print('Check number of samples')
%sx bcftools query -l sHG01063.IMP.chr20.pooled.imputed.vcf.gz | wc -l

Check number of samples


['1']

In [7]:
print('Check number of intersected markers')
%sx bcftools view -H sHG01063.IMP.chr20.pooled.imputed.vcf.gz | wc -l

Check number of intersected markers


["[W::vcf_parse_format] FORMAT 'GT' is not defined in the header, assuming Type=String",
 '52697']

In [8]:
%sx pwd
%sx ls -la

['total 2208',
 'drwxr-xr-x  3 camille camille    4096 aug  2 22:48 .',
 'drwxr-xr-x 32 camille camille    4096 aug  2 22:03 ..',
 'lrwxrwxrwx  1 camille camille      36 aug  2 22:48 IMP.chr20.snps.gt.vcf.gz -> ../20200709/IMP.chr20.snps.gt.vcf.gz',
 'lrwxrwxrwx  1 camille camille      56 aug  2 22:48 InfiniumOmniExpress-chr20-CHROM-POS.txt -> ../omniexpress24/InfiniumOmniExpress-chr20-CHROM-POS.txt',
 'lrwxrwxrwx  1 camille camille      74 aug  2 22:48 omniexpress_20200801.ipynb -> /home/camille/1000Genomes/src/VCFPooling/python/omniexpress_20200801.ipynb',
 'lrwxrwxrwx  1 camille camille      72 aug  2 22:48 omniexpress24chr20.ipynb -> /home/camille/1000Genomes/src/VCFPooling/python/omniexpress24chr20.ipynb',
 'drwxr-sr-x  5 camille camille    4096 aug  2 22:15 phaser',
 'lrwxrwxrwx  1 camille camille      27 aug  2 22:48 reference.panel -> ../20200709/reference.panel',
 '-rw-r--r--  1 camille camille  393457 aug  2 22:48 sHG01063.IMP.chr20.pooled.imputed.vcf.gz',
 '-rw-rw-r--  1 cam

### Run pooling


### Run imputation
With Phaser, on Rackham. Use gcc compiled version, no `--march`
Steps:

1. Prepare phaser folders and scripts
```bash
cd phaser/
mkdir omniexpressLD && cd omniexpressLD
cp ../omniexpress/run_omniHD_snowy.s* ./
mv run_omniHD_snowy.sh run_omniLD_snowy.sh
mv run_omniHD_snowy.sbatch run_omniLD_snowy.sbatch
```

Modify contents of `.sh` and `.sbatch` scripts
```bash
nano run_omniLD_snowy.sh
```
set to 
```
# all paths are relative phaser/   (assumed this script is executed from phaser/omniexpressLD)
cd ..

ne=11418
#error=0.001
error=0.00001
mapfile=examples/5_snps_interpolated_HapMap2_map_20
indir=omniexpressLD/
sample=$SLURM_ARRAY_TASK_ID
samples_file=IMP.chr20.pooled.snps.gl.vcf.gz
ref_file=omniexpressLD/REF.chr20.snps.gt.vcf.gz
results_directory=omniexpressLD/LDLD/

```

```bash
nano run_omniLD_snowy.sbatch
```

set to

```
#!/bin/bash -l
#SBATCH -A snic2019-8-216
#SBATCH -p node
#SBATCH -C mem128GB
#SBATCH -n 16
#SBATCH -t 2-00:00:00
#SBATCH -J omni_LD
### assumes current directory is /crex/proj/snic2019-8-216/private/phaser/omniexpressLD
module load bioinfo-tools
module load bcftools/1.9
module load tabix/0.2.6
bash run_omniLD_snowy.sh # should use 16 OMP threads
```

2. Transfer experimental VCF files from local
```bash
~/PoolImpHuman/data/20200728$ scp *.vcf.gz* camcl609@rackham.uppmax.uu.se:/crex/proj/snic2019-8-216/private/phaser/omniexpressLD
```

In [9]:
print('Impute missing genotypes in the low-density coverage file')
#TODO: description above to Joplin Lab Notebook

Impute missing genotypes in the low-density coverage file


### Compute results with customized metrics

In [3]:
os.chdir('/home/camille/PoolImpHuman/data/20200801')
#%sx python3 -u /home/camille/1000Genomes/src/VCFPooling/poolSNPs/imputation_quality.py ./ sHG01063.IMP.chr20.snps.gt.vcf.gz sHG01063.IMP.chr20.pooled.imputed.vcf.gz /home/camille/1000Genomes/src/VCFPooling/bin/gt_to_gl.sh "chrom:pos"  

['/home/camille/PoolImpHuman',
 "[W::vcf_parse_format] FORMAT 'GT' is not defined in the header, assuming Type=String",
 "[W::vcf_parse_format] FORMAT 'GT' is not defined in the header, assuming Type=String",
 "[W::vcf_parse_format] FORMAT 'GT' is not defined in the header, assuming Type=String",
 "[W::vcf_parse_format] FORMAT 'GT' is not defined in the header, assuming Type=String",
 "[W::vcf_parse_format] FORMAT 'GT' is not defined in the header, assuming Type=String",
 "[W::vcf_parse_format] FORMAT 'GT' is not defined in the header, assuming Type=String",
 'Traceback (most recent call last):',
 '  File "/home/camille/1000Genomes/src/VCFPooling/poolSNPs/imputation_quality.py", line 58, in <module>',
 '    qbeaglegt.pearsoncorrelation(),',
 '  File "/home/camille/1000Genomes/src/VCFPooling/poolSNPs/metrics/quality.py", line 155, in pearsoncorrelation',
 '    score = list(map(scorer, zip(true, imputed)))',
 '  File "/home/camille/1000Genomes/src/VCFPooling/poolSNPs/metrics/quality.py",

In [11]:
# Verify files created at the different phasing and imputation steps
assert os.path.exists('imputation_quality_gtgl.png')


AssertionError: 

In [4]:
# if bcftools is configured for python 2.7 usage 
print('Plotting results with bcftools stats')
%sx deactivate
# bcftools stats needs python 2.7
%sx bcftools stats --af-bins 0.01,0.02,0.04,0.08,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.98 --collapse snps -s HG01063 sHG01063.IMP.chr20.pooled.imputed.vcf.gz sHG01063.IMP.chr20.snps.gt.vcf.gz > filestats.vchk
%sx plot-vcfstats -p bcftoolstats -s filestats.vchk


Plotting results with bcftools stats


['Parsing bcftools stats output: filestats.vchk',
 '\texpected: # PSC\t[2]id\t[3]sample\t[4]nRefHom\t[5]nNonRefHom\t[6]nHets\t[7]nTransitions\t[8]nTransversions\t[9]nIndels\t[10]average depth\t[11]nSingletons',
 '\tfound:    # PSC\t[2]id\t[3]sample\t[4]nRefHom\t[5]nNonRefHom\t[6]nHets\t[7]nTransitions\t[8]nTransversions\t[9]nIndels\t[10]average depth\t[11]nSingletons\t[12]nHapRef\t[13]nHapAlt\t[14]nMissing',
 'Plotting graphs: python plot.py',
 'Invalid limit will be ignored.',
 '  ax1.set_xlim(0,1.01)',
 'Invalid limit will be ignored.',
 '  ax1.set_xlim(0,1.01)',
 'Creating PDF: pdflatex summary.tex >plot-vcfstats.log 2>&1',
 'Finished: bcftoolstats/summary.pdf']