# Pooling-imputation performance from real bead chip data with high density markers

Performs imputation on data from the chromosome 20 of 1000GP on not pooled data.
The 1KGP data spans from POS 60,343 to POS 62,965,354 i.e. a size of ca. 63Mb.

The target and the reference markers from the intersection between the real bead chip Illumina Infinium OmniExpress - 24 Kit
 and the chr20 1000GP data. These are ca. 17,791 markers which corresponds to a density of approx. 1 SNP per 3,35 kb 
 (bibliography Beagle07 mentions 1 SNP per 3 kb for low-density coverage).
 
The samples are assigned to the reference panel or the study population in the same layout as imputing pooled samples
  on high-density genotyped markers (see 20200709).

Apply pooling simulation on this data, and imputation (run parallel execution): 
**pool and impute bead chip markers only, compute metrics and plot statistics** 

In [1]:
import os 

try:
    os.mkdir('/home/camille/PoolImpHuman/data/20200728')
except FileExistsError:
    pass
os.chdir('/home/camille/PoolImpHuman/data/20200728')

In [2]:
print('Configure directory')
%sx ln -s ~/1000Genomes/src/VCFPooling/python/omniexpress24chr20.ipynb ./
%sx ln -s ~/1000Genomes/src/VCFPooling/python/omniexpress_20200728.ipynb ./
%sx ln -s ../omniexpress24/InfiniumOmniExpress-chr20-CHROM-POS.txt ./
%sx ln -s ~/1000Genomes/src/VCFPooling/python/parallel_pooling_20200624.py ./

# Use same target population as with imputation from pooled data
%sx ln -s ../20200709/study.population
%sx ln -s ../20200709/reference.panel

Configure directory


["ln: failed to create symbolic link './reference.panel': File exists"]

### Prepare experimental VCF files (low-density coverage)

In [3]:
print('Create file')
%sx bcftools view -Oz -o ALL.chr20.snps.gt.vcf.gz -R InfiniumOmniExpress-chr20-CHROM-POS.txt ../main/ALL.chr20.snps.gt.vcf.gz

Create file


[]

In [4]:
print('Index file')
%sx bcftools index ALL.chr20.snps.gt.vcf.gz

Index file


[]

In [5]:
print('Check number of samples')
%sx bcftools query -l ALL.chr20.snps.gt.vcf.gz | wc -l

Check number of samples


['2504']

In [6]:
print('Check number of intersected markers')
%sx bcftools view -H ALL.chr20.snps.gt.vcf.gz | wc -l

Check number of intersected markers


['17791']

In [7]:
print('Create REF and IMP populations')
# IMP.chr20.pooled.snps.gl.vcf.gz not GL formatted but name should fit in the bash script for phasing + imputation
%sx bcftools view -Oz -S study.population -o IMP.chr20.snps.gt.vcf.gz ALL.chr20.snps.gt.vcf.gz
%sx bcftools index -f IMP.chr20.snps.gt.vcf.gz
%sx bcftools view -Oz -S reference.panel -o REF.chr20.snps.gt.vcf.gz ALL.chr20.snps.gt.vcf.gz
%sx bcftools index -f REF.chr20.snps.gt.vcf.gz

Create REF and IMP populations


[]

In [8]:
%sx pwd
%sx ls -la

['total 32364',
 'drwxrwxr-x  2 camille camille     4096 jul 28 12:13 .',
 'drwxr-xr-x 30 camille camille     4096 jul 28 11:52 ..',
 '-rw-rw-r--  1 camille camille 15800341 jul 28 12:24 ALL.chr20.snps.gt.vcf.gz',
 '-rw-rw-r--  1 camille camille    32033 jul 28 12:24 ALL.chr20.snps.gt.vcf.gz.csi',
 'lrwxrwxrwx  1 camille camille       59 jul 28 12:13 bcfchunkpara.sh -> /home/camille/PoolImpHuman/bin/bash-scripts/bcfchunkpara.sh',
 '-rw-rw-r--  1 camille camille  2298507 jul 28 12:25 IMP.chr20.snps.gt.vcf.gz',
 '-rw-rw-r--  1 camille camille    26842 jul 28 12:25 IMP.chr20.snps.gt.vcf.gz.csi',
 'lrwxrwxrwx  1 camille camille       56 jul 28 11:52 InfiniumOmniExpress-chr20-CHROM-POS.txt -> ../omniexpress24/InfiniumOmniExpress-chr20-CHROM-POS.txt',
 'lrwxrwxrwx  1 camille camille       74 jul 28 11:52 omniexpress_20200728.ipynb -> /home/camille/1000Genomes/src/VCFPooling/python/omniexpress_20200728.ipynb',
 'lrwxrwxrwx  1 camille camille       72 jul 28 11:52 omniexpress24chr20.ipynb -> /

### Run pooling

In [9]:
print('Clean data/tmp directory')
%sx rm ~/PoolImpHuman/data/tmp/*

Clean data/tmp directory


[]

In [10]:
print('Chunk the file to be imputed')
%sx ln -s /home/camille/PoolImpHuman/bin/bash-scripts/bcfchunkpara.sh ./
%sx bash bcfchunkpara.sh IMP.chr20.snps.gt.vcf.gz ~/PoolImpHuman/data/tmp 1000
# NB: file_in parameter cannot have path prefix, it must be a file name only

Chunk the file to be imputed


['Counting lines in IMP.chr20.snps.gt.vcf.gz',
 'Number of files to pack =  18',
 'Worker 0 GO!',
 'Worker 1 GO!',
 'Worker 3 GO!',
 'Worker 2 GO!',
 'Worker 1: Packing and writing chunk 1',
 'Starts at POS 2441080 and ends at POS 5523754',
 'Worker 3: Packing and writing chunk 3',
 'Starts at POS 8566072 and ends at POS 11464227',
 'Worker 2: Packing and writing chunk 2',
 'Starts at POS 5524247 and ends at POS 8565782',
 'Worker 0: Packing and writing chunk 0',
 'Starts at POS 63244 and ends at POS 2440845',
 "    Worker 1: Chunk '1' OK!",
 "    Worker 3: Chunk '3' OK!",
 "    Worker 2: Chunk '2' OK!",
 "    Worker 0: Chunk '0' OK!",
 'Worker 1: Packing and writing chunk 5',
 'Starts at POS 15056675 and ends at POS 17342346',
 'Worker 3: Packing and writing chunk 7',
 'Starts at POS 20169162 and ends at POS 24187784',
 'Worker 2: Packing and writing chunk 6',
 'Starts at POS 17342611 and ends at POS 20158093',
 "    Worker 1: Chunk '5' OK!",
 'Worker 0: Packing and writing chunk 4',


In [2]:
print('Pool the chunks')
%sx source ~/1000Genomes/venv3.6/bin/activate
%sx python3 -u parallel_pooling_20200624.py /home/camille/PoolImpHuman/data/20200728/IMP.chr20.snps.gt.vcf.gz /home/camille/PoolImpHuman/data/20200728/IMP.chr20.pooled.snps.gl.vcf.gz 4

Pool the chunks


['SNIC PROJ: /home/camille/PoolImpHuman/data',
 '',
 '*******************************************************************************',
 'Number of cpu to be used = 4',
 'Input file = /home/camille/PoolImpHuman/data/20200728/IMP.chr20.snps.gt.vcf.gz',
 'Output file = /home/camille/PoolImpHuman/data/20200728/IMP.chr20.pooled.snps.gl.vcf.gz',
 '*******************************************************************************',
 '',
 '',
 '18 files found will be pooled.................................................',
 "['/home/camille/PoolImpHuman/data/tmp/pack13.IMP.chr20.snps.gt.vcf.gz', '/home/camille/PoolImpHuman/data/tmp/pack1.IMP.chr20.snps.gt.vcf.gz', '/home/camille/PoolImpHuman/data/tmp/pack4.IMP.chr20.snps.gt.vcf.gz', '/home/camille/PoolImpHuman/data/tmp/pack11.IMP.chr20.snps.gt.vcf.gz', '/home/camille/PoolImpHuman/data/tmp/pack0.IMP.chr20.snps.gt.vcf.gz', '/home/camille/PoolImpHuman/data/tmp/pack8.IMP.chr20.snps.gt.vcf.gz', '/home/camille/PoolImpHuman/data/tmp/pack14.IMP.chr20.s

### Run imputation
With Phaser, on Rackham. Use gcc compiled version, no `--march`
Steps:

1. Prepare phaser folders and scripts
```bash
cd phaser/
mkdir omniexpressLD && cd omniexpressLD
cp ../omniexpress/run_omniHD_snowy.s* ./
mv run_omniHD_snowy.sh run_omniLD_snowy.sh
mv run_omniHD_snowy.sbatch run_omniLD_snowy.sbatch
```

Modify contents of `.sh` and `.sbatch` scripts
```bash
nano run_omniLD_snowy.sh
```
set to 
```
# all paths are relative phaser/   (assumed this script is executed from phaser/omniexpressLD)
cd ..

ne=11418
#error=0.001
error=0.00001
mapfile=examples/5_snps_interpolated_HapMap2_map_20
indir=omniexpressLD/
sample=$SLURM_ARRAY_TASK_ID
samples_file=IMP.chr20.pooled.snps.gl.vcf.gz
ref_file=omniexpressLD/REF.chr20.snps.gt.vcf.gz
results_directory=omniexpressLD/LDLD/

```

```bash
nano run_omniLD_snowy.sbatch
```

set to

```
#!/bin/bash -l
#SBATCH -A snic2019-8-216
#SBATCH -p node
#SBATCH -C mem128GB
#SBATCH -n 16
#SBATCH -t 2-00:00:00
#SBATCH -J omni_LD
### assumes current directory is /crex/proj/snic2019-8-216/private/phaser/omniexpressLD
module load bioinfo-tools
module load bcftools/1.9
module load tabix/0.2.6
bash run_omniLD_snowy.sh # should use 16 OMP threads
```

2. Transfer experimental VCF files from local
```bash
~/PoolImpHuman/data/20200728$ scp *.vcf.gz* camcl609@rackham.uppmax.uu.se:/crex/proj/snic2019-8-216/private/phaser/omniexpressLD
```

In [12]:
print('Impute missing genotypes in the low-density coverage file')

Impute missing genotypes in the low-density coverage file


['Contigs in the reference file',
 '.................................................................................',
 'Chromosome  20    Startpos = 63244    Endpos = 62912463',
 '',
 '',
 'Check FORMAT field in files for imputation',
 '.................................................................................',
 'FORMAT in reference panel:  GT',
 '[E::hts_open_format] Failed to open file IMP.chr20.pooled.snps.gl.vcf.gz',
 'Failed to open IMP.chr20.pooled.snps.gl.vcf.gz: No such file or directory',
 'FORMAT in target: ',
 '',
 '',
 'Check number of samples and number of markers in files for imputation',
 '.................................................................................',
 'reference:',
 '2264',
 '',
 'target:',
 '[E::hts_open_format] Failed to open file IMP.chr20.pooled.snps.gl.vcf.gz',
 'Failed to open IMP.chr20.pooled.snps.gl.vcf.gz: No such file or directory',
 '0',
 '',
 '',
 'Phase reference and target with BEAGLE',
 '.....................................

In [13]:
print('Plotting results with bcftools stats')
%sx deactivate
# bcftools stats needs python 2.7
%sx bcftools stats --af-bins 0.01,0.02,0.04,0.08,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.98 --collapse snps -S study.population IMP.chr20.pooled.imputed.vcf.gz IMP.chr20.snps.gt.vcf.gz > filestats.vchk
%sx plot-vcfstats -p bcftoolstats -s filestats.vchk

Plotting results with bcftools stats


['Parsing bcftools stats output: filestats.vchk',
 'Sanity check failed: was this file generated by bcftools stats? at /usr/bin/plot-vcfstats line 99.',
 '\tmain::error("Sanity check failed: was this file generated by bcftools stats?") called at /usr/bin/plot-vcfstats line 585',
 '\tmain::parse_vcfstats1(HASH(0x5644e32cfb78), 0) called at /usr/bin/plot-vcfstats line 294',
 '\tmain::parse_vcfstats(HASH(0x5644e32cfb78)) called at /usr/bin/plot-vcfstats line 47']

### Compute results with customized metrics

In [14]:
%sx python3 -u ../poolSNPs/imputation_quality.py ./ IMP.chr20.snps.gt.vcf.gz IMP.chr20.pooled.imputed.vcf.gz bin/gt_to_gl.sh  

["python3: can't open file '../poolSNPs/imputation_quality.py': [Errno 2] No such file or directory"]

In [15]:
# Verify files created at the different phasing and imputation steps
assert os.path.exists('imputation_quality_gtgl.png')


AssertionError: 

In [None]:
# # if bcftools is configured for python 2.7 usage 
# print('Plotting results with bcftools stats')
# %sx deactivate
# # bcftools stats needs python 2.7
# %sx bcftools stats --af-bins 0.01,0.02,0.04,0.08,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.98 --collapse snps -S study.population IMP.chr20.pooled.imputed.vcf.gz IMP.chr20.snps.gt.vcf.gz > filestats.vchk
# %sx plot-vcfstats -p bcftoolstats -s filestats.vchk
