# Pooling-imputation performance from real bead chip data with high density markers

Performs pooling simulation and imputation on data from the chromosome 20 of 1000GP.
Markers have been chosen as the intersection between the real bead chip Illumina Infinium OmniExpress2.5 - 8 Kit
 (https://support.illumina.com/array/array_kits/humanomniexpress2.5-8-beadchip-kit/downloads.html) 
 and the chr20 1000GP data. The samples are randomly assigned to the reference panel or the study population.

Apply pooling simulation on this data, and imputation (run parallel execution): 
**pool and impute bead chip markers only, compute metrics and plot statistics** 

In [1]:
import os

try:
    os.mkdir('/home/camille/PoolImpHuman/data/20200812')
except FileExistsError:
    pass
os.chdir('/home/camille/PoolImpHuman/data/20200812')

In [2]:
print('Configure directory')
%sx ln -s ~/1000Genomes/src/VCFPooling/python/omniexpress8chr20.ipynb ./
%sx ln -s ~/1000Genomes/src/VCFPooling/python/omniexpress_20200709.ipynb ./
%sx ln -s ~/1000Genomes/src/VCFPooling/python/parallel_pooling_20200812.py ./
%sx ln -s /home/camille/PoolImpHuman/data/20200709/ALL.chr20.snps.gt.vcf.gz

Configure directory


[]

### Prepare experimental VCF file

IMP and REF files are identical to 20200709 (imputation with default parameters)

In [3]:
%sx ln -s /home/camille/PoolImpHuman/data/20200709/IMP.chr20.snps.gt.vcf.gz
%sx ln -s /home/camille/PoolImpHuman/data/20200709/REF.chr20.snps.gt.vcf.gz
%sx bcftools index -f IMP.chr20.snps.gt.vcf.gz
%sx bcftools index -f REF.chr20.snps.gt.vcf.gz

[]

In [4]:
print('Check number of samples')
%sx bcftools query -l ALL.chr20.snps.gt.vcf.gz | wc -l

Check number of samples


['2504']

In [5]:
print('Check number of intersected markers')
%sx bcftools view -H ALL.chr20.snps.gt.vcf.gz | wc -l

Check number of intersected markers


['52697']

In [6]:
print('Clean data/tmp directory')
%sx rm ~/PoolImpHuman/data/tmp/*

Clean data/tmp directory


[]

In [7]:
print('Chunk the file to be imputed')
%sx ln -s /home/camille/PoolImpHuman/bin/bash-scripts/bcfchunkpara.sh ./
%sx bash bcfchunkpara.sh IMP.chr20.snps.gt.vcf.gz ~/PoolImpHuman/data/tmp 1000
# NB: file_in parameter cannot have path prefix, it must be a file name only

Chunk the file to be imputed


['Counting lines in IMP.chr20.snps.gt.vcf.gz',
 'Number of files to pack =  53',
 'Worker 0 GO!',
 'Worker 2 GO!',
 'Worker 3 GO!',
 'Worker 1 GO!',
 'Worker 2: Packing and writing chunk 2',
 'Starts at POS 1823326 and ends at POS 2827243',
 'Worker 3: Packing and writing chunk 3',
 'Starts at POS 2827468 and ends at POS 4074065',
 'Worker 0: Packing and writing chunk 0',
 'Starts at POS 61651 and ends at POS 863706',
 'Worker 1: Packing and writing chunk 1',
 'Starts at POS 863744 and ends at POS 1821917',
 "    Worker 3: Chunk '3' OK!",
 "    Worker 2: Chunk '2' OK!",
 "    Worker 1: Chunk '1' OK!",
 "    Worker 0: Chunk '0' OK!",
 'Worker 2: Packing and writing chunk 6',
 'Starts at POS 5756281 and ends at POS 6693076',
 'Worker 1: Packing and writing chunk 5',
 'Starts at POS 4790612 and ends at POS 5754542',
 'Worker 0: Packing and writing chunk 4',
 'Starts at POS 4075290 and ends at POS 4790574',
 'Worker 3: Packing and writing chunk 7',
 'Starts at POS 6693128 and ends at POS 7

In [1]:
print('Pool the chunks')
%sx source ~/1000Genomes/venv3.6/bin/activate
%sx python3 -u parallel_pooling_20200812.py /home/camille/PoolImpHuman/data/20200812/IMP.chr20.snps.gt.vcf.gz /home/camille/PoolImpHuman/data/20200812/IMP.chr20.pooled.snps.gt.vcf.gz 4

Pool the chunks


['SNIC PROJ: /home/camille/PoolImpHuman/data',
 '',
 '*******************************************************************************',
 'Number of cpu to be used = 4',
 'Input file = /home/camille/PoolImpHuman/data/20200812/IMP.chr20.snps.gt.vcf.gz',
 'Output file = /home/camille/PoolImpHuman/data/20200812/IMP.chr20.pooled.snps.gt.vcf.gz',
 '*******************************************************************************',
 '',
 '',
 '53 files found will be pooled.................................................',
 "['/home/camille/PoolImpHuman/data/tmp/pack39.IMP.chr20.snps.gt.vcf.gz', '/home/camille/PoolImpHuman/data/tmp/pack27.IMP.chr20.snps.gt.vcf.gz', '/home/camille/PoolImpHuman/data/tmp/pack13.IMP.chr20.snps.gt.vcf.gz', '/home/camille/PoolImpHuman/data/tmp/pack44.IMP.chr20.snps.gt.vcf.gz', '/home/camille/PoolImpHuman/data/tmp/pack32.IMP.chr20.snps.gt.vcf.gz', '/home/camille/PoolImpHuman/data/tmp/pack49.IMP.chr20.snps.gt.vcf.gz', '/home/camille/PoolImpHuman/data/tmp/pack24.IMP.chr

```bash
'Time for pooling 1000 variants = 7.668583088001469 sec',
 '',
 'Time elapsed -->  277.8271336200014',
 '/home/camille/PoolImpHuman/data/tmp/IMP.chr20.snps.gt.vcf:',
 ' File created? -> True',
 'Writing to /tmp/bcftools-sort.LJmPJQ',
 'Merging 1 temporary files',
 'Cleaning',
 'Done',
 '/home/camille/PoolImpHuman/data/tmp/IMP.chr20.snps.gt.vcf:',
 ' File sorted? -> True',
 '/home/camille/PoolImpHuman/data/tmp/IMP.chr20.pooled.snps.gt.vcf.gz:',
 ' File created? -> True',
 '/home/camille/PoolImpHuman/data/tmp/IMP.chr20.pooled.snps.gt.vcf.gz:',
 ' File indexed? -> True',
 '',
 'Time elapsed -->  282.233043911001'
```