MosaicForecast [https://github.com/parklab/MosaicForecast](https://github.com/parklab/MosaicForecast) is a somatic variant caller recently developed by Yanmei Dou in the Park Lab at Harvard.  I extended the MosaicForecast software tool with a converter mf2vcf.py, which converts the TSV formatted output of MosaicForecast to VCF.  Here I demonstrate the utility of [mf2vcf.py](https://github.com/attilagk/MosaicForecast/blob/master/mf2vcf.py). 

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import attila_utils
import matplotlib
matplotlib.rcParams['figure.dpi'] = 150
from IPython.display import set_matplotlib_formats

## Input files

In [19]:
%%bash
cd ~/projects/bsm/results/2019-08-20-mosaicforecast-vcf/
rm *.vcf.gz*
wc -l *

   2122247 Mix1A.MF.predictions
       200 test0.MF.predictions
   2122447 total


`Mix1A.MF.predictions` contains more than two million calls for the whole genome.  A light weight sample file `test0.MF.predictions`, with only 200 calls, was extracted from `Mix1A.MF.predictions`.  Looking at the content of `test0.MF.predictions` shows the familiar structure of TSV files.  (Note that only the first 6 columns and first 10 lines are shown.)

In [20]:
%%bash
cd ~/projects/bsm/results/2019-08-20-mosaicforecast-vcf/
head -n10 test0.MF.predictions | cut -f1-6

id	conflict_num	mappability	type	length	GCcontent
Mix1A.BQSR~1~13302~C~T	0	0.0416667	SNP	0	0.666666666666667
Mix1A.BQSR~1~14933~G~A	0	0.916667	SNP	0	0.523809523809524
Mix1A.BQSR~1~14948~G~A	0	0.458333	SNP	0	0.904761904761905
Mix1A.BQSR~1~16103~T~G	0	0.666667	SNP	0	0.666666666666667
Mix1A.BQSR~1~16257~G~C	0	0.333333	SNP	0	0.619047619047619
Mix1A.BQSR~1~16288~C~G	0	0.666667	SNP	0	0.571428571428571
Mix1A.BQSR~1~19776~A~G	0	0.916667	SNP	0	0.666666666666667
Mix1A.BQSR~1~20129~C~T	0	0.666667	SNP	0	0.666666666666667
Mix1A.BQSR~1~20136~T~C	0	0.583333	SNP	0	0.80952380952381


## The `mf2vcf.py` script

Calling `mf2vcf.py` without arguments prompts the help message:

In [1]:
%%bash
mf2vcf.py


    Convert MosaicForecast output MF.predictions into a sorted, gzipped, indexed VCF

    Usage:

    mf2vcf.py <input.MF.predictions refseq.fa output.vcf.gz
    or
    cat input.MF.predictions | mf2vcf.py refseq.fa output.vcf.gz


    Details:

    Currently gzip compressed VCF is the only supported output type.

    For each INFO, FILTER, and FORMAT field the appropriate ID,
    Number, Type, and 'Description' must be specified by editing the __info__,
    __filter__, and __format__ dictionaries in mf2vcf.py.

    The pandas Python package must be installed.
    


`mf2vcf.py` was designed to take the TSV file from STDIN because writing the TSV into a file might be unnecessary when a sorted, gzipped (and indexed) VCF file is available. The TSV is not only unsorted but also uncompressed and therefore can come with a large file size.

## Testing the converter

In [24]:
%%bash
cd ~/projects/bsm/results/2019-08-20-mosaicforecast-vcf/
mf2vcf.py <test0.MF.predictions $REFSEQ test0.vcf.gz

Writing to /tmp/bcftools-sort.IzD9c7
Merging 1 temporary files
Cleaning
Done


The output VCF is shown below (only the first and last ten records are shown for brevity).

In [27]:
%%bash
cd ~/projects/bsm/results/2019-08-20-mosaicforecast-vcf/
bcftools view -h test0.vcf.gz
bcftools view -H test0.vcf.gz | head -n10
for i in {1..10}; do
    echo '...'
done
bcftools view -H test0.vcf.gz | tail -n10

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##source=MosaicForecast
##reference=file:///big/data/refgenome/GRCh37/hs37d5/hs37d5.fa
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT,length=16569>
##contig=<ID=

Note the following points:

In [3]:
%connect_info

{
  "shell_port": 57177,
  "iopub_port": 37493,
  "stdin_port": 47711,
  "control_port": 41619,
  "hb_port": 49979,
  "ip": "127.0.0.1",
  "key": "c7c4efc8-411c2df7a11e183fbc94d8f8",
  "transport": "tcp",
  "signature_scheme": "hmac-sha256",
  "kernel_name": ""
}

Paste the above JSON into a file, and connect with:
    $> jupyter <app> --existing <file>
or, if you are local, you can connect with just:
    $> jupyter <app> --existing kernel-fc2e3ffc-deb8-4e8f-972a-21daccd48e7a.json
or even just:
    $> jupyter <app> --existing
if this is the most recent Jupyter kernel you have started.
