The MSSM_118-Neun_pl sample was run by Taejeong using the master branch of development on the bsmn-pipeline on an AWS EC2 parallel cluster of the Abyzov lab.  Here I process the same sample using the installfix branch on a cluster of the Chess lab.

In [2]:
%load_ext autoreload
%autoreload 2
import synapseclient
import synapseutils
import pandas as pd
import os
import os.path

The [WGS.SCZ.Chess](https://www.synapse.org/#!Synapse:syn21897893) (syn21897893) Synapse folder contains GATK HC callsets produced by Taejeong using the `--run-gatk-hc 50` option to `run_genome_mapping.sh`.  The smallest callset is `MSSM_118_brain.ploidy_50.filtered.vcf` (syn21898502), which corresponds to the MSSM_118-NeuN_pl sample.

All fastq files for this sample were previously uploaded to Synapse folder MSSM_118_NeuN_pl [syn21966777](https://www.synapse.org/#!Synapse:syn21966777).  Now let's create a `sample_list.txt` file for these fastq files and upload it to the bsmn parallel cluster of the Chess lab!

In [3]:
syn = synapseclient.login()

Welcome, Attila Jones!



In [4]:
flist = list(synapseutils.walk(syn, 'syn21966777'))[0][2]
flist = pd.DataFrame(flist, columns = ['file_name', 'location'], dtype='str')
flist['sample_id'] = 'MSSM_118_NeuN_pl'
flist = pd.concat((flist.iloc[:, -1], flist.iloc[:, :-1]), axis=1)
flist.head()

Unnamed: 0,sample_id,file_name,location
0,MSSM_118_NeuN_pl,MSSM118_NeuN_pl_USPD16080281-D712_H5LNVALXX_L5...,syn21966882
1,MSSM_118_NeuN_pl,MSSM118_NeuN_pl_USPD16080281-D712_H5LNVALXX_L5...,syn21966905
2,MSSM_118_NeuN_pl,MSSM118_NeuN_pl_USPD16080281-D712_H5LNVALXX_L6...,syn21966883
3,MSSM_118_NeuN_pl,MSSM118_NeuN_pl_USPD16080281-D712_H5LNVALXX_L6...,syn21966929
4,MSSM_118_NeuN_pl,MSSM118_NeuN_pl_USPD16080281-D712_H5LNVALXX_L7...,syn21966895


Write table to a sample list file

In [5]:
flistpath = '/home/attila/projects/bsm/results/2020-04-21-bsm-pipeline-mssm-118/sample_list.txt'
flist.to_csv(flistpath, sep='\t', header=False, index=False)

Upload that file to the bsmn parallel cluster

In [6]:
%%bash
if false; then
key=~/AWS-accounts/chesslab/id_rsa-aws-user-024812372148 # my private SSH key
dns=ec2-3-12-140-74.us-east-2.compute.amazonaws.com # the master instance of the cluster
fpath=/home/attila/projects/bsm/results/2020-04-21-bsm-pipeline-mssm-118/sample_list.txt
scp -i $key $fpath ec2-user@$dns:/efs/tests/MSSM_118/
fi

Process sample with the bsmn-pipeline using the following command:
```
[ec2-user@ip-172-31-52-76 MSSM_118]$ /shared/bsmn-pipeline/run_genome_mapping.sh --sample-list sample_list.txt --run-gatk-hc 50 --upload syn21966777 > STDOUT-1 2> STDERR-1
```

## Consistency check

Download the call set for MSSM_118_NeuN_pl produced by me (Attila) on the Chesslab's bsmn cluster

In [7]:
%%bash
host=ec2-user@ec2-3-12-140-74.us-east-2.compute.amazonaws.com
key=~/AWS-accounts/chesslab/id_rsa-aws-user-024812372148
srcdir=/efs/tests/MSSM_118/MSSM_118_NeuN_pl/gatk-hc
dstdir=~/projects/bsm/results/2020-04-21-bsm-pipeline-mssm-118/gatk-hc
if false; then
mkdir -p $dstdir
scp -r -i $key $host:$srcdir $dstdir/attila
fi

Now download the corresponding call set produced by Taejeong

In [8]:
resdir = '/home/attila/projects/bsm/results/2020-04-21-bsm-pipeline-mssm-118/gatk-hc'
dstdir = resdir + os.sep + 'taejeong'
if not os.path.exists(dstdir):
    os.mkdir(dstdir)
e = syn.get('syn20815202', downloadLocation=dstdir)
e1 = syn.get('syn20815203', downloadLocation=dstdir)

Let's look at the files downloaded

In [9]:
%%bash
ls -R /home/attila/projects/bsm/results/2020-04-21-bsm-pipeline-mssm-118/gatk-hc

/home/attila/projects/bsm/results/2020-04-21-bsm-pipeline-mssm-118/gatk-hc:
isec
report
taejeong

/home/attila/projects/bsm/results/2020-04-21-bsm-pipeline-mssm-118/gatk-hc/isec:
0000.vcf
0001.vcf
README.txt
report

/home/attila/projects/bsm/results/2020-04-21-bsm-pipeline-mssm-118/gatk-hc/taejeong:
MSSM_118_NeuN_pl.ploidy_50.vcf.gz
MSSM_118_NeuN_pl.ploidy_50.vcf.gz.tbi


### Results

In [13]:
%%bash
cd /home/attila/projects/bsm/results/2020-04-21-bsm-pipeline-mssm-118/gatk-hc
if test ! -d isec; then
taejeong=taejeong/MSSM_118_NeuN_pl.ploidy_50.vcf.gz
attila=attila/MSSM_118_NeuN_pl.ploidy_50.vcf.gz
bcftools isec $taejeong $attila -p isec
fi
cd isec
if test ! -e report; then
for vcf in 000[012].vcf; do
bcftools view -H $vcf | wc -l | tr '\n' ' '
sed -n "/$vcf/ { s/^.*\(\(private to\|shared by\).*$\)/\1/; p }" README.txt
done > report
fi
rm 000[23].vcf 2> /dev/null # remove the large files
cd ..; rm -rf attila taejeong 2> /dev/null
cat isec/report

3 private to	taejeong/MSSM_118_NeuN_pl.ploidy_50.vcf.gz
1 private to	attila/MSSM_118_NeuN_pl.ploidy_50.vcf.gz
5085446 shared by both	taejeong/MSSM_118_NeuN_pl.ploidy_50.vcf.gz attila/MSSM_118_NeuN_pl.ploidy_50.vcf.gz


So from more than 5 million records only 3 and 1 are private to the taejeong and attila callset, respectively.

Let's look at these differing records!

In [11]:
%%bash
cd /home/attila/projects/bsm/results/2020-04-21-bsm-pipeline-mssm-118/gatk-hc/isec/
echo -e private to taejeong/...'\n'
bcftools view -H 0000.vcf
echo -e '\n\n'private to attila/...'\n'
bcftools view -H 0001.vcf

private to taejeong/...

16	46435547	.	T	G	1317.38	VQSRTrancheSNP99.90to100.00	AC=6;AF=0.12;AN=50;BaseQRankSum=0.86;DP=1319;FS=22.44;MLEAC=6;MLEAF=0.12;MQ=45.82;MQRankSum=-8.2;QD=1.03;ReadPosRankSum=7.31;SOR=3.89;VQSLOD=-1611;culprit=DP	GT:AD:DP:GQ:PL	0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1/1/1/1/1/1:1049,226:1275:3:1300,466,241,120,51,14,0,3,20,49,88,135,191,255,325,402,486,577,674,777,887,1004,1127,1258,1397,1543,1697,1861,2033,2216,2410,2616,2835,3068,3318,3585,3873,4184,4522,4891,5298,5749,6255,6831,7498,8289,9258,10508,12266,15251,36814
16	80908579	.	C	A	1647.45	PASS	AC=50;AF=1;AN=50;DP=46;FS=0;MLEAC=44;MLEAF=0.88;MQ=60;QD=28.02;SOR=0.874;VQSLOD=17.57;culprit=MQ	GT:AD:DP:GQ:PL	1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1:0,46:46:4:1393,747,621,545,491,449,414,385,359,336,316,297,280,265,250,237,224,212,201,191,181,171,162,153,145,137,129,122,115,108,101,94,88,82,76,71,65,60,54,

Interestingly, the records differ only on chromosome 16.

## Conclusion
There's a negligible difference between the callset produced by Taejeong and by me (Attila).  It's not clear what exactly lead to that minor difference.  Perhaps a different state of a random number generator used by GATK's HaplotypeCaller, perhaps different version of the installed software tools.

In [12]:
%connect_info

{
  "shell_port": 49193,
  "iopub_port": 42321,
  "stdin_port": 46011,
  "control_port": 55969,
  "hb_port": 60721,
  "ip": "127.0.0.1",
  "key": "db012786-a1947e3ecd6b88a1be720519",
  "transport": "tcp",
  "signature_scheme": "hmac-sha256",
  "kernel_name": ""
}

Paste the above JSON into a file, and connect with:
    $> jupyter <app> --existing <file>
or, if you are local, you can connect with just:
    $> jupyter <app> --existing kernel-54f6489a-334e-42ac-b04f-c388e41e7ca2.json
or even just:
    $> jupyter <app> --existing
if this is the most recent Jupyter kernel you have started.
