### Preliminary setup

This notebook will use R, users will need to ensure an R jupyter interface such as *rpy2* is installed for this to work properly. If you need to install this interface: 

In [15]:
%%bash
pip3 install rpy2



And then to tell this jupyter notebook to use this R interface, following command blocks will have a %% prepended specifying if this should be run with bash or R

In [8]:
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


### VEP (Variant Effect Predictor) - Cache Installation

The first step to using the variant is to install a Cache, **most users will not have to bother with this**, if you are fine with using the cache and VEP versions already downloaded to storage1/compute1 you can ignore this. Inncidently the Cache already on storage1 is at `/storage1/fs1/bga/Active/gmsroot/gc2560/core/cwl/inputs/VEP_cache`

If however you need a species, assembly, or vep version not supported by the support infrastructure you will need to install one yourself

As a bit of a tangent, you can run VEP in "online" mode, meaning that VEP will hit ensembl servers instead of a local cache, however I don't recommend this. It will be extremely slow.

To Start I SSH into compute1 (my ssh keys for compute1 are named after marine mammals, so yours will be different)

From there I get into an interactive VEP session and run the INSTALL.pl perl script for VEP located in /opt/vep/src

Now let's talk about some of the parameters, the big one is -c, it tells the installer where to download the cache, I use -a to specify what i want installed, which for pretty much everyone would be the fasta file and the cache itself denoted by f and c respectively. -s informs VEP I want this for the canis_lupus_familiaris species, -l and --NO_BIOPERL are skipping additional specific packages that are not needed (and are a pain to get installed). And finnally I say I want the CanFam3.1 reference assembly.

All of this typically takes a couple hours so be prepared to wait (which is also why i have this commented out).

As a side note, you were to run without the -a parameter you would be walked through by the installer what you wanted to install. This might be more convinient for some users.



In [19]:
%%bash
ssh beluga
LSF_DOCKER_PRESERVE_ENVIRONMENT=false bsub -M 8000000 -R 'select[mem>8000] span[hosts=1] rusage[mem=8000]' -n 8 -G compute-obigriffith -Is -a 'docker(ensemblorg/ensembl-vep:release_101.0)' /bin/bash
perl /opt/vep/src/ensembl-vep/INSTALL.pl --help
#perl /opt/vep/src/ensembl-vep/INSTALL.pl -c /storage1/fs1/obigriffith/Active/common/vepCache/ -a fc -s canis_lupus_familiaris -l --NO_BIOPERL --ASSEMBLY CanFam3.1
exit

Job <456732> is submitted to default queue <general-interactive>.
release_101.0: Pulling from ensemblorg/ensembl-vep
f08d8e2a3ba1: Pulling fs layer
3baa9cb2483b: Pulling fs layer
94e5ff4c0b15: Pulling fs layer
1860925334f9: Pulling fs layer
59ea04b4d1f5: Pulling fs layer
35da7fd97b95: Pulling fs layer
71bb648dfd3b: Pulling fs layer
3f8d7e441f68: Pulling fs layer
d23ab4039a95: Pulling fs layer
72cd523e7fc6: Pulling fs layer
0f0efd126ff4: Pulling fs layer
59ea04b4d1f5: Waiting
0f0efd126ff4: Waiting
d23ab4039a95: Waiting
3f8d7e441f68: Waiting
35da7fd97b95: Waiting
3baa9cb2483b: Verifying Checksum
3baa9cb2483b: Download complete
94e5ff4c0b15: Verifying Checksum
94e5ff4c0b15: Download complete
f08d8e2a3ba1: Download complete
1860925334f9: Download complete
35da7fd97b95: Verifying Checksum
35da7fd97b95: Download complete
3f8d7e441f68: Download complete
59ea04b4d1f5: Verifying Checksum
59ea04b4d1f5: Download complete
d23ab4039a95: Verifying Checksum
d23ab4039a95: Download complete
71bb648dfd3

Pseudo-terminal will not be allocated because stdin is not a terminal.
You are connecting to RIS Compute services.
Membership in a compute-* AD group is required.

Users are responsible for acting in accordance with
policies applicable to Washington University St. Louis.

https://confluence.ris.wustl.edu/display/RSUM/RIS+Compute+%3A+User+Agreement
<<Waiting for dispatch ...>>
<<Starting on compute1-exec-162.ris.wustl.edu>>


### Uploading Preliminary Data for VEP

Now on to the fun stuff, actually annotated our data with usefull information. For this we'll use a copy of the data from last week available in this repo and push it up to our home directory. We can use `scp` for this, as a reminder your ssh keys are likely different than mine.

In [1]:
%%bash
cd ~/git/bfx-workshop/week_10
scp mutect.filtered.decomposed.readcount_snvs_indel.vcf.gz beluga:~

You are connecting to RIS Compute services.
Membership in a compute-* AD group is required.

Users are responsible for acting in accordance with
policies applicable to Washington University St. Louis.

https://confluence.ris.wustl.edu/display/RSUM/RIS+Compute+%3A+User+Agreement


Next we'll actually run VEP on our VCF file

In [7]:
%%bash
ssh beluga

LSF_DOCKER_VOLUMES=/storage1/fs1/bga/Active:/storage1/fs1/bga/Active LSF_DOCKER_PRESERVE_ENVIRONMENT=false bsub -oo vep.log -q general -M 8000000 -R 'select[mem>8000] span[hosts=1] rusage[mem=8000]' -n 8 -G compute-oncology -a 'docker(ensemblorg/ensembl-vep:release_101.0)' /opt/vep/src/ensembl-vep/vep --cache --dir_cache=/storage1/fs1/bga/Active/gmsroot/gc2560/core/cwl/inputs/VEP_cache/ --input_file=mutect.filtered.decomposed.readcount_snvs_indel.vcf.gz --output_file=mutect.filtered.decomposed.readcount_snvs_indel.flag_pick.vcf --everything --fasta=/storage1/fs1/bga/Active/gmsroot/gc2560/core/cwl/inputs/VEP_cache/homo_sapiens/101_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz --force_overwrite --vcf --terms SO --assembly=GRCh38 --offline --pick --force_overwrite --no_check_variants_order --transcript_version

Job <458380> is submitted to queue <general>.


Pseudo-terminal will not be allocated because stdin is not a terminal.
You are connecting to RIS Compute services.
Membership in a compute-* AD group is required.

Users are responsible for acting in accordance with
policies applicable to Washington University St. Louis.

https://confluence.ris.wustl.edu/display/RSUM/RIS+Compute+%3A+User+Agreement


While we're waiting for that job to complete let's talk about some of the options we specified that aren't so obvious, first as mentioned previously we need to supply the path to an appropriate VEP cache where annotations can be pulled.

the --everything flag tells vep to annotate a fairly comprehensive set of items including transcript support level in ensembl, the hgvs notations, gene symbols etc.

the --vcf flag tells VEP to output in VCF format

--force_overwrite, will overwrite VEP results if they're present in the directory (this is especially usefull if you decide you want more annotations and don't want to remove files to re-run VEP)

--terms SO tells VEP to use the sequence ontology to annotate variant consequences i.e. stop_gain, frameshift, etc.

--no_check_variants_order means that VEP will be able to run on unsorted input files (though this comes at a significant computational cost)

and Finally --pick tells VEP to output only 1 entry for a variant, for example you could have a variant that is both a missense variant in one gene and upstream for another gene, using this option VEP will output the most sever consequence, in this case missense

This just scratches the surface of options available for VEP, but is typically what I run as a "default" and is probably a good place to start. For a more complete list of params there is extensive documentation from ensembl available here: https://uswest.ensembl.org/info/docs/tools/vep/script/vep_options.html#basic

I would like to note that many of VEPs defaults are designed around h.sapiens, for example it is quite simple to run on a different species however you will need that species in your cache and you will need to specify the desired species with the --species parameter


### Downloading and viewing VEP results

VEP outputs a number of files of potential interest, including an HTML report, the file you really probably care about though is the VCF file we had VEP output in the command above, lets go ahead and pull it down and take a look, here I use grep to print the surrounding lines around the pattern match. Importantly we see that for each variant our annotation info is specified under the "INFO" column and  the "INFO" meta line in the header tells us what each column represents in the | delimited VCF file. unfortunately this is quite hard to parse, so we will use R to extract some of this information and have a short plotting lesson in this tutorial.

In [12]:
%%bash
cd ~/git/bfx-workshop/week_10

scp beluga:~/mutect.filtered.decomposed.readcount_snvs_indel.flag_pick.vcf ./
    
grep -A 4 -B 2 "#CHROM" mutect.filtered.decomposed.readcount_snvs_indel.flag_pick.vcf

##VEP="v101" time="2020-11-20 21:23:16" cache="/storage1/fs1/bga/Active/gmsroot/gc2560/core/cwl/inputs/VEP_cache/homo_sapiens/101_GRCh38" ensembl-variation=101.50e7372 ensembl-io=101.943b6c2 ensembl-funcgen=101.b918a49 ensembl=101.856c8e8 1000genomes="phase3" COSMIC="90" ClinVar="202003" ESP="V2-SSA137" HGMD-PUBLIC="20194" assembly="GRCh38.p13" dbSNP="153" gencode="GENCODE 35" genebuild="2014-07" gnomAD="r2.1" polyphen="2.2.2" regbuild="1.0" sift="sift5.2.2"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|MANE|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|GENE_PHENO|SIFT|PolyPhen|DOMAINS|miRNA|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|gnomAD_AF|gnomAD_AFR_AF|gnomAD_AMR_AF|gnomAD_

You are connecting to RIS Compute services.
Membership in a compute-* AD group is required.

Users are responsible for acting in accordance with
policies applicable to Washington University St. Louis.

https://confluence.ris.wustl.edu/display/RSUM/RIS+Compute+%3A+User+Agreement
