# AnnotatedvcfParser.py

Script to parse annotated vcf files (VEP,snpEff,ANNOVAR) to a csv table 

--------------------------------------------------------------------------------------------

In [1]:
import AnnotatedvcfParser as vcf
import pandas as pd
import numpy as np

--------------------------------------------------------------------------------------------

## AnnotatedvcfParser - readVCF()
*datatype : dictionary*

readVCF() takes as input a vcf file and return a dictionary of lists in which the vcf is parsed.
The INFO field is further parsed with internal functions from AnnotadevcfParser.py

--------------------------------------------------------------------------------------------

We start opening the vcf file to parse

In [30]:
vcf_file_path = "/home/yabili/results/08_cosmic_filter/short_T_vs_N_filtered_cosmic_filter.vcf"
vcf_file=open(vcf_file_path, "r" )

In [31]:
vcfFile = vcf.readVCF(vcf_file)
type(vcfFile)

dict

Let's print the Dict.keys() N.B: notice that INFO has been parsed

In [32]:
for key, value in vcfFile.items() :
    print (key, len(value))

CHROM 35
POS 35
ID 35
REF 35
ALT 35
QUAL 35
FILTER 35
INFO 35
FORMAT 35
NORMAL 35
TUMOR 35
AS_FilterStatus 35
AS_SB_TABLE 35
AS_UNIQ_ALT_READ_COUNT 35
CONTQ 35
DP 35
ECNT 35
GERMQ 35
MBQ 35
MFRL 35
MMQ 35
MPOS 35
NALOD 35
NCount 35
NLOD 35
OCM 35
PON 35
POPAF 35
ROQ 35
RPA 35
RU 35
SEQQ 35
STR 35
STRANDQ 35
STRQ 35
TLOD 35
ANN 35
LOF 35
NMD 35
CSQ 35
ANNOVAR_DATE 35
Func.refGene 35
Gene.refGene 35
GeneDetail.refGene 35
ExonicFunc.refGene 35
AAChange.refGene 35
cytoBand 35
ExAC_ALL 35
ExAC_AFR 35
ExAC_AMR 35
ExAC_EAS 35
ExAC_FIN 35
ExAC_NFE 35
ExAC_OTH 35
ExAC_SAS 35
avsnp147 35
SIFT_score 35
SIFT_pred 35
Polyphen2_HDIV_score 35
Polyphen2_HDIV_pred 35
Polyphen2_HVAR_score 35
Polyphen2_HVAR_pred 35
LRT_score 35
LRT_pred 35
MutationTaster_score 35
MutationTaster_pred 35
MutationAssessor_score 35
MutationAssessor_pred 35
FATHMM_score 35
FATHMM_pred 35
PROVEAN_score 35
PROVEAN_pred 35
VEST3_score 35
CADD_raw 35
CADD_phred 35
DANN_score 35
fathmm-MKL_coding_score 35
fathmm-MKL_coding_pred 35

Let's try to print a List of values form the Dictionary

In [35]:
print(vcfFile["ICGC_Id"])

['.', '.', '.', 'MU70470', 'MU40745', 'MU109751', 'MU91776616', '.', '.', '.', 'MU1305116', '.', '.', '.', '.', '.', '.', '.', '.', 'MU1856183', '.', 'MU114312846', '.', '.', '.', '.', '.', '.', '.', '.', '.', 'MU131190896', 'MU591920', '.', '.']


--------------------------------------------------------------------------------------------

# AnnotatedvcfParser - transcipts2ListVepSnpEff()
*datatype : dictionary*

This parse the CSQ and ANN fields, adding a new key (VepCSQSplit/snpeffANNSplit) to the dictionary,
where different transcripts are parsed to a list.

--------------------------------------------------------------------------------------------

**Example of ANN to snpeffANNSplit where different transcripts are parsed to a list**

In [36]:
vcfFileTranslisted = vcf.transcipts2ListVepSnpEff(vcfFile, "snpeffANNSplit")

We can check check that the ANN field from snpEff has been transformed to a list of transcripts at snpeffANNSplit

In [41]:
#print(vcfFileTranslisted["ANN"][1])
print(type(vcfFileTranslisted["ANN"][1]))
print("The string is: " + str(len(vcfFileTranslisted["ANN"][1])) + " characters")

<class 'str'>
The string is: 1813 characters


In [42]:
#print(vcfFileTranslisted["ANN"][1])
print(type(vcfFileTranslisted["snpeffANNSplit"][1]))
print("The list contains: " + str(len(vcfFileTranslisted["snpeffANNSplit"][1])) + " transcripts")

<class 'list'>
The list contains: 14 transcripts


--------------------------------------------------------------------------------------------

**Example of CSQ to VepCSQSplit where different transcripts are parsed to a list**

In [10]:
vcfFileTranslisted = vcf.transcipts2ListVepSnpEff(vcfFile, "VepCSQSplit")

We can check check that the CSQ field from snpEff has been transformed to a list of transcripts at VepCSQSplit

In [11]:
print(type(vcfFileTranslisted["CSQ"][1]))
print("The string is: " + str(len(vcfFileTranslisted["CSQ"][1])) + " characters")

<class 'str'>
The string is: 1017 characters


In [12]:
print(type(vcfFileTranslisted["VepCSQSplit"][1]))
print("The list contains: " + str(len(vcfFileTranslisted["VepCSQSplit"][1])) + " transcripts")

<class 'list'>
The list contains: 4 transcripts


--------------------------------------------------------------------------------------------

**There is the opportunity to parse both ANN/CSQ in a single step**

In [13]:
vcfFileTranslisted = vcf.transcipts2ListVepSnpEff(vcfFile, "snpeffANNSplit,VepCSQSplit")

In [14]:
for key, value in vcfFileTranslisted.items() :
    print (key, len(value))

CHROM 35
POS 35
ID 35
REF 35
ALT 35
QUAL 35
FILTER 35
INFO 35
FORMAT 35
NORMAL 35
TUMOR 35
AS_FilterStatus 35
AS_SB_TABLE 35
AS_UNIQ_ALT_READ_COUNT 35
CONTQ 35
DP 35
ECNT 35
GERMQ 35
MBQ 35
MFRL 35
MMQ 35
MPOS 35
NALOD 35
NCount 35
NLOD 35
OCM 35
PON 35
POPAF 35
ROQ 35
RPA 35
RU 35
SEQQ 35
STR 35
STRANDQ 35
STRQ 35
TLOD 35
ANN 35
LOF 35
NMD 35
CSQ 35
ANNOVAR_DATE 35
Func.refGene 35
Gene.refGene 35
GeneDetail.refGene 35
ExonicFunc.refGene 35
AAChange.refGene 35
cytoBand 35
ExAC_ALL 35
ExAC_AFR 35
ExAC_AMR 35
ExAC_EAS 35
ExAC_FIN 35
ExAC_NFE 35
ExAC_OTH 35
ExAC_SAS 35
avsnp147 35
SIFT_score 35
SIFT_pred 35
Polyphen2_HDIV_score 35
Polyphen2_HDIV_pred 35
Polyphen2_HVAR_score 35
Polyphen2_HVAR_pred 35
LRT_score 35
LRT_pred 35
MutationTaster_score 35
MutationTaster_pred 35
MutationAssessor_score 35
MutationAssessor_pred 35
FATHMM_score 35
FATHMM_pred 35
PROVEAN_score 35
PROVEAN_pred 35
VEST3_score 35
CADD_raw 35
CADD_phred 35
DANN_score 35
fathmm-MKL_coding_score 35
fathmm-MKL_coding_pred 35

--------------------------------------------------------------------------------------------

# AnnotatedvcfParser - splitransciptsVepSnpEff()
*datatype : pandas.DataFrame*

This function take as input the output of transcipts2ListVepSnpEff() and separate the transcripts in different rows. It returns a pandas.DataFrame

--------------------------------------------------------------------------------------------

**Splitting ANN to snpeffANNSplit list of transcripts**

In [25]:
vcfFileTransplitted = vcf.splitransciptsVepSnpEff(vcfFileTranslisted, "snpeffANNSplit")
print("After splitting each transcript to a different row, the number of rows is : " + str(len(vcfFileTransplitted)))

After splitting each transcript to a different row, the number of rows is : 311


--------------------------------------------------------------------------------------------

**Splitting CSQ to VepCSQSplit list of transcripts**

In [26]:
vcfFileTransplitted = vcf.splitransciptsVepSnpEff(vcfFileTranslisted, "VepCSQSplit")
print("After splitting each transcript to a different row, the number of rows is : " + str(len(vcfFileTransplitted)))

After splitting each transcript to a different row, the number of rows is : 74


--------------------------------------------------------------------------------------------

**There is the opportunity to parse both ANN/CSQ in a single step**

*NB: Make attention parsing ANN and CSQ together can lead to a high increase of duplicates rows*

In [27]:
vcfFileTransplitted = vcf.splitransciptsVepSnpEff(vcfFileTranslisted, "snpeffANNSplit,VepCSQSplit")
print("After splitting each transcript to a different row, the number of rows is : " + str(len(vcfFileTransplitted)))

After splitting each transcript to a different row, the number of rows is : 789


--------------------------------------------------------------------------------------------

# AnnotatedvcfParser - pipe2Col
*datatype : pandas.DataFrame*

This function take as input the output of splitransciptsVepSnpEff() and separate **snpeffANNSplit/VepCSQSplit** into different columns based on the "|" internal separation. It returns a pandas.DataFrame object

--------------------------------------------------------------------------------------------

**Splitting snpeffANNSplit into different columns based on the pipe internal separation**

In [18]:
vcfFileTransplittedPipe = vcf.pipe2Col(vcf_file, vcfFileTransplitted, "snpeffANNSplit")

In [19]:
#vcfFileTransplittedPipe.info(verbose=True)
print("After splitting each internal pipe of ANN to a different column, the number of column is : ")
len(vcfFileTransplittedPipe.columns)

After splitting each internal pipe of ANN to a different column, the number of column is : 


136

--------------------------------------------------------------------------------------------

**Splitting VepCSQSplit into different columns based on the pipe internal separation**

In [20]:
vcfFileTransplittedPipe = vcf.pipe2Col(vcf_file, vcfFileTransplitted, "VepCSQSplit")

In [21]:
#vcfFileTransplittedPipe.info(verbose=True)
print("After splitting each internal pipe of CSQ to a different column, the number of column is : ")
len(vcfFileTransplittedPipe.columns)

After splitting each internal pipe of CSQ to a different column, the number of column is : 


188

--------------------------------------------------------------------------------------------

**There is the opportunity to parse both snpeffANNSplit/VepCSQSplit in a single step**

In [22]:
vcfFileTransplittedPipe = vcf.pipe2Col(vcf_file, vcfFileTransplitted, "snpeffANNSplit,VepCSQSplit")

In [23]:
print("After splitting each internal pipe of ANN and CSQ to a different column, the number of column is : ")
len(vcfFileTransplittedPipe.columns)

After splitting each internal pipe of ANN and CSQ to a different column, the number of column is : 


204

These are the column:

In [28]:
vcfFileTransplittedPipe.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 789 entries, 0 to 788
Data columns (total 204 columns):
 #    Column                         Dtype 
---   ------                         ----- 
 0    CHROM                          object
 1    POS                            object
 2    ID                             object
 3    REF                            object
 4    ALT                            object
 5    QUAL                           object
 6    FILTER                         object
 7    INFO                           object
 8    FORMAT                         object
 9    NORMAL                         object
 10   TUMOR                          object
 11   AS_FilterStatus                object
 12   AS_SB_TABLE                    object
 13   AS_UNIQ_ALT_READ_COUNT         object
 14   CONTQ                          object
 15   DP                             object
 16   ECNT                           object
 17   GERMQ                          object
 18   MBQ     

--------------------------------------------------------------------------------------------

## Save results to a CSV file

The parsed DataFrame can be then saved to a csv file

**>> vcfFileTransplittedPipe.to_csv("/your/local/path/results/vcfparsed.csv")**