## Parsing ExAC version release 0.3
#### Parsing the ExAC VCF file and extract many fields of meta-data and variant calling data.

&copy; Anat Etzion-Fuchs, Feb 2017

---
### Instructions:
1. Download the ExAC database from: [ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.3/ExAC.r0.3.sites.vep.vcf.gz](ftp://broadinstitute.org/pub/ExAC_release/release0.3/)
2. Save this script and the .vcf.gz in the same dir (or change the `path` in global vars to where you saved the data file). No need to unzip!
3. Run all code cells one after the other

### Code processing and output:
1) Parse the VCF metadata at the beginning. These include metadata of the following types: ALT, FILTER, FORMAT, INFO, contig, reference. 

**Output:**  1. `metadata_dict` : a dictionary with all the metadata, and keys are the the abovementioned types.        
2.`info_df`: a data-frame with the INFO fields. This is where ExAC lists the description of there different data fields and their meaning. A printout of this data frame is in the last cell.

2) Parse the VCF data lines, see format at: [VCFv4.1.pdf](https://samtools.github.io/hts-specs/VCFv4.1.pdf)  
The header consist of the following: #CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO.  
The INFO is the filed that contains most of the information.  
The code extract some of the fields realted to the allele frequency.   
After the frequency fields, there can be several CSQ.  
Each CSQ represent data of a specific `Feature_Type`: different transcipts or non_coding variants. Each CSQ contains a different Ensembl identifiers.  
A printout of the sepcific format of the CSQ is in the before last cell.  
The meaning of the different fields is described here: [http://useast.ensembl.org/info/docs/tools/vep/vep_formats.html?redirect=no](http://useast.ensembl.org/info/docs/tools/vep/vep_formats.html?redirect=no)

**Output:** `parsed` dir is created and inside one file for each chromosome, in .csv format.   
Each chromosome file is a table, with one row for each CSQ feature_type (for each VCF data line, there's a seperate line for each CSQ that contains the filds that all the CSQ share and also the CSQ specific fields).

#### Import packages

In [1]:
import gzip
import pandas as pd
import numpy as np
from collections import defaultdict
from IPython.core.display import HTML
HTML("<style>.container { width:100% !important; }</style>");

#### Global Vars

In [2]:
curr_dir = !pwd
path = curr_dir[0]+"/" #Linux path
#myFile = "ExAC.r0.3.nonTCGA.sites.vep.vcf.gz"
myFile = "ExAC.r0.3.sites.vep.vcf.gz"

#Positions in the VCF record
CHROM = 0
POS = 1
ID = 2
REF = 3
ALT = 4
QUAL = 5
FILTER = 6
INFO = 7

#fields for extraction (it's important that those names will match *exactly* the dictionary keys in "data_dict")
headlines = ["chrom", "pos", "id", "ref", "alt", "qual", "filter", "AC", "AC_AFR", "AC_AMR", "AC_Adj", "AC_EAS", "AC_FIN", "AC_Het", "AC_Hom", "AC_NFE",
             "AC_OTH", "AC_SAS", "AF", "AN", "AN_AFR", "AN_AMR", "AN_Adj", "AN_EAS", "AN_FIN", "AN_NFE", "AN_OTH", "AN_SAS", "DP", "gene", "feature", 
             "feature_type", "conseq", "prot_pos", "amino_acids", "codons", "strand","ENSP", "SWISSPROT", "SIFT", "PolyPhen", "exon", "intron", "domains", "clin_sig"]

#CSQ positions
#http://useast.ensembl.org/info/docs/tools/vep/vep_formats.html?redirect=no
GENE = 0        #Ensembl stable ID of affected gene
FEATURE = 1     #Ensembl stable ID of feature
FEATURE_TYPE = 2# type of feature. Currently one of Transcript, RegulatoryFeature, MotifFeature
CONSEQ = 3      #consequence type of this variant
PROT_POS = 6    #relative position of amino acid in protein
AMINO_ACIDS = 7 #the change. only given if the variant affects the protein-coding sequence
CODONS = 8      #the alternative codons with the variant base in upper case
ALLELE_NUM = 10 #Allele number from input; 0 is reference, 1 is first alternate etc
STRAND = 12     #the DNA strand (1 or -1) on which the transcript/feature lies
ENSP = 19       #the Ensembl protein identifier of the affected transcript
SWISSPROT = 20  #UniProtKB/Swiss-Prot identifier of protein product
SIFT = 23       #the SIFT prediction and/or score, with both given as prediction(score)
POLYPHEN = 24   #the PolyPhen prediction and/or score
EXON = 25       #the exon number (out of total number)
INTRON = 26     #the intron number (out of total number)
DOMAINS = 27    #the source and identifer of any overlapping protein domains
GMAF = 30       #Non-reference allele and frequency of existing variant in 1000 Genomes
CLIN_SIG = 37   #ClinVar Clinical significance of variant from dbSNP: http://varianttools.sourceforge.net/Annotation/DbSNP
                #Variant Clinical Significance, 0 - unknown, 1 -
                #untested, 2 - non-pathogenic, 3 - probable-non-pathogenic, 4 - probable-pathogenic, 
                #5 - pathogenic, 6 - drug-response, 7 - histocompatibility, 255 - other
            

#Defined here: https://macarthurlab.org/2016/03/17/reproduce-all-the-figures-a-users-guide-to-exac-part-2/#multi-allelic-enriched-regions
multi_allelic_regions = {'14': [[106329000, 106331000], [107178000, 107180000]], 
                         '2': [[89160000, 89162000]],
                        '17': [[18967000, 18968000], [19091000, 19092000]],
                        '22': [[23223000, 23224000]],
                        '1': [[152975000, 152976000]]}
#A flag for removing the multi-allelic regions
removeMultiAllelic = True

#### Open the file and process the meta-data rows
Processed meta-data is stored in the dictionary: `metadata_dict`.  
Metadata of type "INFO" is saved into the data frame: `info_df`.

In [3]:
#Read the file
vcf_file = gzip.open(path+myFile,'r')

#Process meta-data
metadata_dict = {}
data_flag = False
for line in vcf_file:
    if line[0:2] == "##":
        #assign keys according to the format
        key = line[2:line.index('=')]
        if key == "ALT":
            val = dict.fromkeys(["ID","Description"])
        elif key == "FILTER":
            val = dict.fromkeys(["ID","Description"])
        elif key == "FORMAT":
            val = dict.fromkeys(["ID","Number","Type","Description"])
        elif key == "INFO":
            val = dict.fromkeys(["ID","Number", "Type", "Description"])
        elif key == "contig":
            val = dict.fromkeys(["ID","length"])
        elif key == "reference":
            val = dict.fromkeys(["file"])
        #Not processing other metadata types
        else:
            continue
            
        #fill in the data
        for f in val.keys():
            f_key = line.find(f)
            f_beg = line.find("=", f_key)
            if (f_beg < 0):
                f_beg = line.find(":", f_key) #When parsing reference line
            f_end = line.find(",", f_beg)
            if (f_end < 0):
                f_end = line.find(">")
            if (f_end < 0): #When parsing reference line
                f_end = line.find("\n")
            val[f] = line[f_beg + 1:f_end]
            
        #Adding to the metadata dictionary
        if not metadata_dict.has_key(key):
            metadata_dict[key] = [val]
        else:
            metadata_dict[key].append(val)
            
    #Processing the data starting the next line
    elif line[0:6] == "#CHROM":
        data_flag = True
        break

#Arrange the INFO metadata to a data-frame
info_df = pd.DataFrame(metadata_dict["INFO"])
info_df = info_df.sort_values("ID")

#### Process the variant calling data rows
Output: `parsed/` dir with one file for each parsed chromosome: `parsed_chrom#.csv`.

In [4]:
def find_nth(s, x, n=0, overlap=False):
    """A function to find the nth position on a substring x in the string s"""
    
    l = 1 if overlap else len(x)
    i = -l
    for c in xrange(n + 1):
        i = s.find(x, i + l)
        if i < 0:
            break
    return i

In [5]:
def data_to_df_csv(data_dict, headlines, chrom_num):
    """A function that saves the data dictionary to DataFrame and then to .csv"""
    
    #Creating a data_frame from all the parsed values of the chromosome
    df = pd.DataFrame([data_dict[h] for h in headlines])
    df = df.transpose()
    df.columns = headlines
    
    !mkdir -p $path"parsed"
    #Saving the df to a file
    df.to_csv(path+"parsed/parsed_chrom"+chrom_num+".csv", sep='\t')

In [6]:
def update_main_fields(line_parts, data_dict):
    """A function that extract the main fields from line_parts and adds them to the data dictionary.
    Returining the extracted info field for further processing"""
    
    #Extracting Chromosome number
    data_dict["chrom"].append(line_parts[CHROM])
    
    #Extracting position
    data_dict["pos"].append(int(line_parts[POS]))
    
    #Extracting id
    data_dict["id"].append(line_parts[ID])
    
    #Extracting ref
    data_dict["ref"].append(line_parts[REF])
    
    #Extracting quality
    data_dict["qual"].append(line_parts[QUAL])
    
    #Extracting filter
    data_dict["filter"].append(line_parts[FILTER])
    
    #Extracting fields from the info
    info = line_parts[INFO]
    
    #AC = Allele Count
    AC_beg = info.find("AC=")
    AC_end = info.find(";", AC_beg)
    AC_list = (info[AC_beg+3:AC_end]).split(",")
    
    #AC_AFR = Allele Count African population
    AC_AFR_beg = info.find("AC_AFR=")
    AC_AFR_end = info.find(";", AC_AFR_beg)
    AC_AFR_list = (info[AC_AFR_beg+7:AC_AFR_end]).split(",")
    
    #AC_AMR = Allele count American population
    AC_AMR_beg = info.find("AC_AMR=")
    AC_AMR_end = info.find(";", AC_AMR_beg)
    AC_AMR_list = (info[AC_AMR_beg+7:AC_AMR_end]).split(",")
    
    #AC_adjusted = Adjusted Allele Count
    AC_adj_beg = info.find("AC_Adj=")
    AC_adj_end = info.find(";", AC_adj_beg)
    AC_adj_list = (info[AC_adj_beg+7:AC_adj_end]).split(",")
    
    #AC_EAS = Allele Count East Asian population
    AC_EAS_beg = info.find("AC_EAS=")
    AC_EAS_end = info.find(";", AC_EAS_beg)
    AC_EAS_list = (info[AC_EAS_beg+7:AC_EAS_end]).split(",")
    
    #AC_FIN = Allele Count Finish population
    AC_FIN_beg = info.find("AC_FIN=")
    AC_FIN_end = info.find(";", AC_FIN_beg)
    AC_FIN_list = (info[AC_FIN_beg+7:AC_FIN_end]).split(",")
    
    #AC_Het = Allele Count (adjusted) Heterozygous
    AC_Het_beg = info.find("AC_Het=")
    AC_Het_end = info.find(";", AC_Het_beg)
    AC_Het_list = (info[AC_Het_beg+7:AC_Het_end]).split(",")
    
    #AC_Hom = Allele Count (adjusted) Homozygous
    AC_Hom_beg = info.find("AC_Hom=")
    AC_Hom_end = info.find(";", AC_Hom_beg)
    AC_Hom_list = (info[AC_Hom_beg+7:AC_Hom_end]).split(",")
    
    #AC_NFE = Allele Count Non-Finnish European population
    AC_NFE_beg = info.find("AC_NFE=")
    AC_NFE_end = info.find(";", AC_NFE_beg)
    AC_NFE_list = (info[AC_NFE_beg+7:AC_NFE_end]).split(",")
    
    #AC_OTH = Allele Count Other populations
    AC_OTH_beg = info.find("AC_OTH=")
    AC_OTH_end = info.find(";", AC_OTH_beg)
    AC_OTH_list = (info[AC_OTH_beg+7:AC_OTH_end]).split(",")
    
    #AC_SAS = Allele Count South Asian population
    AC_SAS_beg = info.find("AC_SAS=")
    AC_SAS_end = info.find(";", AC_SAS_beg)
    AC_SAS_list = (info[AC_SAS_beg+7:AC_SAS_end]).split(",")
    
    #AF = Allele Frequency 
    AF_beg = info.find("AF=")
    AF_end = info.find(";", AF_beg)
    AF_list = (info[AF_beg+3:AF_end]).split(",")
    
    #AN = Allele Number
    AN_beg = info.find("AN=")
    AN_end = info.find(";", AN_beg)
    data_dict["AN"].append(info[AN_beg+3:AN_end])
    
    #AN_AFR = Allele Number African population
    AN_AFR_beg = info.find("AN_AFR=")
    AN_AFR_end = info.find(";", AN_AFR_beg)
    data_dict["AN_AFR"].append(info[AN_AFR_beg+7:AN_AFR_end])
    
    #AN_AMR = Allele Number American population
    AN_AMR_beg = info.find("AN_AMR=")
    AN_AMR_end = info.find(";", AN_AMR_beg)
    data_dict["AN_AMR"].append(info[AN_AMR_beg+7:AN_AMR_end])
    
    #AN_adj = Adjusted Allele Number
    AN_adj_beg = info.find("AN_Adj=")
    AN_adj_end = info.find(";", AN_adj_beg)
    data_dict["AN_Adj"].append(info[AN_adj_beg+7:AN_adj_end])
    
    #AN_EAS = Allele Number East Asian population
    AN_EAS_beg = info.find("AN_EAS=")
    AN_EAS_end = info.find(";", AN_EAS_beg)
    data_dict["AN_EAS"].append(info[AN_EAS_beg+7:AN_EAS_end])
    
    #AN_FIN = Allele Number Finish population
    AN_FIN_beg = info.find("AN_FIN=")
    AN_FIN_end = info.find(";", AN_FIN_beg)
    data_dict["AN_FIN"].append(info[AN_FIN_beg+7:AN_FIN_end])
    
    #AN_NFE = Allele Number Non-Finnish European population
    AN_NFE_beg = info.find("AN_NFE=")
    AN_NFE_end = info.find(";", AN_NFE_beg)
    data_dict["AN_NFE"].append(info[AN_NFE_beg+7:AN_NFE_end])
    
    #AN_OTH = Allele Number other populations
    AN_OTH_beg = info.find("AN_OTH=")
    AN_OTH_end = info.find(";", AN_OTH_beg)
    data_dict["AN_OTH"].append(info[AN_OTH_beg+7:AN_OTH_end])
    
    #AN_SAS = Allele Number South Asian population
    AN_SAS_beg = info.find("AN_SAS=")
    AN_SAS_end = info.find(";", AN_SAS_beg)
    data_dict["AN_SAS"].append(info[AN_SAS_beg+7:AN_SAS_end])
    
    #DP = "Approximate read depth
    DP_beg = info.find("DP=")
    DP_end = info.find(";", DP_beg)
    data_dict["DP"].append(info[DP_beg+3:DP_end])
    
    return (AC_list, AC_AFR_list, AC_AMR_list, AC_adj_list, AC_EAS_list, AC_FIN_list, AC_Het_list, AC_Hom_list, AC_NFE_list, AC_OTH_list, AC_SAS_list, AF_list)


In [7]:
def fill_empty_fields(line_parts, alt_list, data_dict):
    """A function that updates empty strings for CSQ fields when there's no CSQ data."""
    
    #Adding one line per each variation (can be several even when there's no CSQ)
    for i in range(len(alt_list)):
        
        #This command has to be inside the loop, it adds elements to some dictionary fields
        (AC_list, AC_AFR_list, AC_AMR_list, AC_adj_list, AC_EAS_list, AC_FIN_list, 
         AC_Het_list, AC_Hom_list, AC_NFE_list, AC_OTH_list, AC_SAS_list, AF_list) = update_main_fields(line_parts, data_dict)
        
        #Update the main fields
        data_dict["AC"].append(AC_list[i])
        data_dict["AC_AFR"].append(AC_AFR_list[i])
        data_dict["AC_AMR"].append(AC_AMR_list[i])
        data_dict["AC_Adj"].append(AC_adj_list[i])
        data_dict["AC_EAS"].append(AC_EAS_list[i])
        data_dict["AC_FIN"].append(AC_FIN_list[i])
        data_dict["AC_Het"].append(AC_Het_list[i])
        data_dict["AC_Hom"].append(AC_Hom_list[i])
        data_dict["AC_NFE"].append(AC_NFE_list[i])
        data_dict["AC_OTH"].append(AC_OTH_list[i])
        data_dict["AC_SAS"].append(AC_SAS_list[i])
        data_dict["AF"].append(AF_list[i])
    
        #Update the alt
        data_dict["alt"].append(alt_list[i])
    
        #Update the rest of fields with empty string
        data_dict["gene"].append("")
        data_dict["feature"].append("")
        data_dict["feature_type"].append("")
        data_dict["conseq"].append("")
        data_dict["prot_pos"].append("")
        data_dict["amino_acids"].append("")
        data_dict["codons"].append("")
        data_dict["strand"].append("")
        data_dict["ENSP"].append("")
        data_dict["SWISSPROT"].append("")
        data_dict["SIFT"].append("")
        data_dict["PolyPhen"].append("")
        data_dict["exon"].append("")
        data_dict["intron"].append("")
        data_dict["domains"].append("")
        data_dict["clin_sig"].append("")

In [8]:
#Process the data records of the vcf and save each chromosome to a seperate file
chromosome_iter = '1'
data_dict = defaultdict(list)
multi_allele_skipped = 0

for line in vcf_file:
    line_parts = line.split("\t")
    
    #Excluding multi-allelic regions
    if (removeMultiAllelic == True):
        chrom = line_parts[CHROM]
        if (chrom in multi_allelic_regions.keys()):
            pos = int(line_parts[POS])
            regions = multi_allelic_regions[chrom]
            for region in regions:
                if (pos >= region[0] and pos <= region[1]):
                    #Multi-allelic region - excluding from analysis
                    multi_allele_skipped += 1
                    continue
    
    #If the next line belongs to a different chromosome - saving to file
    if line_parts[CHROM] != chromosome_iter:
        data_to_df_csv(data_dict, headlines, chromosome_iter)
        #Initializing the data dictionary
        data_dict = defaultdict(list)
        print "finished chromosome"+chromosome_iter
        chromosome_iter = line_parts[CHROM]
    
    #Extracting alt
    alt_list = line_parts[ALT].split(",")
    
    #Extracting fields from the info
    info = line_parts[INFO]
    
    #CSQ = Consequence type as predicted by VEP
    CSQ_beg = info.find("CSQ=")
    if (CSQ_beg == -1):
        #NO CSQ data: just fill in empty strings instead
        fill_empty_fields(line_parts, alt_list, data_dict)
    else:    
        CSQ_data = info[CSQ_beg+4:]
        CSQ_features = CSQ_data.split(",")
    
        for CSQ in CSQ_features:
            #Update the main fields for each CSQ feature (so each CSQ will appear in a different line)
            (AC_list, AC_AFR_list, AC_AMR_list, AC_adj_list, AC_EAS_list, AC_FIN_list,
             AC_Het_list, AC_Hom_list, AC_NFE_list, AC_OTH_list, AC_SAS_list, AF_list) = update_main_fields(line_parts, data_dict)

            #Allele_num for deciding which alt, AC and AF to add
            allele_num_beg = find_nth(CSQ, "|", ALLELE_NUM)
            allele_num_end = CSQ.find("|", allele_num_beg+1)
            allele_num = int(CSQ[allele_num_beg+1:allele_num_end])
            #Adding the corresponding alt
            if (allele_num == 0):
                print "allele num = 0 "+line_parts[POS] #Making sure all the features correspond to an alt (0 = ref)
            else:
                data_dict["alt"].append(alt_list[allele_num-1])
                data_dict["AC"].append(AC_list[allele_num-1])
                data_dict["AC_AFR"].append(AC_AFR_list[allele_num-1])
                data_dict["AC_AMR"].append(AC_AMR_list[allele_num-1])
                data_dict["AC_Adj"].append(AC_adj_list[allele_num-1])
                data_dict["AC_EAS"].append(AC_EAS_list[allele_num-1])
                data_dict["AC_FIN"].append(AC_FIN_list[allele_num-1])
                data_dict["AC_Het"].append(AC_Het_list[allele_num-1])
                data_dict["AC_Hom"].append(AC_Hom_list[allele_num-1])
                data_dict["AC_NFE"].append(AC_NFE_list[allele_num-1])
                data_dict["AC_OTH"].append(AC_OTH_list[allele_num-1])
                data_dict["AC_SAS"].append(AC_SAS_list[allele_num-1])
                data_dict["AF"].append(AF_list[allele_num-1])

            #Gene
            gene_beg = find_nth(CSQ, "|", GENE)
            gene_end = CSQ.find("|", gene_beg+1)
            data_dict["gene"].append(CSQ[gene_beg+1:gene_end])
            #Feature
            feature_beg = find_nth(CSQ, "|", FEATURE)
            feature_end = CSQ.find("|", feature_beg+1)
            data_dict["feature"].append(CSQ[feature_beg+1:feature_end])
            #Feature Type
            feature_type_beg = find_nth(CSQ, "|", FEATURE_TYPE)
            feature_type_end = CSQ.find("|", feature_type_beg+1)
            data_dict["feature_type"].append(CSQ[feature_type_beg+1:feature_type_end])
            #Consequence
            conseq_beg = find_nth(CSQ, "|", CONSEQ)
            conseq_end = CSQ.find("|", conseq_beg+1)
            data_dict["conseq"].append(CSQ[conseq_beg+1:conseq_end])
            #Protein_pos
            prot_pos_beg = find_nth(CSQ, "|", PROT_POS)
            prot_pos_end = CSQ.find("|", prot_pos_beg+1)
            data_dict["prot_pos"].append(CSQ[prot_pos_beg+1:prot_pos_end])
            #Amino Acids
            aa_beg = find_nth(CSQ, "|", AMINO_ACIDS)
            aa_end = CSQ.find("|", aa_beg+1)
            data_dict["amino_acids"].append(CSQ[aa_beg+1:aa_end])
            #Codons
            codons_beg = find_nth(CSQ, "|", CODONS)
            codons_end = CSQ.find("|", codons_beg+1)
            data_dict["codons"].append(CSQ[codons_beg+1:codons_end])
            #Strand
            strand_beg = find_nth(CSQ, "|", STRAND)
            strand_end = CSQ.find("|", strand_beg+1)
            data_dict["strand"].append(CSQ[strand_beg+1:strand_end])
            #ENSP
            ENSP_beg = find_nth(CSQ, "|", ENSP)
            ENSP_end = CSQ.find("|", ENSP_beg+1)
            data_dict["ENSP"].append(CSQ[ENSP_beg+1:ENSP_end])
            #Swissprot
            swiss_beg = find_nth(CSQ, "|", SWISSPROT)
            swiss_end = CSQ.find("|", swiss_beg+1)
            data_dict["SWISSPROT"].append(CSQ[swiss_beg+1:swiss_end])
            #SIFT
            sift_beg = find_nth(CSQ, "|", SIFT)
            sift_end = CSQ.find("|", sift_beg+1)
            data_dict["SIFT"].append(CSQ[sift_beg+1:sift_end])
            #PolyPhen
            polyphen_beg = find_nth(CSQ, "|", POLYPHEN)
            polyphen_end = CSQ.find("|", polyphen_beg+1)
            data_dict["PolyPhen"].append(CSQ[polyphen_beg+1:polyphen_end])
            #Exon
            exon_beg = find_nth(CSQ, "|", EXON)
            exon_end = CSQ.find("|", exon_beg+1)
            data_dict["exon"].append(CSQ[exon_beg+1:exon_end])
            #Intron
            intron_beg = find_nth(CSQ, "|", INTRON)
            intron_end = CSQ.find("|", intron_beg+1)
            data_dict["intron"].append(CSQ[intron_beg+1:intron_end])
            #Domains
            domains_beg = find_nth(CSQ, "|", DOMAINS)
            domains_end = CSQ.find("|", domains_beg+1)
            data_dict["domains"].append(CSQ[domains_beg+1:domains_end])
            #clin_sig
            clin_sig_beg = find_nth(CSQ, "|", CLIN_SIG)
            clin_sig_end = CSQ.find("|", clin_sig_beg+1)
            data_dict["clin_sig"].append(CSQ[clin_sig_beg+1:clin_sig_end])

#Saving the data of the last chromosome
data_to_df_csv(data_dict, headlines, chromosome_iter)
print "finished chromosome"+chromosome_iter

vcf_file.close()

finished chromosome1
finished chromosome2
finished chromosome3
finished chromosome4
finished chromosome5
finished chromosome6
finished chromosome7
finished chromosome8
finished chromosome9
finished chromosome10
finished chromosome11
finished chromosome12
finished chromosome13
finished chromosome14
finished chromosome15
finished chromosome16
finished chromosome17
finished chromosome18
finished chromosome19
finished chromosome20
finished chromosome21
finished chromosome22
finished chromosomeX
finished chromosomeY


In [9]:
print multi_allele_skipped

2779


#### Printouts for clarification - you don't have to run this!

In [15]:
#The different fields of the CSQ
pd.options.display.max_colwidth =600
info_df[info_df['ID'] == "CSQ"]

Unnamed: 0,Description,ID,Number,Type
70,"""Consequence type as predicted by VEP. Format: Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|ALLELE_NUM|DISTANCE|STRAND|SYMBOL|SYMBOL_SOURCE|HGNC_ID|BIOTYPE|CANONICAL|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|SIFT|PolyPhen|EXON|INTRON|DOMAINS|HGVSc|HGVSp|GMAF|AFR_MAF|AMR_MAF|ASN_MAF|EUR_MAF|AA_MAF|EA_MAF|CLIN_SIG|SOMATIC|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF_info|LoF_flags|LoF_filter|LoF""",CSQ,.,String


In [17]:
#The info fields of the meta-data
pd.options.display.max_colwidth =100
info_df

Unnamed: 0,Description,ID,Number,Type
0,"""Allele count in genotypes",AC,A,Integer
1,"""African/African American Allele Counts""",AC_AFR,A,Integer
2,"""American Allele Counts""",AC_AMR,A,Integer
3,"""Adjusted Allele Counts""",AC_Adj,A,Integer
4,"""East Asian Allele Counts""",AC_EAS,A,Integer
5,"""Finnish Allele Counts""",AC_FIN,A,Integer
6,"""Adjusted Hemizygous Counts""",AC_Hemi,A,Integer
7,"""Adjusted Heterozygous Counts""",AC_Het,A,Integer
8,"""Adjusted Homozygous Counts""",AC_Hom,A,Integer
9,"""Non-Finnish European Allele Counts""",AC_NFE,A,Integer
