Skip to content

Format_english_specification v20151001

chenyu600 edited this page Jul 31, 2017 · 10 revisions

Format specification for candidates list

Position and variant information

  • Example
CHR POS ID REF ALT VarType Allele
1 987189 rs540580770 C T snv T
  • Specifications
Column Specification
CHR Chromosome
POS Position
ID Identifiers derived from dbSNP or COSMIC database, . represents that there is no record for this position
REF Reference base(s)
ALT Comma separated list of alternate non-reference alleles called on at least one of the samples
VarType Variant type
Allele One of the bases list in ALT
  • Common practice
  1. Perform filter on VarType by using built-in filter of Excel to gain variants of specific variant type;
  2. Perform filter on CHR by using built-in filter of Excel to gain variant that located on specific chromosomes;
  3. Perform filter on POS by using built-in filter of Excel to gain variants on specific position;
  4. Perform filter on ID by using built-in filter of Excel to gain variants with specific identifiers;

位置和变异信息

Note

  • The pipeline only consider positions of corresponding variants but not the bases listed in ALT when trying to present the identifiers in ID. If the record is an InDel, then a identifier will be presented as long as its position located within the InDel regions;
  • Genotype in the candidate list follows the VCF format. The allele values are 0 for the reference allele (what is in REF), 1 for the allele listed in ALT, 2 for the second allele list in ALT and so on. If a call cannot be made for a sample at a given locus, '.' is specified for each missing allele;
  • There will be several records for a certain position if more than one alternate is found (comma separated list of alternate non-reference alleles in ALT).

Allele frequency in various databases

  • Example
# of databases (AF>0.005) KG ESP PVFD IN-HOUSE-1 IN-HOUSE-2 IN-HOUSE-3 CG HapMap Wellderly ExAC
0 0.0006 0.0002 0.0005 0.0005 -1 -1 -1 -1 -1 0.00074549
  • Specifications
Column Specification
# of databases (AF>0.005) Number of databases in which the allele (present ahead) frequency is >0.005. 10 databases (both public and in-house databases) are used in this pipeline. If no frequency is available, '-1' will be presented in the corresponding column.
KG Public database, ~2500 WES/WGS
ESP Public database, ~6500 WES
PVFD BGI in-house database, ~1000 WES/WGS
IN-HOUSE-1 BGI in-house database, ~1000 WES
IN-HOUSE-2 BGI in-house database, ~1000 WES
IN-HOUSE-3 BGI in-house database, ~200 WGS
CG Public database, 51 WGS
HapMap Public database, 270 WGS
Wellderly Public database, 454 WGS
ExAC Public database, ~60,000 WES
  • Recommended manipulations
  1. Quick filter:Filter out records with >N # of databases(AF>0.005). N could be 0,1,2,3.
  2. Advanced filter:Use different thresholds for different databases. The most frequently used threshold is 0.005. Researchers are encouraged to modify the threshold based on the prevalence of the specific disease.

群体频率数据库

BGI in-house capture chips databases

  • Example
SureSelect_Human_All_Exon_V1 SureSelect_Human_All_Exon_V2 SureSelect_Human_All_Exon_50Mb SureSelect_Human_All_Exon_V4 SureSelect_Human_All_Exon_V4+UTRs SeqCapEZ_Exome_v2.0 SeqCapEZ_Exome_v3.0
19 702 274 21 6 643 48
  • Specifications
Column Capture Platforms Capture chips # of samples
SureSelect_Human_All_Exon_V1 Agilent 38Mbp 71
SureSelect_Human_All_Exon_V2 Agilent 44Mbp 1103
SureSelect_Human_All_Exon_50Mb Agilent 50Mbp 390
SureSelect_Human_All_Exon_V4 Agilent 51Mbp 27
SureSelect_Human_All_Exon_V4+UTRs Agilent 71Mbp 7
SeqCapEZ_Exome_v2.0 NimbleGen 44.1Mbp 1026
SeqCapEZ_Exome_v3.0 NimbleGen 64Mbp 82
  • Recommended manipulations
  1. Choose different thresholds for various capture chips. Threshold of 20 is a regular choice.

BGI内部芯片信息

Note

  • In-house capture chips databases are used to get rid of system error. Suppose that a variant is of low frequency in population, but frequently detected by NGS, then it is more likely to be a system error rather than a variant;
  • In-house capture chips databases are used only in non-profit projects.

Prediction of deleteriousness and conservation

  • Example
# of tools (predicted harmful or conserved) SIFT PolyPhen2_HDIV PolyPhen2_HVAR LRT MutationTaster MutationAssessor FATHMM GERP_plus PhyloP SiPhy Gerp PhastCons GWAWA
6 0.03 0.979 0.92 0.000074 0 4.06 2.16 3.74 0.783 11.0086 3.74 0.997 0.42
  • Specifications
Column Specification
# of tools (predicted harmful or conserved) The number of hazard prediction tools that have predicted the variant as a deleterious mutation
SIFT Deleterious(<0.05)
PolyPhen2_HDIV Probably damaging (>=0.957), possibly damaging (0.453<=pp2_hdiv<=0.956); benign (<=0.452)
PolyPhen2_HVAR Probably damaging (>=0.909),possibly damaging (0.447<=pp2_hdiv<=0.909); benign (<=0.446)
LRT -
MutationTaster -
MutationAssessor Deleterious(>1.938)
FATHMM Deleterious(<-1.5)
GERP++ Deleterious(>3)
PhyloP Deleterious(>2.5)
SiPhy -
Gerp
PhastCons Deleterious(>0.6)
GWAWA -
  • Recommended manipulations
  1. Quick filter:Filter out records with <N # of tools (predicted harmful or conserved). Since hazard prediction tools have high false positive and high false negative, these information are for reference only. Therefore, N could be 0;
  2. Advanced filter: Use different thresholds for various tools.

有害性预测

Annotations for genes

  • Example
OMIM GeneTag GO_BP GO_MF GO_CC KEGG_Pathway Proteins_Expression_profiles_of_Normal_Tissue
Neutrophilia, hereditary, 162830 (3) novel GO:0006952,defense response|GO:0007155,cell adhesion|GO:0007165,signal transduction GO:0004872,receptor activity|GO:0004896,cytokine receptor activity GO:0005576,extracellular region|GO:0005886,plasma membrane|GO:0005887,integral to plasma membrane hsa04060,Cytokine-cytokine receptor interaction|hsa04630,Jak-STAT signaling pathway|hsa04640,Hematopoietic cell lineage|hsa05200,Pathways in cancer "ENSG00000119535" "skin 2" "epidermal cells" "Not detected" "APE" "Supportive"
  • Specifications
Column Specification
OMIM Annotations derived from OMIM database for the mutated gene
Genetag Whether the mutated gene has ever been reported for the corresponding disease before. If the mutation is in a causal gene for the studying disease, it will present "known" for that variant in this column; otherwise, "novel" will be presented. This utility depends on the information (HGNC symbol of known causative genes) provided to the pipeline
GO BP Gene ontology annotation in terms of biological process
GO MF Gene ontology annotation in terms of molecular function
GO CC Gene ontology annotation in terms of cellular component
KEGG
The Human Protein Atlas
  • Recommended manipulations
  1. Use the built-in filter of Excel to view variants in known causative genes by tag 'known' in Genetag.
  2. If you are interested in genes with information in OMIM, perform filter on OMIM.
  3. If you are interested in genes expressed in certain tissue, perform filter on The Human Protein Atlas.
  4. If you are interested in genes with GO or KEGG annotation, perform filter on the corresponding columns.

基因水平相关注释

Annotations for transcripts

Due to alternative splicing, there may be multiple transcripts for a single gene. The same variant within different transcripts might have different consequence. For each variant the pipeline offers annotations for two transcripts that contain the corresponding variant. One is the transcript whose function was most affected by the mutation and the other is the canonical transcript of the gene defined by Ensembl.

  • Example
CLIN_SIG IMPACT Consequence HGNC(SYMBOL) Feature BIOTYPE HGVSc HGVSp EXON INTRON DOMAINS SWISSPROT TREMBL UNIPARC SIFT PolyPhen
. MED:11 missense_variant EPHA2 ENST00000358432 protein_coding ENST00000358432.5:c.71C>T ENSP00000351209.5:p.Ala24Val '1/17' '.' Low_complexity_(Seg):Seg&Cleavage_site_(Signalp):Sigp&PIRSF_domain:PIRSF000666 EPHA2_HUMAN Q96HF4_HUMAN&Q8IZL0_HUMAN UPI00000731AB tolerated(0.56) possibly_damaging(0.75)
  • Specifications
Column Specification
CLIN_SIG Clinical significance of variant from dbSNP; It could be one of the following value [unknown, untested, non-pathogenic, probable-non-pathogenic, probable, pathogenic, pathogenic, drug-response, histocompatibility, other]
IMPACT Impact of variant, ranging from 1 to 34, with "1" meaning the most serious impact to the function
Consequence Consequence type caused by the corresponding variation, detailed information please refer to VEP consequences
HGNC(SYMBOL) Gene symbol from HGNC
Feature Ensembl stable ID of feature
BIOTYPE Biotype of transcript or regulatory feature
HGVSc The HGVS coding sequence name
HGVSp The HGVS protein sequence name
Exon The exon number involved in the mutation (out of total number)
Intron The intron number involved in the mutation (out of total number)
DOMAINS The source and identifer of any overlapping protein domains
SWISSPROT UniProtKB/Swiss-Prot identifier of protein product
TREMBL UniProtKB/TrEMBL identifier of protein product
UNIPARC UniParc identifier of protein product
SIFT Prediction and/or score from hazard prediction tools of SIFT
PolyPhen Prediction and/or score from hazard prediction tools of PolyPhen
  • Recommended manipulations
  1. Filter based on information of the transcript whose function was mostly affected.
  2. Filter based on values in IMPACT.
  3. Use prediction from SIFT and PolyPhen as references.
  4. Filter based on information of the canonical transcript.
  5. Use information such as HGVS, Exon, Intron, and DOMAINS as references.

转录本相关注释

Quality control

Filter
ACC,TR,PASS
ACC,TR,VQSRTrancheSNP99.00to99.90
  • Specifications
Field Specification Values
Concordance Two variants calling strategies are applied in parallel. The first is the common practice of using GATK HaplotypeCaller, and the second one is to call variants on cohorts of samples using the HaplotypeCaller in GVCF mode. There are discrepancies between the results from these two calling strategies. Since being unable to tell which is correct by now, tags in the first filed indicate whether the two results are concordant. "ACC" means results from the two calling strategies are concordant, while "DIFF" means the opposite ACC, DIFF
Target region or not whether the varaint is inside ("TR") or outside ("FLANK") the target region of the capture chip used for this project TR, FLANK
GATK filters whether the corresponding variant has passed the GATK Variant Quality Score Recalibration(VQSR). Only the tag "PASS" means the variant is reliable according to the recalibration processure. However, there are false positive and false negative. PASS, LowQual, VQSRTrancheSNP99.00to99.90, VQSRTrancheINDEL99.00to99.90, VQSRTrancheSNP99.90to100.00, VQSRTrancheINDEL99.90to100.00
  • Note
  1. Target regions refer to those that are covered exactly by the designed probes. Generally speaking, such regions should have been well-sequenced. Flanking regions refers to +/-200bp around each target regions. Though flanking regions might not sequenced as well as the target regions, it is recommended to make best use of it.
  2. Different variant calling tools might produce different genotypes for the same position. The pipeline have integrated four calling tools (GATK, SOAPsnp, SAMtools, and Platypus). It is not uncommon these tools produce discordant genotypes for same positions. The pipeline will perform modification when there is discrepancy. However, not all modification are correct. Sanger sequencing has the final say in such situations.
  3. The pipeline performs GATK VQSR on variants resulted from GATK. Briefly, VQSR use Gaussian model to score variant quality for filtering purposes. Variants with "PASS" are treated as reliable.

质控相关注释

Cosegregation

  • Example
AD AR XL
AD:YY:2:3:0 AR:YN:0:3:0 XL:YN:0:0:0
  • Specifications
Field Specification Value
Mode of inheritance AD/AR/XL/Compound Heterozygous
Genotype Y/N
Cosegregate Y/N
Number of cases that fit cosegregation [0, #cases]
Number of controls that fit cosegregation [0, #controls]
Number of individuals with unknown phenotype that fit cosegregation [0, #individuals with unknown phenotype]
  • Recommended manipulations
  1. Since mutations causing rare diseases tend to affect gene functions directly, we usually suppose that causal mutations cosegregate with affected individuals within a famlily. Figure below show how to filter based on cosegregation when the studying disease is AR inheritance. There are 2 affected samples (cases) and 3 unaffected samples (controls). "AR:YY:2:3:0" means that 2 cases and 3 controls have been successfully be genotyped at the corresponding position, and all of their genotypes follow the cosegregation. Records with such tags should be kept for further investigation. "AR:NY:2:2:0" indicates 1 control has not been genotyped. However, samples that have been successfully genotyped follow the cosegregation. Variants like these kind should also be kept to avoid false negative. Sanger sequencing can be a remedy when NGS and bioinformatic analysis fail to genotype samples.

家系分离信息

Detailed sample information

  • Example
Detail_INFO_Format detail-case-1 detail-case-2 detail-control-3 detail-control-4 detail-control-5 case-1 case-2 control-3 control-4 control-5
BGI_GD:DNM:SL_GT:ROH:FL_GT:AD:GQ 'd(M-P)|d(P-M):DNM-FP:0/1:ROH-lt5M:0/1:2,6:45' 'd(M-P)|d(P-M):DNM-FP:0/1:ROH-lt5M:0/1:7,3:86' 'gd-unknown:DNM-unknown:0/0:ROH-lt5M:0/0:11,0:21' 'gd-unknown:DNM-unknown:0/0:ROH-lt5M:0/0:6,0:.' 'g(B-N):DNM-unknown:0/0:ROH-Unkown:0/0:7,0:21' '0/1' '0/1' '0/0' '0/0' '0/0'
  • Specification
Column / Field Specification Value
Detail_INFO_Format This column specifies the tag types and order (colon-separated). This is followed by one column per sample, with the colon-separated data in this column. Seven keywords are presented. BGI_GD:DNM:SL_GT:ROH:FL_GT:AD:GQ
BGI_GD Tag that given to each sample based on the genotypes of his/her parents at the corresponding position. It describes from who each allele comes from. g(B-N), d(M-P)|d(P-M), g(B-N), d(M-P), d(P-M), d(N-B)fp, d(M-P)|d(N-B)fp, g(B-N)|d(P-M), g(B-N)|d(P-M), g(B-N)|d(M-P), d(P-M)|d(N-B)fp, g(B-N)|d(M-P), gd-unknown, g(B-N), g(B-N)
DNM Random forest classifier to distinguish true de novo SNVs DNM-TP, DNM-FP
SL_GT As mentioned before, there two calling strategies in this pipeline. One is based on sample level, the other is based on family level. "SL_GT" is Genotype result from GATK at sample level
ROH Runs of homozygosity ROH-unknown, ROH-lt5M, ROH-gt5M
FL_GT Genotype result from GATK at family level
AD Allelic read depths for the reference and alternate alleles
GQ Genotype quality score derived from GATK at family level 0~99
  • Recommended manipulations

ROH:When the studying pedigree is consanguineous union, homozygosity mapping analysis will be of great help. ROH tag in each sample detailed information column indicates such information. As showed below, for AR inheritance disease of consanguineous families, first filter out records that do not follow the AR inheritance pattern, then keep those with ROH-gt5M. ROH regions with greater size are more likely the identical haplotypes which the parents themselves derived from the common ancestor.

样本信息-ROH

  • Recommended manipulations

DNM: BGI_GD and DNM tags can both be used to find de novo mutations. Figure below show first keep records that cosegregate with corresponding inheritance (usually the AD inheritance), then use the DNM tag to locate de novo SNVs that are considered as true positive by random forest classifier.

样本信息-DNM

BGI_GD

  • Format:
    "a(b-c)" or "gd-unknown"
  • Specification:
    This tag is given to each sample based on the genotype of his/her parents at the corresponding position. It can be used to describe from whom each allele inherits. a, indicates whether the allele is germline or a de novo mutation. Values for it could be [g, d], which represents germline and de novo, respectly.
    (b-c), '-' inside brackets separates two characters that describe the source of the allele. Values for 'b' and 'c' could be [M, P, B, N], which represents maternal, paternal, both, none.
    gd_unknown, meaning that the pipeline can not tell the source of each allele for that sample. This happens when parents genotypes are unknown.
  • Example:
    g(B-N), means that both allele of the sample are germline;
    d(M-P), means that allele from paternal side is a de novo mutation.
  • Application: 1. Filter out records with d(b-c) in any of the case detailed information column when germline genotypes are primarily interest; 2. Filter out records with g(b-c) in any of the case detailed information column when de novo mutations are primarily interest. If the variants are de novo SNVs, then tag DNM should be used to tell which are more reliable. When you are interested in de novo InDels, first kept records with "indel" presented in VarType and then focus on variants with d(b-c) tag to do the interpretation.