Format_english_specification v20151001

Format specification for candidates list

Position and variant information

Example

CHR	POS	ID	REF	ALT	VarType	Allele
1	987189	rs540580770	C	T	snv	T

Specifications

Column	Specification
CHR	Chromosome
POS	Position
ID	Identifiers derived from dbSNP or COSMIC database, `.` represents that there is no record for this position
REF	Reference base(s)
ALT	Comma separated list of alternate non-reference alleles called on at least one of the samples
VarType	Variant type
Allele	One of the bases list in `ALT`

Common practice

Perform filter on VarType by using built-in filter of Excel to gain variants of specific variant type;
Perform filter on CHR by using built-in filter of Excel to gain variant that located on specific chromosomes;
Perform filter on POS by using built-in filter of Excel to gain variants on specific position;
Perform filter on ID by using built-in filter of Excel to gain variants with specific identifiers;

位置和变异信息

Note

The pipeline only consider positions of corresponding variants but not the bases listed in ALT when trying to present the identifiers in ID. If the record is an InDel, then a identifier will be presented as long as its position located within the InDel regions;

Genotype in the candidate list follows the VCF format. The allele values are 0 for the reference allele (what is in REF), 1 for the allele listed in ALT, 2 for the second allele list in ALT and so on. If a call cannot be made for a sample at a given locus, '.' is specified for each missing allele;

There will be several records for a certain position if more than one alternate is found (comma separated list of alternate non-reference alleles in ALT).

Allele frequency in various databases

Example

# of databases (AF>0.005)	KG	ESP	PVFD	IN-HOUSE-1	IN-HOUSE-2	IN-HOUSE-3	CG	HapMap	Wellderly	ExAC
0	0.0006	0.0002	0.0005	0.0005	-1	-1	-1	-1	-1	0.00074549

Specifications

Column	Specification
# of databases (AF>0.005)	Number of databases in which the allele (present ahead) frequency is >0.005. 10 databases (both public and in-house databases) are used in this pipeline. If no frequency is available, '-1' will be presented in the corresponding column.
KG	Public database, ~2500 WES/WGS
ESP	Public database, ~6500 WES
PVFD	BGI in-house database, ~1000 WES/WGS
IN-HOUSE-1	BGI in-house database, ~1000 WES
IN-HOUSE-2	BGI in-house database, ~1000 WES
IN-HOUSE-3	BGI in-house database, ~200 WGS
CG	Public database, 51 WGS
HapMap	Public database, 270 WGS
Wellderly	Public database, 454 WGS
ExAC	Public database, ~60,000 WES

Recommended manipulations

Quick filter：Filter out records with >N # of databases(AF>0.005). N could be 0,1,2,3.
Advanced filter：Use different thresholds for different databases. The most frequently used threshold is 0.005. Researchers are encouraged to modify the threshold based on the prevalence of the specific disease.

群体频率数据库

BGI in-house capture chips databases

Example

SureSelect_Human_All_Exon_V1	SureSelect_Human_All_Exon_V2	SureSelect_Human_All_Exon_50Mb	SureSelect_Human_All_Exon_V4	SureSelect_Human_All_Exon_V4+UTRs	SeqCapEZ_Exome_v2.0	SeqCapEZ_Exome_v3.0
19	702	274	21	6	643	48

Specifications

Column	Capture Platforms	Capture chips	# of samples
SureSelect_Human_All_Exon_V1	Agilent	38Mbp	71
SureSelect_Human_All_Exon_V2	Agilent	44Mbp	1103
SureSelect_Human_All_Exon_50Mb	Agilent	50Mbp	390
SureSelect_Human_All_Exon_V4	Agilent	51Mbp	27
SureSelect_Human_All_Exon_V4+UTRs	Agilent	71Mbp	7
SeqCapEZ_Exome_v2.0	NimbleGen	44.1Mbp	1026
SeqCapEZ_Exome_v3.0	NimbleGen	64Mbp	82

Recommended manipulations

Choose different thresholds for various capture chips. Threshold of 20 is a regular choice.

BGI内部芯片信息

Note

In-house capture chips databases are used to get rid of system error. Suppose that a variant is of low frequency in population, but frequently detected by NGS, then it is more likely to be a system error rather than a variant;

In-house capture chips databases are used only in non-profit projects.

Prediction of deleteriousness and conservation

Example

# of tools (predicted harmful or conserved)	SIFT	PolyPhen2_HDIV	PolyPhen2_HVAR	LRT	MutationTaster	MutationAssessor	FATHMM	GERP_plus	PhyloP	SiPhy	Gerp	PhastCons	GWAWA
6	0.03	0.979	0.92	0.000074	0	4.06	2.16	3.74	0.783	11.0086	3.74	0.997	0.42

Specifications

Column	Specification
# of tools (predicted harmful or conserved)	The number of hazard prediction tools that have predicted the variant as a deleterious mutation
SIFT	Deleterious(<0.05)
PolyPhen2_HDIV	Probably damaging (>=0.957), possibly damaging (0.453<=pp2_hdiv<=0.956); benign (<=0.452)
PolyPhen2_HVAR	Probably damaging (>=0.909),possibly damaging (0.447<=pp2_hdiv<=0.909); benign (<=0.446)
LRT	-
MutationTaster	-
MutationAssessor	Deleterious(>1.938)
FATHMM	Deleterious(<-1.5)
GERP++	Deleterious(>3)
PhyloP	Deleterious(>2.5)
SiPhy	-
Gerp
PhastCons	Deleterious(>0.6)
GWAWA	-

Recommended manipulations

Quick filter：Filter out records with <N # of tools (predicted harmful or conserved). Since hazard prediction tools have high false positive and high false negative, these information are for reference only. Therefore, N could be 0;
Advanced filter: Use different thresholds for various tools.

有害性预测

Annotations for genes

Example

OMIM	GeneTag	GO_BP	GO_MF	GO_CC	KEGG_Pathway	Proteins_Expression_profiles_of_Normal_Tissue
Neutrophilia, hereditary, 162830 (3)	novel	GO:0006952,defense response\|GO:0007155,cell adhesion\|GO:0007165,signal transduction	GO:0004872,receptor activity\|GO:0004896,cytokine receptor activity	GO:0005576,extracellular region\|GO:0005886,plasma membrane\|GO:0005887,integral to plasma membrane	hsa04060,Cytokine-cytokine receptor interaction\|hsa04630,Jak-STAT signaling pathway\|hsa04640,Hematopoietic cell lineage\|hsa05200,Pathways in cancer	"ENSG00000119535" "skin 2" "epidermal cells" "Not detected" "APE" "Supportive"

Specifications

Column	Specification
OMIM	Annotations derived from OMIM database for the mutated gene
Genetag	Whether the mutated gene has ever been reported for the corresponding disease before. If the mutation is in a causal gene for the studying disease, it will present "known" for that variant in this column; otherwise, "novel" will be presented. This utility depends on the information (HGNC symbol of known causative genes) provided to the pipeline
GO BP	Gene ontology annotation in terms of biological process
GO MF	Gene ontology annotation in terms of molecular function
GO CC	Gene ontology annotation in terms of cellular component
KEGG
The Human Protein Atlas

Recommended manipulations

Use the built-in filter of Excel to view variants in known causative genes by tag 'known' in Genetag.
If you are interested in genes with information in OMIM, perform filter on OMIM.
If you are interested in genes expressed in certain tissue, perform filter on The Human Protein Atlas.
If you are interested in genes with GO or KEGG annotation, perform filter on the corresponding columns.

基因水平相关注释

Annotations for transcripts

Due to alternative splicing, there may be multiple transcripts for a single gene. The same variant within different transcripts might have different consequence. For each variant the pipeline offers annotations for two transcripts that contain the corresponding variant. One is the transcript whose function was most affected by the mutation and the other is the canonical transcript of the gene defined by Ensembl.

Example

CLIN_SIG	IMPACT	Consequence	HGNC(SYMBOL)	Feature	BIOTYPE	HGVSc	HGVSp	EXON	INTRON	DOMAINS	SWISSPROT	TREMBL	UNIPARC	SIFT	PolyPhen
.	MED:11	missense_variant	EPHA2	ENST00000358432	protein_coding	ENST00000358432.5:c.71C>T	ENSP00000351209.5:p.Ala24Val	'1/17'	'.'	Low_complexity_(Seg):Seg&Cleavage_site_(Signalp):Sigp&PIRSF_domain:PIRSF000666	EPHA2_HUMAN	Q96HF4_HUMAN&Q8IZL0_HUMAN	UPI00000731AB	tolerated(0.56)	possibly_damaging(0.75)

Specifications

Column	Specification
CLIN_SIG	Clinical significance of variant from dbSNP; It could be one of the following value [unknown, untested, non-pathogenic, probable-non-pathogenic, probable, pathogenic, pathogenic, drug-response, histocompatibility, other]
IMPACT	Impact of variant, ranging from 1 to 34, with "1" meaning the most serious impact to the function
Consequence	Consequence type caused by the corresponding variation, detailed information please refer to VEP consequences
HGNC(SYMBOL)	Gene symbol from HGNC
Feature	Ensembl stable ID of feature
BIOTYPE	Biotype of transcript or regulatory feature
HGVSc	The HGVS coding sequence name
HGVSp	The HGVS protein sequence name
Exon	The exon number involved in the mutation (out of total number)
Intron	The intron number involved in the mutation (out of total number)
DOMAINS	The source and identifer of any overlapping protein domains
SWISSPROT	UniProtKB/Swiss-Prot identifier of protein product
TREMBL	UniProtKB/TrEMBL identifier of protein product
UNIPARC	UniParc identifier of protein product
SIFT	Prediction and/or score from hazard prediction tools of SIFT
PolyPhen	Prediction and/or score from hazard prediction tools of PolyPhen

Recommended manipulations

Filter based on information of the transcript whose function was mostly affected.
Filter based on values in IMPACT.
Use prediction from SIFT and PolyPhen as references.
Filter based on information of the canonical transcript.
Use information such as HGVS, Exon, Intron, and DOMAINS as references.

转录本相关注释

Quality control

Filter
ACC,TR,PASS
ACC,TR,VQSRTrancheSNP99.00to99.90

Specifications

Field	Specification	Values
Concordance	Two variants calling strategies are applied in parallel. The first is the common practice of using GATK HaplotypeCaller, and the second one is to call variants on cohorts of samples using the HaplotypeCaller in GVCF mode. There are discrepancies between the results from these two calling strategies. Since being unable to tell which is correct by now, tags in the first filed indicate whether the two results are concordant. "ACC" means results from the two calling strategies are concordant, while "DIFF" means the opposite	ACC, DIFF
Target region or not	whether the varaint is inside ("TR") or outside ("FLANK") the target region of the capture chip used for this project	TR, FLANK
GATK filters	whether the corresponding variant has passed the GATK Variant Quality Score Recalibration(VQSR). Only the tag "PASS" means the variant is reliable according to the recalibration processure. However, there are false positive and false negative.	PASS, LowQual, VQSRTrancheSNP99.00to99.90, VQSRTrancheINDEL99.00to99.90, VQSRTrancheSNP99.90to100.00, VQSRTrancheINDEL99.90to100.00

Note

Target regions refer to those that are covered exactly by the designed probes. Generally speaking, such regions should have been well-sequenced. Flanking regions refers to +/-200bp around each target regions. Though flanking regions might not sequenced as well as the target regions, it is recommended to make best use of it.
Different variant calling tools might produce different genotypes for the same position. The pipeline have integrated four calling tools (GATK, SOAPsnp, SAMtools, and Platypus). It is not uncommon these tools produce discordant genotypes for same positions. The pipeline will perform modification when there is discrepancy. However, not all modification are correct. Sanger sequencing has the final say in such situations.
The pipeline performs GATK VQSR on variants resulted from GATK. Briefly, VQSR use Gaussian model to score variant quality for filtering purposes. Variants with "PASS" are treated as reliable.

质控相关注释

Cosegregation

Example

AD	AR	XL
AD:YY:2:3:0	AR:YN:0:3:0	XL:YN:0:0:0

Specifications

Field	Specification	Value
Mode of inheritance		AD/AR/XL/Compound Heterozygous
Genotype		Y/N
Cosegregate		Y/N
Number of cases that fit cosegregation		[0, #cases]
Number of controls that fit cosegregation		[0, #controls]
Number of individuals with unknown phenotype that fit cosegregation		[0, #individuals with unknown phenotype]

Recommended manipulations

Since mutations causing rare diseases tend to affect gene functions directly, we usually suppose that causal mutations cosegregate with affected individuals within a famlily. Figure below show how to filter based on cosegregation when the studying disease is AR inheritance. There are 2 affected samples (cases) and 3 unaffected samples (controls). "AR:YY:2:3:0" means that 2 cases and 3 controls have been successfully be genotyped at the corresponding position, and all of their genotypes follow the cosegregation. Records with such tags should be kept for further investigation. "AR:NY:2:2:0" indicates 1 control has not been genotyped. However, samples that have been successfully genotyped follow the cosegregation. Variants like these kind should also be kept to avoid false negative. Sanger sequencing can be a remedy when NGS and bioinformatic analysis fail to genotype samples.

家系分离信息

Detailed sample information

Example

Detail_INFO_Format	detail-case-1	detail-case-2	detail-control-3	detail-control-4	detail-control-5	case-1	case-2	control-3	control-4	control-5
BGI_GD:DNM:SL_GT:ROH:FL_GT:AD:GQ	'd(M-P)\|d(P-M):DNM-FP:0/1:ROH-lt5M:0/1:2,6:45'	'd(M-P)\|d(P-M):DNM-FP:0/1:ROH-lt5M:0/1:7,3:86'	'gd-unknown:DNM-unknown:0/0:ROH-lt5M:0/0:11,0:21'	'gd-unknown:DNM-unknown:0/0:ROH-lt5M:0/0:6,0:.'	'g(B-N):DNM-unknown:0/0:ROH-Unkown:0/0:7,0:21'	'0/1'	'0/1'	'0/0'	'0/0'	'0/0'

Specification

Column / Field	Specification	Value
Detail_INFO_Format	This column specifies the tag types and order (colon-separated). This is followed by one column per sample, with the colon-separated data in this column. Seven keywords are presented.	BGI_GD:DNM:SL_GT:ROH:FL_GT:AD:GQ
BGI_GD	Tag that given to each sample based on the genotypes of his/her parents at the corresponding position. It describes from who each allele comes from.	g(B-N), d(M-P)\|d(P-M), g(B-N), d(M-P), d(P-M), d(N-B)fp, d(M-P)\|d(N-B)fp, g(B-N)\|d(P-M), g(B-N)\|d(P-M), g(B-N)\|d(M-P), d(P-M)\|d(N-B)fp, g(B-N)\|d(M-P), gd-unknown, g(B-N), g(B-N)
DNM	Random forest classifier to distinguish true de novo SNVs	DNM-TP, DNM-FP
SL_GT	As mentioned before, there two calling strategies in this pipeline. One is based on sample level, the other is based on family level. "SL_GT" is Genotype result from GATK at sample level
ROH	Runs of homozygosity	ROH-unknown, ROH-lt5M, ROH-gt5M
FL_GT	Genotype result from GATK at family level
AD	Allelic read depths for the reference and alternate alleles
GQ	Genotype quality score derived from GATK at family level	0~99

Recommended manipulations

ROH：When the studying pedigree is consanguineous union, homozygosity mapping analysis will be of great help. ROH tag in each sample detailed information column indicates such information. As showed below, for AR inheritance disease of consanguineous families, first filter out records that do not follow the AR inheritance pattern, then keep those with ROH-gt5M. ROH regions with greater size are more likely the identical haplotypes which the parents themselves derived from the common ancestor.

样本信息-ROH

Recommended manipulations

DNM: BGI_GD and DNM tags can both be used to find de novo mutations. Figure below show first keep records that cosegregate with corresponding inheritance (usually the AD inheritance), then use the DNM tag to locate de novo SNVs that are considered as true positive by random forest classifier.

样本信息-DNM

BGI_GD

Format：
"a(b-c)" or "gd-unknown"

Specification：
This tag is given to each sample based on the genotype of his/her parents at the corresponding position. It can be used to describe from whom each allele inherits. a, indicates whether the allele is germline or a de novo mutation. Values for it could be [g, d], which represents germline and de novo, respectly.
(b-c), '-' inside brackets separates two characters that describe the source of the allele. Values for 'b' and 'c' could be [M, P, B, N], which represents maternal, paternal, both, none.
gd_unknown, meaning that the pipeline can not tell the source of each allele for that sample. This happens when parents genotypes are unknown.

Example：
g(B-N), means that both allele of the sample are germline;
d(M-P), means that allele from paternal side is a de novo mutation.

Application： 1. Filter out records with d(b-c) in any of the case detailed information column when germline genotypes are primarily interest; 2. Filter out records with g(b-c) in any of the case detailed information column when de novo mutations are primarily interest. If the variants are de novo SNVs, then tag DNM should be used to tell which are more reliable. When you are interested in de novo InDels, first kept records with "indel" presented in VarType and then focus on variants with d(b-c) tag to do the interpretation.

Feedback please contact rdscreening development group

Provide feedback

Saved searches

Use saved searches to filter your results more quickly