Skip to content

tsv: creating a spreadsheet from a filtered VCF

Brent Pedersen edited this page Jun 27, 2019 · 4 revisions

slivar provides flexible filtering of VCFs. But when doing a final variant-by-variant analysis, it's preferable to have the data in a spreadsheet--for clinicians and analysts.

slivar tsv enables this.

Human-readable output

In order to get these VCFs into a spreadsheet format that a clinician might use, one can use the slivar tsv subcommand. This command can also use the gene annotations from VEP or bcftools and add other annotations using the gene name. For example, we can create a gene -> pLI lookup with this command:

wget -qO - https://storage.googleapis.com/gnomad-public/release/2.1.1/constraint/gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz \
       | zcat \
       | cut -f 1,21,24 | tail -n+2 \
       | awk '{ printf("%s\tpLI=%.3g;oe_lof=%.5g\n", $1, $2, $3)}' > pli.lookup

The slivar tsv command allows specifying many of these gene -> value lookups. For example, it's often useful to have the gene description:

wget -qO - ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/gene_condition_source_id \
    | cut -f 2,5 \
    | grep -v ^$'\t' > clinvar_gene_desc.txt
slivar tsv \
  -s denovo \ # indicate which INFO fields were added in previous slivar commands
  -s x_denovo \
  -s recessive \
  -s x_recessive \
  # any info fields to add
  -i gnomad_popmax_af -i gnomad_popmax_af_filter -i gnomad_nhomalt \
  # or CSQ if VEP was used
  -c BCSQ \
  # this will lookup the pLI and description using the gene and add a column for each
  -g pli.lookup \
  -g clinvar_gene_desc.txt \
  -p $ped \
  vcfs/$cohort.vcf > $cohort-variants.tsv
   
# repeat for compound-hets VCF
slivar tsv \
  -s slivar_comphet \
  -i gnomad_popmax_af -i gnomad_popmax_af_filter -i gnomad_nhomalt \
  -c BCSQ \
  -g pli.lookup \
  -g clinvar_gene_desc.txt \
  -p $ped \
  vcfs/$cohort.ch.vcf > $cohort-compound-hets.tsv

these 2 files will contain the same columns so they can be concatenated as needed.

Clone this wiki locally