Skip to content

Sex quality control for Next Generation Sequencing data.

Notifications You must be signed in to change notification settings

genomicsITER/sexQC-for-NGS-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 

Repository files navigation

Sex Inference as a Quality Control for NGS data

The inference of the genetic sex of a sample from its sequence obtained in a Next Generation Sequencing (NGS) experiment is a mandatory quality control step for the discovery of errors in the metadata provided and to contribute to sample traceability. Here, we explain how to use the self-reported sex of an individual and two different bioinformatic approaches, Somalier and Sexi-Heuristic based analysis, to infere the genetic sex of a sample and perform a quality control analysis.


Approach 1: sex inference with Somalier

A first approach, performed by means of the Somalier v0.2.15. tool (Pedersen et al. 2019), let us to infer the sex of the sample from the depth of the X and Y chromosome reads at some selected positions in the genome. This tool uses a total of 17,766 positions in coding regions to be able to work with data from different type of experiments (whole-exome and -genome sequencing, RNA-Seq, etc.). These positions meet several requisites such as: (i) frequently sequenced with high quality, (ii) population allele frequency around 0.5, and (iii) exclusion of segmental duplications regions, low complexity regions and regions nearby insertion and deletions. Relatedness among samples is calculated by allelic concordance from single nucleotide variants within these positions (classified as homozygous, heterozygous, and alternative homozygous).

Somalier uses a genomic VCF file (gVCF) to extract variant and non-variant information from these positions. With somalier extract command we extract position data to a binary file. In a second step, somalier relate calculate and create an HTML file for results visualization. An example code is shown below:

# Path to Somalier binary
SOMALIER="/path/to/somalier_bin"

# Path to Sites file
sites="/path/to/sites.hg19.vcf.gz"

# Path to reference genome
ref="/path/to/ucsc.hg19.fasta"

# Path to your input VCF file
infile="/path/to/VCF_file"

# Run these commands
${SOMALIER} extract -d ${outdir} --sites ${sites} -f ${ref} ${infile}
${SOMALIER} relate -o ${outname} ./*.somalier

Visit the repository of Somalier at GitHub


Approach 2: sex inference with a Heuristic

A second approach uses an in-house Sexi-Heuristic algorithm coded in BASH. The script let us to analyze the coverage or vertical-depth of 11 selected genes located in the non-pseudoautosomal regions (NPAR) of the X and Y chromosomes (Table 1). We then assess the depth distribution accross chromosomes X and Y in order to identify high covered genes suitable for sex classification based on read depth.

Briefly, the Sexi-Heuristic follows this algorithm:

  1. Compute the coverage distribution in NPAR of chromosomes X and Y.
  2. Identify genes with high coverage.
  3. Extract selected gene regions from the corresponding BAM files.
  4. Filter reads with a mapping quality lower than 50 (MQ<50).
  5. Obtain the number of reads per-chromosome.
  6. Obtain the number of reads per-gene.
  7. Create a TSV file with reads per sample, chromosome, and gene.
  8. Compute the normalized fraction of reads in chromosomes X and Y according to [Equation 1].
  9. Plot results.

Table 1. List of genes assessed in sex classification in both X and Y chromosomes.

Chromosomes Genes
ChrX RAB39B ACTRT1 SSX1
F8 UBE2E4P SSX9P
CMC4 FAM47B SSX3
TEX13A PPP1R2P9
ChrY PRORY TBL1Y EIF1AY
KDM5D FAM41AY2 RPS4Y2
AMELY XKRY DAZ4
TSPY2 TXLNGY

Equation 1. Fraction of reads per chromosome.

$$f_X = {Nreads_{chrX} \over Nreads_{chrX} + Nreads_{chrY}}$$

$$f_Y = {Nreads_{chrY} \over Nreads_{chrY} + Nreads_{chrX}}$$


Theoretically, female samples will have all reads mapped to chromosome X and none in chromosome Y. Accordingly, fX=1 and fY=0. In practice, female samples show nearly all reads mapped to chromosome X and nearly none in chromosome Y, thus resulting in fractions with values as fX≃1 and fY≃0.

In our testing and validation datasets (n=835 whole-exomes; publication in preparation) we have observed that the normalized fraction of reads in chromosome X, fX, shows values within the interval [0.988, 1], and the normalized fraction of reads mapped to chromosome Y, fY, ranges in the interval [0, 0.012].

Male samples will have mapped reads splitted between genes on the X and Y chromosomes. As a result, we have observed a fraction of mapped reads in chromosome X in the interval [0.176, 0.593] and the fraction of reads mapped to chromosome Y ranges in the interval [0.407, 0.824].

Male samples use to have a higher dispersion than females in the calculated fractions as a consequence of the observed variation in read depth between X and Y chromosomes.

A scatter plot representing the fraction of reads in X-chromosome in the x-axis, and the fraction of reads in Y-chromosome in the y-axis for each sample is shown (Figure 1). This plots shows three different clusters: one for female samples, one for male samples, and a third cluster representing samples with uncertain sex assignation. Uncertainties in sex inference may arise as a consequence of sample contamination (i.e. contamination of a sample with a different sex), sample swapping, error in the sample labeling, etc.

Figure 1. Sex inference of multiple samples based on our Sexi-Heuristic analysis.


Code for the Heuristic approach

  Detailed code with command usage for Sexi-Heuristic.


License and Attribution

This repository and data exports are released under the CC BY 4.0 license. Please acknowledge the authors of this repository, and the open source software used in this work (third-party copyrights and licenses may apply).

Please cite as: "Sex Inference as a Quality Control for NGS data, GitHub repository at https://github.com/genomicsITER/sexQC-for-NGS-data (accessed on YYYY-MM-DD)".


Participating

Want to share your relevant links? Place a Direct Message to @labcflores or @resocios on Twitter (see below).

By JMLS @resocios

Follow us on Twitter @labcflores


Update logs

September 2, 2022. The repository becomes public.

About

Sex quality control for Next Generation Sequencing data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published