Skip to content

Commit

Permalink
ADD: result.rst: Finish analysis topic
Browse files Browse the repository at this point in the history
  • Loading branch information
thiago-miller committed Dec 29, 2019
1 parent 1d4a3a4 commit ef20dee
Show file tree
Hide file tree
Showing 2 changed files with 187 additions and 6 deletions.
1 change: 1 addition & 0 deletions docs/meson.build
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ doc_images = files(
'images/logo_sideRETRO.png',
'images/orientation_opposite_strand.png',
'images/orientation_same_strand.png',
'images/result_confusion.png',
'images/result_heatmap.png',
'images/retrocopy.png'
)
Expand Down
192 changes: 186 additions & 6 deletions docs/result.rst
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ Simulation
We used the **SANDY** tool (version v0.23), *A straightforward and complete next-generation
sequencing read simulator* [1]_, for simulate all 5 genomes according to the structural
variations that we desgined and according to the sampling. SANDY demands 2 steps for the
task: First we indexed all retrocopies for each individual and next we could simulate all
task: First we indexed all retrocopies for each individual and then we could simulate all
whole-genome sequencing.

.. code-block:: sh
Expand Down Expand Up @@ -163,6 +163,33 @@ whole-genome sequencing.
$REF_FASTA
done
As result we have a pair of FASTQ files (forward and reverse complement) for
each simulated individual. Next it is required to align our sequencing data
against the human reference genome in order to generate mapped files in SAM
format. We used BWA aligner (version 0.7.9) [2]_ for accomplish this task.

.. code-block:: sh
# Individual directories with the
# simulated data
IND_DIR=(ind1 ind2 ind3 ind4 ind5)
# Reference genome
REF_FASTA=hg38.fa
# Index reference genome
bwa index $REF_FASTA
# Alignment
for ind in "${IND[@]}"; do
bwa mem \
-t 10 \
$REF_FASTA \
$ind/out_R1_001.fastq.gz \
$ind/out_R2_001.fastq.gz > $ind/out.sam
done
Running sideRETRO
=================

Expand All @@ -171,8 +198,8 @@ to detect the designed somatic retrocopies.

.. code-block:: sh
# Our simulated BAM files list
LIST=(ind1/out.bam ind2/out.bam ind3/out.bam ind4/out.bam ind5/out.bam)
# Our simulated SAM files list
LIST=(ind1/out.sam ind2/out.sam ind3/out.sam ind4/out.sam ind5/out.sam)
# GENCODE annotation v31
ANNOTATION=gencode.v31.annotation.gff3.gz
Expand Down Expand Up @@ -207,20 +234,173 @@ to detect the designed somatic retrocopies.
result/out.db
# Finally run make-vcf
sider make-vcf --reference-file=$REF_FASTA result/out.db
sider make-vcf \
--reference-file=$REF_FASTA \
--prefix=simulation \
result/out.db
Download
========

Our output :file:`simulation.vcf`, as well as the files to be indexed by SANDY,
:file:`ind1.txt`, :file:`ind2.txt`, :file:`ind3.txt`, :file:`ind4.txt` and
:file:`ind5.txt`, can be downloaded at
:download:`simulation.tar.gz <data/simulation.tar.gz>`.

Analysis
========

All retrocopies were detected except for 5 retrocopies from the following parental
genes that were not detected in any individual: :file:`KLHL17`, :file:`MECP2`,
:file:`NPHP4`, :file:`OR4F5` and :file:`SMARCE1`. In a manual check, those
retrocopies do not present :ref:`abnormal reads <chap_methodology>` in any
individual - which could enable their recognition by our pipeline. However to make a
fair and reproducible analysis, we will consider it as a methodological error of our
tool. So sideRETRO found :math:`\frac{23}{28}` retrocopies, which means **82%**.

In more detail, we will show that result and more concerning:

1) `Genomic coordinate`_
2) `Zygosity`_

Genomic coordinate
------------------

Regarding the genomic coordinate, sideRETRO got it right **100%** for **chromosome**
and **strand**. The performance in detecting the insertion point position was
satisfactory with a MEDSE of **1 base**.

.. table:: Predicted position. Error = :math:`|actualPos - predictedPos|`

+---------------+------------------------+-------------------------+-------+
| Parental gene | Actual position | Predicted position | Error |
+===============+=======+===========+====+========+===========+====+=======+
| AMY2B | chr5 | 122895832 | \- | chr5 | 122895831 | \- | 1 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| ATM | chr19 | 16443740 | \+ | chr19 | 16443738 | \+ | 2 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| BCL2 | chr4 | 146453786 | \+ | chr4 | 146453783 | \+ | 3 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| BRAF | chr18 | 22375812 | \- | chr18 | 22375811 | \- | 1 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| BRCA1 | chr7 | 102911369 | \- | chr7 | 102911368 | \- | 1 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| BRIP1 | chr12 | 113357216 | \- | chr12 | 113357216 | \- | 0 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| DVL1 | chr5 | 54105196 | \- | chr5 | 54105195 | \- | 1 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| FANCD2 | chr3 | 121339674 | \+ | chr3 | 121339673 | \+ | 1 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| FAT1 | chr16 | 65659952 | \- | chr16 | 65659951 | \- | 1 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| GNB1 | chr12 | 70734022 | \+ | chr12 | 70734022 | \+ | 0 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| MUTYH | chr4 | 2716745 | \+ | chr4 | 2716760 | \+ | 15 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| PBX1 | chr21 | 22718521 | \- | chr21 | 22718520 | \- | 1 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| PRCC | chr1 | 190777903 | \- | chr1 | 190777902 | \- | 1 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| PTEN | chr2 | 140252560 | \- | chr2 | 140252559 | \- | 1 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| RER1 | chr13 | 55179109 | \+ | chr13 | 55179108 | \+ | 1 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| SAMD11 | chr4 | 14585135 | \+ | chr4 | 14585134 | \+ | 1 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| SET | chr7 | 154178578 | \- | chr7 | 154178578 | \- | 0 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| SIRT1 | chrX | 120716688 | \- | chrX | 120716687 | \- | 1 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| SOX2 | chr10 | 88689163 | \- | chr10 | 88689162 | \- | 1 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| TMEM52 | chr1 | 82897536 | \+ | chr1 | 82897534 | \+ | 2 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| TP53 | chr5 | 42938944 | \- | chr5 | 42938943 | \- | 1 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| TPM3 | chr7 | 3488208 | \- | chr7 | 3488207 | \- | 1 |
+---------------+-------+-----------+----+--------+-----------+----+-------+
| XPO1 | chr10 | 33172062 | \- | chr10 | 33172056 | \- | 6 |
+---------------+-------+-----------+----+--------+-----------+----+-------+

.. table:: Genomic coordinate detection summary.
MSE (Mean Squared Error), MEDSE (Median Squared Error).

+------------+--------+---------------------+
| | | Position |
| Chromosome | Strand +------------+--------+
| | | MSE | MEDSE |
+============+========+============+========+
| 100% | 100% | 12.7 bases | 1 base |
+------------+--------+------------+--------+

.. note:: The retrocopies from :file:`MUTYH` and :file:`XPO1` do not present
:ref:`splitted reads <chap_methodology>`, so they are annotated as *IMPRECISE*
at :file:`simulation.vcf` file. So it can explain why their insertion point position
detection has the biggest absolute error, which pulls the MSE up.

Zygosity
--------

With the exception of the 5 undetected retrocopies (parental genes :file:`KLHL17`,
:file:`MECP2`, :file:`NPHP4`, :file:`OR4F5` and :file:`SMARCE1`), it was possible
to accurately calculate the zygosity. Only for the :file:`MUTYH`'s retrocopy, the
zygosity was wrongly assigned as heterozygous for the individual :file:`IND2` - when
the true value was *homozygous alternate*. As :file:`MUTYH`'s retrocopy was
*imprecisely* annotated for the lack of splitted reads, we may expect a reference
allele depth overestimation on the predicted site - which explains the heterozygous
attribution.

.. figure:: images/result_heatmap.png
:scale: 75%
:align: center
:scale: 100%
:figwidth: 100%
:align: left

Heatmap summarizing the result for each individual

To calculate the performance in terms of reference/alternate allele detection, we
made a confusion matrix for each individual and an overall matrix with the junction
of all simulations:

.. figure:: images/result_confusion.png
:scale: 100%
:figwidth: 100%
:align: left

Confusion matrix of alleles detection performance

+-----+----------------------------------------+
| TP | True Positive |
+-----+----------------------------------------+
| FP | False Positive |
+-----+----------------------------------------+
| TN | True Negative |
+-----+----------------------------------------+
| FN | False Negative |
+-----+----------------------------------------+
| TPR | True Positive Rate or Sensitivity |
+-----+----------------------------------------+
| TNR | True Negative Rate or Specificity |
+-----+----------------------------------------+
| PPV | Positive Predictive Value or Precision |
+-----+----------------------------------------+
| NPV | Negative Predictive Value |
+-----+----------------------------------------+
| ACC | Accuracy |
+-----+----------------------------------------+

There was no *false positive* at all - as can be seen by the **precision** and
**specificity** with 100% value for all individuals. The worst **accuracy** was
at individual 2, with 91.07% value, and best was at individual 3, with 96.43%
value. The overall **sensitivity**, **specificity** and **accuracy** were: 81%,
100% and 93.21% value consecutively.

References and Further Reading
==============================

.. [1] Miller, Thiago et al. (2019).
galantelab/sandy: Release v0.23 (Version v0.23).
Zenodo. http://doi.org/10.5281/zenodo.2589575.
.. [2] Li H. and Durbin R. (2009).
Fast and accurate short read alignment with Burrows-Wheeler Transform.
Bioinformatics, 25:1754-60. [PMID: 19451168].

0 comments on commit ef20dee

Please sign in to comment.