Skip to content

Commit

Permalink
update QC-preseq and assembly documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
Michal Sakin committed Mar 29, 2021
1 parent c987342 commit 192dbc0
Show file tree
Hide file tree
Showing 5 changed files with 146 additions and 7 deletions.
107 changes: 100 additions & 7 deletions docs/source/assembly.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,9 @@ Additional Dependencies
+++++++++++++++++++++++

`cutadapt <https://cutadapt.readthedocs.io/en/stable/>`_ - sudo apt install cutadapt

`meryl <https://github.com/marbl/meryl>`_

`merqury <https://github.com/marbl/merqury/wiki>`_


Expand All @@ -20,9 +22,9 @@ In preparation
Kmer analysis, QV and completness
---------------------------------

In this section we will not use the long distance infomation that is captured by the Omni-C proximity ligation libraries. Instead, we take advandage of the uniform coverafe of Omni-C libraries and demonstrate how, after trimming, the reads can be used just as shotgun sequencing, for kmer analysis, measuring assembly completness and QV.
In this section we will not use the long distance infomation that is captured by the Omni-C proximity ligation libraries. Instead, we take advandage of the uniform coverage of Omni-C libraries and demonstrate how, after trimming, the reads can be used just as shotgun sequencing, for kmer analysis, measuring assembly completness and QV.

In short, kmer composition of the assembly and the reads will be used to evaluate the comletness of the assembly and bp quality.
In short, kmer composition of the assembly and the reads will be used to evaluate the completness of the assembly and bp quality.

Trimming
++++++++
Expand Down Expand Up @@ -58,21 +60,112 @@ Command:

.. code-block:: console
cutadapt -j 48 -b <bridge sequence> -B <bridge sequence> -o <trimmed_output_R1.fastq> -p <trimmed_output_R2.fastq> <input_R1.fastq> <input_R2.fastq>
cutadapt -j <cores> -b <bridge sequence> -B <bridge sequence> -o <trimmed_output_R1.fastq> -p <trimmed_output_R2.fastq> <input_R1.fastq> <input_R2.fastq>
Example:

.. code-block:: console
cutadapt -j 16 -b GGTTCGTCCA -B GGTTCGTCCA -o trim_OmniC_800M_R1.fastq -p trim_OmniC_800M_R2.fastq OmniC_800M_R1.fastq OmniC_800M_R2.fastq
Generate kmer databases
+++++++++++++++++++++++

If you don't know the optimal kmer size for your genome, run the ``best_k.sh`` script from the merqury repository

Command:

.. code-block:: console
./best_k.sh <genome_size>
Example:

.. code-block:: console
./best_k.sh 3100000000
In this example, human genome size the optimal kmer size is 21.

Next, use the ``count`` utility of ``meryl`` tool to generate a meryl database from each one of the fastq files

Command:

.. code-block:: console
meryl k=<k size> count output <output name> <input file>
Example:

.. code-block:: console
meryl k=21 count output R1_21.meryl trim_OmniC_800M_R1.fastq
meryl k=21 count output R2_21.meryl trim_OmniC_800M_R2.fastq
The output from the example above is two directories: R1_21.meryl and R2_21.meryl, each one containing kmer database generated based on one fastq file. For downstream steps, all the kmer databases need to be merged into one database using the ``union-sum`` utility of ``meryl``.

Command:

.. code-block:: console
meryl union-sum output <output name> <path to inputs>
Example:

.. code-block:: console
meryl union-sum output reads_21.meryl *meryl
Evaluate assembly completness and QV
++++++++++++++++++++++++++++++++++++

Merqury was developed to allow reference free assembly evaluation, based on k-mer spectrum of low error rate reads (Omni-C in this case). For more details see `merqury documentation <https://github.com/marbl/merqury/wiki/2.-Overall-k-mer-evaluation>`_ . Merqury accept as an input the assembly of interest and meryl kmer database and outputs information on reads and assembly spectrum, assembly kmer completness, and assembly base level QV estimation.

Command:

.. code-block:: console
merqury.sh <kmer DB> <assembly> <output prefix>
Example:

.. code-block:: console
cutadapt -j 48 -b GGTTCGTCCA -B GGTTCGTCCA -o trim_OmniC_800M_R1.fastq -p trim_OmniC_800M_R2.fastq OmniC_800M_R1.fastq OmniC_800M_R2.fastq
merqury.sh reads_21.meryl asm.fasta merqury_out
kmer database
=============
Detailed describtion of the outputs can be found `here <https://github.com/marbl/merqury/wiki/2.-Overall-k-mer-evaluation>`_. Here are some highlights, the example below is using OmniC library of human sample HG002:

completeness.stats - details the assembly name (column #1), assembly k-mers used in the analysis (column #3), reads k-mers used in the analysis (column #4) and % kmer completness in the assembly (column #5).

Example:

.. code-block:: console
HG002.pri.ctg all 2243267808 2305938845 97.2822
K-mers that are found only in the assembly are assumed to be bp errors, the <output prefix>.qv summarize the QV results across the assembly. An additional file, <output prefix><assembly>.qv details QV values for each scaffold in the assembky. Assembly summary QV file include the following details: assembly name (column #1), assembly unique kmers (column #2), kmers shared in assembly and reads (column #3), QV (column #4), Error rate (column #5). See example below:

Example:

.. code-block:: console
HG002.ctg 726218 3063065775 49.4726 1.12912e-05
meryl union-sum output PUR1715_reads.meryl *meryl
1 change: 1 addition & 0 deletions docs/source/before_you_begin.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ Make sure that the following dependencies are installed:
- `bwa <https://github.com/lh3/bwa>`_
- `pairtools <https://github.com/open2c/pairtools>`_
- `samtools <https://github.com/samtools/samtools>`_
- `preseq <https://github.com/smithlabcode/preseq>`_

If you are facing any issues with the installation of any of the dependencies, please contact the supporter of the relevant package.

Expand Down
Binary file added docs/source/images/preseq.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
38 changes: 38 additions & 0 deletions docs/source/library_qc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,3 +44,41 @@ After the script completes, it will print:
Library Complexity
==================

If you preformed a shallow sequencing experiment (e.g. 2M reads) and running a QC analysis to decide which library to use for deep sequencing (DS), it is recommended to evaluate the complexity of the library before moving to DS.

The `lc_extrap` utility of the `preseq` package aims to predict the complexity of sequencing libraries.


``pairtools parse`` options:


.. csv-table::
:file: tables/preseq.csv
:header-rows: 1
:widths: 20 20 60
:class: tight-table


``preseq lc_extrap`` command example for extrapolating library complexity:

**Command:**

.. code-block:: console
preseq lc_extrap -bam -pe -extrap 2.1e9 -step 1e8 -seg_len 1000000000 -output <output file> <input bam file>
**Example:**

.. code-block:: console
preseq lc_extrap -bam -pe -extrap 2.1e9 -step 1e8 -seg_len 1000000000 -output out.preseq mapped.PT.bam
In this example the output file `out.preseq` will detail the extrapolated complexity curve of your library, with the number of reads in the first column and the expected distinct read value in the second column. For a typical experiment (human sample) check the expected complexity at 300M reads (to show the content of the file, type **cat out.preseq**). Expected unique pairs at 300M sequencing is at least ~ 120 million.

.. image:: /images/preseq.png

7 changes: 7 additions & 0 deletions docs/source/tables/preseq.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"Parameter ",Value,Function
bam,,Specifies that the input file type is bam. Please note that for a bam file to be a recognized input file htslib sould be installed as well and preseq should be built with htslib support (for more details see preseq documentation or our installDep.sh script as example)
pe,,Specifies that paired end data is used
extrap,2.10E+09,Maximum extrapolation
step,1.00E+08,Extrapolation step size
seg_len,1000000000,maximum segment length when merging paired end bam
output,,output file

0 comments on commit 192dbc0

Please sign in to comment.