Merge pull request #330 from gagneurlab/dev

Dev
gagneurlab · Jun 23, 2022 · 2833400 · 2833400
2 parents b86f008 + 233bd1c
commit 2833400
Show file tree

Hide file tree

Showing 92 changed files with 2,731 additions and 619 deletions.
diff --git a/.gitignore b/.gitignore
@@ -12,6 +12,7 @@ __pycache__*
 *.egg-info*
 .eggs*
 dist/*
+*.pyc
 
 # typical latex tmp files
 *.aux

diff --git a/README.md b/README.md
@@ -3,6 +3,8 @@
 [![Version](https://img.shields.io/github/v/release/gagneurlab/drop?include_prereleases)](https://github.com/gagneurlab/drop/releases)
 [![Version](https://readthedocs.org/projects/gagneurlab-drop/badge/?version=latest)](https://gagneurlab-drop.readthedocs.io/en/latest)
 
+The detection of RNA Outliers Pipeline (DROP) is an integrative workflow to detect aberrant expression, aberrant splicing, and mono-allelic expression from raw sequencing files. Since version 1.2.0 it also has a module to perform RNA-seq variant calling. 
+
 The manuscript is available in [Nature Protocols](https://www.nature.com/articles/s41596-020-00462-5). [SharedIt link.](https://rdcu.be/cdMmF)
 
 <img src="drop_sticker.png" alt="drop logo" width="200" class="center"/>
@@ -11,7 +13,7 @@ The manuscript is available in [Nature Protocols](https://www.nature.com/article
 DROP is available on [bioconda](https://anaconda.org/bioconda/drop).
 We recommend using a dedicated conda environment. (installation time: ~ 10min)
 ```
-mamba install -c conda-forge -c bioconda drop
+mamba create -n drop_env -c conda-forge -c bioconda drop
 ```
 
 Test installation with demo project
@@ -49,6 +51,14 @@ This shows you the rules of all subworkflows. Omit `-n` and specify the number o
 snakemake aberrantExpression --cores 10
 ```
 
+## Citation
+
+If you use DROP in research, please cite our [manuscript](https://www.nature.com/articles/s41596-020-00462-5).
+
+Furthermore, if you use the aberrant expression module, also cite [OUTRIDER](https://doi.org/10.1016/j.ajhg.2018.10.025); if you use the aberrant splicing module, also cite [FRASER](https://www.nature.com/articles/s41467-020-20573-7); and if you use the MAE module, also cite the [Kremer, Bader et al study](https://www.nature.com/articles/ncomms15824) and [DESeq2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8).
+
+For the complete set of tools used by DROP (e.g. for counting), see the [manuscript](https://www.nature.com/articles/s41596-020-00462-5).
+
 ## Datasets
 The following publicly-available datasets of gene counts can be used as controls.
 Please cite as instructed for each dataset.

diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -23,7 +23,7 @@
 author = 'Michaela Müller'
 
 # The full version, including alpha/beta/rc tags
-release_ = '1.1.4'
+release_ = '1.2.0'
 
 
 
@@ -36,6 +36,7 @@
 # ones.
 extensions = [
     "sphinx_rtd_theme",
+    'sphinx.ext.autosectionlabel'
 ]
 
 # Add any paths that contain templates here, relative to this directory.

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -2,7 +2,7 @@ DROP - Detection of RNA Outliers Pipeline
 ==========================================
 
 DROP is intended to help researchers use RNA-Seq data in order to detect genes with aberrant expression,
-aberrant splicing and mono-allelic expression. It consists of three independent modules for each of those strategies.
+aberrant splicing, mono-allelic expression, and RNA-Seq variant calling. It consists of 4 independent modules for each of those strategies.
 After installing DROP, the user needs to fill in the config file and sample annotation table (:doc:`prepare`).
 Then, DROP can be executed in multiple ways (:doc:`pipeline`).
 
@@ -13,6 +13,7 @@ Then, DROP can be executed in multiple ways (:doc:`pipeline`).
    installation
    prepare
    pipeline
+   output
    license
    help
 
@@ -21,10 +22,11 @@ Quickstart
 
 DROP is available on `bioconda <https://anaconda.org/bioconda/drop>`_.
 We recommend using a dedicated conda environment. (installation time: ~ 10min)
+Use `mamba` instead of `conda` as it provides more reliable and faster dependency solving.
 
 .. code-block:: bash
 
-    mamba install -c conda-forge -c bioconda drop
+    mamba create -n drop -c conda-forge -c bioconda drop
 
 Test installation with demo project
 

diff --git a/docs/source/installation.rst b/docs/source/installation.rst
@@ -9,6 +9,8 @@ In case the conda channel priority is set to ``strict``, it should be reset to `
     conda config --set channel_priority true
 
 We recommend using a dedicated conda environment (here: ``drop_env``) for installing drop.
+For installing, use `mamba` instead of `conda` as it provides more reliable and faster dependency solving.
+
 
 .. code-block:: bash
 
@@ -27,6 +29,7 @@ Test whether the pipeline runs through by setting up the demo dataset in an empt
     drop demo
 
 The pipeline can be run using `snakemake <snakemake.readthedocs.io/>`_ commands
+Run time: ~25min
 
 .. code-block:: bash
 
@@ -111,11 +114,11 @@ Alternatively, DROP can be installed without ``conda``. In this case the followi
 
     * `tabix <https://www.htslib.org/download/>`_
 
-    * `samtools <https://www.htslib.org/download/>`_ >= 1.7
+    * `samtools <https://www.htslib.org/download/>`_ >= 1.9
 
-    * `bcftools <https://github.com/samtools/bcftools>`_ >= 1.7
+    * `bcftools <https://github.com/samtools/bcftools>`_ >= 1.9
 
-    * `GATK <https://software.broadinstitute.org/gatk/>`_ >= 4.0.4
+    * `GATK <https://software.broadinstitute.org/gatk/>`_ >= 4.1.8
 
     * `graphviz <https://www.graphviz.org/>`_
 
@@ -134,4 +137,3 @@ As this is a lengthy process, it might be desirable to install them in advance,
 
     # optional
     Rscript <path/to/drop/repo>/drop/installRPackages.R drop/requirementsR.txt
-
diff --git a/docs/source/output.rst b/docs/source/output.rst
@@ -0,0 +1,110 @@
+Results and Output of DROP
+===========================
+
+DROP is intended to help researchers use RNA-Seq data in order to detect genes with aberrant expression,
+aberrant splicing and mono-allelic expression. By simplifying the workflow process we hope to provide
+easy-to-read HTML files and output files. This section explains the results files. The paths of the output
+files correspond to the ones from the demo (that can be run with the following code snippet)::
+
+    #install drop
+    mamba create -n drop_env -c conda-forge -c bioconda drop
+    conda activate drop_env
+    
+    mkdir drop_demo
+    cd drop_demo
+    drop demo
+    
+    snakemake -c1
+
+Aberrant Expression
++++++++++++++++++++
+
+HTML file
+#########
+Looking at the resulting ``Output/html/drop_demo_index.html`` we can see the ``AberrantExpression`` 
+tab at the top of the screen. The Overview tab contains links to the:  
+
+* Counts Summaries for each aberrant expression group
+    * number of local and external samples
+    * Mapped reads and size factors for each sample
+    * histograms showing the mean count distribution with different conditions
+    * expressed genes within each sample and as a dataset
+* Outrider Summaries for each aberrant expression group
+    * aberrantly expressed genes per sample
+    * correlation between samples before and after the autoencoder
+    * biological coefficient of variation
+    * aberrant samples
+    * results table
+* Files for each aberrant expression group
+    * OUTRIDER datasets 
+        * Follow the `OUTRIDER vignette <https://www.bioconductor.org/packages/devel/bioc/vignettes/OUTRIDER/inst/doc/OUTRIDER.pdf>`_ for individual OUTRIDER object file (ods) analysis.
+    * Results tables
+        * ``results.tsv`` this text file contains only the significant genes and samples that meet the cutoffs defined in the config file for ``padjCutoff`` and ``zScoreCutoff``
+
+Local result files
+##################
+Additionally the ``aberrantExpression`` module creates the file ``Output/processed_results/aberrant_expression/{annotation}/outrider/{drop_group}/OUTRIDER_results_all.Rds``. This file contains the entire OUTRIDER results table regardless of significance.
+
+Aberrant Splicing
++++++++++++++++++
+
+HTML file
+##########
+Looking at the resulting ``Output/html/drop_demo_index.html`` we can see the ``AberrantSplicing`` 
+tab at the top of the screen. The Overview tab contains links to the:  
+
+* Counting Summaries for each aberrant splicing group
+    * number of local and external samples
+    * number introns/splice sites before and after merging
+    * comparison of local and external mean counts
+    * histograms showing the junction expression before and after filtering and variability
+* FRASER Summaries for each aberrant splicing group
+    * the number of samples, introns, and splice sites 
+    * correlation between samples before and after the autoencoder
+    * results table
+* Files for each aberrant splicing group
+    * FRASER datasets (fds)
+        * Follow the `FRASER vignette <https://www.bioconductor.org/packages/devel/bioc/vignettes/FRASER/inst/doc/FRASER.pdf>`_ for individual FRASER object file (fds) analysis.
+    * Results tables
+        * ``results_per_junction.tsv`` this text file contains only significant junctions that meet the cutoffs defined in the config file. 
+
+Local result files
+##################
+Additionally the ``aberrantSplicing`` module creates the following file ``Output/processed_results/aberrant_splicing/results/{annotation}/fraser/{drop_group}/results.tsv``.
+This text file contains only significant junctions that meet the cutoffs defined in the config file, aggregated at the gene level. Any sample/gene pair is represented by only the most significant junction.
+
+Mono-allelic Expression
++++++++++++++++++++++++
+
+HTML file
+##########
+Looking at the resulting ``Output/html/drop_demo_index.html`` we can see the ``MonoallelicExpression`` 
+tab at the top of the screen. The Overview tab contains links to the:  
+
+* Results for each mae group
+    * number of samples, genes, and mono-allelically expressed heterozygous SNVs
+    * a cascade plot that shows additional filters
+    * histogram of inner cohort frequency
+    * summary of the cascade plot and results table
+* Files for each mae group
+    * Allelic counts
+        * a directory containing the allelic counts of heterozygous variants
+    * Results data tables of each sample (.Rds)
+        * Rds objects containing the full results table regardless of MAE status
+    * Significant MAE results tables
+        * a link to the results file
+        * Only contains significant MAE for the alternative allele results and results that pass the config file cutoffs
+* Quality Control
+    * QC Overview
+        * For each mae group QC checks for DNA/RNA matching
+
+Local result files
+##################
+Additionally the ``mae`` module creates the following files:
+
+* ``Output/processed_results/mae/{drop_group}/MAE_results_all_{annotation}.tsv.gz``
+    * this file contains the MAE results of all heterozygous SNVs regardless of significance
+* ``Output/processed_results/mae/{drop_group}/MAE_results_{annotation}.tsv``
+    * this is the file linked in the HTML document and described above
+* ``Output/processed_results/mae/{drop_group}/MAE_results_{annotation}_rare.tsv``
+    * this file is a subset of ``MAE_results_{annotation}.tsv`` with only the variants that pass the allele frequency cutoffs. If ``add_AF`` is set to ``true`` in config file must meet minimum AF set by ``max_AF``. Additionally, the inner-cohort frequency must meet the ``maxVarFreqCohort`` cutoff
diff --git a/docs/source/pipeline.rst b/docs/source/pipeline.rst
@@ -6,13 +6,13 @@ DROP is `Snakemake <https://snakemake.readthedocs.io/en/stable/executing/cli.htm
 Dry run
 -------
 
-Open a terminal in your project repository. Execute 
+Open a terminal in your project repository. Execute
 
 .. code-block:: bash
-    
-    snakemake --cores 1 -n 
 
-This will perform a *dry-run*, which means it will display all the steps (or rules) that need to be executed. To also display the reason why those rules need to be exeucted, run 
+    snakemake --cores 1 -n
+
+This will perform a *dry-run*, which means it will display all the steps (or rules) that need to be executed. To also display the reason why those rules need to be executed, run
 
 .. code-block:: bash
 
@@ -23,7 +23,7 @@ Finally, a simplified dry-run can be achieved by executing
 .. code-block:: bash
 
     snakemake --cores 1 -nq
-    
+
 Calling ``snakemake --cores 1`` without any additional parameters will execute the whole workflow. Snakemake requires you to designate the number of cores when running the ``snakemake`` command.
 
 
@@ -45,16 +45,17 @@ Every single module can be called independently.
 .. code-block:: bash
 
     snakemake <subworkflow>
-    
+
 ========================  =======================================================================
-Subworkflow                Description                                                       
+Subworkflow                Description
 ========================  =======================================================================
 ``aberrantExpression``     Aberrant expression pipeline
 ``aberrantSplicing``       Aberrant splicing pipeline
 ``mae``                    Monoalleic expression pipeline
+``rnaVariantCalling``      RNA Variant Calling pipeline
 ========================  =======================================================================
 
-An example for calling the aberrant expression pipeline with 10 cores would be 
+An example for calling the aberrant expression pipeline with 10 cores would be
 
 .. code-block:: bash
 
@@ -69,7 +70,7 @@ When DROP is updated or jobs fail, the following commands can be used to rerun a
 Unlocking the pipeline
 ++++++++++++++++++++++
 
-While running, Snakemake *locks* the directory. If, for a whatever reason, the pipeline was interrupted, the directory might be kept locked. Therefore, call 
+While running, Snakemake *locks* the directory. If, for a whatever reason, the pipeline was interrupted, the directory might be kept locked. Therefore, call
 
 .. code-block:: bash
 
@@ -83,7 +84,8 @@ Updating DROP
 +++++++++++++
 Every time a project is initialized, a temporary folder ``.drop`` will be created in the project folder.
 If a new version of drop is installed, the ``.drop`` folder has to be updated for each project that has been
-initialized using an older version.
+initialized using an older version. `drop update` will also reset the local project's `Scripts/` directory to match the installed version, so be sure to save any additional scripts or analyses in another location.
+
 To do this run:
 
 .. code-block:: bash
@@ -96,14 +98,13 @@ Skipping recomputation of files
 If snakemake is interrupted and restarted, it will continue with the last unsuccessful job in the job graph. If a script is updated with minor change, e.g. when calling ``drop update``, all jobs of the modified script and its downstream steps will be rerun. However, in some cases one might want to keep the intermediate files instead and continue with the missing files. In order to do so, first execute
 
 .. code-block:: bash
-   
+
    snakemake <rule> --touch
 
-for whichever rule or module you want to continue the computation. The ``--touch`` command touches all output files required by the pipeline that have already been computed. Omitting the rule will lead to accessing the complete pipeline. Afterwards, use 
+for whichever rule or module you want to continue the computation. The ``--touch`` command touches all output files required by the pipeline that have already been computed. Omitting the rule will lead to accessing the complete pipeline. Afterwards, use
 
 .. code-block:: bash
 
     snakemake unlock
-    
-to unlock the submodules, so that the jobs that need to be computed can be identified.
 
+to unlock the submodules, so that the jobs that need to be computed can be identified.