diff --git a/README.rst b/README.rst index 4e556ae..6d71c1b 100644 --- a/README.rst +++ b/README.rst @@ -8,7 +8,7 @@ :alt: JOSS (journal of open source software) DOI .. image:: https://github.com/sequana/rnaseq/actions/workflows/main.yml/badge.svg - :target: https://github.com/sequana/rnaseq/actions/workflows/main.yaml + :target: https://github.com/sequana/rnaseq/actions/workflows/main.yaml @@ -17,7 +17,7 @@ This is is the **RNA-seq** pipeline from the `Sequana 5 pip install sequana @@ -100,22 +100,22 @@ To use apptainer, initialise the pipeline with the --use-singularity option and Details ~~~~~~~~~ -This pipeline runs a **RNA-seq** analysis of sequencing data. It runs in -parallel on a set of input FastQ files (paired or not). +This pipeline runs a **RNA-seq** analysis of sequencing data. It runs in +parallel on a set of input FastQ files (paired or not). A brief HTML report is produced together with a MultiQC report. This pipeline is complex and requires some expertise for the interpretation. -Many online-resources are available and should help you deciphering the output. +Many online-resources are available and should help you deciphering the output. Yet, it should be quite straigtforward to execute it as shown above. The -pipeline uses bowtie1 to look for ribosomal contamination (rRNA). Then, +pipeline uses bowtie1 to look for ribosomal contamination (rRNA). Then, it cleans the data with cutapdat if you say so (your data may already be -pre-processed). If no adapters are provided (default), reads are -trimmed for low quality bases only. Then, mapping is performed with standard mappers such as +pre-processed). If no adapters are provided (default), reads are +trimmed for low quality bases only. Then, mapping is performed with standard mappers such as star or bowtie2 (--aligner option). Finally, feature counts are extracted from the previously generated BAM files. We guess the strand and save the feature counts into the directoy -./rnadiff/feature_counts. +./rnadiff/feature_counts. The pipelines stops there. However, RNA-seq analysis are followed by a different analysis (DGE hereafter). Although the DGE is not part of the pipeline, you can @@ -138,7 +138,7 @@ Rules and configuration details ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Here is the `latest documented configuration file `_ -to be used with the pipeline. Each rule used in the pipeline may have a section in the configuration file. +to be used with the pipeline. Each rule used in the pipeline may have a section in the configuration file. .. warning:: the RNAseQC rule is switch off and is not currently functional in @@ -158,6 +158,9 @@ Changelog ========= ==================================================================== Version Description ========= ==================================================================== +0.19.1 * add rnaseqc container. + * Update rseqc rules (redirection) + * cleanup onsuccess rule 0.19.0 * Refactorisation to use click 0.18.1 * fastp multiqc regression. Fixed missing sample names by updating multiqc_config and adding sample names in the output filename @@ -166,28 +169,28 @@ Version Description * BUG: Fix missing params (options) in star_mapping rule not taken into account 0.17.1 * use new rulegraph / graphviz apptainer -0.17.0 * fastp step changed to use sequana-wrappers. Slight change in +0.17.0 * fastp step changed to use sequana-wrappers. Slight change in config file. The reverse and forward adapter options called rev and fwd have been dropped in favor of a single adapters option. - v0.17.0 config and schema are not compatible with previous + v0.17.0 config and schema are not compatible with previous versions. * Update singularity containers and add new one for fastp -0.16.1 * fix bug in feature counts automatic strand balance detection. Was +0.16.1 * fix bug in feature counts automatic strand balance detection. Was always using the stranded case (2). * add singularity workflow for testing * fix documentation in config.yaml -0.16.0 * star, salmon, bam_coverage are now in sequana wrappers, updated +0.16.0 * star, salmon, bam_coverage are now in sequana wrappers, updated the pipeline accordingly - * updated config file and schema to include resources inside the + * updated config file and schema to include resources inside the config file (so as to use new --profile option) * set singularity images in all rules - * star wrappers has changed significantly to use star + * star wrappers has changed significantly to use star recommandation. To keep using previous way, a legacy option is available and set to True in this version. * bamCoverage renamed in bam_coverage in the config file * multiqc_config removed redundant information and ordered the output in a coherent way (QC and then analysis) -0.15.2 * Fix bowtie2 rule to use new wrappers. Use wrappers in +0.15.2 * Fix bowtie2 rule to use new wrappers. Use wrappers in add_read_group and mark_duplicates 0.15.1 * Adapt to new bowtie2 align wrapper 0.15.0 * fix typo reported in https://github.com/sequana/rnaseq/issues/12 @@ -199,7 +202,7 @@ Version Description same genome directory. * Ribosomal is now estimated on the first 100,000 reads to speed up analysis - * --indexing and --force-indexing options not required anymore. + * --indexing and --force-indexing options not required anymore. Indexing will be done automatically and not redone if present. * Use of the new sequana-wrappers repository 0.13.0 * Major update to use the new sequana version and the RNADiff tools. @@ -210,7 +213,7 @@ Version Description * user interface has now a --skip-gff-check option. Better handling of input gff with more meaningful messages * integration of rseqc tool -0.12.1 * indexing was always set to True in the config after 0.9.16 update. +0.12.1 * indexing was always set to True in the config after 0.9.16 update. 0.12.0 * BUG fix: Switch mark_duplicates correctly beore feature counts 0.11.0 * rnadiff one factor is simplified * When initiating the pipeline, provide information about the GFF @@ -226,7 +229,7 @@ Version Description created and used * fix the --do-igvtools and --do-bam-coverage with better doc 0.10.0 * 9/12/2020 - * Fixed bug in sequana/star_indexing for small genomes (v0.9.7). + * Fixed bug in sequana/star_indexing for small genomes (v0.9.7). Changed the rnaseq requirements to benefit from this bug-fix that could lead to seg fault with star aligner for small genomes. * Report improved with strand guess and plot @@ -235,32 +238,32 @@ Version Description * In config file, bowtie section 'do' option is removed. This is now set automatically if rRNA_feature or rRNA_file is provided. This allows us to skip the rRNA mapping entirely if needed. - * fastq_screen should be functional. Default behaviour is off. If + * fastq_screen should be functional. Default behaviour is off. If set only phiX174 will be search for. Users should build their own configuration file. - * star/bowtie1/bowtie2 have now their own sub-directories in the - genome directory. + * star/bowtie1/bowtie2 have now their own sub-directories in the + genome directory. * added --run option to start pipeline automatically (if you know what you are doing) * rnadiff option has now a default value (one_factor) * add strandness plot in the HTML summary page -0.9.19 * Remove the try/except around tolerance (guess of strandness) to +0.9.19 * Remove the try/except around tolerance (guess of strandness) to make sure this is provided by the user. Final onsuccess benefits from faster GFF function (sequana 0.9.4) -0.9.18 * Fix typo (regression bug) + add tolerance in schema + generic +0.9.18 * Fix typo (regression bug) + add tolerance in schema + generic title in multiqc_config. (oct 2020) 0.9.17 * add the *tolerance* parameter in the feature_counts rule as a user - parameter (config and pipeline). -0.9.16 * Best feature_counts is now saved into rnadiff/feature_counts + parameter (config and pipeline). +0.9.16 * Best feature_counts is now saved into rnadiff/feature_counts directory and rnadiff scripts have been updated accordingly * the most probable feature count option is now computed more effectivily and incorporated inside the Snakemake pipeline (not in - the onsuccess) so that multiqc picks the best one (not the 3 + the onsuccess) so that multiqc picks the best one (not the 3 results) * the target.txt file can be generated inside the pipeline if user fill the rnadiff/conditions section in the config file * indexing options are filled automatically when calling - sequana_rnaseq based on the presence/absence of the index + sequana_rnaseq based on the presence/absence of the index of the aligner being used. * salmon now integrated and feature counts created (still WIP in sequana) @@ -283,13 +286,13 @@ Version Description analysis 0.9.11 * Automatic guessing of the strandness of the experiment 0.9.10 * Fix multiqc for RNAseQC rule -0.9.9 * Fix RNAseQC rule, which is now available. +0.9.9 * Fix RNAseQC rule, which is now available. * Fix ability to use existing rRNA file as input 0.9.8 * Fix indexing for bowtie1 to not be done if aligner is different * add new options: --feature-counts-options and --do-rnaseq-qc, --rRNA-feature * Based on the input GFF, we now check the validity of the rRNA - feature and feature counts options to check whether the feature + feature and feature counts options to check whether the feature exists in the GFF * schema is now used to check the config file values * add a data test for testing and documentation @@ -298,25 +301,25 @@ Version Description * Possiblity to switch off cutadapt section * Fixing bowtie2 rule in sequana and update the pipeline accordingly * Include a schema file - * output-directory parameter renamed into output_directory (multiqc + * output-directory parameter renamed into output_directory (multiqc section) * handle stdout correctly in fastqc, bowtie1, bowtie2 rules 0.9.5 * Fixed https://github.com/sequana/sequana/issues/571 * More cutadapt commands and sanity checks * Fixed bowtie2 options import in rnaseq.rules -0.9.4 -0.9.3 if a fastq_screen.conf is provided, we switch the fastqc_screen +0.9.4 +0.9.3 if a fastq_screen.conf is provided, we switch the fastqc_screen section ON automatically 0.9.0 **Major refactorisation.** - * remove sartools, kraken rules. + * remove sartools, kraken rules. * Indexing is now optional and can be set in the configuration. * Configuration file is simplified with a general section to enter - the genome location and aligner. + the genome location and aligner. * Fixed rules in sequana (0.8.0) that were not up-to-date with several executables used in the pipeline including picard, fastq_screen, etc. See Sequana Changelog for details with respect - to rules changes. - * Copying the feature counts in main directory ready to use for + to rules changes. + * Copying the feature counts in main directory ready to use for a differential analysis. ========= ==================================================================== diff --git a/pyproject.toml b/pyproject.toml index 40bade0..3fb3061 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "poetry.core.masonry.api" [tool.poetry] name = "sequana-rnaseq" -version = "0.19.0" +version = "0.19.1" description = "A RNAseq pipeline from raw reads to feature counts" authors = ["Sequana Team"] license = "BSD-3" diff --git a/sequana_pipelines/rnaseq/cluster_config.json b/sequana_pipelines/rnaseq/cluster_config.json deleted file mode 100755 index 48bae56..0000000 --- a/sequana_pipelines/rnaseq/cluster_config.json +++ /dev/null @@ -1,34 +0,0 @@ -{ - "__default__" : - { - "ram" : "5G" - }, - - "star_mapping" : - { - "ram" : "34G" - }, - - "star_index" : - { - "ram" : "34G" - }, - - "bowtie2_index" : - { - "ram" : "20G" - }, - "RNAseQC" : - { - "ram" : "34G" - }, - - "bamCoverage" : - { - "ram" : "20G" - }, - "mark_duplicates" : - { - "ram" : "34G" - } -} diff --git a/sequana_pipelines/rnaseq/config.yaml b/sequana_pipelines/rnaseq/config.yaml index 79f83b9..613259b 100644 --- a/sequana_pipelines/rnaseq/config.yaml +++ b/sequana_pipelines/rnaseq/config.yaml @@ -26,7 +26,7 @@ apptainers: igvtools: "https://zenodo.org/record/7022635/files/igvtools_2.12.0.img" graphviz: "https://zenodo.org/record/7928262/files/graphviz_7.0.5.img" multiqc: "https://zenodo.org/record/10205070/files/multiqc_1.16.0.img" - + rnaseqc: "https://zenodo.org/record/5799564/files/rnaseqc_2.35.0.img" # =========================================== Sections for the users @@ -370,6 +370,8 @@ rnaseqc: do: false gtf_file: options: --coverage + resources: + mem: 8G # if be_file not provided, try to create one on the fly diff --git a/sequana_pipelines/rnaseq/main.py b/sequana_pipelines/rnaseq/main.py index d1db8af..9365b69 100755 --- a/sequana_pipelines/rnaseq/main.py +++ b/sequana_pipelines/rnaseq/main.py @@ -13,41 +13,38 @@ # documentation: http://sequana.readthedocs.io # ############################################################################## -import sys import os import shutil import subprocess +import sys -import rich_click as click import click_completion - - -from sequana_pipetools.options import * +import rich_click as click from sequana_pipetools import SequanaManager +from sequana_pipetools.options import * click_completion.init() NAME = "rnaseq" - help = init_click( NAME, groups={ "Pipeline Specific": [ - "--aligner", - "--contaminant-file", - "--do-igvtools", - "--do-bam-coverage", - "--do-mark-duplicates", - "--do-rnaseqc", - "--do-rseqc", - "--genome-directory", - "--rnaseqc-gtf-file", - "--rRNA-feature", - "--rseqc-bed-file", - "--skip-rRNA", - "--skip-gff-check", - "--trimming-quality", + "--aligner", + "--contaminant-file", + "--do-igvtools", + "--do-bam-coverage", + "--do-mark-duplicates", + "--do-rnaseqc", + "--do-rseqc", + "--genome-directory", + "--rnaseqc-gtf-file", + "--rRNA-feature", + "--rseqc-bed-file", + "--skip-rRNA", + "--skip-gff-check", + "--trimming-quality", ], }, ) @@ -60,11 +57,14 @@ @include_options_from(ClickGeneralOptions) @include_options_from(ClickTrimmingOptions) @include_options_from(ClickFeatureCountsOptions) -@click.option("--genome-directory", - "genome_directory", default=".", +@click.option( + "--genome-directory", + "genome_directory", + default=".", show_default=True, type=click.Path(dir_okay=True, file_okay=False), - required=True) + required=True, +) @click.option( "--aligner", "aligner", @@ -78,12 +78,12 @@ help="""Feature name corresponding to the rRNA to be identified in the input GFF/GTF files. Must exist and be valid. If you do not have any, you may skip this step using --skip-rRNA or provide a fasta file using --contaminant-file""", - ) +) @click.option( - "--skip-rRNA", - "skip_rRNA", - is_flag=True, - help="""skip the mapping on rRNA feature. ignored if --contaminant-file is provided""", + "--skip-rRNA", + "skip_rRNA", + is_flag=True, + help="""skip the mapping on rRNA feature. ignored if --contaminant-file is provided""", ) @click.option( "--contaminant-file", @@ -99,7 +99,7 @@ default=False, show_default=True, help="""By default we check the coherence between the input -GFF file and related options (e.g. --feature_counts_feature_type and +GFF file and related options (e.g. --feature_counts_feature_type and --feature_counts_attribute options). This may take time e.g. for mouse or human. Using this option skips the sanity checks""", ) @@ -109,7 +109,7 @@ help="""if set, this will compute TDF files that can be imported in IGV browser. TDF file allows to quickly visualise the coverage of the mapped reads.""", - ) +) @click.option( "--do-bam-coverage", is_flag=True, @@ -120,15 +120,11 @@ is_flag=True, help="""Mark duplicates. To be used e.g. with QCs""", ) -@click.option( - "--do-rnaseqc", - is_flag=True, - help="do RNA-seq QC using RNAseQC v2" -) +@click.option("--do-rnaseqc", is_flag=True, help="do RNA-seq QC using RNAseQC v2") @click.option( "--rnaseqc-gtf-file", help="""The GTF file to be used for RNAseQC. Without a valid GTF, - RNAseqQC will not work. You may try sequana.gff3 module to build the gtf from the GFF file""", + RNAseqQC will not work. You may try sequana gff-to-gtf application.""", ) @click.option( "--do-rseqc", @@ -137,12 +133,12 @@ corresponding to your GFF file. For prokaryotes, the BED file is created on the fly.""", ) -@click.option( - "--rseqc-bed-file", - help="""The rseQC input bed file.""" -) +@click.option("--rseqc-bed-file", help="""The rseQC input bed file.""") def main(**options): + if options["from_project"]: + click.echo("--from-project Not yet implemented") + sys.exit(1) # the real stuff is here manager = SequanaManager(options, NAME) manager.setup() @@ -151,10 +147,9 @@ def main(**options): options = manager.options cfg = manager.config.config - from sequana_pipetools import logger - logger.setLevel(options.level) + logger.setLevel(options.level) manager.fill_data_options() # --------------------------------------------------------- general @@ -176,9 +171,7 @@ def main(**options): # mutually exclusive options if options.contaminant_file: cfg.general.contaminant_file = os.path.abspath(options.contaminant_file) - logger.warning( - "You are using a custom FASTA --contaminant_file so --rRNA-feature will be ignored" - ) + logger.warning("You are using a custom FASTA --contaminant_file so --rRNA-feature will be ignored") cfg.general.rRNA_feature = None elif options.skip_rRNA: cfg.general.rRNA_feature = None @@ -211,7 +204,6 @@ def main(**options): cfg.fastp.disable_quality_filtering = False cfg.fastp.disable_adapter_trimming = False - # ----------------------------------------------------- feature counts cfg.feature_counts.options = options.feature_counts_options cfg.feature_counts.strandness = options.feature_counts_strandness @@ -239,7 +231,8 @@ def main(**options): ) cfg.rnaseqc.do = False if options.aligner in ["salmon"]: - logger.info("WARNING" + logger.info( + "WARNING" "You asked for RNA_seqc QC assessements but no" " BAM will be generated by the salmon aligner. Switching off this option. " ) @@ -252,7 +245,7 @@ def main(**options): # -------------------------------------------------------- RNAdiff - #import sequana_pipelines.rnaseq + # import sequana_pipelines.rnaseq # SANITY CHECKS # -------------------------------------- do we find rRNA feature in the GFF ? @@ -275,10 +268,7 @@ def main(**options): valid_attributes = gff.attributes # about 10 seconds # first check the rRNA feature - if ( - cfg["general"]["rRNA_feature"] - and cfg["general"]["rRNA_feature"] not in valid_features - ): + if cfg["general"]["rRNA_feature"] and cfg["general"]["rRNA_feature"] not in valid_features: logger.error( "rRNA feature not found in the input GFF ({})".format(gff_file) @@ -321,48 +311,36 @@ def main(**options): ) sys.exit() else: - unique = set( - [x[fc_attr] for k, x in dd.attributes.items() if fc_attr in x] - ) - logger.info( - f"Found {S} '{fc_attr}' entries for the attribute [{len(unique)} unique entries]" - ) + unique = set([x[fc_attr] for k, x in dd.attributes.items() if fc_attr in x]) + logger.info(f"Found {S} '{fc_attr}' entries for the attribute [{len(unique)} unique entries]") if S != len(unique): - logger.warning( - "Attribute non-unique. Feature counts should handle it" - ) + logger.warning("Attribute non-unique. Feature counts should handle it") if options.feature_counts_extra_attributes: for extra_attr in cfg.feature_counts.extra_attributes.split(","): if extra_attr not in set(attributes): - logger.error( - "{extra_attr} not found in the GFF attributes. Try one of {set(attributes)}" - ) + logger.error("{extra_attr} not found in the GFF attributes. Try one of {set(attributes)}") sys.exit() - # need to move the custom file into the working directoty if "," in cfg.feature_counts.feature: - logger.info( - "Building a custom GFF file (custom.gff) using Sequana. Please wait" - ) + logger.info("Building a custom GFF file (custom.gff) using Sequana. Please wait") genome_directory = os.path.abspath(cfg.general.genome_directory) genome_name = genome_directory.rsplit("/", 1)[1] prefix_name = genome_directory + "/" + genome_name from sequana import GFF3 + gff = GFF3(prefix_name + ".gff") fc_types = cfg.feature_counts.feature.strip().split(",") gff.save_gff_filtered(features=fc_types, filename="custom.gff") cfg.general.custom_gff = "custom.gff" - # finalise the command and save it; copy the snakemake. update the config # file and save it. manager.teardown() - try: # option added in latest version if cfg.general.custom_gff: shutil.copy(cfg.general.custom_gff, options.workdir) @@ -370,6 +348,5 @@ def main(**options): pass - if __name__ == "__main__": main() diff --git a/sequana_pipelines/rnaseq/rnaseq.rules b/sequana_pipelines/rnaseq/rnaseq.rules index 6c0797d..4d7cb11 100644 --- a/sequana_pipelines/rnaseq/rnaseq.rules +++ b/sequana_pipelines/rnaseq/rnaseq.rules @@ -63,7 +63,7 @@ __gff_file__ = f"{__prefix_name__}.gff" # if manager.config['general']['custom_gff']: __gff_file__ = manager.config['general']['custom_gff'] -assert os.path.exists(__gff_file__) +assert os.path.exists(__gff_file__), f"GFF file {__gff_file__} does not exist" # check existence of fasta and gff before starting; for this in [__fasta_file__, __gff_file__]: @@ -858,23 +858,21 @@ expected_output.append("post_analysis/all_features.out") if config['rseqc']['do']: - if config['rseqc']['bed_file']: - __bed__file = config['rseqc']['bed_file'] - else: - __bed__file = "outputs/temp.bed" - rule gff2bed: - input: __gff_file__ - output: __bed__file - run: - from sequana import GFF3 - g = GFF3(input[0]) - g.to_bed(output[0], 'Name') + rule gff2bed: + input: + gff=__gff_file__ + output: + bed="tmp/temp.bed" # config['rseqc'].get('bed_file', "tmp/temp.bed") + message: "Build BED file from GFF using Sequana" + run: + from sequana import GFF3 + g = GFF3(input[0]) + g.to_bed(output[0], 'Name') rule rseqc: input: bam=__final_bam__, - bed=__bed__file - + bed=rules.gff2bed.output.bed # no need to put all outputs output: bam_stat= "{sample}/rseqc/{sample}_bam_stat.txt", @@ -889,27 +887,27 @@ if config['rseqc']['do']: shell: """ # for paired data only - inner_distance.py -i {input.bam} -o {wildcards.sample}/rseqc/{wildcards.sample} -r {input.bed} + inner_distance.py -i {input.bam} -o {wildcards.sample}/rseqc/{wildcards.sample} -r {input.bed} &>{log} # For now GC not very useful in the output so commented - # read_GC.py -i {input.bam} -o {wildcards.sample}/rseqc/{wildcards.sample} + # read_GC.py -i {input.bam} -o {wildcards.sample}/rseqc/{wildcards.sample} &>{log} # genebody coverage - geneBody_coverage.py -i {input.bam} -o {wildcards.sample}/rseqc/{wildcards.sample} -r {input.bed} 1>{log}2>{log} + geneBody_coverage.py -i {input.bam} -o {wildcards.sample}/rseqc/{wildcards.sample} -r {input.bed} &>{log} # uses bigwig redundant with geneBody_coverage - # geneBody_coverage2.py -i {wildcards.sample}/bamCoverage/{wildcards.sample}.norm.bw -o {wildcards}/rseq/{wildcards.sample} -r test.bed + # geneBody_coverage2.py -i {wildcards.sample}/bamCoverage/{wildcards.sample}.norm.bw -o {wildcards}/rseq/{wildcards.sample} -r test.bed &>{log} # Not included in the multiqc module so commented for now #clipping_profile.py -i {input.bam} -s {params.paired} -o {wildcards.sample}/rseqc/{wildcards.sample} - read_duplication.py -i {input.bam} -o {wildcards.sample}/rseqc/{wildcards.sample} - junction_annotation.py -i {input.bam} -o {wildcards.sample}/rseqc/{wildcards.sample} -r {input.bed} - junction_saturation.py -i {input.bam} -o {wildcards.sample}/rseqc/{wildcards.sample} -r {input.bed} - infer_experiment.py -i {input.bam} -r {input.bed} > {wildcards.sample}/rseqc/{wildcards.sample}.infer.txt + read_duplication.py -i {input.bam} -o {wildcards.sample}/rseqc/{wildcards.sample} &>{log} + junction_annotation.py -i {input.bam} -o {wildcards.sample}/rseqc/{wildcards.sample} -r {input.bed} &>{log} + junction_saturation.py -i {input.bam} -o {wildcards.sample}/rseqc/{wildcards.sample} -r {input.bed} &>{log} + infer_experiment.py -i {input.bam} -r {input.bed} > {wildcards.sample}/rseqc/{wildcards.sample}.infer.txt &>{log} # bam stats KEEP last since that is the expected output to make sure # previous files (not listed in output:) are computed first. - bam_stat.py -i {input.bam} > {output.bam_stat} + bam_stat.py -i {input.bam} > {output.bam_stat} &>{log} """ # one series is enough. if bam_stats is created, others are also created @@ -938,10 +936,26 @@ if config["rnaseqc"]["do"] and config['general']['aligner'] != 'salmon': logs.critical(f"{__gtf_file__} not found in {gdir}. Trying the GFF file") __gtf_file__ = __prefix_name__ + ".gff" + + rule rnaseqc_fixup: + input: + gtf = __gtf_file__ + output: + gtf = temp("tmp/test.gtf") + run: + # If input GTF has no exon or genes, an error message is printed and + # no files are created. This seems to be an issue in rnaseqc. + # So, we create dummy gene and exon + with open(output.gtf, "w") as ff: + ff.write(open(input['gtf'], "r").read()) + ff.write('myCHR\tSGD\tgene\t0\t0\t.\t+\t0\tgene_id "dummy"\n') + ff.write('myCHR\tSGD\texon\t0\t0\t.\t+\t0\texon_id "dummy"\n') + ff.close() + rule rnaseqc: input: bam = __final_bam__, - gtf = __gtf_file__ + gtf = rules.rnaseqc_fixup.output.gtf output: metrics = "{sample}/rnaseqc/{sample}.metrics.tsv" log: @@ -949,18 +963,14 @@ if config["rnaseqc"]["do"] and config['general']['aligner'] != 'salmon': params: directory = "{sample}/rnaseqc", options= config['rnaseqc']['options'] - run: - # If input GTF has no exon or genes, an error message is printed and - # no files are created. This seems to be an issue in rnaseqc. - # So, we create dummy gene and exon - from easydev import TempFile - with TempFile(suffix=".gtf") as fout: - ff = open(fout.name, "w") - ff.write('myCHR\tSGD\tgene\t0\t0\t.\t+\t0\tgene_id "dummy";') - ff.write('myCHR\tSGD\texon\t0\t0\t.\t+\t0\texon_id "dummy";') - ff.write(open(input['gtf'], "r").read()) - ff.close() - shell("rnaseqc " + fout.name + " {input.bam} {params.directory} -s {wildcards.sample} {params.options} &>{log}") + resources: + **config["rnaseqc"]["resources"] + container: + config["apptainers"]["rnaseqc"] + shell: + """ + rnaseqc {input.gtf} {input.bam} {params.directory} -s {wildcards.sample} {params.options} &>{log} + """ expected_output.extend(expand("{sample}/rnaseqc/{sample}.metrics.tsv", sample=manager.samples)) @@ -1037,7 +1047,7 @@ rule prepare_DGE_analysis: with open("post_analysis/README.rst", "w") as fout: fout.write(f"""{msg} The design.csv file must be formatted as follows (for 2 conditions with 3 replicates each): - + label,condition samplename_1,condition_name_1 samplename_2,condition_name_1 @@ -1046,7 +1056,7 @@ rule prepare_DGE_analysis: samplename_5,condition_name_2 samplename_6,condition_name_2 """) - + # 2. save the script with open(output.rnadiff, "w") as fout: attribute = config['feature_counts']['attribute'] @@ -1064,7 +1074,6 @@ localrules: rulegraph, prepare_DGE_analysis onsuccess: # Create plots about stats from sequana import logger as log - from sequana import BAM from sequana.modules_report.summary import SequanaReport import colorlog @@ -1101,6 +1110,7 @@ onsuccess: rRNA_done = True + intro += """

ribosomal / contaminant content

""" try: if os.path.exists("multiqc/multiqc_report_data/multiqc_bowtie1.txt"): df = pd.read_csv("multiqc/multiqc_report_data/multiqc_bowtie1.txt", sep='\t') @@ -1108,24 +1118,24 @@ onsuccess: df = pd.read_csv("multiqc/multiqc_data/multiqc_bowtie1.txt", sep='\t') else: rRNA_done = False - raise Exception - intro += """

ribosomal / contaminant content

""" - if "reads_aligned_percentage" in df.columns: - rRNA = int(df.reads_aligned_percentage.mean()*100)/100 + if rRNA_done: + if "reads_aligned_percentage" in df.columns: + rRNA = int(df.reads_aligned_percentage.mean()*100)/100 + else: + rRNA = int((100 - df.not_aligned_percentage.mean())*100)/100 + + if rRNA < 10: + intro += f"

rRNA content (or contaminant provided) represents {rRNA}%, which is low (as expected). " + elif rRNA < 20: + intro += f"

rRNA content (or contaminant provided) represents {rRNA}%, which is moderately low. " + elif rRNA > 20: + rRNA += f"

rRNA content (or contaminant provided) represents {rRNA}%, which is relatively high. " + elif rRNA > 50: + rRNA += f"

rRNA content (or contaminant provided) represents {rRNA}%, which is very high high. " else: - rRNA = int((100 - df.not_aligned_percentage.mean())*100)/100 - - if rRNA < 10: - intro += f"

rRNA content (or contaminant provided) represents {rRNA}%, which is low (as expected). " - elif rRNA < 20: - intro += f"

rRNA content (or contaminant provided) represents {rRNA}%, which is moderately low. " - elif rRNA > 20: - rRNA += f"

rRNA content (or contaminant provided) represents {rRNA}%, which is relatively high. " - elif rRNA > 50: - rRNA += f"

rRNA content (or contaminant provided) represents {rRNA}%, which is very high high. " - - intro += """Please see the section of the multiqc report for details. Here below

""" + intro += f"

rRNA content not computed (no rRNA gene or contaminant provided)

" + except Exception as err: print(err) pass @@ -1144,7 +1154,7 @@ onsuccess: except Exception as err: print(err) - + # Include the bowtie plot intro += """

Mapping rate

""" if config['general']['aligner'] == "bowtie2": @@ -1204,22 +1214,12 @@ onsuccess: intro+= """

Differential analysis

- Differentially expressed genes analysis is not performed automatically with this pipeline. However, information, feature counts, and other materials can be found in the directory ./post_analysis/. Most probably an analysis is present. If so, please open the directory in rnadiff report. + Differentially expressed genes analysis is not performed automatically with this pipeline. However, information, feature counts, and other materials can be found in the directory post_analysis where the standalone 'sequana rnadiff' can be used with DeSEq2. Most probably an analysis is present. If so, please open the directory in rnadiff report.

""" # Now the final report. add the original command in the HTML report - try: - command = "" - with open(".sequana/info.txt", "r") as fin: - for line in fin: - if not line.startswith("#"): - command += line - intro += f"

Command used

{command}" - except Exception: - pass - data = manager.getmetadata() s = SequanaReport(data, intro) diff --git a/sequana_pipelines/rnaseq/schema.yaml b/sequana_pipelines/rnaseq/schema.yaml index 3a27b14..5921a78 100644 --- a/sequana_pipelines/rnaseq/schema.yaml +++ b/sequana_pipelines/rnaseq/schema.yaml @@ -285,6 +285,9 @@ mapping: type: str "options": type: str + "resources": + type: any + required: true 'rseqc': type: map mapping: diff --git a/test/data/Saccer3/Saccer3_rRNA.fa b/test/data/Saccer3/Saccer3_rRNA.fa new file mode 100644 index 0000000..abe42a0 --- /dev/null +++ b/test/data/Saccer3/Saccer3_rRNA.fa @@ -0,0 +1,2 @@ +>empty +AAAAAAAAAAAAAA diff --git a/test/test_main.py b/test/test_main.py index 504acdc..34bcf90 100644 --- a/test/test_main.py +++ b/test/test_main.py @@ -1,22 +1,23 @@ import os -import tempfile import subprocess import sys -from sequana_pipelines.rnaseq.main import main +import tempfile + from click.testing import CliRunner -from . import test_dir +from sequana_pipelines.rnaseq.main import main +from . import test_dir sharedir = f"{test_dir}/data" saccer3 = f"{test_dir}/data/Saccer3/" +conta = f"{test_dir}/data/Saccer3/Saccer3_rRNA.fa" # fast def test_standalone_subprocess(): directory = tempfile.TemporaryDirectory() - cmd = """sequana_rnaseq --input-directory {} --working-directory {} """.format( - sharedir, directory.name) + cmd = """sequana_rnaseq --input-directory {} --working-directory {} """.format(sharedir, directory.name) subprocess.call(cmd.split()) @@ -25,21 +26,48 @@ def test_standalone_script(): directory = tempfile.TemporaryDirectory() runner = CliRunner() - results = runner.invoke(main, ["--input-directory", sharedir, "--genome-directory", - saccer3, "--force", "--aligner", "bowtie2", - "--feature-counts-feature-type", 'gene,tRNA', - "--rRNA-feature", "rRNA_gene"]) # ideally should be rRNA but current + results = runner.invoke( + main, + [ + "--input-directory", + sharedir, + "--genome-directory", + saccer3, + "--force", + "--aligner", + "bowtie2", + "--feature-counts-feature-type", + "gene,tRNA", + "--working-directory", + directory.name, + "--rRNA-feature", + "rRNA_gene", + ], + ) # ideally should be rRNA but current assert results.exit_code == 0 def test_standalone_script_contaminant(): directory = tempfile.TemporaryDirectory() runner = CliRunner() - results = runner.invoke(main, ["--input-directory", sharedir, "--genome-directory", - saccer3, "--force", "--aligner", "bowtie2", - "--feature-counts-feature-type", 'gene,tRNA', - "--contaminant-file", "test.fa", - "--rRNA-feature", "rRNA_gene"]) # ideally should be rRNA but current + results = runner.invoke( + main, + [ + "--input-directory", + sharedir, + "--genome-directory", + saccer3, + "--force", + "--aligner", + "bowtie2", + "--feature-counts-feature-type", + "gene", + "--contaminant-file", + conta, + "--working-directory", + directory.name, + ], + ) assert results.exit_code == 0 @@ -53,36 +81,77 @@ def test_version(): def test_standalone_script_wrong_feature(): directory = tempfile.TemporaryDirectory() import sequana_pipelines.rnaseq.main as m - sys.argv = ["test", "--input-directory", sharedir, "--genome-directory", - saccer3, "--force", "--aligner", "bowtie2", - "--feature-counts-feature-type", 'dummy', - "--rRNA-feature", "rRNA_gene"] # ideally should be rRNA but current + + sys.argv = [ + "test", + "--input-directory", + sharedir, + "--genome-directory", + saccer3, + "--force", + "--aligner", + "bowtie2", + "--feature-counts-feature-type", + "dummy", + "--working-directory", + directory.name, + "--rRNA-feature", + "rRNA_gene", + ] # ideally should be rRNA but current try: m.main() assert False except: assert True + # fast def test_standalone_script_wrong_reference(): directory = tempfile.TemporaryDirectory() import sequana_pipelines.rnaseq.main as m - sys.argv = ["test", "--input-directory", sharedir, "--genome-directory", - "dummy", "--force", "--aligner", "bowtie2", - "--rRNA-feature", "rRNA_gene"] # ideally should be rRNA but current + + sys.argv = [ + "test", + "--input-directory", + sharedir, + "--genome-directory", + "dummy", + "--force", + "--aligner", + "bowtie2", + "--working-directory", + directory.name, + "--rRNA-feature", + "rRNA_gene", + ] # ideally should be rRNA but current try: m.main() assert False except: assert True + # fast def test_standalone_script_wrong_triming(): directory = tempfile.TemporaryDirectory() import sequana_pipelines.rnaseq.main as m - sys.argv = ["test", "--input-directory", sharedir, "--genome-directory", - saccer3, "--force", "--aligner", "bowtie2", "--software-choice", "dummy", - "--rRNA-feature", "rRNA_gene"] # ideally should be rRNA but current + + sys.argv = [ + "test", + "--input-directory", + sharedir, + "--genome-directory", + saccer3, + "--force", + "--aligner", + "bowtie2", + "--software-choice", + "dummy", + "--working-directory", + directory.name, + "--rRNA-feature", + "rRNA_gene", + ] # ideally should be rRNA but current try: m.main() assert False @@ -99,7 +168,6 @@ def test_full(): cmd = f"sequana_rnaseq --input-directory {sharedir} --genome-directory {saccer3} --aligner bowtie2 --working-directory {wk} --force" subprocess.call(cmd.split()) - cmd = "snakemake -s rnaseq.rules --wrapper-prefix https://raw.githubusercontent.com/sequana/sequana-wrappers/ -p --cores 2 " stat = subprocess.call(cmd.split(), cwd=wk) @@ -107,6 +175,7 @@ def test_full(): assert os.path.exists(wk + "/summary.html") assert os.path.exists(wk + "/multiqc/multiqc_report.html") + # slow def test_full_star(): @@ -116,7 +185,6 @@ def test_full_star(): cmd = f"sequana_rnaseq --input-directory {sharedir} --genome-directory {saccer3} --aligner star --working-directory {wk} --force" subprocess.call(cmd.split()) - cmd = "snakemake -s rnaseq.rules --wrapper-prefix https://raw.githubusercontent.com/sequana/sequana-wrappers/ -p --cores 2 " stat = subprocess.call(cmd.split(), cwd=wk) @@ -124,6 +192,7 @@ def test_full_star(): assert os.path.exists(wk + "/summary.html") assert os.path.exists(wk + "/multiqc/multiqc_report.html") + # slow def __test_full_salmon(): @@ -133,11 +202,9 @@ def __test_full_salmon(): cmd = f"sequana_rnaseq --input-directory {sharedir} --genome-directory {saccer3} --aligner salmon --working-directory {wk} --force" subprocess.call(cmd.split()) - cmd = "snakemake -s rnaseq.rules --wrapper-prefix https://raw.githubusercontent.com/sequana/sequana-wrappers/ -p --cores 2 " stat = subprocess.call(cmd.split(), cwd=wk) assert os.path.exists(wk + "/summary.html") assert os.path.exists(wk + "/multiqc/multiqc_report.html") -