Skip to content

v1.0.0 #1

@sreichl

Description

@sreichl
  • Consider building from scratch by reusing parts from Snakemake and BSF pipeline each. Instead of forking and deleting. But mention and give credit. -> no, continue with current codebase and approach
  • Remove all donwstream analyses e.g., DeSeq & PCA
  • change script [count-matrix.py](http://count-matrix.py/) line 40 sep="\t" to sep = ","
  • switch to one sequencing unit CSV file
  • change to raw BAM files as input
    • alignment + downstream first
    • trimming
      • cutadapt: doesn't support unaligned paired end BAMs: https://cutadapt.readthedocs.io/en/stable/reference.html#supported-file-formats
      • check if paired end does not require trimming?! -> requires trimming
      • trimming necessary? STAR trimming and quality control alexdobin/STAR#455 -> yes
      • check if and how we trim in ATACseq
        • Step 1: A loop runs samtools fastq on each raw BAM file (from {input}) to convert them into FASTQ format.
          Pipe 1: The combined FASTQ output is piped to fastp for quality filtering and adapter trimming, which reads from standard input and writes processed FASTQ to standard output.
          Pipe 2: The filtered reads are piped to bowtie2 for alignment, which uses the processed reads as input.
      • check if how BSF trims/processes RNAseq data -> very complicated: (http://cemmgit.int.cemm.at/mschuster/bsfpython)
        • The fun starts at bsf.workflows.rnaseq_deseq, which imports bsf.workflows.star, which imports bsf.workflows.picard_sam_to_fastq. The driver scripts are generated via Pip at installation from the console_run() class methods.
      • What if it were easy and I could trim in the same step as alignment using pipes ie convert to FASTQ, trim, align, provide aligned BAMs?
      • alternative trimmers
        • trimgalore
        • fastp
        • trimmomatic (used by BSF)
  • processing.smk
    • double check all commands and parameters in conjunction with their manuals (manual & AI)
    • find out about necessity of RG and if I should set that actually
    • make sure not only marked but actually trimmed and filtered output
    • enable auto adapter triming in fastp for paired end (check log about this)?
      Adapter auto-detection is disabled for STDIN mode
  • Consider adding back adapter.fa in config and define as conditional input(!) to trimming so that it is tracked and required.
  • MultiQC report
    • think about redirecting STAR logging into result folder for multiQC? -> star result folder is input
    • Check how and what MultiQC wants from where? rseqc also logs in log folder, but samtools/fastp etc I put somewhere else -> makes sense?!
    • add fastp and samtools logs to multiqc
    • upgrade MultiQC?
  • consider moving logs into result directory (multiqc needs to know where, ie have to change that as well)
  • rule star_index takes very long. check if parallelizable
  • gene annotation: including GC content and length for downstream analysis e.g., CQN normalization (code exists)
  • sample annotation: (MultiQC) report with sequencing statistics and export as annotation (like in ATACseq)
  • strandedness: understand when, where and why used/necessary
  • consider removing threads from config
  • check for latest software versions and add to env yaml files
  • adapt or remove validation schemas for annot and config
  • RtH: build in a check for paired/single-end
    • Check for single end or paired could result in success/fail.log files (name read_type_correct.log)
  • add export.smk
  • protect main branch
  • test with multiple input bam files per sample(!) and check file sizes and counts (should be doubled)
  • why are the temp FASTQ files not removed? -> Fix
  • test run on large dataset
  • make README
    • Methods can be taken from macroStim
    • reference/cite the original Snakemake RNA pipeline
      • citing: If you use this workflow in a publication, please…cite this AND the original pipeline here DOI
    • refer to genome_tracks module, like done in ATAC-seq pipeline (DRY principle). If group variable is sample name then its sample wise bigWigs.
  • DR: I forgot to clean resources folder and everything kept aligning to the downloaded genome there - important check for your version of the rna processing
    • Reference data could contain ref info eg species etc -> but more complexity
    • put in docs the location of resources and also a reminder to delete them if parameters changed
  • send out for testing (BockBots) & review (BSF)
  • make MrBiomics compatible (CITATION.cff etc)
    • citation of original pipeline in CITATION.CFF as resource

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions