v1.0.0

- [x]  Consider building from scratch by reusing parts from Snakemake and BSF pipeline each. Instead of forking and deleting. But mention and give credit. -> no, continue with current codebase and approach
- [x] Remove all donwstream analyses e.g., DeSeq & PCA
- [x]  change script [[count-matrix.py](http://count-matrix.py/)](http://count-matrix.py/) line 40 sep="\t" to sep = ","
- [x] switch to one sequencing unit CSV file
- [x]  change to raw BAM files as input
  - [x] alignment + downstream first
  - [x] trimming
    - cutadapt: doesn't support unaligned paired end BAMs: https://cutadapt.readthedocs.io/en/stable/reference.html#supported-file-formats
    - [x] check if paired end does not require trimming?! -> requires trimming
    - [x] trimming necessary? https://github.com/alexdobin/STAR/issues/455 -> yes
    - [x] check if and how we trim in ATACseq
      - Step 1: A loop runs samtools fastq on each raw BAM file (from {input}) to convert them into FASTQ format.
Pipe 1: The combined FASTQ output is piped to fastp for quality filtering and adapter trimming, which reads from standard input and writes processed FASTQ to standard output.
Pipe 2: The filtered reads are piped to bowtie2 for alignment, which uses the processed reads as input.
    - [x] check if how BSF trims/processes RNAseq data -> very complicated: (http://cemmgit.int.cemm.at/mschuster/bsfpython)
      - The fun starts at bsf.workflows.rnaseq_deseq, which imports bsf.workflows.star, which imports bsf.workflows.picard_sam_to_fastq. The driver scripts are generated via Pip at installation from the console_run() class methods.
    - [x] What if it were easy and I could trim in the same step as alignment using pipes ie convert to FASTQ, trim, align, provide aligned BAMs?
    - alternative trimmers
      - trimgalore
      - fastp
      - trimmomatic (used by BSF)
- [x] `processing.smk`
  - [x] double check all commands and parameters in conjunction with their manuals (manual & AI)
  - [x] find out about necessity of RG and if I should set that actually
  - [x] make sure not only marked but actually trimmed and filtered output
  - [x] enable auto adapter triming in `fastp` for paired end (check log about this)?
        `Adapter auto-detection is disabled for STDIN mode`
- [x] Consider adding back `adapter.fa` in `config` and define as conditional input(!) to trimming so that it is tracked and required. 
- [x] MultiQC report
  - [x] think about redirecting STAR logging into result folder for multiQC? -> star result folder is input
  - [x] Check how and what MultiQC wants from where? rseqc also logs in log folder, but samtools/fastp etc I put somewhere else -> makes sense?!
  - [x] add fastp and samtools logs to multiqc
  - [x] upgrade MultiQC?
- [x] consider moving logs into result directory (multiqc needs to know where, ie have to change that as well)
- [x] rule `star_index` takes very long. check if parallelizable
- [x]  **gene annotation:** including GC content and length for downstream analysis e.g., CQN normalization (code exists)
- [x]  **sample annotation:** (MultiQC) report with sequencing statistics and export as annotation (like in ATACseq)
- [x] **strandedness**: understand when, where and why used/necessary
- [x] consider removing threads from config
- [x] check for latest software versions and add to env yaml files
- [x] adapt or remove validation schemas for annot and config
- [x]  RtH: build in a check for paired/single-end
  - Check for single end or paired could result in success/fail.log files (name read_type_correct.log) 
- [x] add `export.smk`
- [x] protect main branch
- [x] test with multiple input bam files per sample(!) and check file sizes and counts (should be doubled)
- [x] why are the temp FASTQ files not removed? -> Fix
- [x] test run on large dataset
- [x] make README
  - [x] Methods can be taken from macroStim
  - [x] reference/cite the original Snakemake RNA pipeline
    - [x] citing: **If you use this workflow in a publication, please…cite this AND the original pipeline here DOI**
  - [x] refer to genome_tracks module, like done in ATAC-seq pipeline (DRY principle). If group variable is sample name then its sample wise bigWigs.
- [x]  DR: I forgot to clean resources folder and everything kept aligning to the downloaded  genome there - important check for your version of the rna processing
  - Reference data could contain ref info eg species etc -> but more complexity
  - put in docs the location of resources and also a reminder to delete them if parameters changed 
- [x] send out for testing (BockBots) & review (BSF)
- [x] make [MrBiomics compatible](https://github.com/epigen/MrBiomics/wiki/Sustainability-%26-Reproducibility) (CITATION.cff etc)
    - [x] citation of original pipeline in CITATION.CFF as resource


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.0.0 #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

v1.0.0 #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions