-
Notifications
You must be signed in to change notification settings - Fork 2
Closed
Description
- Consider building from scratch by reusing parts from Snakemake and BSF pipeline each. Instead of forking and deleting. But mention and give credit. -> no, continue with current codebase and approach
- Remove all donwstream analyses e.g., DeSeq & PCA
- change script [count-matrix.py](http://count-matrix.py/) line 40 sep="\t" to sep = ","
- switch to one sequencing unit CSV file
- change to raw BAM files as input
- alignment + downstream first
- trimming
- cutadapt: doesn't support unaligned paired end BAMs: https://cutadapt.readthedocs.io/en/stable/reference.html#supported-file-formats
- check if paired end does not require trimming?! -> requires trimming
- trimming necessary? STAR trimming and quality control alexdobin/STAR#455 -> yes
- check if and how we trim in ATACseq
- Step 1: A loop runs samtools fastq on each raw BAM file (from {input}) to convert them into FASTQ format.
Pipe 1: The combined FASTQ output is piped to fastp for quality filtering and adapter trimming, which reads from standard input and writes processed FASTQ to standard output.
Pipe 2: The filtered reads are piped to bowtie2 for alignment, which uses the processed reads as input.
- Step 1: A loop runs samtools fastq on each raw BAM file (from {input}) to convert them into FASTQ format.
- check if how BSF trims/processes RNAseq data -> very complicated: (http://cemmgit.int.cemm.at/mschuster/bsfpython)
- The fun starts at bsf.workflows.rnaseq_deseq, which imports bsf.workflows.star, which imports bsf.workflows.picard_sam_to_fastq. The driver scripts are generated via Pip at installation from the console_run() class methods.
- What if it were easy and I could trim in the same step as alignment using pipes ie convert to FASTQ, trim, align, provide aligned BAMs?
- alternative trimmers
- trimgalore
- fastp
- trimmomatic (used by BSF)
-
processing.smk- double check all commands and parameters in conjunction with their manuals (manual & AI)
- find out about necessity of RG and if I should set that actually
- make sure not only marked but actually trimmed and filtered output
- enable auto adapter triming in
fastpfor paired end (check log about this)?
Adapter auto-detection is disabled for STDIN mode
- Consider adding back
adapter.fainconfigand define as conditional input(!) to trimming so that it is tracked and required. - MultiQC report
- think about redirecting STAR logging into result folder for multiQC? -> star result folder is input
- Check how and what MultiQC wants from where? rseqc also logs in log folder, but samtools/fastp etc I put somewhere else -> makes sense?!
- add fastp and samtools logs to multiqc
- upgrade MultiQC?
- consider moving logs into result directory (multiqc needs to know where, ie have to change that as well)
- rule
star_indextakes very long. check if parallelizable - gene annotation: including GC content and length for downstream analysis e.g., CQN normalization (code exists)
- sample annotation: (MultiQC) report with sequencing statistics and export as annotation (like in ATACseq)
- strandedness: understand when, where and why used/necessary
- consider removing threads from config
- check for latest software versions and add to env yaml files
- adapt or remove validation schemas for annot and config
- RtH: build in a check for paired/single-end
- Check for single end or paired could result in success/fail.log files (name read_type_correct.log)
- add
export.smk - protect main branch
- test with multiple input bam files per sample(!) and check file sizes and counts (should be doubled)
- why are the temp FASTQ files not removed? -> Fix
- test run on large dataset
- make README
- Methods can be taken from macroStim
- reference/cite the original Snakemake RNA pipeline
- citing: If you use this workflow in a publication, please…cite this AND the original pipeline here DOI
- refer to genome_tracks module, like done in ATAC-seq pipeline (DRY principle). If group variable is sample name then its sample wise bigWigs.
- DR: I forgot to clean resources folder and everything kept aligning to the downloaded genome there - important check for your version of the rna processing
- Reference data could contain ref info eg species etc -> but more complexity
- put in docs the location of resources and also a reminder to delete them if parameters changed
- send out for testing (BockBots) & review (BSF)
- make MrBiomics compatible (CITATION.cff etc)
- citation of original pipeline in CITATION.CFF as resource
Metadata
Metadata
Assignees
Labels
No labels