# Computational Workflows for biomedical data

Welcome to the course Computational Workflows for Biomedical Data. Over the next two weeks, you will learn how to leverage nf-core pipelines to analyze biomedical data and gain hands-on experience in creating your own pipelines, with a strong emphasis on Nextflow and nf-core.

Course Structure:

- Week 1: You will use a variety of nf-core pipelines to analyze a publicly available biomedical study.
- Week 2: We will shift focus to learning the basics of Nextflow, enabling you to design and implement your own computational workflows.<br>
- Final Project: The last couple of days, you will apply your knowledge to create a custom pipeline for analyzing biomedical data using Nextflow and the nf-core template.

## Basics

If you have not installed all required software, please do so now asap!


If you already installed all software, please go on and start answering the questions in this notebook. If you have any questions, don't hesitate to approach us.

1. What is nf-core?

A global community collaborating to build open-source Nextflow components and pipelines

2. How many pipelines are there currently in nf-core?

Based on the Pipeline page: 139

![image.png](attachment:image.png)

3. Are there any non-bioinformatic pipelines in nf-core?

Yes, for example the nf-core/meerpipe which is a astronomy pipeline.

4. Let's go back a couple of steps. What is a pipeline and what do we use it for?


The idea is to process large volumes of raw biological data, such as DNA or RNA sequences. A pipeline is built using automated, systematic software and/or algorithmic steps to produce interpretable output. This ensures consistent and reproducible results.

5. Why do you think nf-core adheres to strict guidelines?

Nf-core tries to follow the FAIR principles. Thes are guidelines to improve the Findability, Accessibility, Interoperability, and Reuse of digital datasets. Therefore Nf-core is a main driver to achieve these goals.

6. What are the main features of nf-core pipelines?
- extensive documentation
- stable releases
- open source: All nf-core code is licensed under the MIT license
- CI testing
- run anywhere
- packaged software

## Let's start using the pipelines

1. Find the nf-core pipeline used to measure differential abundance of genes

nf-core/differentialabundance

In [None]:
# run the pipeline in a cell 
# to run bash in jupyter notebooks, simply use ! before the command
# e.g.

!pwd

# For the tasks in the first week, please use the command line to run your commands and simply paste the commands you used in the respective cells!


/Users/leoniewehnert/pCloud Drive/Master/Semester 4/Nextflow_Prak/computational-workflows-2025/notebooks/day_01


In [2]:
# run the pipeline in the test profile using docker containers
# make sure to specify the version you want to use (use the latest one)

#!nextflow run nf-core/differentialabundance

!nextflow run nf-core/differentialabundance -r 1.5.0 -profile test,docker --outdir test


[1m[38;5;232m[48;5;43m N E X T F L O W [0;2m  ~  [mversion 25.04.7[m
[K
Launching[35m `https://github.com/nf-core/differentialabundance` [0;2m[[0;1;36mnostalgic_goldberg[0;2m] DSL2 - [36mrevision: [0;36m3dd360fed0 [1.5.0][m
[K
[33mWARN: Access to undefined parameter `monochromeLogs` -- Initialise it to a default value eg. `params.monochromeLogs = some_value`[39m[K


-[2m----------------------------------------------------[0m-
                                        [0;32m,--.[0;30m/[0;32m,-.[0m
[0;34m        ___     __   __   __   ___     [0;32m/,-._.--~'[0m
[0;34m  |\ | |__  __ /  ` /  \ |__) |__         [0;33m}  {[0m
[0;34m  | \| |       \__, \__/ |  \ |___     [0;32m\`-._,-`-,[0m
                                        [0;32m`._,._,'[0m
[0;35m  nf-core/differentialabundance v1.5.0-g3dd360f[0m
-[2m----------------------------------------------------[0m-
[1mCore Nextflow options[0m
  [0;34mrevision                    : [0;32m1.5.0[0m
  [0;

In [None]:
# repeat the run. What did change?
# First run: Duration    : 11m 33, CPU hours   : 0.2, Succeeded   : 21
# Second run: Duration    : 10m 30s, CPU hours   : 0.2, Succeeded   : 21
# hashes on left side, the order changed

!nextflow run nf-core/differentialabundance -r 1.5.0 -profile test,docker --outdir test


[1m[38;5;232m[48;5;43m N E X T F L O W [0;2m  ~  [mversion 25.04.7[m
[K
Launching[35m `https://github.com/nf-core/differentialabundance` [0;2m[[0;1;36mdesperate_bell[0;2m] DSL2 - [36mrevision: [0;36m3dd360fed0 [1.5.0][m
[K
[33mWARN: Access to undefined parameter `monochromeLogs` -- Initialise it to a default value eg. `params.monochromeLogs = some_value`[39m[K


-[2m----------------------------------------------------[0m-
                                        [0;32m,--.[0;30m/[0;32m,-.[0m
[0;34m        ___     __   __   __   ___     [0;32m/,-._.--~'[0m
[0;34m  |\ | |__  __ /  ` /  \ |__) |__         [0;33m}  {[0m
[0;34m  | \| |       \__, \__/ |  \ |___     [0;32m\`-._,-`-,[0m
                                        [0;32m`._,._,'[0m
[0;35m  nf-core/differentialabundance v1.5.0-g3dd360f[0m
-[2m----------------------------------------------------[0m-
[1mCore Nextflow options[0m
  [0;34mrevision                    : [0;32m1.5.0[0m
  [0;34mr

In [None]:
# now set -resume to the command. What did change? 
# the time changed it is way faster, around 1 min
!nextflow run nf-core/differentialabundance -r 1.5.0 -profile test,docker -resume --outdir test


[1m[38;5;232m[48;5;43m N E X T F L O W [0;2m  ~  [mversion 25.04.7[m
[K
Launching[35m `https://github.com/nf-core/differentialabundance` [0;2m[[0;1;36msilly_babbage[0;2m] DSL2 - [36mrevision: [0;36m3dd360fed0 [1.5.0][m
[K
[33mWARN: Access to undefined parameter `monochromeLogs` -- Initialise it to a default value eg. `params.monochromeLogs = some_value`[39m[K


-[2m----------------------------------------------------[0m-
                                        [0;32m,--.[0;30m/[0;32m,-.[0m
[0;34m        ___     __   __   __   ___     [0;32m/,-._.--~'[0m
[0;34m  |\ | |__  __ /  ` /  \ |__) |__         [0;33m}  {[0m
[0;34m  | \| |       \__, \__/ |  \ |___     [0;32m\`-._,-`-,[0m
                                        [0;32m`._,._,'[0m
[0;35m  nf-core/differentialabundance v1.5.0-g3dd360f[0m
-[2m----------------------------------------------------[0m-
[1mCore Nextflow options[0m
  [0;34mrevision                    : [0;32m1.5.0[0m
  [0;34mru

Check out the current directory. Next to the outdir you specified, what else has changed?

The hash values at the left side are the same! This makes the computation way faster!

Nextflow maintains a cache of all previously run processes in the work/ directory. With -resume, only new or modified tasks run.
Identical tasks are skipped and their cached outputs are re-linked, which is almost instantaneous.

In [8]:
# delete the work directory and run the pipeline again using -resume. What did change?
!nextflow run nf-core/differentialabundance -r 1.5.0 -profile test,docker -resume --outdir test


[1m[38;5;232m[48;5;43m N E X T F L O W [0;2m  ~  [mversion 25.04.7[m
[K
Launching[35m `https://github.com/nf-core/differentialabundance` [0;2m[[0;1;36mgloomy_ptolemy[0;2m] DSL2 - [36mrevision: [0;36m3dd360fed0 [1.5.0][m
[K
[33mWARN: Access to undefined parameter `monochromeLogs` -- Initialise it to a default value eg. `params.monochromeLogs = some_value`[39m[K


-[2m----------------------------------------------------[0m-
                                        [0;32m,--.[0;30m/[0;32m,-.[0m
[0;34m        ___     __   __   __   ___     [0;32m/,-._.--~'[0m
[0;34m  |\ | |__  __ /  ` /  \ |__) |__         [0;33m}  {[0m
[0;34m  | \| |       \__, \__/ |  \ |___     [0;32m\`-._,-`-,[0m
                                        [0;32m`._,._,'[0m
[0;35m  nf-core/differentialabundance v1.5.0-g3dd360f[0m
-[2m----------------------------------------------------[0m-
[1mCore Nextflow options[0m
  [0;34mrevision                    : [0;32m1.5.0[0m
  [0;34mr

What changed? <br>
The runtime increases again, the order of the hashes is new as well. Every process will be recomputed from scratch, even with -resume, because there is nothing to reuse.
It behaves just like a fresh run.

## Lets look at the results

### What is differential abundance analysis?
The idea is to do a statistical comparison of feature abundances across groups to identify significant changes.

Give the most important plots from the report:
- (MA plots (before / after shrinkage): Show the relationship between log₂ fold change and mean abundance. They help to evaluate whether normalization worked correctly and whether the statistical assumptions of the model are reasonable)
- Preprocesing: Boxplots of different samples/treatments, they provide an overview of the distribution of abundances across samples. They are useful for spotting outliers, distribution shifts, or normalization issues.
- Data Reduction: Principal Component Analysis (PCA), highlights the major axes of variation among samples. This allows to check whether samples cluster according to biological groups, or whether there are batch effects or other artifacts.
- Final Analyis: Volcano plots display log₂ fold change versus –log₁₀ p-value, indicating which features (genes/taxa) are significantly up- or down-regulated between contrasts.

![image.png](attachment:image.png)

![image-3.png](attachment:image-3.png)

![image-2.png](attachment:image-2.png)