# UMR MBT Microbiome Praktikum Session 1


---

### 1.1 Setup and introduction to the command line for Bioinformatics

First, let's run the following cell, to install the `conda` package manager (ca. 2 minutes)


In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()

Checking that the install is correct:

In [None]:
import condacolab
condacolab.check()

Afterwards we install all the programs we need today using `conda` (usually takes up to 5 minutes):

In [None]:
! conda install -y -c bioconda fastqc pandaseq kraken2 kraken-biom >/dev/null 2>&1

Let's verify our installs by running **_one_** the following commands:

In [None]:
! fastqc -h

In [None]:
! pandaseq -h

In [None]:
! kraken2

To conclude the setup, let's retrieve our course repository to a new folder called "**course**"

In [None]:
! git clone https://github.com/barrantesisrael/mbtmicrobiome2023 course

and retrieve the 16S Greengenes database by changing first to the materials folder, and running the `getgg.sh` script as following:

In [None]:
%cd course/data2025

In [None]:
! bash getgg.sh

Notice that we used `%` instead of `!` to run out command line function. This makes the path change to our directory permanent. Check the content of the current directory/folder

In [None]:
! ls

---

## Background

#### _Why do we need to use the command line in Bioinformatics?_

- Reproducibility: Command-line tools and programming enable the creation of reproducible workflows. By writing scripts or workflows, researchers can document and automate their analyses, ensuring that others can replicate their results.
- Pipelines: programs talking to each other (pipes)
- Redirection: programs write and read to files
- Text Streams: Allow us to both couple programs together and process data without storing huge amounts of data in our computers’ memory
- Modularity
  - Modular workflows allow us to experiment with alternate methods and approaches, since independent components can be easily swapped out
  - In a modular workflow each component is independent, which makes it easier to inspect intermediate results for inconsistencies and isolate problematic steps
  - Modular components allow us to choose tools and languages that are appropriate for specific tasks
  - Modular programs are reusable and applicable to many types of data

Reference: [Buffalo, V. (2015) _Bioinformatics Data Skills_. O’Reilly Media, Sebastopol CA](https://www.oreilly.com/library/view/bioinformatics-data-skills/9781449367480/)

---

#### _Reproducibility_

- Literate programming: Chunks of programming (analytical code) with human-readable text (comments)
- Version control: Tracking changes made to sets of files (of a project), typically program source code, scripts and documentation
- Environment control: Versions of all programs (plus libraries, packages, OS) used; archival copies for future reference. Example: sessionInfo()
- Persistent data sharing: collaborative, transparent, accessible science
- Documentation, e.g. README file
- Project (data + code + …) validation
- Command-line tools and programming enable the creation of reproducible workflows (aka pipelines). By writing scripts or workflows, researchers can document and automate their analyses, ensuring that others can replicate their results.
- Modularity of the command-line: Possibility of running long pipelines of programs, one after another + piping

Reference: [Ziemann et al. _Brief Bioinform_. 2023 Sep 22;24(6):bbad375](https://academic.oup.com/bib/article/24/6/bbad375/7326135)

---

#### _The 16S rRNA gene for community profiling_

- Operational concepts of classification: OTUs and ASVs
  - Operational taxonomic units (OTU): Consensus sequences from clustering
  - Amplicon Sequence Variants (ASV): Exact sequences
- Community profiling: Identifying OTUs and/or ASVs in samples
- 16S rRNA part of the 30S small subunit (SSU) of the prokaryotic ribosome
  - All prokayotes have one (or more) copy of this gene
  - NOT true for any protein coding genes
- Different parts of the gene exhibit different levels of conservation
  - More conserved regions can be used to analyse distantly related species
  - More variable regions can be used to analyse more closely related species
- Gene ~1540 nt
  - Not too short so as to be uninformative; not too long so as to be unmanageable
- Most 16S rRNA gene is highly conserved between different species
  - Hypervariable regions: Nine much less conserved (V1 – V9)

![16SrRNA](https://bioinformatics.ccr.cancer.gov/docs/qiime2/images/16SrRNA.png)

Figure source: Fukuda et al. Molecular Approaches to Studying Microbial Communities: Targeting the 16S Ribosomal RNA Gene. J UOEH. 2016 Sep;38(3):223-32. doi: [10.7888/juoeh.38.223](https://doi.org/10.7888/juoeh.38.223)

---

#### _Microbiome analysis_

---

#### _Sequence data formats_

---

Replace the name of the FASTQ file with your own identifier, and observe the first ten lines -what do you notice?

In [None]:
! head Platz1_R1.head.fastq


Count the total number of lines with the command below -how many READS are in this FASTQ? Hint: each read in FASTQ format consists of four lines


In [None]:
! wc -l Platz1_R1.head.fastq

Find specific nucleotide combinations, e.g "AATATT"

In [None]:
! grep "AATATT" Platz1_R1.head.fastq | head

Count how many times the hexanucleotide "AATATT" appears in your Illumina data

In [None]:
! grep -c "AATATT" Platz1_R1.head.fastq

### 1.2 Quality control with the [FASTQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) tool

Replace the `Platz` number with your own in the following command before running

In [None]:
! fastqc --quiet Platz1_R1.head.fastq

To display the output, download the HTML output from the `Files` view on the side menu to your local computer (Folder: /course/data2024), and open it in a web browser.

---

#### _Sequencing and assembly_

---

### 1.3 Amplicon assembly with [pandaseq](https://github.com/neufeld/pandaseq)

Again replace the `Platz` number with your own in the following command before running the following command:

In [None]:
! pandaseq -f Platz10_R1.head.fastq -r Platz10_R2.head.fastq -w Platz10.fa -g log.txt

Observe the first ten lines of the FASTA output:

In [None]:
! head Platz10.fa

Q: _How many sequences are in this FASTA output?_

Hint: count the total number of HEADER line symbols (">") with the command:


In [None]:
! grep -c ">" Platz10.fa

Q: _What is the rate of FASTA sequences vs FASTQ reads? And what does this tell about our sequencing and assembly quality and efficiency?_



### 1.3 OTU Assignment with [kraken2](https://ccb.jhu.edu/software/kraken2/) against the 16S [Greengenes](https://greengenes.secondgenome.com/) database

Run your samples against Greengenes with your own data; the example here is the with the `Platz10` run, so replace the FASTQ filenames with your own FASTQ names as before.



In [None]:
! kraken2 --db 16S_Greengenes_k2db --use-names --output output.txt --report report.txt --paired Platz10_R1.head.fastq Platz10_R2.head.fastq

Inspect your individual results (file: `report.txt`) with the following command:

In [None]:
! cat report.txt

Question: _What are the most predominant genera in your personal Illumina runs?_

The output report from Kraken2 consists of the following fields:

- Percentage of fragments covered by the clade rooted at this taxon
- Number of fragments covered by the clade rooted at this taxon
- Number of fragments assigned directly to this taxon
- A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies.
- NCBI taxonomic ID number
- Indented scientific name

See the program [documentation](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#sample-report-output-format) for details.


Finally, to generate our OTU table, first we need to execute kraken2 with all Illumina sequencing pairs; here is simplified by calling the script `getkrakenreports.sh`:

In [None]:
! bash getkrakenreports.sh

followed by merging all reports with the [`kraken-biom`](https://github.com/jenniferlu717/KrakenTools) tool:

In [None]:
! kraken-biom *.report --fmt tsv -o mbtmicrobiome20251022.tsv

Now let's take a look at our OTU table:

In [None]:
! cat mbtmicrobiome20251022.tsv

OTU taxonomic information can be retrieved from the [NCBI Taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy) database, e.g. the id number 527 corresponds to [_Acidiphilium_ sp.](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=527)

The full version of this table (e.g. see [`mbtmicrobiome2024.tsv`](https://github.com/barrantesisrael/mbtmicrobiome2023/blob/main/data2025/mbtmicrobiome2024.tsv)) will be analyzed in the next session, together with the [`metadata`](https://docs.google.com/spreadsheets/d/1cmch5QirBpVdN67B-8XmMPAtNNMhrGgylskox9nsuVw/edit?usp=sharing) file.

---

## Contact

Dr. rer. nat. Israel Barrantes <br>
Junior Research Group Translational Bioinformatics (head)<br>
Institute for Biostatistics and Informatics in Medicine and Ageing Research, Office 3017<br>
Rostock University Medical Center<br>
Ernst-Heydemann-Str. 8<br>
18057 Rostock, Germany<br>

Email: israel.barrantes[bei]uni-rostock.de

---
Last update 2025/10/13