# UMR MBT Microbiome Praktikum

Scripts and tutorials for analyzing microbiome data. Lab practice for the lecture: [Moderne molekulare und Hochdurchsatz-Technologien in der medizinischen Grundlagenforschung](https://lsf.uni-rostock.de/qisserver/rds?state=verpublish&status=init&vmfile=no&publishid=144145&moduleCall=webInfo&publishConfFile=webInfo&publishSubDir=veranstaltung)

## Description

These sessions will cover the use of a variety of software tools needed for the analysis of microbiome data, from the handling of the Illumina sequencing data, to the processing of 16S rRNA amplicon data. During these sessions, the students will be able to:

* Evaluate the quality of an Illumina sequencing run, including data filtering;
* Carry out assemblies of 16S rRNA amplicons;
* Assign Illumina runs to OTUs from 16S rRNA databases;
* Learn the basics on the R programming language and environment; and
* State how to manipulate microbiome data including count tables and sample metadata.


##### Venues and Dates


* Wednesday October 18th and 25th 2023 (10:00 - 11:30), at the SR2 - ZIM (Zentrum für Innere Medizin, Ernst-Heydemann-Str. 6)

#### Presentations

* Slides: [MBTPraktikum2023V01.pdf](https://drive.google.com/file/d/1IvdyRI0kiJNK5ECtGWHwOgrY763N10QR/view?usp=share_link)

#### Software

* All required software, packages and data are accessible through our virtual Binder environment

---

## Session 1


##### 1.1 Setup and introduction to the command line for Bioinformatics

First, let's run the following cell, to install the `conda` package manager


In [None]:
! wget https://raw.githubusercontent.com/barrantesisrael/mbtmicrobiome2023/main/materials/setup_conda
%run setup_conda

--2023-10-24 15:38:20--  https://raw.githubusercontent.com/barrantesisrael/mbtmicrobiome2023/main/materials/setup_conda
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2219 (2.2K) [text/plain]
Saving to: ‘setup_conda’


2023-10-24 15:38:20 (31.7 MB/s) - ‘setup_conda’ saved [2219/2219]



Afterwards we install all the programs we need today using `conda` (usually takes up to 5 minutes):

In [None]:
! conda install -y -c bioconda fastqc pandaseq kraken2 >/dev/null 2>&1

Let's verify our installs by running **_one_** the following commands:

In [None]:
! fastqc -h

In [None]:
! pandaseq -h

In [None]:
! kraken2

Need to specify input filenames!
Usage: kraken2 [options] <filename(s)>

Options:
  --db NAME               Name for Kraken 2 DB
                          (default: none)
  --threads NUM           Number of threads (default: 1)
  --quick                 Quick operation (use first hit or hits)
  --unclassified-out FILENAME
                          Print unclassified sequences to filename
  --classified-out FILENAME
                          Print classified sequences to filename
  --output FILENAME       Print output to filename (default: stdout); "-" will
                          suppress normal output
  --confidence FLOAT      Confidence score threshold (default: 0.0); must be
                          in [0, 1].
  --minimum-base-quality NUM
                          Minimum base quality used in classification (def: 0,
                          only effective with FASTQ input).
  --report FILENAME       Print a report with aggregrate counts/clade to file
  --use-mpa-style         Wi

To conclude the setup, let's retrieve our course repository to a new folder called "**course**"

In [None]:
! git clone https://github.com/barrantesisrael/mbt.microbiome.2021 course

Cloning into 'course'...
remote: Enumerating objects: 695, done.[K
remote: Counting objects: 100% (260/260), done.[K
remote: Compressing objects: 100% (156/156), done.[K
remote: Total 695 (delta 151), reused 186 (delta 103), pack-reused 435[K
Receiving objects: 100% (695/695), 5.38 MiB | 7.09 MiB/s, done.
Resolving deltas: 100% (345/345), done.


and retrieve the 16S Greengenes database by changing first to the materials folder, and running the `getgg.sh` script as following:

In [None]:
%cd course/data2023

/content/course/data2023


In [None]:
! bash getgg.sh

Notice that we used `%` instead of `!` to run out command line function. This makes the path change to our directory permanent. Check the content of the current directory/folder

In [None]:
! ls

16S_Greengenes_k2db	      Platz15_R2.head.fastq  Platz25_R1.head.fastq
bash_session_20231018_V01.sh  Platz16_R1.head.fastq  Platz25_R2.head.fastq
emptyfile		      Platz16_R2.head.fastq  Platz2_R1.head.fastq
getgg.sh		      Platz17_R1.head.fastq  Platz2_R2.head.fastq
KontrolleA_R1.head.fastq      Platz17_R2.head.fastq  Platz3_R1.head.fastq
KontrolleA_R2.head.fastq      Platz18_R1.head.fastq  Platz3_R2.head.fastq
KontrolleK_R1.head.fastq      Platz18_R2.head.fastq  Platz4_R1.head.fastq
KontrolleK_R2.head.fastq      Platz19_R1.head.fastq  Platz4_R2.head.fastq
mbtmicrobiome2023.biom	      Platz19_R2.head.fastq  Platz5_R1.head.fastq
mbtmicrobiome2023.tsv	      Platz1_R1.head.fastq   Platz5_R2.head.fastq
Platz10_R1.head.fastq	      Platz1_R2.head.fastq   Platz6_R1.head.fastq
Platz10_R2.head.fastq	      Platz20_R1.head.fastq  Platz6_R2.head.fastq
Platz11_R1.head.fastq	      Platz20_R2.head.fastq  Platz7_R1.head.fastq
Platz11_R2.head.fastq	      Platz21_R1.head.fastq  Platz7_R2.head.fastq
Plat

Replace the name of the FASTQ file with your own identifier, and observe the first ten lines -what do you notice?

In [None]:
! head Platz1_R1.head.fastq

@M02093:262:000000000-L53B2:1:1101:19849:1170 1:N:0:ACTCGCTA+CTCTCTAT
CCTACGGGCTGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAAGGAAGAAGTATTTCGGTATGTAAACTTCTATCAGCAGGGAAGATAGTGACGGTACCTGACTAAGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGTGTGGCAAGTCTGATGTGAAAGGCATGGGCTCAACCTGTGGACTGCATTGGAAACTGTCATACTTGAGTGCCGGAGGG
+
ACCCCFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGG>EGGGGFGGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFEFGGFDDBGEFFEDEFGFF<?AF<BF<ADFFFFF>).1F????DFFBB>0:A04?BFFFF>3=:8)).))6<A)44<>F24<9B96>B
@M02093:262:000000000-L53B2:1:1101:19552:1182 1:N:0:ACTCGCTA+CTCTCTAT
CCTACAGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGTGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGAAAATGACGGTACCTGACTAAGAAGCCCTGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTATCCGGAATTACTGGGTGTAAAGGGTGCGGAGGTGGTATGGCAAGTCAGAAGTGAAAACCCAGGGCT


Count the total number of lines with the command below -how many READS are in this FASTQ? Hint: each read in FASTQ format consists of four lines


In [None]:
! wc -l Platz1_R1.head.fastq

400 Platz1_R1.head.fastq


Find specific nucleotide combinations, e.g "AATATT"

In [None]:
! grep "AATATT" Platz1_R1.head.fastq | head

CCTACGGGCTGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAAGGAAGAAGTATTTCGGTATGTAAACTTCTATCAGCAGGGAAGATAGTGACGGTACCTGACTAAGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGTGTGGCAAGTCTGATGTGAAAGGCATGGGCTCAACCTGTGGACTGCATTGGAAACTGTCATACTTGAGTGCCGGAGGG
CCTACAGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGTGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGAAAATGACGGTACCTGACTAAGAAGCCCTGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTATCCGGAATTACTGGGTGTAAAGGGTGCGGAGGTGGTATGGCAAGTCAGAAGTGAAAACCCAGGGCTTAACTCTGGGACTGCTTTTGAAACTGTCAGACTGGAGTGCAGGAGAG
CCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGACGAGAGTCTGAACCAGCCAAGTAGCGTGAAGGATGACTGCCCTATGGGTTGTAAACTTCTTTTATACGGGAATAAAGTGAGGCACGTGTGCCTTTTTGTATGTACCGTATGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACGCTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGG
CCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGCCGAGAGGCTGAACCAGCCAAGTCGCGTGAAGGAAGAAGGATCTATGGTTTGTAAAC

Count how many times the hexanucleotide "AATATT" appears in your Illumina data

In [None]:
! grep -c "AATATT" Platz1_R1.head.fastq

87


### 1.2 Quality control with the [FASTQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) tool

Replace the `Platz` number with your own in the following command before running

In [None]:
! fastqc --quiet Platz1_R1.head.fastq

null


To display the output, download the HTML output from the `Files` view on the side menu to your local computer, and open it in a web browser.



### 1.3 Amplicon assembly with [pandaseq](https://github.com/neufeld/pandaseq)

Again replace the `Platz` number with your own in the following command before running the following command:

In [None]:
! pandaseq -f Platz10_R1.head.fastq -r Platz10_R2.head.fastq -w Platz10.fa -g log.txt

Observe the first ten lines of the FASTA output:

In [None]:
! head Platz10.fa

>M02093:262:000000000-L53B2:1:1101:10955:1184:ACTCGCTA+ACTGCATA
CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAAGGAAGAAGTATCTCGGTATGTAAACTTCTATCAGCAGGGAAGAAAATGACGGTACCTGACTAAGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGTGTTGCAAGTCTGATGTGAAAGACGGGGGCTCAACCCCTGGACTGCATTGGAAACTGTGATACTCGAGTGCCGGAGAGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACGGTAACTGACGTTGAGGCTCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCGTGTAGTC
>M02093:262:000000000-L53B2:1:1101:16471:1188:ACTCGCTA+ACTGCATA
CCTAGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCAACGCCGCGTGAGTGATGACGGCCTTCGGGTTGTAAAGCTCTGTCTTCAGGGACGATAATGACGGTACCTGAGGAGGAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTATGTCGCAAGCGTTATCCGGATTTATTGGGCGTAAAGCGCGTCTAGGCGGTCTGGTAAGTCTGATGTGGAAATGCGGGGCTCAACTCCGTATTGCGTTGGAAACTGCCAGACTAGAGTACTGGAGAGGTGGGCGGAACTACAAGTGTAGAGGTGAAAGTCGTAGATATTTGTAGGAATGCCGATAGAGAAGTCAGCTCACTGGACAGATACTGACGCTGAAGCGCGAAAGCATGGGAGCAAACAGGATTAGATACCCTC

Q: _How many sequences are in this FASTA output?_

Hint: count the total number of HEADER line symbols (">") with the command:


In [None]:
! grep -c ">" Platz10.fa

97


Q: _What is the rate of FASTA sequences vs FASTQ reads? And what does this tell about our sequencing and assembly quality and efficiency?_



### 1.3 OTU Assignment with [kraken2](https://ccb.jhu.edu/software/kraken2/) against the 16S [Greengenes](https://greengenes.secondgenome.com/) database

Run your samples against Greengenes with your own data; the example here is the with the `Platz10` run, so replace the FASTQ filenames with your own FASTQ names as before.



In [None]:
! kraken2 --db 16S_Greengenes_k2db --use-names --output output.txt --report report.txt --paired Platz10_R1.head.fastq Platz10_R2.head.fastq

Loading database information... done.
Processed 100 sequences (60094 bp) ...100 sequences (0.06 Mbp) processed in 0.012s (511.4 Kseq/m, 307.31 Mbp/m).
  100 sequences classified (100.00%)
  0 sequences unclassified (0.00%)


Inspect your individual results (file: `report.txt`) with the following command:

In [None]:
! cat report.txt

100.00	100	0	R	1	root
100.00	100	0	D	3	  Bacteria
 48.00	48	0	P	34	    Firmicutes
 43.00	43	0	C	226	      Clostridia
 43.00	43	6	O	527	        Clostridiales
 25.00	25	11	F	1017	          Lachnospiraceae
  5.00	5	5	G	1800	            Coprococcus
  4.00	4	4	G	1797	            Blautia
  4.00	4	4	G	1801	            Dorea
  1.00	1	0	G	1810	            [Ruminococcus]
  1.00	1	1	S	2783	              [Ruminococcus] gnavus
  7.00	7	2	F	1020	          Ruminococcaceae
  4.00	4	4	G	1830	            Oscillospira
  1.00	1	1	G	1831	            Ruminococcus
  3.00	3	1	F	1010	          Clostridiaceae
  2.00	2	0	G	1777	            Clostridium
  2.00	2	2	S	2770	              Clostridium perfringens
  2.00	2	0	F	1025	          Veillonellaceae
  2.00	2	2	G	1840	            Dialister
  5.00	5	0	C	225	      Bacilli
  5.00	5	0	O	524	        Lactobacillales
  4.00	4	0	F	1003	          Enterococcaceae
  4.00	4	4	G	1756	            Enterococcus
  1.00	1	0	F	1006	          Streptococcaceae
  1.00	1	1	G	1765	     

Question: _What are the most predominant genera in your personal Illumina runs?_

The output report from Kraken2 consists of the following fields:

- Percentage of fragments covered by the clade rooted at this taxon
- Number of fragments covered by the clade rooted at this taxon
- Number of fragments assigned directly to this taxon
- A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies.
- NCBI taxonomic ID number
- Indented scientific name

See the program [documentation](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#sample-report-output-format) for details.

