# Taxonomic Profiling in Kraken2

## Running Kraken2 on Paired-End Sequencing Data
Run Kraken2 for Taxonomic Classification: For each sample, it processes the paired-end reads (_R1_trimmed.fastq.gz and _R2_trimmed.fastq.gz) using Kraken2 with the following settings:

Database Path: /home/jovyan/kraken2-db is used as the Kraken2 database.
Output Files:
The detailed classification results for each sample are written to ***/home/jovyan/results/\${sample}.kraken2***.
A summarized report is saved as ***/home/jovyan/results/\${sample}.tab***.

In [7]:
!mkdir -p /home/jovyan/results
!for sample in $(ls /home/jovyan/data/*_R1_*.fastq.gz | sed 's/.*\///' | sed 's/_R1_.*//'); do \
    echo "Processing sample: ${sample}"; \
    kraken2 --db /home/jovyan/kraken2-db --paired --output /home/jovyan/results/${sample}.kraken2 --report /home/jovyan/results/${sample}.tab \
    /home/jovyan/data/${sample}_R1_trimmed.fastq.gz /home/jovyan/data/${sample}_R2_trimmed.fastq.gz;\
done


Processing sample: FH1_S162
Loading database information... done.
3129692 sequences (855.69 Mbp) processed in 316.891s (592.6 Kseq/m, 162.02 Mbp/m).
  1409993 sequences classified (45.05%)
  1719699 sequences unclassified (54.95%)
Processing sample: FH2_S163
Loading database information... done.
3129692 sequences (945.17 Mbp) processed in 249.631s (752.2 Kseq/m, 227.18 Mbp/m).
  1305174 sequences classified (41.70%)
  1824518 sequences unclassified (58.30%)


#### Initial Output for Kraken

In [50]:
!head  /home/jovyan/results/FH2_S163.tab

 58.30	1824518	1824518	U	0	unclassified
 41.70	1305174	383	R	1	root
 41.68	1304328	63202	R1	131567	  cellular organisms
 38.60	1207946	47320	D	2	    Bacteria
 26.13	817718	24022	P	1224	      Proteobacteria
 10.62	332446	5911	C	28216	        Betaproteobacteria
  8.77	274500	13693	O	80840	          Burkholderiales
  7.17	224399	31842	F	80864	            Comamonadaceae
  3.30	103320	12602	G	12916	              Acidovorax
  1.65	51785	51785	S	358220	                Acidovorax sp. KKS102


## Generating Abundance Reports with Bracken from Kraken2 Reports
This code block processes the Kraken2 output for each sample using Bracken, which estimates species abundance from Kraken2 reports. The read length range (50-150 bp) is based on the prior FASTQC quality check results.

Bracken Parameters:

- Database Path: -d `/home/jovyan/kraken2-db` points to the Kraken2 database used in the previous step.
- Input File: -i `/home/jovyan/results/${sample}.tab` specifies the Kraken2 report for the sample.
- Level: -l G sets the taxonomic level to "Genus."
- Threshold: -t 150 specifies a threshold of 150 reads to consider species presence.
- Read Length: -r 100 indicates an estimated average read length of 100 bp, based on FASTQC results (read lengths ranged between 50-150 bp).
- Output File: -o /home/jovyan/results/${sample}.breport specifies the output file path for the abundance report.
  
Note: Produces two output files, one `/home/jovyan/results/${sample}.breport` and another `/home/jovyan/results/${sample}_bracken_genuses.tab`

In [46]:
!for sample in $(ls /home/jovyan/data/*_R1_*.fastq.gz | sed 's/.*\///' | sed 's/_R1_.*//'); do \
    bracken -d /home/jovyan/kraken2-db -i /home/jovyan/results/${sample}.tab -l G -t 150 -r 100 -o /home/jovyan/results/${sample}.bracken;\
done

 >> Checking for Valid Options...
 >> Running Bracken 
      >> python src/est_abundance.py -i /home/jovyan/results/FH1_S162.tab -o /home/jovyan/results/FH1_S162.bracken -k /home/jovyan/kraken2-db/database100mers.kmer_distrib -l G -t 150
PROGRAM START TIME: 01-11-2025 13:44:11
>> Checking report file: /home/jovyan/results/FH1_S162.tab
BRACKEN SUMMARY (Kraken report: /home/jovyan/results/FH1_S162.tab)
    >>> Threshold: 150 
    >>> Number of genuses in sample: 1165 
	  >> Number of genuses with reads > threshold: 208 
	  >> Number of genuses with reads < threshold: 957 
    >>> Total reads in sample: 3129692
	  >> Total reads kept at genuses level (reads > threshold): 1183840
	  >> Total reads discarded (genuses reads < threshold): 23070
	  >> Reads distributed: 195558
	  >> Reads not distributed (eg. no genuses above threshold): 7525
	  >> Unclassified reads: 1719699
BRACKEN OUTPUT PRODUCED: /home/jovyan/results/FH1_S162.bracken
PROGRAM END TIME: 01-11-2025 13:44:11
  Bracken complete

#### Output of Bracken

In [51]:
!head  /home/jovyan/results/FH2_S163_bracken_genuses.tab

100.00	1274246	0	R	1	root
100.00	1274246	0	R1	131567	  cellular organisms
96.19	1225643	0	D	2	    Bacteria
69.21	881885	0	P	1224	      Proteobacteria
31.42	400364	0	C	28216	        Betaproteobacteria
27.13	345649	0	O	80840	          Burkholderiales
19.15	244018	0	F	80864	            Comamonadaceae
10.91	138956	0	G	12916	              Acidovorax
1.38	17557	0	G	283	              Comamonas
1.52	19431	0	G	201096	              Alicycliphilus


## Converting Kraken2 Reports to MetaPhlAn-Style Format
This code block processes Kraken2 summary reports for each sample and converts them into a MetaPhlAn-style format using the kreport2mpa.py script. The MetaPhlAn-style format is commonly used for downstream analyses and visualization of taxonomic profiles. 

The MetaPhlAn-style format organizes taxonomic profiles in a standardized hierarchy, where each taxonomic level (e.g., kingdom, phylum, genus) is represented in a single line. 

ex: k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia   0.45



In [36]:
!for sample in $(ls /home/jovyan/data/*_R1_*.fastq.gz | sed 's/.*\///' | sed 's/_R1_.*//'); do  \
    kreport2mpa.py -r /home/jovyan/results/${sample}_bracken_genuses.tab -o /home/jovyan/results/${sample}_mpa.tab;\
done

#### Normalize Taxonomic Profiles for Abundance Calculation
This script processes the MetaPhlAn-style taxonomic profile files for each sample and normalizes the abundance values so they sum up to 100%.

In [37]:

!for sample in $(ls /home/jovyan/data/*_R1_*.fastq.gz | sed 's/.*\///' | sed 's/_R1_.*//'); do \
    sum=$(grep -vP "\\|" /home/jovyan/results/${sample}_mpa.tab | cut -f 2 | awk '{sum += $1} END {printf "%.2f", sum/100}'); \
    awk -v sum="$sum" 'BEGIN {FS="\t"; OFS="\t"} {print $1, $2/sum}' /home/jovyan/results/${sample}_mpa.tab >> /home/jovyan/results/${sample}_profile.tab; \
done

In [38]:
!head /home/jovyan/results/FH1_S162_profile.tab

d__Bacteria	96.5527
d__Bacteria|p__Proteobacteria	68.5
d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria	54.841
d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales	25.9388
d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae	23.5621
d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Raoultella	6.52018
d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Enterobacter	6.09267
d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Klebsiella	3.11332
d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia	5.14588
d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Citrobacter	1.58606


#### Merging Normalized Taxonomic Profiles into a Single Table
This command uses the merge_metaphlan_tables.py script from MetaPhlan2 to combine individual sample taxonomic profiles into a single merged abundance table. 

In [39]:
!merge_metaphlan_tables.py /home/jovyan/results/*_profile.tab >> /home/jovyan/results/merged_metakraken_abundance_table.txt

In [40]:
!head /home/jovyan/results/merged_metakraken_abundance_table.txt;

ID	FH1_S162_profile	FH2_S163_profile
d__Archaea	0.0188488	0.095743
d__Archaea|p__Crenarchaeota	0.0	0.0805183
d__Archaea|p__Crenarchaeota|c__Thermoprotei	0.0	0.0805183
d__Archaea|p__Crenarchaeota|c__Thermoprotei|o__Sulfolobales	0.0	0.0805183
d__Archaea|p__Crenarchaeota|c__Thermoprotei|o__Sulfolobales|f__Sulfolobaceae	0.0	0.0805183
d__Archaea|p__Crenarchaeota|c__Thermoprotei|o__Sulfolobales|f__Sulfolobaceae|g__Sulfolobus	0.0	0.0805183
d__Archaea|p__Euryarchaeota	0.0188488	0.0151462
d__Archaea|p__Euryarchaeota|c__Methanobacteria	0.0188488	0.0151462
d__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales	0.0188488	0.0151462


#### Reformat Sample Name Headers to Exclude _profile

In [41]:
!sed -i '1s/_profile//g' /home/jovyan/results/merged_metakraken_abundance_table.txt