# Taxonomic Profiling in Kraken2

## Running Kraken2 on Paired-End Sequencing Data
Run Kraken2 for Taxonomic Classification: For each sample, it processes the paired-end reads (_R1_trimmed.fastq.gz and _R2_trimmed.fastq.gz) using Kraken2 with the following settings:

Database Path: /home/jovyan/kraken2-db is used as the Kraken2 database.
Output Files:
The detailed classification results for each sample are written to ***/home/jovyan/results/\${sample}.kraken2***.
A summarized report is saved as ***/home/jovyan/results/\${sample}.tab***.

In [1]:
!mkdir -p /home/jovyan/results
!for sample in $(ls /home/jovyan/data/*_R1_*.fastq.gz | sed 's/.*\///' | sed 's/_R1_.*//'); do \
    echo "Processing sample: ${sample}"; \
    kraken2 --db /home/jovyan/kraken2-db --paired --output /home/jovyan/results/${sample}.kraken2 --report /home/jovyan/results/${sample}.tab \
    /home/jovyan/data/${sample}_R1_trimmed.fastq.gz /home/jovyan/data/${sample}_R2_trimmed.fastq.gz;\
done


Processing sample: BFH10_S128
Loading database information... done.
3129692 sequences (945.17 Mbp) processed in 106.034s (1771.0 Kseq/m, 534.83 Mbp/m).
  865436 sequences classified (27.65%)
  2264256 sequences unclassified (72.35%)
Processing sample: BFH33_S151
Loading database information... done.
3129692 sequences (945.17 Mbp) processed in 62.497s (3004.7 Kseq/m, 907.40 Mbp/m).
  1520930 sequences classified (48.60%)
  1608762 sequences unclassified (51.40%)
Processing sample: BH02_S77
Loading database information... done.
3129692 sequences (945.17 Mbp) processed in 163.693s (1147.2 Kseq/m, 346.44 Mbp/m).
  1090615 sequences classified (34.85%)
  2039077 sequences unclassified (65.15%)
Processing sample: BH03_S78
Loading database information... done.
3129692 sequences (945.17 Mbp) processed in 126.404s (1485.6 Kseq/m, 448.64 Mbp/m).
  963977 sequences classified (30.80%)
  2165715 sequences unclassified (69.20%)
Processing sample: FH1_S162
Loading database information... done.
31296

## Generating Abundance Reports with Bracken from Kraken2 Reports
This code block processes the Kraken2 output for each sample using Bracken, which estimates species abundance from Kraken2 reports. The read length range (50-150 bp) is based on the prior FASTQC quality check results.

Bracken Parameters:

- Database Path: -d `/home/jovyan/kraken2-db` points to the Kraken2 database used in the previous step.
- Input File: -i `/home/jovyan/results/${sample}.tab` specifies the Kraken2 report for the sample.
- Level: -l G sets the taxonomic level to "Genus."
- Threshold: -t 150 specifies a threshold of 150 reads to consider species presence.
- Read Length: -r 100 indicates an estimated average read length of 100 bp, based on FASTQC results (read lengths ranged between 50-150 bp).
- Output File: -o /home/jovyan/results/${sample}.breport specifies the output file path for the abundance report.

In [2]:
!for sample in $(ls /home/jovyan/data/*_R1_*.fastq.gz | sed 's/.*\///' | sed 's/_R1_.*//'); do \
    bracken -d /home/jovyan/kraken2-db -i /home/jovyan/results/${sample}.tab -l G -t 150 -r 100 -o /home/jovyan/results/${sample}.breport;\
done

 >> Checking for Valid Options...
 >> Running Bracken 
      >> python src/est_abundance.py -i /home/jovyan/results/BFH10_S128.tab -o /home/jovyan/results/BFH10_S128.breport -k /home/jovyan/kraken2-db/database100mers.kmer_distrib -l G -t 150
PROGRAM START TIME: 12-20-2024 00:44:23
>> Checking report file: /home/jovyan/results/BFH10_S128.tab
BRACKEN SUMMARY (Kraken report: /home/jovyan/results/BFH10_S128.tab)
    >>> Threshold: 150 
    >>> Number of genuses in sample: 1156 
	  >> Number of genuses with reads > threshold: 164 
	  >> Number of genuses with reads < threshold: 992 
    >>> Total reads in sample: 3129692
	  >> Total reads kept at genuses level (reads > threshold): 351634
	  >> Total reads discarded (genuses reads < threshold): 27166
	  >> Reads distributed: 484490
	  >> Reads not distributed (eg. no genuses above threshold): 2146
	  >> Unclassified reads: 2264256
BRACKEN OUTPUT PRODUCED: /home/jovyan/results/BFH10_S128.breport
PROGRAM END TIME: 12-20-2024 00:44:23
  Bracken

## Converting Kraken2 Reports to MetaPhlAn-Style Format
This code block processes Kraken2 summary reports for each sample and converts them into a MetaPhlAn-style format using the kreport2mpa.py script. The MetaPhlAn-style format is commonly used for downstream analyses and visualization of taxonomic profiles. 

The MetaPhlAn-style format organizes taxonomic profiles in a standardized hierarchy, where each taxonomic level (e.g., kingdom, phylum, genus) is represented in a single line. 

ex: k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia   0.45



In [3]:
!for sample in $(ls /home/jovyan/data/*_R1_*.fastq.gz | sed 's/.*\///' | sed 's/_R1_.*//'); do  \
    kreport2mpa.py -r /home/jovyan/results/${sample}.tab -o /home/jovyan/results/${sample}_mpa.tab;\
done

#### Normalize Taxonomic Profiles for Abundance Calculation
This script processes the MetaPhlAn-style taxonomic profile files for each sample and normalizes the abundance values so they sum up to 100%.

In [9]:

!for sample in $(ls /home/jovyan/data/*_R1_*.fastq.gz | sed 's/.*\///' | sed 's/_R1_.*//'); do \
    sum=$(grep -vP "\\|" /home/jovyan/results/${sample}_mpa.tab | cut -f 2 | awk '{sum += $1} END {printf "%.2f", sum/100}'); \
    if [ -z "$sum" ] || [ "$sum" == "0" ]; then \
        sum=0.00001; \
        echo "Warning: sum is zero or undefined, setting sum to 0"; \
    fi; \
    awk -v sum="$sum" 'BEGIN {FS="\t"; OFS="\t"} {print $1, $2/sum}' /home/jovyan/results/${sample}_mpa.tab >> /home/jovyan/results/${sample}_profile.tab; \
done

In [10]:
!head /home/jovyan/results/FH1_S162_profile.tab

d__Bacteria	96.4236
d__Bacteria|p__Proteobacteria	67.1025
d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria	52.3122
d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales	23.9919
d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae	20.4287
d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Raoultella	5.98904
d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Raoultella|s__Raoultella_ornithinolytica	1.58662
d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Raoultella|s__Raoultella_planticola	0.0736675
d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Raoultella|s__Raoultella_sp._X13	0.0366204
d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Enterobacter	5.10304


#### Merging Normalized Taxonomic Profiles into a Single Table
This command uses the merge_metaphlan_tables.py script from MetaPhlan2 to combine individual sample taxonomic profiles into a single merged abundance table. 

In [14]:
!merge_metaphlan_tables.py /home/jovyan/results/*_profile.tab >> /home/jovyan/results/merged_metakraken_abundance_table.txt

In [17]:
!head /home/jovyan/results/merged_metakraken_abundance_table.txt

ID	BFH10_S128	BFH33_S151	BH02_S77	BH03_S78	FH1_S162	FH2_S163
d__Archaea	2.53983	0.426819	0.133947	0.316336	0.0479976	0.122828
d__Archaea|p__Candidatus_Korarchaeota	0.000117059	6.64309e-05	0.0	0.0	0.0	0.0
d__Archaea|p__Candidatus_Korarchaeota|g__Candidatus_Korarchaeum	0.000117059	6.64309e-05	0.0	0.0	0.0	0.0
d__Archaea|p__Candidatus_Korarchaeota|g__Candidatus_Korarchaeum|s__Candidatus_Korarchaeum_cryptofilum	0.000117059	6.64309e-05	0.0	0.0	0.0	0.0
d__Archaea|p__Candidatus_Micrarchaeota	0.0	0.000132862	0.0	0.0	0.0	0.0
d__Archaea|p__Candidatus_Micrarchaeota|s__Candidatus_Micrarchaeota_archaeon_Mia14	0.0	0.000132862	0.0	0.0	0.0	0.0
d__Archaea|p__Crenarchaeota	0.130638	0.0751998	0.0432623	0.126156	0.000497753	0.0824762
d__Archaea|p__Crenarchaeota|c__Thermoprotei	0.130638	0.0751998	0.0432623	0.126156	0.000497753	0.0824762
d__Archaea|p__Crenarchaeota|c__Thermoprotei|o__Acidilobales	0.0	6.64309e-05	0.0	0.000524777	0.0	0.0


#### Reformat Sample Name Headers to Exclude _profile

In [16]:
!sed -i '1s/_profile//g' /home/jovyan/results/merged_metakraken_abundance_table.txt