# Taxonomic Profiling using Metaphlan4

Different versions of MetaPhlAn marker databases are available here:

http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/bowtie2_indexes/

Ideally we would install the latest version (vJun 23), but given limited computing resources, let's use a toy database instead (vJan 21 TOY). It can be installed by issuing the following command: 
```
metaphlan --install --index mpa_vJan21_TOY_CHOCOPhlAnSGB_202103_bt2.tar --bowtie2db data/metaphlan_databases/
```
But this has already been installed and is available for use in the `data/metaphlan_databases` folder.

Let's prepare a folder to dump various output files that metaphlan will produce.

In [10]:
! mkdir -p results/metaphlan_output

The command to run `metaphlan` on a single sample is:
```
metaphlan data/metagenome_samples/FH1_1.fastq.gz,data/metagenome_samples/FH1_2.fastq.gz --bowtie2db data/metaphlan_databases --index mpa_vJan21_TOY_CHOCOPhlAnSGB_202103 --input_type fastq --bowtie2out results/metaphlan_output/FH1.bowtie2.bz2 --output results/metaphlan_output/FH1_metaphlan_profile
```
This command simply points to the metaphlan the pair of read files corresponding to a metagenome sample, and the installed database. MetaPhlAn uses `Bowtie2` that does the assigning of reads to genes in the database. The assignments are written to the `bowtie2out` path. The output we are interested in, the relative abundances, are written to the path specified in `output`. 

Instead of running this MetaPhlAn command one-by-one for each sample, the following script automates the process, looping through all the samples.

In [11]:
!for sample in $(ls data/metagenome_samples/*_1.fastq.gz | sed 's/.*\///' | sed 's/_1.*//'); do \
    echo "Processing sample: ${sample}"; \
    metaphlan data/metagenome_samples/${sample}_1.fastq.gz,data/metagenome_samples/${sample}_2.fastq.gz \
    --bowtie2db data/metaphlan_databases --index mpa_vJan21_TOY_CHOCOPhlAnSGB_202103 --input_type fastq \
    --bowtie2out results/metaphlan_output/${sample}.bowtie2.bz2 --output results/metaphlan_output/${sample}_metaphlan_profile;\
done


Processing sample: BH1
BowTie2 output file detected: results/metaphlan_output/BH1.bowtie2.bz2
Please use it as input or remove it if you want to re-perform the BowTie2 run.
Exiting...

Processing sample: BH2
BowTie2 output file detected: results/metaphlan_output/BH2.bowtie2.bz2
Please use it as input or remove it if you want to re-perform the BowTie2 run.
Exiting...

Processing sample: BH3
BowTie2 output file detected: results/metaphlan_output/BH3.bowtie2.bz2
Please use it as input or remove it if you want to re-perform the BowTie2 run.
Exiting...

Processing sample: BH4
BowTie2 output file detected: results/metaphlan_output/BH4.bowtie2.bz2
Please use it as input or remove it if you want to re-perform the BowTie2 run.
Exiting...

Processing sample: FH1
BowTie2 output file detected: results/metaphlan_output/FH1.bowtie2.bz2
Please use it as input or remove it if you want to re-perform the BowTie2 run.
Exiting...

Processing sample: FH2
BowTie2 output file detected: results/metaphlan_outp

Taxonomic profiles of all the samples can be merged into a single file by running the following:

In [14]:
! rm -f results/metaphlan_output/merged_metaphlan_abundance_table.txt
! merge_metaphlan_tables.py results/metaphlan_output/*profile >> results/metaphlan_output/merged_metaphlan_abundance_table.txt


Further analysis requires R, so we will move to a different [notebook](statistical_analysis.ipynb) that runs an R kernel. Before moving to that notebook though, let's also obtain a [taxonomonic profile using kraken2](kraken2.ipynb) so that we can compare the two approaches later. 