# Workshop 8 (Week beginning May 18)
# Introduction to Metagenome Analysis

In this workshop we will attempt to produce a metagenome-assembled-genome (MAG) from a microbial community associated with the human body, i.e part of the human microbiome.

## Background

The Human Microbiome Project was a major driver for methods development for the analysis of microbial communities (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3564958/). There are many sites around the body that have microbial communities - an overview of the samples that have been analysed as part of the initial Human Microbiome Project is summarised in the graph below.

<img src="data/workshop_8/hmgp_sample_sites.png">

## Kraken Reports

Some samples, such as faeces, have complex microbiomes (bacteria from many different species), which can make it more difficult to produce a MAG. 

The 33 samples used in today’s activity are from a less complex human microbial community. These samples were prepared from a series of vaginal swabs.

`kraken2` was used to produce a profile (--report) of the microbial community from the Illumina readset for each sample (Paired-end, 100 base reads). The reports (.tab files) contains tabulated information about the number of reads classified at each taxonomic level. Columns: % reads, total number of reads assigned to the clade rooted by the taxon, total number of reads assigned to the taxon directly, encoded taxonomic level, TaxID (NCBI), taxonomic classification.

Open one of the reports and examine some of the classifications.

Produce an overview of the number of reads in each sample (classified (root) and unclassified) with the command below.

`grep -P '\tU\t' -A 1 -H *.tab`

The version of the minikraken database used contains the human genome sequence. Human sequences represent an important potential ‘contaminant’ in the analysis of human microbiomes. Produce an overview of the human sequences in each sample.

`grep -P 'Homo sapiens' -H *.tab`

How many samples have less than 10% human reads?

The organism we would like to produce a MAG for is *Gardnerella vaginalis*. For a reasonable draft genome sequence we will need at least 15x depth. The estimated genome size for *Gardnerella vaginalis* can be found here: https://www.ncbi.nlm.nih.gov/genome/genomes/1967

for i in *.tab; do grep -H -P '\tS\t' $i | sort -k 2 -n | grep Gardnerella;
done

Which samples have enough *Gardnerella vaginalis* reads to produce an acceptable draft genome sequence? 

## Create a MAG

Reads for sample 1119_21M have been assembled using `Megahit`. The metagenome assembly (1119_21M.fa) contains 4716 contigs (min. 500 bases) and a total of 22,543,734 bases. Remember that this assembly contains sequences from multiple species.

Identify the contigs with similarity to the genome sequence of *Gardnerella vaginalis* strain 409-05 (NC_013721.1; 409-05.fa) using BLAST.

Extract the relevant contigs from the metagenome assembly using `samtools` based on your BLAST results (Hint: use samtools faidx -r with a regions file, where each region is specified on a new line).

What metrics could you use to assess the *Gardnerella vaginalis* MAG you produced from 1119_21M?

An alternate approach: Given that *Gardnerella vaginalis* is the dominant organism in the sample, produce a list of contigs with high read coverage (say multi > 50) and use this list to extract the relevant contigs.

The ‘contig coverage’ (multi=) is contained in the header of each of the contigs in 1119_21M.fa.

`grep '>' 1119_21M.fa | less`

Thank you to Dr. Dieter Bulach and Dharmesh Bhuva for developing the tutorial material. Updated by Steven Morgan.