# Now, for something completely different

I have talked a lot about sequencing data analysis. This was assuming you have a nice sequencing library, containing human or animal or plant DNA.

Then, I have talked about population genomics, assuming you have genotypes from such clades, with enough resolution to allow statistical approaches like PCA, f-statistics etc.

However, what if you have a sample coming from the environment? Or sediments? Or ancient DNA that is not quite good?

# Metagenomics

Then, taxonomic profiling would be a good idea!

## kraken2

Let's try a tool called [kraken2](https://doi.org/10.1186/s13059-019-1891-0)!

What does this do? It tries to match short pieces of the sequencing reads to a database of possible genomes. This is not a very easy task, and basically an abbreviation to mapping complete sequencing reads to a large number of complete genomes. The output can be reads and the taxon they are assigned to (essentially creating another copy of your data). And a summary report telling you what is there.

For now, let's focus on this summary report! To make things easy, I provide a (rather small) standard database for `kraken2`. Depending on your question, you may use [different ones](https://benlangmead.github.io/aws-indexes/k2) or create your own.

```
WDIR=/lisc/scratch/course/2024w550001/

module load kraken2

kraken2 --db ../share/krakendb  --output - test3_p.fastq.gz --report=ktest3.txt 
```

This may take a while, and it can be quite memory intensive, especially when using large databases. Note that it is preferred to take files without adapter sequences: the preprocessing step is the same as it was for mapping!

The report is just a text table with stuff. Let's inspect it!

If we try a different set of data, what do we see? Where might this sample come frome?

```
kraken2 --db ../share/krakendb  --output - test_p.fastq.gz --report=ktest.txt 
```

Also, we can subset to things that are more common. Here again it's very handy to have some knowledge on basic file processing:

```
awk '$2>1000 && $4=="S"' ktest.txt
```

Now let's interpret this! Where could this DNA potentially come from?

Of course, different genomes have different sizes, leading to the problem that the read count or percentage itself is of limited information. In that case, once something interesting is identified, one might need to account for genome size etc.

Also, using a different `kraken2` database can help once you have an idea what is going on in the data (or, of course, because you know that it is a sample of a specific type).



## A little challenge

Good, this is great so far!

Now, let's look at some other data!

In the `share` directory, there are two more `fastq.gz` files: `test5.fastq.gz` and `test6.fastq.gz`.

* Preprocess and screen them!

* Maybe also map them to the human genome? How does that look like?

* What do you see here?


## BLAST

Another commonly used way of classifying is BLAST. This is a very different algorithm, and it can query against a very large search space. Results might be quite different and subject to interpretation, with a major problem being the format of the output.

In principle, `blast` can be run on the command line, but the resources in this environment are rather limited. Instead, we will do it through the NCBI BLAST web interface.

Another problem is that `blast` does not like `fastq` format files, but they have to be converted into `fasta` files (the same format that reference genomes have). Given the time, we may just take a very small subset of 50 reads. There are many ways to do this, one would be with `awk`:

```
zcat test5_p.fastq.gz | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}'  | head -n 50 > testforfa5.fsa
```

I'm honest, I searched the internet to look it up how to do this most easily. Then, we download the file and query it through the [web interface](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome).

This procedure works well for single reads, but it doesn't scale easily. However, assume `kraken2` identified a couple of reads belonging to a virus, and you are interested in viruses, then you may try blasting those reads.



## Competitive mapping

Another strategy to work with is competitive mapping. Here, you use multiple reference genomes and try to map your read to all of them at once. Then you check which one "attracts" the most reads here.

There are some tools which can do this, here we will use [bbmap](https://sourceforge.net/projects/bbmap/). There is a function to provide multiple reference genomes, and find the best hit. It's a nice tool because it does not require a tedious installation, I put a copy into the `share` directory. Assuming the data from `test5` is likely a primate, but we are not certain if it is human, we may do this:

On the first run, with the following code, a reference set of multiple genomes is built. This took several hours, so you may not repeat it by yourselves.

```
#$WDIR/share/bbmap/bbsplit.sh in=test5_p.fastq.gz ref=$WDIR/ref_gen/hg19.fa,$WDIR/ref_gen/Pan_paniscus.panpan1.1.dna.toplevel.fa,$WDIR/ref_gen/Papio_anubis.Panubis1.0.dna.toplevel.fa,$WDIR/ref_gen/Pongo_abelii.Susie_PABv2.dna.toplevel.fa out=clean1.fq refstats=test5_bbstats.txt
```

Now, you can actually run:

```
$WDIR/share/bbmap/bbsplit.sh -Xmx20g in=test5_p.fastq.gz ref=$WDIR/martin/ref out=clean1.fq refstats=test5_bbstats.txt
```

Now, let's inspect the output with the refstats!

This in principle works for other organisms/genomes as well.
