# Filtering Nanopore data and assigning taxonomy to reads using Epi2Me

There are a variety of tools that can be used to filter out bad quality Nanopore reads and reads with unsuitable lengths, and to match reads to their most likely species. However, we're going to focus on a very easy set of tools to use: [Epi2me labs](https://labs.epi2me.io). This is a Nanopore bioinformatics workbench that Oxford Nanopore runs, with a variety of workflows that utilize already-existing tools.

In our case, we're going to be using the metagenomics workflow. This workflow allows you filter by read length and quality, filter out sequences from a host genome, detect antimicrobial resistance genes (in the case of shotgun metagenomic sequencing) and assign taxonomy to reads. There are two options for how to assign taxonomy - one is using Kraken2/Bracken, which uses kmers to map reads to the lowest common ancestor (LCA) that matches that pattern of kmers and predicts which descendant is the best match (Kraken2) and converts this into species abundances (Bracken). Below is a visualization of the Kraken workflow.

The other option is mapping the reads to a provided database of specific genomes using minimap. This is useful if you know what should be in your sample already - but that's rarely the case for microbiome research.

### Kraken workflow

<img src="https://media.springernature.com/full/springer-static/image/art%3A10.1186%2Fgb-2014-15-3-r46/MediaObjects/13059_2013_Article_3351_Fig1_HTML.jpg?as=webp" alt="Alternative text"  width="800" height="500" />


### Using Epi2me Labs

Epi2me Labs has a [desktop application](https://labs.epi2me.io/downloads/) that is very easy and intuitive to use. However, it's not possible to use the desktop app with the way these computing resources are set up, so we're going to use it in its command-line application (it's kind of easier to see all your options this way anyway)

Epi2me workflows run using nextflow, which is a python-based method of chaining multiple programs together. It works with docker, a method of creating computing environments as packages. It's not important to know these things, but it IS important to know that I've pre-installed them here, and you will need to install them yourself if you want to run Epi2me Labs on your computer.

Run the command below to see all the options for the metagenomics workflow. In this case we need to use the flag "--help" rather than "-h" as "-h" will bring up a more general menu.

In [2]:
nextflow run epi2me-labs/wf-metagenomics --help

N E X T F L O W  ~  version 23.04.3
NOTE: Your local project version looks outdated - a different revision is available in the remote repository [4397263427]
Launching `https://github.com/epi2me-labs/wf-metagenomics` [reverent_euclid] DSL2 - revision: 54e6d3e743 [master]
[33mWARN: NEXTFLOW RECURSION IS A PREVIEW FEATURE - SYNTAX AND FUNCTIONALITY CAN CHANGE IN FUTURE RELEASES[39m[K

[0;92m||||||||||   [0m[2m_____ ____ ___ ____  __  __ _____      _       _
[0;92m||||||||||  [0m[2m| ____|  _ \_ _|___ \|  \/  | ____|    | | __ _| |__  ___
[0;33m|||||       [0m[2m|  _| | |_) | |  __) | |\/| |  _| _____| |/ _` | '_ \/ __|
[0;33m|||||       [0m[2m| |___|  __/| | / __/| |  | | |__|_____| | (_| | |_) \__ \
[0;94m||||||||||  [0m[2m|_____|_|  |___|_____|_|  |_|_____|    |_|\__,_|_.__/|___/
[0;94m||||||||||  [0m[1mwf-metagenomics v2.5.0-g54e6d3e[0m
[2m--------------------------------------------------------------------------------[0m
Typical pipeline command:

  [0;36mnex

                                          * ecoh
  --amr_minid                   [2m[integer] [0mThreshold of required identity to report a match between a gene in the database and fastq reads. Valid interval: 
                                          0-100[2m [default: 80][0m 
  --amr_mincov                  [2m[integer] [0mMinimum coverage (breadth-of) threshold required to report a match between a gene in the  database and fastq reads. 
                                          Valid interval: 0-100[2m [default: 80][0m 

[4m[1mReport Options[0m
  --abundance_threshold         [2m[number]  [0mRemove those taxa whose abundance is lower than the chosen value.[2m[0m
  --n_taxa_barplot              [2m[integer] [0mNumber of most abundance taxa to be displayed in the barplot. The rest of taxa will be grouped under the "Other" 
                                          category.[2m [default: 8][0m 

[4m[1mOutput Options[0m
  --out_dir                     [2m[string]

Let's try running one of the mock datasets.

In [3]:
ls ~/data/

ls ~/data/
Jessica_Zymo_DNA_16S_flongle_test  Zymo_16S_SRR25400687_Nanoplot_2
SRR25400687_unfiltered.biom	   Zymo_16S_table.biom
Zymo_16S_SRR25400687.fastq	   zymo_standard.csv
Zymo_16S_SRR25400687_Nanoplot	   zymo_standard_bacteriaonly.csv


: 1

Let's run the Institut Pasteur de Lille 16S dataset through Epi2me's amplicon taxonomy workflow. We'll need to provide the program with the input data (The fastq file that we input into Nanoplot), the name of a directory to store our results, and the name of the database that Epi2me (and therefore the program Kraken2 mentioned above) should use for assigning taxonomy. Since we are using 16S sequences (recall from the lectures yesterday that this is the bacterial 'barcode' gene), we should use a database of 16S sequences. We'll use the database 'ncbi_16s_18s'.

In [6]:
nextflow run epi2me-labs/wf-metagenomics --fastq data/Zymo_16S_SRR25400687.fastq \
    --out_dir Epi2me_results/Zymo_16S_Epi2me --database_set 'ncbi_16s_18s'


N E X T F L O W  ~  version 23.04.3
NOTE: Your local project version looks outdated - a different revision is available in the remote repository [4397263427]
Launching `https://github.com/epi2me-labs/wf-metagenomics` [modest_joliot] DSL2 - revision: 54e6d3e743 [master]
[33mWARN: NEXTFLOW RECURSION IS A PREVIEW FEATURE - SYNTAX AND FUNCTIONALITY CAN CHANGE IN FUTURE RELEASES[39m[K

[0;92m||||||||||   [0m[2m_____ ____ ___ ____  __  __ _____      _       _
[0;92m||||||||||  [0m[2m| ____|  _ \_ _|___ \|  \/  | ____|    | | __ _| |__  ___
[0;33m|||||       [0m[2m|  _| | |_) | |  __) | |\/| |  _| _____| |/ _` | '_ \/ __|
[0;33m|||||       [0m[2m| |___|  __/| | / __/| |  | | |__|_____| | (_| | |_) \__ \
[0;94m||||||||||  [0m[2m|_____|_|  |___|_____|_|  |_|_____|    |_|\__,_|_.__/|___/
[0;94m||||||||||  [0m[1mwf-metagenomics v2.5.0-g54e6d3e[0m
[2m--------------------------------------------------------------------------------[0m
[1mCore Nextflow options[0m
  [0;34mre

Staging foreign file: https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/ncbi_targeted_loci_kraken2.tar.gz[K
[18A
executor >  local (4)[K
[b7/3aed0e] process > fastcat (1)                    [  0%] 0 of 1[K
[-        ] process > kraken_pipeline:unpackTaxonomy -[K
[-        ] process > kraken_pipeline:unpackDatabase -[K
[-        ] process > kraken_pipeline:determine_b... -[K
[-        ] process > kraken_pipeline:kraken_server  -[K
[38/b6e6d8] process > kraken_pipeline:run_common:... [  0%] 0 of 1[K
[cf/6ae75c] process > kraken_pipeline:run_common:... [100%] 1 of 1 ✔[K
[-        ] process > kraken_pipeline:kraken2_client -[K
[-        ] process > kraken_pipeline:progressive... -[K
[-        ] process > kraken_pipeline:progressive... -[K
[-        ] process > kraken_pipeline:progressive... -[K
[-        ] process > kraken_pipeline:makeReport     -[K
[d4/fe696f] process > kraken_pipeline:output (1)     [  0%] 0 of 1[K
[-        ] proces

[cf/6ae75c] process > kraken_pipeline:run_common:... [100%] 1 of 1 ✔[K
[-        ] process > kraken_pipeline:kraken2_client -[K
[-        ] process > kraken_pipeline:progressive... -[K
[-        ] process > kraken_pipeline:progressive... -[K
[-        ] process > kraken_pipeline:progressive... -[K
[-        ] process > kraken_pipeline:makeReport     -[K
[ca/98d128] process > kraken_pipeline:output (2)     [100%] 2 of 2[K
[-        ] process > kraken_pipeline:stop_kraken... -[K
Staging foreign file: https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/taxdmp_2023-01-01.zip[K
Staging foreign file: https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/ncbi_targeted_loci_kraken2.tar.gz[K
Staging foreign file: https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/database1000mers.kmer_distrib[K
[19A
executor >  local (8)[K
[b7/3aed0e] process > fastcat (1)                    [  0%] 0 of 1[K
[-        ] pr

Let's navigate to the results file in the file menu on the left and see what our results look like.

### Quality filtering and length filtering 

Epi2me's workflow provides some options for filtering the data by length and by the quality score of the reads. What difference does filtering by length, by quality, or both, make in terms of what species we detect?

  --min_len                     [integer] Specify read length lower limit.
  
  --min_read_qual               [number]  Specify read quality lower limit.
  
  --max_len                     [integer] Specify read length upper limit

#### Filtering by length 

We know from yesterday's discussiong that the 16S gene should be around 1,500bp. Let's try filtering the data so that the only sequences we keep are between 1,400 and 1,700 bp.

In [None]:
nextflow run epi2me-labs/wf-metagenomics --fastq data/Zymo_16S_SRR25400687.fastq \
    --out_dir Epi2me_results/Zymo_16S_Epi2me_lengthfiltered --database_set 'ncbi_16s_18s'\
    --min_len 1400 --max_len 1700


#### Filtering by quality

Now let's try filtering out all the sequences with an average quality score below 14. 

In [None]:
nextflow run epi2me-labs/wf-metagenomics --fastq data/Zymo_16S_SRR25400687.fastq \
    --out_dir Epi2me_results/Zymo_16S_Epi2me_qualityfiltered --database_set 'ncbi_16s_18s'\
    --min_read_qual 14


#### Filtering by both

In [5]:
nextflow run epi2me-labs/wf-metagenomics --fastq data/Zymo_16S_SRR25400687.fastq \
    --out_dir Epi2me_results/Zymo_16S_Epi2me_qualityandlengthfiltered --database_set 'ncbi_16s_18s'\
    --min_len 1400 --max_len 1700--min_read_qual 14



nextflow run epi2me-labs/wf-metagenomics --fastq ~/data/Zymo_16S_SRR25400687.fastq \
    --out_dir ~/backup/Zymo_16S_SRR25400687_Epi2me_filtered --database_set 'ncbi_16s_18s'\
    --min_len 1390 --max_len 1780 --min_read_qual 14
N E X T F L O W  ~  version 23.04.3
Launching `https://github.com/epi2me-labs/wf-metagenomics` [astonishing_hodgkin] DSL2 - revision: 54e6d3e743 [master]
[33mWARN: NEXTFLOW RECURSION IS A PREVIEW FEATURE - SYNTAX AND FUNCTIONALITY CAN CHANGE IN FUTURE RELEASES[39m[K

[0;92m||||||||||   [0m[2m_____ ____ ___ ____  __  __ _____      _       _
[0;92m||||||||||  [0m[2m| ____|  _ \_ _|___ \|  \/  | ____|    | | __ _| |__  ___
[0;33m|||||       [0m[2m|  _| | |_) | |  __) | |\/| |  _| _____| |/ _` | '_ \/ __|
[0;33m|||||       [0m[2m| |___|  __/| | / __/| |  | | |__|_____| | (_| | |_) \__ \
[0;94m||||||||||  [0m[2m|_____|_|  |___|_____|_|  |_|_____|    |_|\__,_|_.__/|___/
[0;94m||||||||||  [0m[1mwf-metagenomics v2.5.0-g54e6d3e[0m
[2m-------------

: 1

Now try your own way of filtering the data! Call the result file "Epi2me_results/Zymo_16S_Epi2me_myfilter"

Let's navigate to the left-hand menu and look at the resulting file reports together.