# Organizing and visualizing Nanopore data

1. Navigating the results of a Nanopore run
2. Combining fastq files from a run
3. Visualizing the run using Nanoplot

## 1. Navigating the results of a Nanopore run

The software on an Mk1C, or MinKNOW on a computer when running an Mk1B, allows you to look at how a Nanopore sequencing run is going in real time. Once a Nanopore run is finished, a report is generated that you can inspect at your leisure, as well as the fast5 and resultant fastq files from the run. We will inspect each element.

As part of her summer work, Jessica was trying out sequencing on a Flongle flow cell using Zymo's mock microbial DNA as the DNA input. She used the 16S amplicon library prep kit to generate full-length 16S sequences. She used barcodes to differentiate between multiple samples; for each barcode she used a different concentration of DNA. The purpose of this was to test whether the Flongle flow cell could accurately sequence small amounts of DNA.

In this case, the Flongle flow cells sat in the Texas heat for multiple days before they made it to the lab. as we've discussed, Nanopore flow cells do not appreciate that! Jessica decided to use one of them to practice on and see if she could get good data. Did she? We'll be the judge of that.

Let's navigate to the folder Jessica_Zymo_DNA_16S_flongle_test and see what's in it

In [None]:
ls

In [4]:
cd data/Jessica_Zymo_DNA_16S_flongle_test
ls

20230814_1517_MN35209_NEW001_ada67482


There is only one folder, showing a date along with other information: 

"[start_time]\_[device_ID]\_[flow_cell_id]\_[short_protocol_run_id]" - 

that's the result file. Let's navigate inside further.

In [7]:
cd 20230814_1517_MN35209_NEW001_ada67482
ls

barcode_alignment_NEW001_ada67482_8572d6d5.tsv
fast5_fail
fast5_pass
fast5_skip
fastq_fail
fastq_pass
final_summary_NEW001_ada67482_8572d6d5.txt
other_reports
pore_activity_NEW001_ada67482_8572d6d5.csv
report_NEW001_20230814_1522_ada67482.html
report_NEW001_20230814_1522_ada67482.json
report_NEW001_20230814_1522_ada67482.md
sample_sheet_NEW001_20230814_1522_ada67482.csv
sequencing_summary_NEW001_ada67482_8572d6d5.txt
throughput_NEW001_ada67482_8572d6d5.csv


***What is in the result folder from a Nanopore run?***

Record of the barcodes used in this experiment
1. barcode_alignment_NEW001_ada67482_8572d6d5.tsv
***
Directories containing fast5 files - these are the files containing the raw signals from the pores. MinKNOW acquires these data in chunks from the device, decides where reads begin and end, and saves them to fast5 files (newer technology has switched to using pod5 format which is an upgraded file storage system, but that's too rich for our blood so far).
***
2. fast5_fail <-- quality below threshold set by software (default is Q score of 7)
3. fast5_pass <-- quality above threshhold
4. fast5_skip <-- real-time basecalling not fast enough to deal with these so MinKNOW puts them aside here
***
Directories containing fastq files - this is your raw sequencing data. Basecalling software (in this case, Guppy which is part of MinKNOW) determines how the 'squiggles' - i.e. the raw signals from the pores - translate into nucleotides. Fastq files contain both the nucleotide sequences and the quality score associated with each base, while fasta files contain only nucleotide sequences

5. fastq_fail <-- bad quality reads
6. fastq_pass <-- good quality reads (or at least above average quality threshold)
***
A variety of reports showing different metrics of how the sequencing run went.

7. final_summary_NEW001_ada67482_8572d6d5.txt
8. other_reports
9. pore_activity_NEW001_ada67482_8572d6d5.csv
10. report_NEW001_20230814_1522_ada67482.html
11. report_NEW001_20230814_1522_ada67482.json
12. report_NEW001_20230814_1522_ada67482.md
13. sample_sheet_NEW001_20230814_1522_ada67482.csv
14. sequencing_summary_NEW001_ada67482_8572d6d5.txt
15. throughput_NEW001_ada67482_8572d6d5.csv

The fastq_pass folder is where the sequencing results that passed MinKNOW's original quality filter are located - let's check it out.

In [10]:
ls fastq_pass

barcode02			barcode07
barcode04			my_happy_sequencing_data.fastq
barcode05			unclassified
barcode06


As we can see, it's full of subdirectories with the names of the different barcodes. The software in MinKNOW automatically checks for the barcode tag on each sequence and puts it in its proper folder. If it can't figure out which barcode it is, or if it can't find a barcode (it's not a 100% perfect process), the sequence goes in the "unclassfied" folder. 

Now, let's examine what's inside one of these barcode folders:

In [11]:
cd fastq_pass/barcode02

In [12]:
ls 

NEW001_pass_barcode02_ada67482_8572d6d5_0.fastq


Looks like there is only one file of sequencing results, meaning there wasn't a lot of data. As Nanopore sequencing continues, the MinKNOW program will write the sequencing results to a file once the amount of data it's collected has reached a certain threshold - then it will begin again, building a new file. See how this file ends in 0? If there had been more sequencing results, you would see files labeled with 1, 2, 3, etc., sometimes into the hundreds.

Let's unzip the file:

In [17]:
gunzip NEW001_pass_barcode02_ada67482_8572d6d5_0.fastq.gz

In [19]:
ls

NEW001_pass_barcode02_ada67482_8572d6d5_0.fastq


Now let's navigate to the barcode04 folder:

In [15]:
cd ../barcode04

In [16]:
ls

NEW001_pass_barcode04_ada67482_8572d6d5_0.fastq


Hmm, only one sequencing file there too. Let's unzip it:

In [28]:
gunzip NEW001_pass_barcode04_ada67482_8572d6d5_0.fastq.gz

Now, let's try concatenating these two files with the different barcodes into a single fastq file, called "my_happy_sequencing_data.fastq"

In [33]:
cat barcode02/NEW001_pass_barcode02_ada67482_8572d6d5_0.fastq barcode04/NEW001_pass_barcode04_ada67482_8572d6d5_0.fastq > my_happy_sequencing_data.fastq

In [34]:
ls

barcode02  barcode05  barcode07			      unclassified
barcode04  barcode06  my_happy_sequencing_data.fastq


Why did we do that? For some of the programs you might wish to use on your Nanopore data, they will require all of the sequencing data for a sample to be in a single file. Using the cat command allows you to combine the contents of multiple files into one. 

#### Exploring the output files of the run

Now we will look at the output file from this sequencing run. In the menu on the left-hand side, navigate to Jessica's sequencing results folder and click on 'report_NEW001_20230814_1522_ada67482.html'. We will go through the results of this run together. (did this run go well?)

## 2. Visualizing and examining Nanopore data with Nanoplot

Let's go back to the main data folder. Now we're going to use a program called Nanoplot (already installed for you) to explore and visualize some data from a Nanopore run.

In [17]:
cd ~/data
ls

bash: cd: /Users/asimpson/data: No such file or directory
NEW001_pass_barcode04_ada67482_8572d6d5_0.fastq


#### Using flags

Many/most programs that you run on the command line will use 'flags' - input options that start with a dash and then a letter or a word to indicate what kind of information that flag is looking for. The short commands you learned earlier today in the intro to bash all have flags you can use, if you want to make your commands more specific or useful. 

For built-in commands in Unix like ls, cd, mv, etc. you can use the word 'man' (short for manual) followed by the command to see all of the possible flags/options. Try looking at the manual for 'ls' below. To refresh your memory, ls lists the contents of a file. But there are lots of ways to list the contents! 

In [None]:
man ls

That's a lot, right? Most of that you likely won't ever use. 

However, you'll need to know how to use flags to be able to input data into a program in the command prompt.

Most of the programs you'll be using for bioinformatics purposes allow you to look at the input options using the flag '-h' or '--help' - the help menu, basically. Let's explore the input options for the program "fastANI", which is already pre-installed, by using the -h flag. (fastANI calculates average nucleotide identity between two genomes - we won't be using it today but I'm using it as an example of a nice, simple, well-written help menu!)


In [4]:
fastANI -h

-----------------
fastANI is a fast alignment-free implementation for computing whole-genome Average Nucleotide Identity (ANI) between genomes
-----------------
Example usage:
$ fastANI -q genome1.fa -r genome2.fa -o output.txt
$ fastANI -q genome1.fa --rl genome_list.txt -o output.txt

SYNOPSIS
--------
fastANI [-h] [-r <value>] [--rl <value>] [-q <value>] [--ql <value>] [-k
        <value>] [-t <value>] [--fragLen <value>] [--minFraction <value>]
        [--visualize] [--matrix] [-o <value>] [-v]

OPTIONS
--------
-h, --help
     print this help page

-r, --ref <value>
     reference genome (fasta/fastq)[.gz]

--rl, --refList <value>
     a file containing list of reference genome files, one genome per line

-q, --query <value>
     query genome (fasta/fastq)[.gz]

--ql, --queryList <value>
     a file containing list of query genome files, one genome per line

-k, --kmer <value>
     kmer size <= 16 [default : 16]

-t, --threads <value>
     thread count for parallel execution [defa

Of interest: Based on the example, it looks like we would an input genome to be the query (-q) and an input genome to be the reference (-r), and the name of an output file (-o). That seems to be all we need to run the program - as the example in the help men shows.

fastANI -q genome1.fa -r genome2.fa -o output.txt

of note: the -t command. This specifies the number of threads/CPUs (i.e. parallel processes) the program can use. This is very important in bioinformatics because often, a process will take a very very long time unless you give it multiple threads. It used to be that personal computers didn't even have multiple threads to use, although that is changing now. However, right now we are logged onto a Linux server which does support multithreading.

#### Nanoplot

Now let's check out the options for Nanoplot, a program for filtering and visualizing Nanopore data. Let's look at the help menu:

In [3]:
Nanoplot -h

usage: Nanoplot [-h] [-v] [-t THREADS] [--verbose] [--store] [--raw] [--huge]
                [-o OUTDIR] [--no_static] [-p PREFIX] [--tsv_stats]
                [--info_in_report] [--maxlength N] [--minlength N]
                [--drop_outliers] [--downsample N] [--loglength]
                [--percentqual] [--alength] [--minqual N] [--runtime_until N]
                [--readtype {1D,2D,1D2}] [--barcoded] [--no_supplementary]
                [-c COLOR] [-cm COLORMAP]
                [-f [{png,jpg,jpeg,webp,svg,pdf,eps,json} ...]]
                [--plots [{kde,hex,dot} ...]] [--legacy [{kde,dot,hex} ...]]
                [--listcolors] [--listcolormaps] [--no-N50] [--N50]
                [--title TITLE] [--font_scale FONT_SCALE] [--dpi DPI]
                [--hide_stats]
                (--fastq file [file ...] | --fasta file [file ...] | --fastq_rich file [file ...] | --fastq_minimal file [file ...] | --summary file [file ...] | --bam file [file ...] | --ubam file [file ...] | --cram

Again, that is a LOT. However, we'll do a very simple run - we'll give it a single input file (a fastq file - raw sequencing data). We'll give the program an output folder name, tell it that we want the format of any images generated to be a .png, and tell it that it can use 10 threads.

Here, we'll run Nanoplot using a dataset generated at the Institut Pasteur de Lille. This research group sequenced the Zymobiomics microbial community DNA standard using Oxford Nanopore's 16S kit. This kit is designed to amplify the entire 16S rRNA region of bacterial (and archeal) genomes, which on average is 1492 base pairs long.

Run the command below - the input is the concatenated fastq file of the sequencing run (meaning that all the fastq files generated during the sequencing experiment were combined into one file), called "Zymo_16S_SRR25400687.fastq". The output is a directory which Nanoplot will fill with various files visualizing the sequencing data - we will supply a name for the output file, "Zymo_16S_SRR25400687_Nanoplot"

In [None]:
NanoPlot --fastq Zymo_16S_SRR25400687.fastq -o Zymo_16S_SRR25400687_Nanoplot -f png -t 10


Now let's enter our output folder Zymo_16S_SRR25400687_Nanoplot within the data folder and list the contents

In [39]:
cd ~/data/Zymo_16S_SRR25400687_Nanoplot
ls

LengthvsQualityScatterPlot_dot.html
LengthvsQualityScatterPlot_dot.png
LengthvsQualityScatterPlot_kde.html
LengthvsQualityScatterPlot_kde.png
NanoPlot-report.html
NanoPlot_20230909_2225.log
NanoStats.txt
Non_weightedHistogramReadlength.html
Non_weightedHistogramReadlength.png
Non_weightedLogTransformed_HistogramReadlength.html
Non_weightedLogTransformed_HistogramReadlength.png
WeightedHistogramReadlength.html
WeightedHistogramReadlength.png
WeightedLogTransformed_HistogramReadlength.html
WeightedLogTransformed_HistogramReadlength.png
Yield_By_Length.html
Yield_By_Length.png


Let's take a look at the general stats file:

In [40]:
cat NanoStats.txt

General summary:         
Mean read length:                1,365.0
Mean read quality:                   9.4
Median read length:              1,502.0
Median read quality:                10.7
Number of reads:               500,000.0
Read length N50:                 1,507.0
STDEV read length:                 381.3
Total bases:               682,499,554.0
Number, percentage and megabases of reads above quality cutoffs
>Q5:	496854 (99.4%) 678.1Mb
>Q7:	468390 (93.7%) 644.1Mb
>Q10:	306987 (61.4%) 440.4Mb
>Q12:	129272 (25.9%) 189.0Mb
>Q15:	1516 (0.3%) 2.1Mb
Top 5 highest mean basecall quality scores and their read lengths
1:	18.4 (205)
2:	18.3 (163)
3:	18.2 (168)
4:	17.9 (156)
5:	17.8 (149)
Top 5 longest reads and their mean basecall quality score
1:	39751 (4.0)
2:	4089 (5.2)
3:	3853 (4.0)
4:	3815 (6.5)
5:	3788 (4.2)


Looks like the average and median read lengths and quality scores are good - they are around or close to ~1500 base pairs, and the Q scores (for R9 flow cells run on a MinION) are quite good

Now, navigate to our results folder in the left-hand file display, and let's look at some of the results we just generated together.