### WIMP (What's in My Pot?) - Oxford Nanopore's metagenomic classifier.
    

In [None]:
# Import all the modules that we will need
import os, h5py, subprocess
import numpy as np
import pandas
%load_ext rpy2.ipython
from IPython.core.display import display, HTML

from ete3 import NCBITaxa
ncbi = NCBITaxa()

In [None]:
# Now we need to set our directories:
HOME_DIR = os.environ.get('PORECAMPAU_ANALYSIS_PATH') + "/"
PORECAMPAU_DATA_DIR = os.environ.get('PORECAMPAU_DATA_PATH') + "/"

LAKE_HILLIER_FAST5_DIR = PORECAMPAU_DATA_DIR + "fast5/"
METAGENOMICS_DIR = PORECAMPAU_DATA_DIR + "metagenomics/"

LOCAL_LAKE_HILLIER_DIR = HOME_DIR + "lake_hillier/"
    
# Create the local LAKE HILLIER DIRECTORY
if not os.path.isdir(LOCAL_LAKE_HILLIER_DIR):
    os.mkdir(LOCAL_LAKE_HILLIER_DIR)

This dataset has been returned from the metrichor cloud.
Like the fastq mux, and hairpin attribute of normal 2D reads, it also contains a classification summary attribute.
This holds the taxonomic ID that was assigned to this read.

Let's extract that information from each fast5 file in the list and write this into a tsv file.
(A text file with two tabs, one for the read name another for the taxonomic id).

In [None]:
fast5_files = [LAKE_HILLIER_FAST5_DIR + fast5_file for fast5_file in os.listdir(LAKE_HILLIER_FAST5_DIR)
                if fast5_file.endswith(".fast5")]
hdf5_workflow_path = "Analyses/Classification_000/Summary"
hdf5_taxid_path = hdf5_workflow_path + "/classification_2d"
tsv_file = LOCAL_LAKE_HILLIER_DIR + "my_taxid_tsv.txt"

In [None]:
# Open the tsv handler
tsv_handler = open(tsv_file, 'w')
for fast5_file in fast5_files:
    # Read in the fast5 file into a handler variable
    f = h5py.File(fast5_file, 'r')
    # The try command is used so that we can catch any errors.
    # This prevents the script from stalling if we encounter a read that
    # does not have the classification strand attribute.
    try:  
        f = h5py.File(fast5_file, 'r')  # r is for read.
        workflow_status = f[hdf5_workflow_path].attrs.values()[0] 
        if not workflow_status == "Workflow successful":
            continue  # No sequence classified, let's move onto the next sequence.
        # We don't need to use the else statement since we've used the 'continue'
        # statement in the previous line.
        taxid = f[hdf5_taxid_path].attrs.values()[0]
        # Write file name and taxid to our tsv file.
        tsv_handler.write("%s \t %s \n" % (fast5_file, taxid))  # The \t means tab while the \n means new line.
    except KeyError:
        print("This file does not have the classification attribute")

# Close the file. Programmers who don't close files are as bad as housemates who don't
# shut the front door of your house.
tsv_handler.close()

Great work, now open up this tsv file using the file management system so we can get a grasp of what is happening.

There were 100 fast5 files in the folder but only 70 lines here. It seems that some of these reads weren't successfully aligned to the WIMP database.

Now let's grab the tsv from the entire set of pass reads. We might be able to make use of that. 

In [None]:
# Let's get the full array and load that into python.
# We can then parse that array into R to return a summary of the list.
full_tsv = METAGENOMICS_DIR
full_tsv += "lake_hillier_taxids_by_WIMP.tsv"

taxids = np.loadtxt(full_tsv, delimiter="\t", usecols=[1])

In [None]:
%%R -i taxids -o taxids_summary
# Generates a table of the the number of taxids and how many reads matched to that id.
taxids_summary <- as.data.frame(table(taxids))

In [None]:
# We can use the ete3 package to then generate a lineage tree for each taxid.

# Write the summarised version to file.
lineage_file = LOCAL_LAKE_HILLIER_DIR + "lake_hillier_lineages.tsv"
lineage_file_h = open(lineage_file, 'w')

for index, row in taxids_summary.iterrows():
    taxid = row['taxids']
    freq = str(row['Freq'])  # Freq is the column name automatically generated by R table
    # Generate the lineage for the given taxid
    # Returns a list of taxids.
    try:
        lineage = ncbi.get_lineage(taxid)
    except ValueError:  # Can occur with different NCBI database versions
            print("Warning: %s not found in database" % taxid)  
                                                             
    names = ncbi.get_taxid_translator(lineage)
    lineage_names = [names[taxid] for taxid in lineage]
    # We have to add the 1: to not write the 'root' kingdom in the lineage.
    lineage_file_h.write(freq + "\t" + "\t".join(lineage_names[1:]) + "\n")

lineage_file_h.close()

Have a look at the lineage_file before moving on to get an appreciation of the input for a Krona file. Run the following command to then generate a Krona plot of Lake Hillier.

In [None]:
krona_output_file = LOCAL_LAKE_HILLIER_DIR + "krona_output_lake_hillier.html"
# We want the name variable to go onto the command line with the double quotes,
# so we mask it with single quotes first.
krona_output_name = '"Lake Hillier - Sediment, MinION R7.3 Q2 2015"'


ktImportCommand = "ktImportText -o %s -n %s %s" % (krona_output_file, krona_output_name, lineage_file)
stderr = subprocess.check_output(ktImportCommand, shell=True, stderr=subprocess.STDOUT)
print("Stderr = %s" % stderr)

krona_output_file_quotes = '"' + krona_output_file.replace(HOME_DIR, "../") + '"'
print "Add the following to the src variable in the html plot"
print krona_output_file_quotes

In [None]:
%%HTML -i krona_output_file_quotes
<iframe width="200%" height="700" src="../lake_hillier/krona_output_lake_hillier.html"></iframe>