<a href="https://colab.research.google.com/github/epi2me-labs/tutorials/blob/master/Analysis_of_EPI2ME_16S_CSV_Output.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Analysis of EPI2ME 16S CSV Output</h1>

The EPI2ME 16S (and WIMP) analyses allow the download of a summary table of the results. However this table does not contain full lineage information and so it is not immediately useful to create the Sankey tree diagrams that EPI2ME displays in its web interface.

The following short code fragments, demonstrate how to decorate the EPI2ME data table and aggregate the read counts. 


An example 16S report can be found [here](https://epi2me.nanoporetech.com/workflow_instance/242318?token=6FF8C4B6-D055-11EA-A7B7-23477BC332F9). The code box below will download the corresponding CSV output file.

In [5]:
bucket = "ont-exd-int-s3-euwst1-epi2me-labs"
domain = "s3-eu-west-1.amazonaws.com"
site = "https://{}.{}".format(bucket, domain)

filename = "242318_classification_16s_barcode-v1.csv"

!echo "Downloading sample data"
!wget -q $site/misc/242318_classification_16s_barcode-v1.csv \
    && echo "Download complete" || echo "Download failed"
!echo
!head 242318_classification_16s_barcode-v1.csv || echo "File not readable"

Downloading sample data
Download complete

readid,runid,barcode,exit_status,taxid,species_taxid,species,accuracy,genus_taxid,genus,lca
0003c930-069c-4c63-a8df-1a0ca50992c4,711db73c212e422fad11ca3c0ed596fc,NA,Classification successful,404937,404937,Anoxybacillus thermarum,93.88,150247,Anoxybacillus,0
000d4674-7d99-4fed-8ba6-ab5781ab95b7,711db73c212e422fad11ca3c0ed596fc,NA,Classification successful,43657,43657,Pseudoalteromonas luteoviolacea,93.37,53246,Pseudoalteromonas,0
00248183-faee-4f5b-a9fe-feb8e241db92,711db73c212e422fad11ca3c0ed596fc,NA,Classification successful,1855725,1855725,Mucilaginibacter antarcticus,95.5,423349,Mucilaginibacter,0
0029656e-5028-48e5-8822-0ba770d9ccf6,711db73c212e422fad11ca3c0ed596fc,NA,Classification successful,878213,878213,Actinomycetospora iriomotensis,92.25,402649,Actinomycetospora,0
002b9804-6cd0-42cd-b86e-6c71cf01bf4e,711db73c212e422fad11ca3c0ed596fc,NA,Classification successful,2027860,2027860,Mucilaginibacter rubeus,93.07,423349,Mucilaginibacter,0
0

In order to decorate the file with lineage information we will use the [taxonkit](https://bioinf.shenwei.me/taxonkit/) tool. The codebox below will download this tool and also `csvkit` which we will use to convert the EPI2ME file from a comma-separated file to a tab-seperated file (which taxonkit requires)

In [None]:
!pip install csvkit
!wget https://github.com/shenwei356/taxonkit/releases/download/v0.6.0/taxonkit_linux_amd64.tar.gz
!tar -xzvf taxonkit_linux_amd64.tar.gz

Taxonkit requires the NCBI taxonomy database to function, let's download and decompress that:

In [None]:
!mkdir taxdump
%cd taxdump
!wget http://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
!tar -xzvf taxdump.tar.gz
%cd ..

We now have all we need to get going:
- the EPI2ME results
- taxonkit
- The NCBI taxonomy

Let's first write a code function to read in and perform several conversions on the EPI2ME file

In [13]:
import pandas as pd
def parse_epi2me(fname):
    fnametsv = fname + '.tsv'
    print("Converting to TSV")
    !csvformat -T $fname > $fnametsv
    print("Running lineage")
    !./taxonkit lineage --data-dir taxdump $fnametsv -i 5 > epi2me2lineage.tmp
    print("Running reformat")
    !./taxonkit reformat --data-dir taxdump -i 12 epi2me2lineage.tmp > epi2me2lineage
    print("Munging data")
    epi2me = pd.read_csv("epi2me2lineage", sep='\t')
    # rename some columns so they don't clash with the lineage info
    epi2me.columns = epi2me.columns[0:len(epi2me.columns)-2].to_list() + ['_lineage', 'lineage']
    epi2me = epi2me.rename(columns={'species': 'species_name', 'genus':'genus_name'})
    # extract the lineage info into its on columns in the table
    lineage = epi2me['lineage'].str.split(";", expand=True)
    lineage.columns = ['kingdom', 'phylum', 'class', 'order', 'family','genus', 'species']
    epi2me = pd.concat((epi2me, lineage), axis=1)
    return epi2me

And run the function on our example EPI2ME results:

In [14]:
epi2me = parse_epi2me(filename)
display(epi2me.head())

Converting to TSV
Running lineage
Running reformat
Munging data


Unnamed: 0,readid,runid,barcode,exit_status,taxid,species_taxid,species_name,accuracy,genus_taxid,genus_name,lca,_lineage,lineage,kingdom,phylum,class,order,family,genus,species
0,0003c930-069c-4c63-a8df-1a0ca50992c4,711db73c212e422fad11ca3c0ed596fc,,Classification successful,404937.0,404937.0,Anoxybacillus thermarum,93.88,150247.0,Anoxybacillus,0.0,cellular organisms;Bacteria;Terrabacteria grou...,Bacteria;Firmicutes;Bacilli;Bacillales;Bacilla...,Bacteria,Firmicutes,Bacilli,Bacillales,Bacillaceae,Anoxybacillus,Anoxybacillus thermarum
1,000d4674-7d99-4fed-8ba6-ab5781ab95b7,711db73c212e422fad11ca3c0ed596fc,,Classification successful,43657.0,43657.0,Pseudoalteromonas luteoviolacea,93.37,53246.0,Pseudoalteromonas,0.0,cellular organisms;Bacteria;Proteobacteria;Gam...,Bacteria;Proteobacteria;Gammaproteobacteria;Al...,Bacteria,Proteobacteria,Gammaproteobacteria,Alteromonadales,Pseudoalteromonadaceae,Pseudoalteromonas,Pseudoalteromonas luteoviolacea
2,00248183-faee-4f5b-a9fe-feb8e241db92,711db73c212e422fad11ca3c0ed596fc,,Classification successful,1855725.0,1855725.0,Mucilaginibacter antarcticus,95.5,423349.0,Mucilaginibacter,0.0,cellular organisms;Bacteria;FCB group;Bacteroi...,Bacteria;Bacteroidetes;Sphingobacteriia;Sphing...,Bacteria,Bacteroidetes,Sphingobacteriia,Sphingobacteriales,Sphingobacteriaceae,Mucilaginibacter,Mucilaginibacter antarcticus
3,0029656e-5028-48e5-8822-0ba770d9ccf6,711db73c212e422fad11ca3c0ed596fc,,Classification successful,878213.0,878213.0,Actinomycetospora iriomotensis,92.25,402649.0,Actinomycetospora,0.0,cellular organisms;Bacteria;Terrabacteria grou...,Bacteria;Actinobacteria;Actinobacteria;Pseudon...,Bacteria,Actinobacteria,Actinobacteria,Pseudonocardiales,Pseudonocardiaceae,Actinomycetospora,Actinomycetospora iriomotensis
4,002b9804-6cd0-42cd-b86e-6c71cf01bf4e,711db73c212e422fad11ca3c0ed596fc,,Classification successful,2027860.0,2027860.0,Mucilaginibacter rubeus,93.07,423349.0,Mucilaginibacter,0.0,cellular organisms;Bacteria;FCB group;Bacteroi...,Bacteria;Bacteroidetes;Sphingobacteriia;Sphing...,Bacteria,Bacteroidetes,Sphingobacteriia,Sphingobacteriales,Sphingobacteriaceae,Mucilaginibacter,Mucilaginibacter rubeus


With this data table we can now extract counts of reads at any of the taxonomic ranks, for example:

In [15]:
epi2me['phylum'].value_counts()

Proteobacteria           181030
Actinobacteria           123767
Firmicutes                87991
Bacteroidetes             56287
Euryarchaeota             20693
Tenericutes                5217
Spirochaetes               3758
Deinococcus-Thermus        3737
Crenarchaeota              3019
Cyanobacteria              2782
Thermotogae                1979
Verrucomicrobia            1886
                           1505
Acidobacteria              1471
Fusobacteria               1434
Planctomycetes             1352
Chloroflexi                1218
Aquificae                  1008
Chlamydiae                  931
Synergistetes               519
Deferribacteres             439
Thermodesulfobacteria       430
Balneolaeota                404
Nitrospirae                 287
Chlorobi                    282
Fibrobacteres               196
Chrysiogenetes              179
Thaumarchaeota              166
Gemmatimonadetes            162
Ignavibacteriae             137
Lentisphaerae               122
Rhodothe

### Some notes

The EPI2ME table provides a `taxid`, `species_taxid` and a `genus_taxid`. EPI2ME provides some sanity checking on its classification. If the top hits of a read are from different genera the `taxid` will be empty, that is to say the read is "Unclassified". The code above is using the value of the `taxid` field to derive the lineage information.