# Summary:

This notebook is for visualizing antibiotic resistance gene tables generated by [ABRicate](https://github.com/tseemann/abricate) and [SRST2](https://github.com/katholt/srst2). 

    
# Example Use Case:

In this example, the complete Shakya et al. 2013 metagenome is being compared to small, medium, and large subsamples of itself after conservative or aggressive read filtering and assembly with SPAdes or MEGAHIT. The datasets used in this example are named according to their metagenome content, relative degree of read filtering, and assembler used where appropriate. ABRicate is appropriate for analysis of antibiotic resistance genes (ARG) in reads while is SRST is useful for analysis of ABR in contigs.

* SRR606249 = Accession number for the complete Shakya et al. 2013 metagenome
* subset50 = 50% of the complete Shakya et al. 2013 metagenome
* subset25 = 25% of the complete Shakya et al. 2013 metagenome
* subset10 = 10% of the complete Shakya et al. 2013 metagenome
* pe.trim2 = Conservative read filtering
* pe.trim30 = Aggressive read filtering
* megahit = MEGHIT assembly 
* spades = SPAdes assembly 


# Objectives:

* Create table with all of the genes found 
* Count the total number of genes found for each dataset
* Count the number of unique genes found per dataset
* Compare unique genes found using a presence/absence table
* Compare results from reads and assemblies

In [1]:
import matplotlib as mpl
import matplotlib.pyplot as plt
#from pickle import load
import pandas as pd
import glob
#%matplotlib inline

# Analysis of 

In [2]:
def concat_files(filenames):
    x = glob.glob(filenames)
    list_of_dfs = [pd.read_table(filename, header = 1) for filename in x]
    for dataframe, filename in zip(list_of_dfs, x):
        dataframe['filename'] = filename
    combined_df = pd.concat(list_of_dfs, ignore_index=True)
    return combined_df
concat_files("*tab")

Unnamed: 0,#FILE,SEQUENCE,START,END,GENE,COVERAGE,COVERAGE_MAP,GAPS,%COVERAGE,%IDENTITY,DATABASE,ACCESSION,filename
0,Processing: /data/SRR606249_1.trim30.fq.gz_spa...,,,,,,,,,,,,SRR606249_1.trim30_spades_abricate.tab
1,Found 15 genes in /data/SRR606249_1.trim30.fq....,,,,,,,,,,,,SRR606249_1.trim30_spades_abricate.tab
2,/data/SRR606249_1.trim30.fq.gz_spades_output/c...,NODE_1077_length_38551_cov_20.554681,14537.0,14715.0,vat(A)_1,346-524/660,.......=/====..,2/2,26.97,75.556,resfinder,L07778,SRR606249_1.trim30_spades_abricate.tab
3,/data/SRR606249_1.trim30.fq.gz_spades_output/c...,NODE_1288_length_30418_cov_6.642328,13224.0,13332.0,oqxB_1,1168-1276/3153,.....==........,0/0,3.46,83.486,resfinder,EU370913,SRR606249_1.trim30_spades_abricate.tab
4,/data/SRR606249_1.trim30.fq.gz_spades_output/c...,NODE_14_length_460229_cov_14.415215,103641.0,105137.0,lsa(A)_2,1-1497/1497,===============,0/0,100.00,99.933,resfinder,AY58982,SRR606249_1.trim30_spades_abricate.tab
5,/data/SRR606249_1.trim30.fq.gz_spades_output/c...,NODE_16351_length_1140_cov_6.075576,1.0,581.0,blaOXA-181_1,123-703/798,..============.,0/0,72.81,81.928,resfinder,HM992946,SRR606249_1.trim30_spades_abricate.tab
6,/data/SRR606249_1.trim30.fq.gz_spades_output/c...,NODE_16931_length_1080_cov_2.175610,904.0,1025.0,aph(6)-Ic_1,454-575/801,........===....,0/0,15.23,78.689,resfinder,X01702,SRR606249_1.trim30_spades_abricate.tab
7,/data/SRR606249_1.trim30.fq.gz_spades_output/c...,NODE_26607_length_447_cov_6.528061,35.0,447.0,oqxB_1,2659-3071/3153,......../...===,4/4,13.04,75.422,resfinder,EU370913,SRR606249_1.trim30_spades_abricate.tab
8,/data/SRR606249_1.trim30.fq.gz_spades_output/c...,NODE_27739_length_407_cov_3.707386,15.0,137.0,vat(F)_1,99-221/666,..===..........,0/0,18.47,77.236,resfinder,AF170730,SRR606249_1.trim30_spades_abricate.tab
9,/data/SRR606249_1.trim30.fq.gz_spades_output/c...,NODE_28254_length_391_cov_2.407738,1.0,169.0,msr(D)_2,1296-1464/1464,.............==,0/0,11.54,100.000,resfinder,AF274302,SRR606249_1.trim30_spades_abricate.tab


In [3]:
# Calculate the total number of genes annotated with Prokka
def calc_total_genes():
    combined_df = concat_files('*tab')
    x = combined_df.groupby('filename').GENE.count()
    y = x.to_frame()
    bingo = y.sort_values('GENE',ascending=False)
    bingo
    return bingo
calc_total_genes()

Unnamed: 0_level_0,GENE
filename,Unnamed: 1_level_1
SRR606249_1.trim30_spades_abricate.tab,15
SRR606249_subset10_1.trim2_spades_abricate.tab,15
SRR606249_subset25_1.trim2_spades_abricate.tab,15
SRR606249_1.trim2_spades_abricate.tab,14
SRR606249_subset25_1.trim30_megahit_abricate.tab,14
SRR606249_subset25_1.trim30_spades_abricate.tab,14
SRR606249_1.trim30_megahit_abricate.tab,13
SRR606249_subset25_1.trim2_megahit_abricate.tab,13
SRR606249_subset10_1.trim30_spades_abricate.tab,12
SRR606249_subset50_1.trim2_spades_abricate.tab,12


In [4]:
# Calculate the total number of unique genes annotated with Prokka

def calculate_unique_genes():
    combined_df = concat_files("*tab")
    x = combined_df.groupby('filename').GENE.nunique()
    y = x.to_frame()
    bingo = y.sort_values('GENE',ascending=False)
    bingo
    return bingo
calculate_unique_genes()

Unnamed: 0_level_0,GENE
filename,Unnamed: 1_level_1
SRR606249_1.trim2_spades_abricate.tab,12
SRR606249_1.trim30_spades_abricate.tab,12
SRR606249_subset25_1.trim30_spades_abricate.tab,12
SRR606249_subset25_1.trim2_spades_abricate.tab,11
SRR606249_subset25_1.trim30_megahit_abricate.tab,11
SRR606249_1.trim30_megahit_abricate.tab,10
SRR606249_subset25_1.trim2_megahit_abricate.tab,10
SRR606249_subset50_1.trim2_spades_abricate.tab,10
SRR606249_1.trim2_megahit_abricate.tab,9
SRR606249_subset50_1.trim30_spades_abricate.tab,9


In [5]:
# Creat concatenated tsv file 
combined_df = concat_files('*tab')
# Remove columns keeping only 'gene' and 'filename'
# Drop any na values
combined_df.dropna(axis=0, inplace=True)
#new_combined_df.head()
g = combined_df.groupby('GENE')
ug = list(set(combined_df['GENE']))

a = []
for GENE in ug:
    gene_group = g.get_group(GENE)
    if len(gene_group['filename'])>1:
        a.append(gene_group[['filename', 'GENE']])
#a

[                                             filename      GENE
 49     SRR606249_subset25_1.trim2_spades_abricate.tab  otr(C)_1
 122   SRR606249_subset25_1.trim30_spades_abricate.tab  otr(C)_1
 146  SRR606249_subset25_1.trim30_megahit_abricate.tab  otr(C)_1,
                                             filename      GENE
 10            SRR606249_1.trim30_spades_abricate.tab  tet(O)_3
 27   SRR606249_subset50_1.trim30_spades_abricate.tab  tet(O)_3
 181   SRR606249_subset50_1.trim2_spades_abricate.tab  tet(O)_3
 220            SRR606249_1.trim2_spades_abricate.tab  tet(O)_3,
                                             filename      GENE
 9             SRR606249_1.trim30_spades_abricate.tab  msr(D)_2
 11            SRR606249_1.trim30_spades_abricate.tab  msr(D)_2
 26   SRR606249_subset50_1.trim30_spades_abricate.tab  msr(D)_2
 60    SRR606249_subset25_1.trim2_spades_abricate.tab  msr(D)_2
 86            SRR606249_1.trim2_megahit_abricate.tab  msr(D)_2
 128  SRR606249_subset25_1.trim30_

In [6]:
from collections import defaultdict

gene_filenames = defaultdict(list)

for line in a:
    gene_filenames[line['GENE'].iloc[0]].extend(line['filename'].tolist())

In [8]:
filenames = set()
for files in gene_filenames.values():
    filenames.update(files)

In [9]:
filenames = list(filenames)

In [10]:
data = {}
for gene, files in gene_filenames.items():
    data[gene] = [file in files for file in filenames]
dense_df = pd.DataFrame.from_dict(data, orient='index', columns=filenames)

In [12]:
from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

In [13]:
import numpy as np
import pandas as pd
from clustergrammer_widget import *
np.version.version

'1.14.5'

In [14]:
# initialize network object
net = Network(clustergrammer_widget)
# load dataframe
net.load_df(dense_df)
# cluster using default parameters
net.cluster(enrichrgram=False)
# make the visualization
net.widget()

clustergrammer_widget(network='{"row_nodes": [{"name": "otr(C)_1", "ini": 18, "clust": 1, "rank": 3, "rankvar"…

In [15]:
import qgrid

In [16]:
qgrid.show_grid(dense_df, show_toolbar=True)

QgridWidget(grid_options={'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': True, 'defau…

In [None]:
# Analysis of SRST2 results

In [24]:
def concat_txt(filenames):
    x = glob.glob(filenames)
    list_of_dfs = [pd.read_table(filename, header = 0) for filename in x]
    for dataframe, filename in zip(list_of_dfs, x):
        dataframe['filename'] = filename
    combined_df = pd.concat(list_of_dfs, ignore_index=True, sort=True)
    return combined_df
concat_txt("srst2/*results.txt")

Unnamed: 0,DB,LsaA_MLS,MphD_MLS,Sample,TEM-1D_Bla,allele,annotation,clusterid,coverage,depth,diffs,divergence,filename,gene,length,maxMAF,seqid,uncertainty
0,ARGannot.r1,,,SRR606249_subset10,,MphD_1613,no;no;MphD;MLS;NC_017312;2292413-2291580;834,228.0,90.528,3.798,5snp79holes,0.662,srst2/SRR606249_subset10_1.trim2.fq.gz__fullge...,MphD_MLS,834.0,0.25,1613.0,edge0.0
1,ARGannot.r1,,,SRR606249_subset50,,LsaA_298,no;no;LsaA;MLS;AY225127;41-1537;1497,33.0,100.0,17.648,26snp,1.737,srst2/SRR606249_subset50_1.trim30.fq.gz__fullg...,LsaA_MLS,1497.0,0.071,298.0,
2,ARGannot.r1,,,SRR606249_subset50,,MphD_1613,no;no;MphD;MLS;NC_017312;2292413-2291580;834,228.0,100.0,17.692,6snp,0.719,srst2/SRR606249_subset50_1.trim30.fq.gz__fullg...,MphD_MLS,834.0,0.04,1613.0,
3,,LsaA_298*,MphD_1613*,SRR606249_subset50,,,,,,,,,srst2/SRR606249_subset50_1.trim30.fq.gz__genes...,,,,,
4,,,,SRR606249_subset10,,,,,,,,,srst2/SRR606249_subset10_1.trim30.fq.gz__genes...,,,,,
5,,LsaA_298*,MphD_1613*,SRR606249,TEM-116_967*?,,,,,,,,srst2/SRR606249_1.trim30.fq.gz__genes__ARGanno...,,,,,
6,,LsaA_298*,MphD_1613*,SRR606249,TEM-116_967*?,,,,,,,,srst2/SRR606249_1.trim2.fq.gz__genes__ARGannot...,,,,,
7,ARGannot.r1,,,SRR606249_subset25,,LsaA_298,no;no;LsaA;MLS;AY225127;41-1537;1497,33.0,100.0,8.917,26snp,1.737,srst2/SRR606249_subset25_1.trim2.fq.gz__fullge...,LsaA_MLS,1497.0,0.5,298.0,
8,ARGannot.r1,,,SRR606249_subset25,,MphD_1613,no;no;MphD;MLS;NC_017312;2292413-2291580;834,228.0,100.0,9.616,6snp,0.719,srst2/SRR606249_subset25_1.trim2.fq.gz__fullge...,MphD_MLS,834.0,0.167,1613.0,
9,ARGannot.r1,,,SRR606249_subset25,,LsaA_298,no;no;LsaA;MLS;AY225127;41-1537;1497,33.0,100.0,8.01,26snp,1.737,srst2/SRR606249_subset25_1.trim30.fq.gz__fullg...,LsaA_MLS,1497.0,0.125,298.0,


In [25]:
# Calculate the total number of genes annotated with Prokka
def calc_total_genes():
    combined_df = concat_txt('srst2/*results.txt')
    x = combined_df.groupby('filename').gene.count()
    y = x.to_frame()
    bingo = y.sort_values('gene',ascending=False)
    bingo
    return bingo
calc_total_genes()

Unnamed: 0_level_0,gene
filename,Unnamed: 1_level_1
srst2/SRR606249_1.trim2.fq.gz__fullgenes__ARGannot.r1__results.txt,3
srst2/SRR606249_1.trim30.fq.gz__fullgenes__ARGannot.r1__results.txt,3
srst2/SRR606249_subset25_1.trim2.fq.gz__fullgenes__ARGannot.r1__results.txt,2
srst2/SRR606249_subset25_1.trim30.fq.gz__fullgenes__ARGannot.r1__results.txt,2
srst2/SRR606249_subset50_1.trim2.fq.gz__fullgenes__ARGannot.r1__results.txt,2
srst2/SRR606249_subset50_1.trim30.fq.gz__fullgenes__ARGannot.r1__results.txt,2
srst2/SRR606249_subset10_1.trim2.fq.gz__fullgenes__ARGannot.r1__results.txt,1
srst2/SRR606249_1.trim2.fq.gz__genes__ARGannot.r1__results.txt,0
srst2/SRR606249_1.trim30.fq.gz__genes__ARGannot.r1__results.txt,0
srst2/SRR606249_subset10_1.trim2.fq.gz__genes__ARGannot.r1__results.txt,0


In [43]:
def calculate_unique_genes():
    combined_df = concat_txt('srst2/*results.txt')
    x = combined_df.groupby('filename').gene.nunique()
    y = x.to_frame()
    bingo = y.sort_values('gene',ascending=False)
    bingo
    return bingo
calculate_unique_genes()

Unnamed: 0_level_0,gene
filename,Unnamed: 1_level_1
srst2/SRR606249_1.trim2.fq.gz__fullgenes__ARGannot.r1__results.txt,3
srst2/SRR606249_1.trim30.fq.gz__fullgenes__ARGannot.r1__results.txt,3
srst2/SRR606249_subset25_1.trim2.fq.gz__fullgenes__ARGannot.r1__results.txt,2
srst2/SRR606249_subset25_1.trim30.fq.gz__fullgenes__ARGannot.r1__results.txt,2
srst2/SRR606249_subset50_1.trim2.fq.gz__fullgenes__ARGannot.r1__results.txt,2
srst2/SRR606249_subset50_1.trim30.fq.gz__fullgenes__ARGannot.r1__results.txt,2
srst2/SRR606249_subset10_1.trim2.fq.gz__fullgenes__ARGannot.r1__results.txt,1
srst2/SRR606249_1.trim2.fq.gz__genes__ARGannot.r1__results.txt,0
srst2/SRR606249_1.trim30.fq.gz__genes__ARGannot.r1__results.txt,0
srst2/SRR606249_subset10_1.trim2.fq.gz__genes__ARGannot.r1__results.txt,0


In [51]:
# Creat concatenated tsv file 
combined_df = concat_txt('srst2/*results.txt')
combined_df

Unnamed: 0,DB,LsaA_MLS,MphD_MLS,Sample,TEM-1D_Bla,allele,annotation,clusterid,coverage,depth,diffs,divergence,filename,gene,length,maxMAF,seqid,uncertainty
0,ARGannot.r1,,,SRR606249_subset10,,MphD_1613,no;no;MphD;MLS;NC_017312;2292413-2291580;834,228.0,90.528,3.798,5snp79holes,0.662,srst2/SRR606249_subset10_1.trim2.fq.gz__fullge...,MphD_MLS,834.0,0.25,1613.0,edge0.0
1,ARGannot.r1,,,SRR606249_subset50,,LsaA_298,no;no;LsaA;MLS;AY225127;41-1537;1497,33.0,100.0,17.648,26snp,1.737,srst2/SRR606249_subset50_1.trim30.fq.gz__fullg...,LsaA_MLS,1497.0,0.071,298.0,
2,ARGannot.r1,,,SRR606249_subset50,,MphD_1613,no;no;MphD;MLS;NC_017312;2292413-2291580;834,228.0,100.0,17.692,6snp,0.719,srst2/SRR606249_subset50_1.trim30.fq.gz__fullg...,MphD_MLS,834.0,0.04,1613.0,
3,,LsaA_298*,MphD_1613*,SRR606249_subset50,,,,,,,,,srst2/SRR606249_subset50_1.trim30.fq.gz__genes...,,,,,
4,,,,SRR606249_subset10,,,,,,,,,srst2/SRR606249_subset10_1.trim30.fq.gz__genes...,,,,,
5,,LsaA_298*,MphD_1613*,SRR606249,TEM-116_967*?,,,,,,,,srst2/SRR606249_1.trim30.fq.gz__genes__ARGanno...,,,,,
6,,LsaA_298*,MphD_1613*,SRR606249,TEM-116_967*?,,,,,,,,srst2/SRR606249_1.trim2.fq.gz__genes__ARGannot...,,,,,
7,ARGannot.r1,,,SRR606249_subset25,,LsaA_298,no;no;LsaA;MLS;AY225127;41-1537;1497,33.0,100.0,8.917,26snp,1.737,srst2/SRR606249_subset25_1.trim2.fq.gz__fullge...,LsaA_MLS,1497.0,0.5,298.0,
8,ARGannot.r1,,,SRR606249_subset25,,MphD_1613,no;no;MphD;MLS;NC_017312;2292413-2291580;834,228.0,100.0,9.616,6snp,0.719,srst2/SRR606249_subset25_1.trim2.fq.gz__fullge...,MphD_MLS,834.0,0.167,1613.0,
9,ARGannot.r1,,,SRR606249_subset25,,LsaA_298,no;no;LsaA;MLS;AY225127;41-1537;1497,33.0,100.0,8.01,26snp,1.737,srst2/SRR606249_subset25_1.trim30.fq.gz__fullg...,LsaA_MLS,1497.0,0.125,298.0,


In [53]:
# Drop any na values
combined_df.dropna(axis=0, subset=['gene'], inplace=True)
combined_df
       # or pass
#df.fillna(0, inplace=True)
#new_combined_df.head()
g = combined_df.groupby('gene')
ug = list(set(combined_df['gene']))

a = []
for gene in ug:
    gene_group = g.get_group(gene)
    if len(gene_group['filename'])>1:
       a.append(gene_group[['filename', 'gene']])
a

[                                             filename      gene
 0   srst2/SRR606249_subset10_1.trim2.fq.gz__fullge...  MphD_MLS
 2   srst2/SRR606249_subset50_1.trim30.fq.gz__fullg...  MphD_MLS
 8   srst2/SRR606249_subset25_1.trim2.fq.gz__fullge...  MphD_MLS
 10  srst2/SRR606249_subset25_1.trim30.fq.gz__fullg...  MphD_MLS
 13  srst2/SRR606249_subset50_1.trim2.fq.gz__fullge...  MphD_MLS
 18  srst2/SRR606249_1.trim30.fq.gz__fullgenes__ARG...  MphD_MLS
 21  srst2/SRR606249_1.trim2.fq.gz__fullgenes__ARGa...  MphD_MLS,
                                              filename      gene
 1   srst2/SRR606249_subset50_1.trim30.fq.gz__fullg...  LsaA_MLS
 7   srst2/SRR606249_subset25_1.trim2.fq.gz__fullge...  LsaA_MLS
 9   srst2/SRR606249_subset25_1.trim30.fq.gz__fullg...  LsaA_MLS
 12  srst2/SRR606249_subset50_1.trim2.fq.gz__fullge...  LsaA_MLS
 16  srst2/SRR606249_1.trim30.fq.gz__fullgenes__ARG...  LsaA_MLS
 19  srst2/SRR606249_1.trim2.fq.gz__fullgenes__ARGa...  LsaA_MLS,
                       

In [54]:
from collections import defaultdict

gene_filenames = defaultdict(list)

for line in a:
    gene_filenames[line['gene'].iloc[0]].extend(line['filename'].tolist())

In [55]:
gene_filenames

defaultdict(list,
            {'MphD_MLS': ['srst2/SRR606249_subset10_1.trim2.fq.gz__fullgenes__ARGannot.r1__results.txt',
              'srst2/SRR606249_subset50_1.trim30.fq.gz__fullgenes__ARGannot.r1__results.txt',
              'srst2/SRR606249_subset25_1.trim2.fq.gz__fullgenes__ARGannot.r1__results.txt',
              'srst2/SRR606249_subset25_1.trim30.fq.gz__fullgenes__ARGannot.r1__results.txt',
              'srst2/SRR606249_subset50_1.trim2.fq.gz__fullgenes__ARGannot.r1__results.txt',
              'srst2/SRR606249_1.trim30.fq.gz__fullgenes__ARGannot.r1__results.txt',
              'srst2/SRR606249_1.trim2.fq.gz__fullgenes__ARGannot.r1__results.txt'],
             'LsaA_MLS': ['srst2/SRR606249_subset50_1.trim30.fq.gz__fullgenes__ARGannot.r1__results.txt',
              'srst2/SRR606249_subset25_1.trim2.fq.gz__fullgenes__ARGannot.r1__results.txt',
              'srst2/SRR606249_subset25_1.trim30.fq.gz__fullgenes__ARGannot.r1__results.txt',
              'srst2/SRR606249_subset50_

In [56]:
filenames = set()
for files in gene_filenames.values():
    filenames.update(files)

In [57]:
filenames = list(filenames)

In [58]:
data = {}
for gene, files in gene_filenames.items():
    data[gene] = [file in files for file in filenames]
dense_df = pd.DataFrame.from_dict(data, orient='index', columns=filenames)
dense_df

Unnamed: 0,srst2/SRR606249_subset50_1.trim2.fq.gz__fullgenes__ARGannot.r1__results.txt,srst2/SRR606249_subset25_1.trim2.fq.gz__fullgenes__ARGannot.r1__results.txt,srst2/SRR606249_subset25_1.trim30.fq.gz__fullgenes__ARGannot.r1__results.txt,srst2/SRR606249_subset10_1.trim2.fq.gz__fullgenes__ARGannot.r1__results.txt,srst2/SRR606249_1.trim2.fq.gz__fullgenes__ARGannot.r1__results.txt,srst2/SRR606249_1.trim30.fq.gz__fullgenes__ARGannot.r1__results.txt,srst2/SRR606249_subset50_1.trim30.fq.gz__fullgenes__ARGannot.r1__results.txt
MphD_MLS,True,True,True,True,True,True,True
LsaA_MLS,True,True,True,False,True,True,True
TEM-1D_Bla,False,False,False,False,True,True,False


In [59]:
# initialize network object
net = Network(clustergrammer_widget)
# load dataframe
net.load_df(dense_df)
# cluster using default parameters
net.cluster(enrichrgram=False)
# make the visualization
net.widget()

clustergrammer_widget(network='{"row_nodes": [{"name": "MphD_MLS", "ini": 3, "clust": 1, "rank": 2, "rankvar":…

In [60]:
qgrid.show_grid(dense_df, show_toolbar=True)

QgridWidget(grid_options={'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': True, 'defau…

# Conclusions: 
