# Summary:

This notebook is for visualizing contig annotations from PROKKA

    
# Example Use Case:

In this example, the complete Shakya et al. 2013 metagenome is being compared to small, medium, and large subsamples of itself after conservative or aggressive read filtering and assembly with SPAdes or MEGAHIT. The datasets used in this example are named according to their metagenome content, relative degree of read filtering, and assembler used:

* SRR606249 = Accession number for the complete Shakya et al. 2013 metagenome
* subset50 = 50% of the complete Shakya et al. 2013 metagenome
* subset25 = 25% of the complete Shakya et al. 2013 metagenome
* subset10 = 10% of the complete Shakya et al. 2013 metagenome
* pe.trim2 = Conservative read filtering
* pe.trim30 = Aggressive read filtering
* megahit = MEGHIT assembly 
* spades = SPAdes assembly 


# Objectives:

* Annotation 
* Total number of genes
* Total number unique
* Compare samples and spit out unique

In [1]:
import matplotlib as mpl
import matplotlib.pyplot as plt
from pickle import load
import pandas as pd
import glob
%matplotlib inline

In [2]:
def concat_files(filenames):
    x = glob.glob(filenames)
    list_of_dfs = [pd.read_table(filename) for filename in x]
    for dataframe, filename in zip(list_of_dfs, x):
        dataframe['filename'] = filename
    combined_df = pd.concat(list_of_dfs, ignore_index=True)
    return combined_df
concat_files("*tsv")

Unnamed: 0,locus_tag,ftype,gene,EC_number,product,filename
0,OBGPGDIN_00001,CDS,drrB_1,Daunorubicin/doxorubicin resistance ABC transp...,,SRR606249_subset25_1.trim30_megahit_.tsv
1,OBGPGDIN_00002,CDS,hypothetical protein,,,SRR606249_subset25_1.trim30_megahit_.tsv
2,OBGPGDIN_00003,CDS,hypothetical protein,,,SRR606249_subset25_1.trim30_megahit_.tsv
3,OBGPGDIN_00004,CDS,hypothetical protein,,,SRR606249_subset25_1.trim30_megahit_.tsv
4,OBGPGDIN_00005,CDS,hypothetical protein,,,SRR606249_subset25_1.trim30_megahit_.tsv
5,OBGPGDIN_00006,CDS,valS_1,6.1.1.9,Valine--tRNA ligase,SRR606249_subset25_1.trim30_megahit_.tsv
6,OBGPGDIN_00007,CDS,hypothetical protein,,,SRR606249_subset25_1.trim30_megahit_.tsv
7,OBGPGDIN_00008,CDS,hypothetical protein,,,SRR606249_subset25_1.trim30_megahit_.tsv
8,OBGPGDIN_00009,CDS,hypothetical protein,,,SRR606249_subset25_1.trim30_megahit_.tsv
9,OBGPGDIN_00010,CDS,hypothetical protein,,,SRR606249_subset25_1.trim30_megahit_.tsv


In [19]:
# Calculate the total number of genes annotated with Prokka
def calc_total_genes():
    combined_df = concat_files("*tsv")
    x = combined_df.groupby('filename').gene.count()
    y = x.to_frame()
    bingo = y.sort_values('gene',ascending=False)
    bingo
    return bingo
calc_total_genes()

Unnamed: 0_level_0,gene
filename,Unnamed: 1_level_1
SRR606249_1.trim2_megahit_.tsv,195733
SRR606249_1.trim30_megahit_.tsv,195340
SRR606249_subset50_1.trim2_megahit_.tsv,193931
SRR606249_1.trim2_spades_.tsv,192008
SRR606249_subset50_1.trim30_megahit_.tsv,191005
SRR606249_1.trim30_spades_.tsv,190777
SRR606249_subset50_1.trim30_spades_.tsv,184330
SRR606249_subset25_1.trim2_megahit_.tsv,182300
SRR606249_subset25_1.trim2_spades_.tsv,177824
SRR606249_subset25_1.trim30_megahit_.tsv,172574


In [4]:
# Calculate the total number of unique genes annotated with Prokka
def calculate_unique_genes():
    combined_df = concat_files("*tsv")
    x = combined_df.groupby('filename').gene.nunique()
    y = x.to_frame()
    bingo = y.sort_values('gene',ascending=False)
    bingo
    return bingo
calculate_unique_genes()

Unnamed: 0_level_0,gene
filename,Unnamed: 1_level_1
SRR606249_1.trim2_megahit_.tsv,94112
SRR606249_1.trim2_spades_.tsv,93457
SRR606249_1.trim30_megahit_.tsv,91514
SRR606249_1.trim30_spades_.tsv,90661
SRR606249_subset50_1.trim2_megahit_.tsv,89970
SRR606249_subset50_1.trim30_megahit_.tsv,85913
SRR606249_subset50_1.trim30_spades_.tsv,85258
SRR606249_subset25_1.trim2_megahit_.tsv,77010
SRR606249_subset25_1.trim2_spades_.tsv,76953
SRR606249_subset25_1.trim30_spades_.tsv,71905


In [5]:
# Calcuate the intersection between the unique genes in each dataset
combined_df = concat_files('*tsv')
combined_df.dropna(axis=0, inplace=True)
#combined_df.head()
g = combined_df.groupby('gene')
ug = list(set(combined_df['gene']))

In [6]:
#g.get_group(ug[0])

In [20]:
# Creat concatenated tsv file 
combined_df = concat_files('*tsv')
# Remove columns keeping only 'gene' and 'filename'
new_combined_df = combined_df.drop(combined_df.columns[[0, 1, 3, 4]], axis=1)
# Drop any na values
new_combined_df.dropna(axis=0, inplace=True)
new_combined_df.head()
g = combined_df.groupby('gene')
ug = list(set(new_combined_df['gene']))

a = []
for gene in ug:
    gene_group = g.get_group(gene)
    if len(gene_group['filename'])>1:
        a.append(gene_group[['filename', 'gene']])
a

[                                         filename    gene
 21615    SRR606249_subset25_1.trim30_megahit_.tsv  nqrB_2
 207185     SRR606249_subset25_1.trim2_spades_.tsv  nqrB_2
 375593    SRR606249_subset25_1.trim30_spades_.tsv  nqrB_2
 546723    SRR606249_subset25_1.trim2_megahit_.tsv  nqrB_2
 740327              SRR606249_1.trim2_spades_.tsv  nqrB_2
 957164            SRR606249_1.trim30_megahit_.tsv  nqrB_2
 1113812   SRR606249_subset50_1.trim30_spades_.tsv  nqrB_2
 1318786   SRR606249_subset50_1.trim2_megahit_.tsv  nqrB_2
 1500950            SRR606249_1.trim30_spades_.tsv  nqrB_2
 1739464            SRR606249_1.trim2_megahit_.tsv  nqrB_2
 1870309    SRR606249_subset10_1.trim2_spades_.tsv  nqrB_2
 2021046  SRR606249_subset50_1.trim30_megahit_.tsv  nqrB_2
 2189185   SRR606249_subset10_1.trim30_spades_.tsv  nqrB_2
 2365971  SRR606249_subset10_1.trim30_megahit_.tsv  nqrB_2
 2477347   SRR606249_subset10_1.trim2_megahit_.tsv  nqrB_2,
                                          filename    g

In [8]:
from collections import defaultdict

gene_filenames = defaultdict(list)

for line in a:
    gene_filenames[line['gene'].iloc[0]].extend(line['filename'].tolist())

In [9]:
filenames = set()
for files in gene_filenames.values():
    filenames.update(files)

In [10]:
filenames = list(filenames)

In [11]:
data = {}
for gene, files in gene_filenames.items():
    data[gene] = [file in files for file in filenames]
dense_df = pd.DataFrame.from_dict(data, orient='index', columns=filenames)
#dense_df

In [12]:
from string import ascii_letters
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [13]:
int_dense_df = dense_df.astype(int)#.to_csv("con_annot.txt",sep='\t')
#int_dense_df

In [14]:
import qgrid
#qgrid 

In [15]:
qgrid.show_grid(dense_df, show_toolbar=True)

QgridWidget(grid_options={'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': True, 'defau…

# Conclusions:

In this example, we analyzed the gene annotations for the complete Shakya et al. 2013 metagenome and small, medium, and large subsamples of it that were quality trimmed conservatively or aggressively, and assembled with SPAdes or MEGAHIT. We observed that a larger number of genes (~3-4000) were annotated with when MEGAHIT was used as the assembler however, the majority of those genes were hypthothetical. The number or genes detected did not drop off significant when the dataset was subsampled by 50%. Aggressive trimming decreased the number of annotations indepednent with both assembler. 


To do: 
- Create linkage analysis figure to determine visualize how many genes are shared amongst samples