## Assigning essentiality labels using E. coli K-12 datasets

Challenge: I generated transposon sequencing data for E. coli B REL606, which is less commonly used compared to E. coli K-12. Since there isn't a well-annotated ground truth of gene essentiality for the REL606 strain, I will use a dataset from E. coli K-12 as a benchmark. 

1. The E. coli K-12 Keio collection: in this experiment, the authors attempted to make gene deletions for every gene in the K-12 genome. They were unable to make deletions in ~300 genes, which are considered essential. 
2. The E. coli K-12 TraDIS collection: a transposon sequencing experiment where the authors did a manual examination of several edge cases and annotated essentiality based on both sequencing data and prior results in the literature.

Challenge: there are several unannotated genes in the E. coli genome, and the two strains can differ in their gene content. To get around these issues, I will identify all the genes that are shared between the REL606 and K-12 genome and extract essentiality labels only for those.

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import os
from pathlib import Path

In [None]:
#current working directory
repo = os.getcwd()
print(repo)

In [None]:
metadata_path = repo +'/Metadata/'

### Loading metadata

Another confounding factor is that there are multiple names for the same gene in E. coli K-12, and the gene names used by the authors in Goodall et al and Baba et al may not be consistent with what names were assigned when annotating the E. coli REL606 reference genome using prokka.

To get around this, I will use info from the ecocyc database which contains up to four synonyms for every gene.

In [None]:
eco_syn = pd.read_csv(metadata_path+'ecoli_genes.col',on_bad_lines='skip',skiprows=28,sep='\t')

In [None]:
#opening the pandas file with all the metadata
all_data = pd.read_csv(metadata_path+"all_metadata_REL606.txt", sep="\t")
names = all_data.iloc[:,0]
gene_start = all_data.iloc[:,3]
gene_end = all_data.iloc[:,4]
strand = all_data.iloc[:,5]
locations = np.transpose(np.vstack([gene_start,gene_end,strand]))
k12_tags = all_data.iloc[:,2]
uniprot_rel606 = all_data.iloc[:,6]
product = all_data.iloc[:,-1]

### Loading K-12 data

In [None]:
tradis = pd.read_excel(metadata_path+'tradis_k12_essentiality.xlsx',skiprows=1)

In [None]:
keio_ess = pd.read_csv(metadata_path+'genes_essential_keio.csv')
keio_all = pd.read_csv(metadata_path+'genes_all_keio.csv')

### Identifying overlap of gene names between these datasets and REL606

In [None]:
print('Number of overlapping gene names with TraDIS:', len(set(tradis['Gene'])&(set(names))))

In [None]:
print('Number of overlapping gene names with TraDIS:', len(set(keio_all['gene'])&(set(names))))

In [None]:
a = []
if a:
    print(k)

For every gene in my dataset that doesn't share a gene name with TraDIS (or the Keio collection), I will find all synonyms and check if any of the synonyms are present in the TraDIS data

In [None]:
synonym_dict_tradis = {}
for gene in set(names)-set(tradis['Gene']):
    if 'FJKNNBLA' not in gene:   #if there is no known gene, Prokka will assign the locus name 
        #(which starts with eight random letters) as gene name
        #check if gene is present in the eco_syn database
        if gene in eco_syn.values:
#             count += 1
            #identify relevant row in the dataframe
            row_num = np.where(eco_syn.eq(gene))[0][0]
            col_num = np.where(eco_syn.eq(gene))[1][0]
            #data to extract:
            extract_cols = list(set([2,8,9,10,11]) - set([col_num])) #extract all the other names for gene
            dat = set(eco_syn.iloc[row_num, extract_cols])
            dat = list({x for x in dat if x==x}) #remove nans
            #check if any of the synonyms are in the tradis dataset
            overlap = list(set(dat) & set(tradis['Gene']))
            if overlap:
                synonym_dict_tradis[gene]=overlap

In [None]:
synonym_dict_keio = {}
for gene in set(names)-set(keio_all['gene']):
    if 'FJKNNBLA' not in gene:   #if there is no known gene, Prokka will assign the locus name 
        #(which starts with eight random letters) as gene name
        #check if gene is present in the eco_syn database
        if gene in eco_syn.values:
            #identify relevant row in the dataframe
            row_num = np.where(eco_syn.eq(gene))[0][0]
            col_num = np.where(eco_syn.eq(gene))[1][0]
            #data to extract:
            extract_cols = list(set([2,8,9,10,11]) - set([col_num])) #extract all the other names for gene
            dat = set(eco_syn.iloc[row_num, extract_cols])
            dat = list({x for x in dat if x==x}) #remove nans
            #check if any of the synonyms are in the tradis dataset
            overlap = list(set(dat) & set(keio_all['gene']))
            if overlap:
                synonym_dict_keio[gene]=overlap

### Loading the tnseq features file

Not all genes have reads mapping to the interior of the gene, so we are only interested in labels for the the genes that do have reads mapped.

In [None]:
tnseq_feats = pd.read_csv('tnseq_features_REL606.csv', index_col=0)
genes_int = tnseq_feats['Gene']

### Examining the distribution of names counts reveals that over 80 genes names appear multiple times in the genome.

This can happen for two reasons:
- these genes are actually duplicated in E. coli B REL606
- the annotation tool calls proteins that are similar enough the same gene name

I think this observation is likely due to a combination of both factors. Because it's quite possible that these genes are paralogs, I will exclude these genes from the main project of predicting gene essentiality using machine learning classification algorithms. The rationale is that even if the genes are essential, by having an extra copy present in the genome, it won't show up as essential in the TnSeq data.

In [None]:
unique, counts = np.unique(names, return_counts=True)

In [None]:
sns.countplot(x=counts)
plt.yscale('symlog')

In [None]:
## creating an array which indicates if the corresponding gene in our dataset is duplicated or not.
multiple = np.zeros_like(names)
for gene in range(len(names)):
    if np.size(np.where(names==names[gene])[0])>1:
        multiple[gene] = 1

### Compiling the essentiality labels (after excluding the duplicated genes)

In [None]:
tradis_dict = {0: 'Essential', 1: 'Nonessential', 2: 'Unclear'}

In [None]:
genes_included = np.sort(list(set(genes_int)-set(np.where(multiple)[0])))

In [None]:
print('Fraction of genes retained after removing genes with potential duplicates\n',len(genes_included)/len(names))

#### Counting number of genes which are present in TraDIS/Keio datasets, or synonym of the gene is present in TraDIS/Keio datasets

In [None]:
count1=0
count2=0
count3=0
for gene in genes_included:
    if names[gene] in list(tradis['Gene']):
        count1+=1
    elif names[gene] in synonym_dict_tradis.keys():
        count2+=1
#         print(names[gene])
    else:
        count3+=1
print('Gene names present in TraDIS data:', count1)
print('Synonym of gene names present in TraDIS data:', count2)
print('Not present in TraDIS data:', count3)

In [None]:
count1=0
count2=0
count3=0
for gene in genes_included:
    if names[gene] in list(keio_all['gene']):
        count1+=1
    elif names[gene] in synonym_dict_keio.keys():
        count2+=1
    else:
        count3+=1
print('Gene names present in Keio data:', count1)
print('Synonym of gene names present in Keio data:', count2)
print('Not present in Keio data:', count3)

Overall, we end up with ~3600 genes which are shared (with high confidence) between the TraDIS/Keio collection datasets, and our TnSeq data.

### Now, I'll assign each gene an essentiality label for each gene

In [None]:
labels_tradis = []
indices_tradis = []

row_list = []

for gene in genes_included:
    row_dict = {}
    row_dict['Gene'] = gene
    
    ### step 1: essentiality as predicted in the TraDIS dataset
    if names[gene] in list(tradis['Gene']):
        indices_tradis.append(gene)
        #identifying the row in the tradis data corresponding to this gene, extracting only the last three columns
        #which contain information about gene essentiality
        key = np.where(list(tradis.iloc[np.where(tradis['Gene']==names[gene])[0][0], -3:]))[0][0]
        #the idea behind np.where is that one of the columns is necessarily true, and the above line will return 
        #which of the columns is true. This is then mapped to essentiality status
        row_dict['TraDIS'] = tradis_dict[key]
        
    elif names[gene] in synonym_dict_tradis.keys():
        key = np.where(list(tradis.iloc[np.where(tradis['Gene']==synonym_dict_tradis[names[gene]][0])[0][0], -3:]))[0][0]
        row_dict['TraDIS'] = tradis_dict[key]
        
    else:
        row_dict['TraDIS'] = 'NA'
    
    ### step 2: essentiality as predicted in the Keio dataset
    if names[gene] in list(keio_all['gene']):
        if names[gene] in list(keio_ess['Gene']):
            row_dict['Keio'] = 'Essential'
        else:
            row_dict['Keio'] = 'Nonessential'
    
    elif names[gene] in synonym_dict_keio.keys():
        search = synonym_dict_keio[names[gene]][0]
        if search in list(keio_ess['Gene']):
            row_dict['Keio'] = 'Essential'
        else:
            row_dict['Keio'] = 'Nonessential'
    
    else:
        row_dict['Keio'] = 'NA'
        
    #finally add this dictionary to the row_list
    row_list.append(row_dict)

In [None]:
essentiality_labels = pd.DataFrame(row_list)

In [None]:
essentiality_labels = essentiality_labels.set_index('Gene', drop=True)

### Now for merging both the features and gene essentiality data into the same dataframe

In [None]:
features = pd.read_csv('tnseq_features_REL606.csv')

In [None]:
features = features.set_index('Gene', drop=True)

In [None]:
merged_data = pd.merge(features, essentiality_labels, left_index=True, right_index=True)

In [None]:
merged_data.to_csv('tnseq_features_essentiality.csv')

This is what will go into the machine learning models for

In [204]:
np.sum(merged_data['TraDIS']=='Essential')    #number of genes called essential in the tradis data

333

In [205]:
np.sum(merged_data['Keio']=='Essential')      #number of genes called essential in the keio collection

285