## Reanalysis of DeJesus et al. (2017) HMM results - Mycobacterium tuberculosis only

The aim of this script is to use the existing results from the HMM model used in __DeJesus et al. (2017)__ to assign essentiality to the different genetic features identified in the _M. tuberculosis_ genome. This effectively recreates the __.genes.txt file__ produced by the TRANSIT HMM model.  
This is done because the HMM model used in DeJesus et al. (2017) is unavailable publicly, so the data must be adapted directly from the paper.  
This step is unnecessary for _M. bovis_ as the .genes.txt file is produced directly from the TRANSIT HMM model.

In [1]:
import pandas as pd
import gffpandas.gffpandas as gffpd

Load Supplementary Table 2 from DeJesus et al. (2017) as a dataframe. This is equivalent to the .sites.txt output file from the HMM used in the paper.   
  
  _NOTE:_ This table has already been adjusted to have the locations of genetic features from the 'full genetic features assembly.gff3' - this was achieved by running all data through the older HMM model, and then copying just the column mapping TA sites to genetic features.

In [2]:
sites_file = '../data/Mycobacterium tuberculosis/mbo002173137st2_updated.xlsx'
sites_df = pd.read_excel(sites_file, skiprows = 1)

Adjust the dataframe to only contain necessary information.

In [3]:
# Remove uneccesary columns
sites_df_crop = sites_df.drop(['tRNA', 'rRNA','DNA-Methylation-Site','TRIT-Site','5pUTR','Promoter',
                               'Num. of Datasets with Insertions','Sum of Normalized Read Counts'], axis=1)

# Replace NA values
sites_df_crop['Matches Non Permissive Motif'].fillna(0, inplace=True);

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  sites_df_crop['Matches Non Permissive Motif'].fillna(0, inplace=True);


Create a dictionary to store each ORF (gene, domain, unannotated region) and the number of TA sites in each essentiality state predicted within that ORF. Iterate through the sites_df_crop dataframe to collect all information.

In [4]:
feature_summary = {}

# For each TA site within the genome (each row)...
for index, row in sites_df_crop.iterrows():

    # Find the associated orf
    orf = row['ORF ID']

    if pd.notna(orf):
        
        # Store orf(s) in a list
        orf_ids = row['ORF ID'].split(',')

        for orf_id in orf_ids:
            # Create a new key in the feature_summary for each new ORF found
            if orf_id not in feature_summary:
                feature_summary[orf_id] = {
                    'Name': orf_id,
                    'Number of TA Sites': 0,
                    'Number of Permissive (P) Sites': 0,
                    'Number of Non-Permissive (NP) Sites': 0,
                    'Number of Sites Belonging to Essential State': 0,
                    'Number of Sites Belonging to Growth-Defect State': 0,
                    'Number of Sites Belonging to Non-Essential State': 0,
                    'Number of Sites Belonging to Growth-Advantage State': 0,
                    'Final Call': ''}
            # Increase number of TA sites within the associated ORF by 1
            feature_summary[orf_id]['Number of TA Sites'] += 1
            
            # Record if the TA site is permissive or non-permissive by increasing count by 1
            if row['Matches Non Permissive Motif'] == 1:
                feature_summary[orf_id]['Number of Non-Permissive (NP) Sites'] += 1
            else:
                feature_summary[orf_id]['Number of Permissive (P) Sites'] += 1

            # Increase by 1, the count of whichever state the TA site is predicted to be
            if row['Essentiality State'] == 'ES':
                feature_summary[orf_id]['Number of Sites Belonging to Essential State'] += 1
            elif row['Essentiality State'] == 'GD':
                feature_summary[orf_id]['Number of Sites Belonging to Growth-Defect State'] += 1
            elif row['Essentiality State'] == 'NE':
                feature_summary[orf_id]['Number of Sites Belonging to Non-Essential State'] += 1
            elif row['Essentiality State'] == 'GA':
                feature_summary[orf_id]['Number of Sites Belonging to Growth-Advantage State'] += 1

Convert the feature_summary dictionary into a dataframe with relevant columns.

In [5]:
# Convert the gene_summary dictionary to a DataFrame
features_df = pd.DataFrame.from_dict(feature_summary, orient='index')

# Reorder columns to match the desired output
features_df = features_df[['Name', 'Number of TA Sites', 'Number of Permissive (P) Sites',
     'Number of Non-Permissive (NP) Sites', 'Number of Sites Belonging to Essential State',
     'Number of Sites Belonging to Growth-Defect State', 'Number of Sites Belonging to Non-Essential State',
     'Number of Sites Belonging to Growth-Advantage State', 'Final Call']]

Determine which essentiality state has the highest number of TA sites for each ORF and add as a column in the feature dataframe.

In [6]:
# Select the subset of essentiality columns
columns_subset = features_df.iloc[:, 4:8]  # Select columns by their integer index positions

# Find the column name with the maximum value for each row
max_TA_index = columns_subset.idxmax(axis=1)

# Define the mapping dictionary for column name to abbreviation
column_abbreviations = {'Number of Sites Belonging to Essential State': 'ES',
                  'Number of Sites Belonging to Growth-Defect State': 'GD',
                  'Number of Sites Belonging to Non-Essential State': 'NE',
                  'Number of Sites Belonging to Growth-Advantage State': 'GA'}

# Map the column names to their abbreviations
max_TA_index_abbreviation = max_TA_index.map(column_abbreviations)

# Add the abbreviations to genes_domains_df
features_df['Final Call'] = max_TA_index_abbreviation

Export dataframe as a .genes.txt file, mimicing the .genes.output output from the HMM model.

In [7]:
# Create header necessary for .genes.txt file
header = "#HMM - Genes\n#command line:\n#summary of gene calls:\n#key: ES=essential, GD=insertions cause growth-defect, NE=non-essential, GA=insertions confer growth-advantage, N/A=not analyzed (genes with 0 TA sites)\n#ORF	TAs	Permissive TAs	Non-permissive TAs	ES sites	GD sites	NE sites	GA sites	call\n"

output_filename = '../data/Mycobacterium tuberculosis/NC_018143.1 HMM feature essentiality predictions.genes.txt'

# Write the header and dataFrame to the file
with open(output_filename, 'w') as f:
    f.write(header)
    features_df.to_csv(f, sep='\t', index=False, header=False)