## Predicting essential region containing genes

This script uses the .genes.txt output of the TRANSIT HMM model to determine which genes may contain essential regions. These are genes that contain a __mix of at least 1 essential and at least 1 non-essential (nonessential, growth advantage, or growth disadvantage) regions__. To be considered in the analysis, the region must have at least 3 TA sites.

In [57]:
import pandas as pd

First, find the corresponding .genes.txt file, depending on the _Mycobacterium_ species. This should have already been generated using the TRANSIT HMM model.

In [58]:
# For Mycobacterium bovis:
input_file = '../data/Mycobacterium bovis/LT708304 HMM feature essentiality predictions.genes.txt'
input_cols = ['ORF', 'gene', 'annotation', 'TAs', 'ES sites', 'GD sites', 'NE sites', 
        'GA sites', 'saturation', 'NZmean', 'call']
output_gene_list = '../data/Mycobacterium bovis/LT708304 predicted essential region containing genes.txt'

In [59]:
# For Mycobacterium tuberculosis:
input_file = '../data/Mycobacterium tuberculosis/NC_018143.1 HMM feature essentiality predictions.genes.txt'
input_cols = ['ORF', 'TAs', 'Permissive TAs', 'Non-permissive TAs', 'ES sites', 'GD sites', 'NE sites', 
               'GA sites', 'call']
output_gene_list = '../data/Mycobacterium tuberculosis/NC_018143.1 predicted essential region containing genes.txt'

Read in the .genes.txt file as a dataframe and adjust to only contain necessary data.

In [60]:
calls_df = pd.read_table(input_file, skiprows = 5, names = input_cols)

In [61]:
# Drop rows where 'ORF' does not contain ':'
calls_df = calls_df[calls_df['ORF'].str.contains(':')]

calls_df['locus_tag'] = None
calls_df['coords'] = None

# Split the names into locus_tag and coordinates
calls_df[['locus_tag', 'coords']] = calls_df['ORF'].str.split(':', n=1, expand = True)

# Remove any rows with no locus_tag
calls_df = calls_df[calls_df['locus_tag'] != 'nan']

# Capitalize all locus tags
calls_df['locus_tag'] = calls_df['locus_tag'].str.capitalize()

Filter only for features with at least 3 total TA sites, as this is the criteria for consideration.

In [62]:
# Filter for calls with at least 3 TA sites
calls_df_3TA = calls_df[calls_df['TAs'] >= 3]

Generate a pivot table where each locus tag is a row, and the tally of features (genes, domains, or unannotated regions) predicted of essentiality call are the columns.

In [63]:
# Create a pivot table for each locus_tag
pivot_df = calls_df_3TA.pivot_table(index='locus_tag', columns='call', aggfunc='size', fill_value=0)

# Reset the index to make 'locus_tag' a column again
pivot_df = pivot_df.reset_index()

Filter for potential essential region containing genes. These are genes with at least one essential feautre and at least one nonessential / growth advantage / growth disadvantage region.

In [64]:
# Filter for at least 1 GA, GD, or NE site
pivot_df_filter = pivot_df[(pivot_df['GA'] > 0) | (pivot_df['GD'] > 0) | (pivot_df['NE'] > 0)]
# Additionally filter for at least on ES site
pivot_df_filter = pivot_df_filter[pivot_df['ES'] > 0]

  pivot_df_filter = pivot_df_filter[pivot_df['ES'] > 0]


Create a list of predicted essential region containing genes and export as a .txt file.

In [65]:
essential_region_gene_list = []

for index, row in pivot_df_filter.iterrows():
    essential_region_gene_list.append(row['locus_tag'])

In [66]:
with open(output_gene_list, 'w') as output:
    for locus_tag in essential_region_gene_list:
        output.write(str(locus_tag) + '\n')