## Assemble essentiality state GFF3

The purpose of this script is to convert regions of the genome predicted as ES/NE/GA/GD into a GFF3, where these essentiality regions are treated as features. As a GFF3 file, these can then be used in JBrowse2 to visualise the essentiality on a feature track.

In [12]:
import pandas as pd
import gffpandas.gffpandas as gffpd

First, read in the corresponding .sites.txt file, depending on the _Mycobacterium_ species. This should have already been generated using the TRANSIT HMM model.

In [13]:
# For Mycobcaterium bovis
input_file = '../data/Mycobacterium bovis/LT708304 HMM feature essentiality predictions.sites.txt'
input_cols = ['Coordinate', 'Num. insertions', '% ES', '% GD', '% NE', '% GA', 'Essentiality State', 'ORF ID']

essentiality_df = pd.read_table(input_file, skiprows = 18, names = input_cols)

species = 'LT708304.1'
gff3_header = 'LT708304.1 1 4349904'
gff3_output = '../data/Mycobacterium bovis/LT708304 essentialty states assembly.gff3'

In [14]:
# For Mycobacterium tuberculosis
input_file = '../data/Mycobacterium tuberculosis/mbo002173137st2_updated.xlsx' # This is the equivalent file from DeJesus et al. (2017)
input_cols = ['Coordinate','Matches Non Permissive Motif',	'Num. of Datasets with Insertions',
              'Sum of Normalized Read Counts','Mean Normalized Read Count Among Replicates With Insertions',
              'Low Coverage Site','ORF ID','tRNA','rRNA','DNA-Methylation-Site','TRIT-Site','5pUTR','Promoter',
              'Essentiality State']

essentiality_df = pd.read_excel(input_file, names = input_cols)

species = 'NC_018143.1'
gff3_header = 'NC_018143.1 1 4411708'
gff3_output = '../data/Mycobacterium tuberculosis/NC_018143.1 essentiality states assembly.gff3'

Keep only the necessary columns.

In [15]:
cols_of_interest = ['Coordinate', 'Essentiality State']
essentiality_df = essentiality_df[cols_of_interest]

Create a function to find the coordinates for each different state (ES, NE, GA, GD):

In [16]:
def find_state_coordinates(df, state):

    # Initialise lists for the definite regions and the potential regions
    state_regions = []
    potential_state_regions = []

    first_state = True
    found_state = False
    current_coord = 0

    for index, row in df.iterrows():
        previous_coord = current_coord + 1
        current_coord = row['Coordinate']
        if row['Essentiality State'] == state:
            # If this is the first of that state found in a state group...
            if first_state == True:
                # The state region could potentially start from the previous coord + 1
                potential_start_coord = previous_coord
                # The state region definitely starts with the current coord
                start_coord = current_coord

                first_state = False
                found_state = True
            # Otherwise, keep updating the end_coord as the current_coord
            end_coord = current_coord
        else:
            # If the end of the state region...
            if found_state == True:
                # The state region could have potentially ended at the current state - 1
                potential_end_coord = current_coord - 1
                # Record the co-ordinates of the region found
                potential_state_regions.append((potential_start_coord, potential_end_coord))
                state_regions.append((start_coord, end_coord))
            # Reset the first_state and found_state variables
            first_state = True
            found_state = False
    
    return state_regions, potential_state_regions

Individually find all of the coordinates for each essentiality state.

In [17]:
rows = []

# Find coordinates for essential states:
ES_regions, potential_ES_regions = find_state_coordinates(essentiality_df, "ES")
# Loop through all essential states found and generate the text required for a GFF3 file.
for index, coord in enumerate(ES_regions):
    name = f"Essential region {index+1}"
    definite_row = {'start': coord[0],'end': coord[1],'type': "State",'attributes': f"ID={name};state=Essential"}
    potential_row = {'start': potential_ES_regions[index][0], 'end': potential_ES_regions[index][1], 'type': "Potential_state", 'attributes': f"Parent={name};state=Essential"}
    rows.append(definite_row)
    rows.append(potential_row)

# Find coordinates for nonessential states:
NE_regions, potential_NE_regions = find_state_coordinates(essentiality_df, "NE")
# Loop through all nonessential states found and generate the text required for a GFF3 file.
for index, coord in enumerate(NE_regions):
    name = f"Non-essential region {index+1}"
    definite_row = {'start': coord[0],'end': coord[1],'type': "State",'attributes': f"ID={name};state=Non-essential"}
    potential_row = {'start': potential_NE_regions[index][0], 'end': potential_NE_regions[index][1], 'type': "Potential_state", 'attributes': f"Parent={name};state=Non-essential"}
    rows.append(definite_row)
    rows.append(potential_row)

# Find coordinates for growth advantage states:
GA_regions, potential_GA_regions = find_state_coordinates(essentiality_df, "GA")
# Loop through all growth advantage states found and generate the text required for a GFF3 file.
for index, coord in enumerate(GA_regions):
    name = f"Growth advantage region {index+1}"
    definite_row = {'start': coord[0],'end': coord[1],'type': "State",'attributes': f"ID={name};state=Growth advantage"}
    potential_row = {'start': potential_GA_regions[index][0], 'end': potential_GA_regions[index][1], 'type': "Potential_state", 'attributes': f"Parent={name};state=Growth advantage"}
    rows.append(definite_row)
    rows.append(potential_row)

# Find coordinates for growth disadvantage states:
GD_regions, potential_GD_regions = find_state_coordinates(essentiality_df, "GD")
# Loop through all growth disadvantage states found and generate the text required for a GFF3 file.
for index, coord in enumerate(GD_regions):
    name = f"Growth defect region {index+1}"
    definite_row = {'start': coord[0],'end': coord[1],'type': "State",'attributes': f"ID={name};state=Growth defect"}
    potential_row = {'start': potential_GD_regions[index][0], 'end': potential_GD_regions[index][1], 'type': "Potential_state", 'attributes': f"Parent={name};state=Growth defect"}
    rows.append(definite_row)
    rows.append(potential_row)

# Generate a dataframe of just the essential information
essentiality_states_df = pd.DataFrame(rows, columns=['start', 'end', 'type', 'attributes'])

Expand the dataframe to contain all information required for a GFF3 file.

In [18]:
col_names = ['seq_id', 'source', 'feature', 'feature start', 'feature end', 'score', 'strand', 'phase', 'attributes']

essentiality_states_df['seq_id'] = species
essentiality_states_df['source'] = "HMM states"
essentiality_states_df['score'] = "."
essentiality_states_df['strand'] = "."
essentiality_states_df['phase'] = "."

Convert to GFF3 dataframe and export GFF3.

In [19]:
header = f"##gff-version 3\n##sequence-region {gff3_header}\n"
essentiality_states_gff = gffpd.Gff3DataFrame(input_df=essentiality_states_df, input_header=header)

In [20]:
# Save the GFF3 dataframe as a file
essentiality_states_gff.to_gff3(gff3_output)