## Update gene assembly - *Mycobacterium tuberculosis only*

This script is to match up the gene assemblies __NC_018143.1__ and __NC_000962.3__, to mimic the gene assembly used in the DeJesus et al. (2017) paper.
The updated gene assembly uses the co-ordinate positions from NC_018143.1 but matched with the gene locus IDs and names from NC_000962.3.

In [10]:
import pandas as pd
import gffpandas.gffpandas as gffpd

First, read in the gene assembly GFF3 files.

In [11]:
coordinates_gff = gffpd.read_gff3('../data/Mycobacterium tuberculosis/NC_018143.1 gene assembly.gff3')
genename_gff = gffpd.read_gff3('../data/Mycobacterium tuberculosis/NC_000962.3 gene assembly.gff3')

Convert the GFF3s to Dataframes and merge them together.

In [12]:
# Convert to dataframe where each gene attribute is a sperate column
coordinates_df = coordinates_gff.attributes_to_columns()
genename_df = genename_gff.attributes_to_columns()

In [13]:
# Modify the locus tag naming schemes to match
# NC_018143.1 (coordinates_df) has a different abbreviation at the beginning, 'RVBD_' which is converted to 'Rv'
coordinates_df['locus_tag'] = coordinates_df['locus_tag'].str.replace('RVBD_', 'Rv')

# Create a dataframe of only key information of the gene IDs from genename_df
genename_df_short = genename_df[['locus_tag', 'gene', 'Name']]

In [14]:
# Merge the dataframes on the locus_tag column
merged_df = pd.merge(coordinates_df, genename_df_short, on = 'locus_tag')

Now, create a new 'attributes' string to replace the existing attributes column

In [15]:
for index, row in merged_df.iterrows():

    # Generate an attribute string based on attributes of interest
    attribute_str = f"ID=gene-{row['locus_tag']};Name={row['Name_y']};locus_tag={row['locus_tag']}"

    # If a gene name is available, add to the attributes string
    if pd.isna(row['gene']) == False:
        attribute_str += f";gene={row['gene']}"

    # Replace the existing attributes string with the new one
    merged_df.at[index, 'attributes'] = attribute_str

Convert the merged_df to a GFF3 DataFrame and export the GFF3 file.

In [16]:
# Keep only the first nine columns (for standard GFF3 format)
merged_df = merged_df.iloc[:, :9]

In [17]:
# Convert the merged dataframe back to a GFF3 dataframe
header = "##gff-version 3\n##sequence-region NC_018143.1 1 4411708\n"
merged_gff = gffpd.Gff3DataFrame(input_df=merged_df, input_header=header)

In [18]:
# Save the GFF3 dataframe as a file
merged_gff.to_gff3('../data/Mycobacterium tuberculosis/NC_018143.1 gene assembly updated.gff3')