# GFF Processing

Both NIH (https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/GCF_000001405.40-RS_2023_10/) and Ensembl (https://ftp.ensembl.org/pub/release-107/gff3/homo_sapiens/) provide GFF files for annotation of the human genome.  However, the annotations are not replicas of each other, nor are they constrained to only protein-coding annotations.  Thus, we will do some processing here to get a usable annotation file which can be used in our package.

# TODO: Add gene length information (not just transcript length)?

# TODO: Add isoform information

# Libraries

In [3]:
# Libraries


# Pandas
import pandas as pd

In [4]:
# Set the data folder.
data_folder = '/media/apollo/Samsung_T5/transfer/mayur/annotations/'
# data_folder = '/home/mad1188/rvallsamples/'

# NIH Processing

First, read in the GFF file from NIH.

In [5]:
# Set the file name.
filename = 'GCF_000001405.40_GRCh38.p14_genomic.gff'

In [6]:
# NIH file.

# Skip the metadata lines (1-9).
df = pd.read_csv(
    data_folder + filename, 
    header = None,
    sep = '\t', 
    skiprows = 9
)

There is a slight quirk because of how Pandas reads in the data, so remove the last row.

In [7]:
# Drop the last row.
df = df[0:len(df)-1]

Set the column names.

In [8]:
# Column names.
df.columns = ['nih_molecule_accession', 'source', 'category', 'start', 'stop', 'dummy_one', 'strand', 'dummy_two', 'information']

In [9]:
df

Unnamed: 0,nih_molecule_accession,source,category,start,stop,dummy_one,strand,dummy_two,information
0,NC_000001.11,RefSeq,region,1.0,248956422.0,.,+,.,ID=NC_000001.11:1..248956422;Dbxref=taxon:9606...
1,NC_000001.11,BestRefSeq,pseudogene,11874.0,14409.0,.,+,.,"ID=gene-DDX11L1;Dbxref=GeneID:100287102,HGNC:H..."
2,NC_000001.11,BestRefSeq,transcript,11874.0,14409.0,.,+,.,ID=rna-NR_046018.2;Parent=gene-DDX11L1;Dbxref=...
3,NC_000001.11,BestRefSeq,exon,11874.0,12227.0,.,+,.,ID=exon-NR_046018.2-1;Parent=rna-NR_046018.2;D...
4,NC_000001.11,BestRefSeq,exon,12613.0,12721.0,.,+,.,ID=exon-NR_046018.2-2;Parent=rna-NR_046018.2;D...
...,...,...,...,...,...,...,...,...,...
4901538,NC_012920.1,RefSeq,exon,15888.0,15953.0,.,+,.,ID=exon-TRNT-1;Parent=rna-TRNT;Dbxref=GeneID:4...
4901539,NC_012920.1,RefSeq,gene,15956.0,16023.0,.,-,.,"ID=gene-TRNP;Dbxref=GeneID:4571,HGNC:HGNC:7494..."
4901540,NC_012920.1,RefSeq,tRNA,15956.0,16023.0,.,-,.,ID=rna-TRNP;Parent=gene-TRNP;Dbxref=GeneID:457...
4901541,NC_012920.1,RefSeq,exon,15956.0,16023.0,.,-,.,ID=exon-TRNP-1;Parent=rna-TRNP;Dbxref=GeneID:4...


Immediately drop the dummy columns.

In [10]:
# No dummy columns required.
df = df.drop(columns = ['dummy_one', 'dummy_two'])

In [11]:
# set(list(df['category']))

As we are only concerned with high-quality annotations, we will restrict the genome to "Complete genome" annotations, see https://www.ncbi.nlm.nih.gov/assembly/help/#resultsfields

In [12]:
# Complete genome only.
# Source: https://stackoverflow.com/a/11531402
df = df[df['nih_molecule_accession'].str.contains("NC")]

Restrict to only exons.

In [13]:
# Basic restriction with a copy so we can add
# columns below.
restricted = df[df.category.isin(['exon'])].copy()

In [14]:
restricted

Unnamed: 0,nih_molecule_accession,source,category,start,stop,strand,information
3,NC_000001.11,BestRefSeq,exon,11874.0,12227.0,+,ID=exon-NR_046018.2-1;Parent=rna-NR_046018.2;D...
4,NC_000001.11,BestRefSeq,exon,12613.0,12721.0,+,ID=exon-NR_046018.2-2;Parent=rna-NR_046018.2;D...
5,NC_000001.11,BestRefSeq,exon,13221.0,14409.0,+,ID=exon-NR_046018.2-3;Parent=rna-NR_046018.2;D...
8,NC_000001.11,BestRefSeq,exon,29321.0,29370.0,-,ID=exon-NR_024540.1-1;Parent=rna-NR_024540.1;D...
9,NC_000001.11,BestRefSeq,exon,24738.0,24891.0,-,ID=exon-NR_024540.1-2;Parent=rna-NR_024540.1;D...
...,...,...,...,...,...,...,...
4901527,NC_012920.1,RefSeq,exon,14149.0,14673.0,-,ID=exon-ND6-1;Parent=rna-ND6;Dbxref=GeneID:454...
4901531,NC_012920.1,RefSeq,exon,14674.0,14742.0,-,ID=exon-TRNE-1;Parent=rna-TRNE;Dbxref=GeneID:4...
4901534,NC_012920.1,RefSeq,exon,14747.0,15887.0,+,ID=exon-CYTB-1;Parent=rna-CYTB;Dbxref=GeneID:4...
4901538,NC_012920.1,RefSeq,exon,15888.0,15953.0,+,ID=exon-TRNT-1;Parent=rna-TRNT;Dbxref=GeneID:4...


Take a look at the string structure to help us parse below.

In [15]:
list(restricted.information[:1])

['ID=exon-NR_046018.2-1;Parent=rna-NR_046018.2;Dbxref=GeneID:100287102,GenBank:NR_046018.2,HGNC:HGNC:37102;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2']

Go row-by-row and split on the semicolon, keeping the exon ID, the parent, the RNA key, the gene, the pseudogene status, and the transcript ID.  This is **expensive** to do because the row format is not consistent across records.

In [16]:
# No need for speculative annotations.
# Source: https://stackoverflow.com/a/39949288
# patternDel = "experiment="
# filter = restricted[3].str.contains(patternDel)

# restricted = restricted[~filter]

In [17]:
# restricted

In [18]:
# Make a dictionary to hold all the lists for the DataFrame.
dataframe_dict = {
    'ID': [],
    'exon_parent': [],
    'exon_gbkey': [],
    'exon_gene': [],
    'exon_transcript_id': []
}

# Go row-by-row.
for row in list(restricted.information):
    
    # Skip the record if the evidence is conjectural.
    if 'experiment=' not in row:
    
        # Split the row.
        split_up = row.split(';')

        # Define variables to hold the information.
        ID = ''
        exon_parent = ''
        exon_gbkey = ''
        exon_gene = ''
        exon_pseudo = False
        exon_transcript_id = ''

        # Now go over each element in the split.
        for su in split_up:

            # Split on the '='.
            sub_split_up = su.split('=')

            # Anything useful?
            if sub_split_up[0] == 'ID':
                ID = sub_split_up[1]
            if sub_split_up[0] == 'Parent':
                exon_parent = sub_split_up[1]
            elif sub_split_up[0] == 'gbkey':
                exon_gbkey = sub_split_up[1]
            elif sub_split_up[0] == 'gene':
                exon_gene = sub_split_up[1]
            elif sub_split_up[0] == 'pseudo':
                exon_pseudo = True
            elif sub_split_up[0] == 'transcript_id':
                exon_transcript_id = sub_split_up[1]

        # If no psuedogene, record the information.
        if exon_pseudo == False:
            dataframe_dict['ID'].append(ID)
            dataframe_dict['exon_parent'].append(exon_parent)
            dataframe_dict['exon_gbkey'].append(exon_gbkey)
            dataframe_dict['exon_gene'].append(exon_gene)
            dataframe_dict['exon_transcript_id'].append(exon_transcript_id)
        else:
            dataframe_dict['ID'].append('unusable')
            dataframe_dict['exon_parent'].append('unusable')
            dataframe_dict['exon_gbkey'].append('unusable')
            dataframe_dict['exon_gene'].append('unusable')
            dataframe_dict['exon_transcript_id'].append('unusable')
    
    else:
        
        # Record unusable.
        dataframe_dict['ID'].append('unusable')
        dataframe_dict['exon_parent'].append('unusable')
        dataframe_dict['exon_gbkey'].append('unusable')
        dataframe_dict['exon_gene'].append('unusable')
        dataframe_dict['exon_transcript_id'].append('unusable')

# Create the DataFrame.
restricted_df = pd.DataFrame(
    data = dataframe_dict
)

In [19]:
restricted_df

Unnamed: 0,ID,exon_parent,exon_gbkey,exon_gene,exon_transcript_id
0,unusable,unusable,unusable,unusable,unusable
1,unusable,unusable,unusable,unusable,unusable
2,unusable,unusable,unusable,unusable,unusable
3,unusable,unusable,unusable,unusable,unusable
4,unusable,unusable,unusable,unusable,unusable
...,...,...,...,...,...
2111872,exon-ND6-1,rna-ND6,mRNA,ND6,
2111873,exon-TRNE-1,rna-TRNE,tRNA,TRNE,
2111874,exon-CYTB-1,rna-CYTB,mRNA,CYTB,
2111875,exon-TRNT-1,rna-TRNT,tRNA,TRNT,


Do a quick and dirty "join" which avoids the headache of the index.

In [20]:
restricted['ID'] = list(restricted_df['ID'])
restricted['exon_parent'] = list(restricted_df['exon_parent'])
restricted['exon_gbkey'] = list(restricted_df['exon_gbkey'])
restricted['exon_gene'] = list(restricted_df['exon_gene'])
restricted['exon_transcript_id'] = list(restricted_df['exon_transcript_id'])

In [21]:
restricted

Unnamed: 0,nih_molecule_accession,source,category,start,stop,strand,information,ID,exon_parent,exon_gbkey,exon_gene,exon_transcript_id
3,NC_000001.11,BestRefSeq,exon,11874.0,12227.0,+,ID=exon-NR_046018.2-1;Parent=rna-NR_046018.2;D...,unusable,unusable,unusable,unusable,unusable
4,NC_000001.11,BestRefSeq,exon,12613.0,12721.0,+,ID=exon-NR_046018.2-2;Parent=rna-NR_046018.2;D...,unusable,unusable,unusable,unusable,unusable
5,NC_000001.11,BestRefSeq,exon,13221.0,14409.0,+,ID=exon-NR_046018.2-3;Parent=rna-NR_046018.2;D...,unusable,unusable,unusable,unusable,unusable
8,NC_000001.11,BestRefSeq,exon,29321.0,29370.0,-,ID=exon-NR_024540.1-1;Parent=rna-NR_024540.1;D...,unusable,unusable,unusable,unusable,unusable
9,NC_000001.11,BestRefSeq,exon,24738.0,24891.0,-,ID=exon-NR_024540.1-2;Parent=rna-NR_024540.1;D...,unusable,unusable,unusable,unusable,unusable
...,...,...,...,...,...,...,...,...,...,...,...,...
4901527,NC_012920.1,RefSeq,exon,14149.0,14673.0,-,ID=exon-ND6-1;Parent=rna-ND6;Dbxref=GeneID:454...,exon-ND6-1,rna-ND6,mRNA,ND6,
4901531,NC_012920.1,RefSeq,exon,14674.0,14742.0,-,ID=exon-TRNE-1;Parent=rna-TRNE;Dbxref=GeneID:4...,exon-TRNE-1,rna-TRNE,tRNA,TRNE,
4901534,NC_012920.1,RefSeq,exon,14747.0,15887.0,+,ID=exon-CYTB-1;Parent=rna-CYTB;Dbxref=GeneID:4...,exon-CYTB-1,rna-CYTB,mRNA,CYTB,
4901538,NC_012920.1,RefSeq,exon,15888.0,15953.0,+,ID=exon-TRNT-1;Parent=rna-TRNT;Dbxref=GeneID:4...,exon-TRNT-1,rna-TRNT,tRNA,TRNT,


Get rid of anything unusable.

In [22]:
# Only usable material.
restricted = restricted[restricted.ID != 'unusable'].copy()

In [23]:
restricted

Unnamed: 0,nih_molecule_accession,source,category,start,stop,strand,information,ID,exon_parent,exon_gbkey,exon_gene,exon_transcript_id
21,NC_000001.11,BestRefSeq,exon,17369.0,17436.0,-,ID=exon-NR_106918.1-1;Parent=rna-NR_106918.1;D...,exon-NR_106918.1-1,rna-NR_106918.1,precursor_RNA,MIR6859-1,NR_106918.1
23,NC_000001.11,BestRefSeq,exon,17369.0,17391.0,-,ID=exon-MIR6859-1-1;Parent=rna-MIR6859-1;Dbxre...,exon-MIR6859-1-1,rna-MIR6859-1,ncRNA,MIR6859-1,
25,NC_000001.11,BestRefSeq,exon,17409.0,17431.0,-,ID=exon-MIR6859-1-2-1;Parent=rna-MIR6859-1-2;D...,exon-MIR6859-1-2-1,rna-MIR6859-1-2,ncRNA,MIR6859-1,
30,NC_000001.11,Gnomon,exon,29774.0,30667.0,+,ID=exon-XR_007065314.1-1;Parent=rna-XR_0070653...,exon-XR_007065314.1-1,rna-XR_007065314.1,ncRNA,MIR1302-2HG,XR_007065314.1
31,NC_000001.11,Gnomon,exon,30976.0,31093.0,+,ID=exon-XR_007065314.1-2;Parent=rna-XR_0070653...,exon-XR_007065314.1-2,rna-XR_007065314.1,ncRNA,MIR1302-2HG,XR_007065314.1
...,...,...,...,...,...,...,...,...,...,...,...,...
4901527,NC_012920.1,RefSeq,exon,14149.0,14673.0,-,ID=exon-ND6-1;Parent=rna-ND6;Dbxref=GeneID:454...,exon-ND6-1,rna-ND6,mRNA,ND6,
4901531,NC_012920.1,RefSeq,exon,14674.0,14742.0,-,ID=exon-TRNE-1;Parent=rna-TRNE;Dbxref=GeneID:4...,exon-TRNE-1,rna-TRNE,tRNA,TRNE,
4901534,NC_012920.1,RefSeq,exon,14747.0,15887.0,+,ID=exon-CYTB-1;Parent=rna-CYTB;Dbxref=GeneID:4...,exon-CYTB-1,rna-CYTB,mRNA,CYTB,
4901538,NC_012920.1,RefSeq,exon,15888.0,15953.0,+,ID=exon-TRNT-1;Parent=rna-TRNT;Dbxref=GeneID:4...,exon-TRNT-1,rna-TRNT,tRNA,TRNT,


See what actually has a transcript.

In [24]:
# Only concerned with known transcripts.
has_transcript = restricted[restricted.exon_transcript_id != ''].copy()

In [25]:
has_transcript

Unnamed: 0,nih_molecule_accession,source,category,start,stop,strand,information,ID,exon_parent,exon_gbkey,exon_gene,exon_transcript_id
21,NC_000001.11,BestRefSeq,exon,17369.0,17436.0,-,ID=exon-NR_106918.1-1;Parent=rna-NR_106918.1;D...,exon-NR_106918.1-1,rna-NR_106918.1,precursor_RNA,MIR6859-1,NR_106918.1
30,NC_000001.11,Gnomon,exon,29774.0,30667.0,+,ID=exon-XR_007065314.1-1;Parent=rna-XR_0070653...,exon-XR_007065314.1-1,rna-XR_007065314.1,ncRNA,MIR1302-2HG,XR_007065314.1
31,NC_000001.11,Gnomon,exon,30976.0,31093.0,+,ID=exon-XR_007065314.1-2;Parent=rna-XR_0070653...,exon-XR_007065314.1-2,rna-XR_007065314.1,ncRNA,MIR1302-2HG,XR_007065314.1
32,NC_000001.11,Gnomon,exon,34168.0,35418.0,+,ID=exon-XR_007065314.1-3;Parent=rna-XR_0070653...,exon-XR_007065314.1-3,rna-XR_007065314.1,ncRNA,MIR1302-2HG,XR_007065314.1
35,NC_000001.11,BestRefSeq,exon,30366.0,30503.0,+,ID=exon-NR_036051.1-1;Parent=rna-NR_036051.1;D...,exon-NR_036051.1-1,rna-NR_036051.1,precursor_RNA,MIR1302-2,NR_036051.1
...,...,...,...,...,...,...,...,...,...,...,...,...
4464014,NC_000024.10,Gnomon,exon,57192325.0,57192537.0,+,ID=exon-XM_017030055.2-1;Parent=rna-XM_0170300...,exon-XM_017030055.2-1,rna-XM_017030055.2,mRNA,IL9R,XM_017030055.2
4464015,NC_000024.10,Gnomon,exon,57194043.0,57194127.0,+,ID=exon-XM_017030055.2-2;Parent=rna-XM_0170300...,exon-XM_017030055.2-2,rna-XM_017030055.2,mRNA,IL9R,XM_017030055.2
4464016,NC_000024.10,Gnomon,exon,57196336.0,57197337.0,+,ID=exon-XM_017030055.2-3;Parent=rna-XM_0170300...,exon-XM_017030055.2-3,rna-XM_017030055.2,mRNA,IL9R,XM_017030055.2
4464022,NC_000024.10,BestRefSeq,exon,57203182.0,57203350.0,-,ID=exon-NR_138048.1-2-1;Parent=rna-NR_138048.1...,exon-NR_138048.1-2-1,rna-NR_138048.1-2,ncRNA,WASIR1,NR_138048.1


Pick a gene to check our work.

In [26]:
# TET2 check.

# Note that for predicted products we do not have
# start, stop, or strand information even though
# the gene is known.
restricted[restricted['exon_gene'] == 'TET2']

# with pd.option_context(
#     'display.max_rows', None,
#     'display.max_columns', None,
#     'display.precision', 3,
# ):
#     print(
#         restricted[restricted['exon_gene'] == 'TET2']
#     )

Unnamed: 0,nih_molecule_accession,source,category,start,stop,strand,information,ID,exon_parent,exon_gbkey,exon_gene,exon_transcript_id
1233582,NC_000004.12,Gnomon,exon,105146876.0,105146979.0,+,ID=exon-XR_007057933.1-1;Parent=rna-XR_0070579...,exon-XR_007057933.1-1,rna-XR_007057933.1,misc_RNA,TET2,XR_007057933.1
1233583,NC_000004.12,Gnomon,exon,105190360.0,105190505.0,+,ID=exon-XR_007057933.1-2;Parent=rna-XR_0070579...,exon-XR_007057933.1-2,rna-XR_007057933.1,misc_RNA,TET2,XR_007057933.1
1233584,NC_000004.12,Gnomon,exon,105233897.0,105237351.0,+,ID=exon-XR_007057933.1-3;Parent=rna-XR_0070579...,exon-XR_007057933.1-3,rna-XR_007057933.1,misc_RNA,TET2,XR_007057933.1
1233585,NC_000004.12,Gnomon,exon,105241339.0,105242098.0,+,ID=exon-XR_007057933.1-4;Parent=rna-XR_0070579...,exon-XR_007057933.1-4,rna-XR_007057933.1,misc_RNA,TET2,XR_007057933.1
1233586,NC_000004.12,Gnomon,exon,105242834.0,105242916.0,+,ID=exon-XR_007057933.1-5;Parent=rna-XR_0070579...,exon-XR_007057933.1-5,rna-XR_007057933.1,misc_RNA,TET2,XR_007057933.1
...,...,...,...,...,...,...,...,...,...,...,...,...
1233725,NC_000004.12,Gnomon,exon,105146876.0,105146979.0,+,ID=exon-XM_047415842.1-1;Parent=rna-XM_0474158...,exon-XM_047415842.1-1,rna-XM_047415842.1,mRNA,TET2,XM_047415842.1
1233726,NC_000004.12,Gnomon,exon,105190360.0,105190505.0,+,ID=exon-XM_047415842.1-2;Parent=rna-XM_0474158...,exon-XM_047415842.1-2,rna-XM_047415842.1,mRNA,TET2,XM_047415842.1
1233727,NC_000004.12,Gnomon,exon,105233897.0,105237351.0,+,ID=exon-XM_047415842.1-3;Parent=rna-XM_0474158...,exon-XM_047415842.1-3,rna-XM_047415842.1,mRNA,TET2,XM_047415842.1
1233728,NC_000004.12,Gnomon,exon,105238514.0,105238713.0,+,ID=exon-XM_047415842.1-4;Parent=rna-XM_0474158...,exon-XM_047415842.1-4,rna-XM_047415842.1,mRNA,TET2,XM_047415842.1


Now we can annotate confirmed vs. predicted RNAs.

In [27]:
# Predicted - yes or no?

# Source: https://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_accession_numbers_and_mole/?report=objectonly
predicted = []

for i in list(restricted['exon_transcript_id']):
    
    # Anything there?
    if len(i) > 0:
    
        # Split and identify.
        split_up = i.split('_')[0]

        if split_up in ['NM', 'NR']:
            predicted.append('no')
        elif split_up in ['XM', 'XR']:
            predicted.append('yes')
    
    else:
        
        # pd.NA as inconsistent beavior ere
        # so set a manual missin value.
        predicted.append('not_available')

# Add to the DataFrame.
restricted['predicted'] = predicted

In [28]:
restricted

Unnamed: 0,nih_molecule_accession,source,category,start,stop,strand,information,ID,exon_parent,exon_gbkey,exon_gene,exon_transcript_id,predicted
21,NC_000001.11,BestRefSeq,exon,17369.0,17436.0,-,ID=exon-NR_106918.1-1;Parent=rna-NR_106918.1;D...,exon-NR_106918.1-1,rna-NR_106918.1,precursor_RNA,MIR6859-1,NR_106918.1,no
23,NC_000001.11,BestRefSeq,exon,17369.0,17391.0,-,ID=exon-MIR6859-1-1;Parent=rna-MIR6859-1;Dbxre...,exon-MIR6859-1-1,rna-MIR6859-1,ncRNA,MIR6859-1,,not_available
25,NC_000001.11,BestRefSeq,exon,17409.0,17431.0,-,ID=exon-MIR6859-1-2-1;Parent=rna-MIR6859-1-2;D...,exon-MIR6859-1-2-1,rna-MIR6859-1-2,ncRNA,MIR6859-1,,not_available
30,NC_000001.11,Gnomon,exon,29774.0,30667.0,+,ID=exon-XR_007065314.1-1;Parent=rna-XR_0070653...,exon-XR_007065314.1-1,rna-XR_007065314.1,ncRNA,MIR1302-2HG,XR_007065314.1,yes
31,NC_000001.11,Gnomon,exon,30976.0,31093.0,+,ID=exon-XR_007065314.1-2;Parent=rna-XR_0070653...,exon-XR_007065314.1-2,rna-XR_007065314.1,ncRNA,MIR1302-2HG,XR_007065314.1,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4901527,NC_012920.1,RefSeq,exon,14149.0,14673.0,-,ID=exon-ND6-1;Parent=rna-ND6;Dbxref=GeneID:454...,exon-ND6-1,rna-ND6,mRNA,ND6,,not_available
4901531,NC_012920.1,RefSeq,exon,14674.0,14742.0,-,ID=exon-TRNE-1;Parent=rna-TRNE;Dbxref=GeneID:4...,exon-TRNE-1,rna-TRNE,tRNA,TRNE,,not_available
4901534,NC_012920.1,RefSeq,exon,14747.0,15887.0,+,ID=exon-CYTB-1;Parent=rna-CYTB;Dbxref=GeneID:4...,exon-CYTB-1,rna-CYTB,mRNA,CYTB,,not_available
4901538,NC_012920.1,RefSeq,exon,15888.0,15953.0,+,ID=exon-TRNT-1;Parent=rna-TRNT;Dbxref=GeneID:4...,exon-TRNT-1,rna-TRNT,tRNA,TRNT,,not_available


Pick a gene to check our work.

In [29]:
# TET2 check.

# Note that for predicted products we do not have
# start, stop, or strand information even though
# the gene is known.
tet_2_check = restricted[restricted['exon_gene'] == 'TET2']
tet_2_check = tet_2_check[tet_2_check['predicted'] == 'no']

# with pd.option_context(
#     'display.max_rows', None,
#     'display.max_columns', None,
#     'display.precision', 3,
# ):
#     print(
#         restricted[restricted['exon_gene'] == 'TET2']
#     )

In [30]:
tet_2_check

Unnamed: 0,nih_molecule_accession,source,category,start,stop,strand,information,ID,exon_parent,exon_gbkey,exon_gene,exon_transcript_id,predicted
1233609,NC_000004.12,BestRefSeq,exon,105145875.0,105146086.0,+,ID=exon-NM_017628.4-1;Parent=rna-NM_017628.4;D...,exon-NM_017628.4-1,rna-NM_017628.4,mRNA,TET2,NM_017628.4,no
1233610,NC_000004.12,BestRefSeq,exon,105190360.0,105190505.0,+,ID=exon-NM_017628.4-2;Parent=rna-NM_017628.4;D...,exon-NM_017628.4-2,rna-NM_017628.4,mRNA,TET2,NM_017628.4,no
1233611,NC_000004.12,BestRefSeq,exon,105233897.0,105242771.0,+,ID=exon-NM_017628.4-3;Parent=rna-NM_017628.4;D...,exon-NM_017628.4-3,rna-NM_017628.4,mRNA,TET2,NM_017628.4,no
1233636,NC_000004.12,BestRefSeq,exon,105146876.0,105146979.0,+,ID=exon-NM_001127208.3-1;Parent=rna-NM_0011272...,exon-NM_001127208.3-1,rna-NM_001127208.3,mRNA,TET2,NM_001127208.3,no
1233637,NC_000004.12,BestRefSeq,exon,105190360.0,105190505.0,+,ID=exon-NM_001127208.3-2;Parent=rna-NM_0011272...,exon-NM_001127208.3-2,rna-NM_001127208.3,mRNA,TET2,NM_001127208.3,no
1233638,NC_000004.12,BestRefSeq,exon,105233897.0,105237351.0,+,ID=exon-NM_001127208.3-3;Parent=rna-NM_0011272...,exon-NM_001127208.3-3,rna-NM_001127208.3,mRNA,TET2,NM_001127208.3,no
1233639,NC_000004.12,BestRefSeq,exon,105241339.0,105241429.0,+,ID=exon-NM_001127208.3-4;Parent=rna-NM_0011272...,exon-NM_001127208.3-4,rna-NM_001127208.3,mRNA,TET2,NM_001127208.3,no
1233640,NC_000004.12,BestRefSeq,exon,105242834.0,105242927.0,+,ID=exon-NM_001127208.3-5;Parent=rna-NM_0011272...,exon-NM_001127208.3-5,rna-NM_001127208.3,mRNA,TET2,NM_001127208.3,no
1233641,NC_000004.12,BestRefSeq,exon,105243570.0,105243778.0,+,ID=exon-NM_001127208.3-6;Parent=rna-NM_0011272...,exon-NM_001127208.3-6,rna-NM_001127208.3,mRNA,TET2,NM_001127208.3,no
1233642,NC_000004.12,BestRefSeq,exon,105259619.0,105259769.0,+,ID=exon-NM_001127208.3-7;Parent=rna-NM_0011272...,exon-NM_001127208.3-7,rna-NM_001127208.3,mRNA,TET2,NM_001127208.3,no


Everything looks good, so write to file.

In [None]:
# Write it out.
restricted.to_csv(
    data_folder + filename + '.exon.processed',
    sep = '\t',
    index = False
)