Author: Ronny F. Pacheco Date: Sep 2024
Copyright: © 2024 Ronny Pacheco License: MIT License

---

MIT License

Copyright (c) 2024 Ronny Pacheco

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

# Needed modules

In [1]:
# Load the needed libraries
import pickle
import os

import pandas as pd
import numpy as np
import json

In [2]:
# https://kioku-space.com/en/jupyter-skip-execution/
from IPython.core.magic import register_cell_magic # type: ignore


@register_cell_magic
def skip(line, cell):
    print('Skipping cell')
    if line and cell:
        pass
    return

# Pickle save

In [3]:
%%skip
# =============================================================================
# main function
# =============================================================================
def data_save_load(option):
    """
    This function is used to save or load data for the jupyter notebook
    """
    path_folder = "ipynb_db"  # Folder to save variables
    os.makedirs(path_folder, exist_ok=True)  # Create folder if not exist
    notebook_name = os.path.basename(os.path.abspath(''))
    path_file = os.path.join(path_folder, f"{notebook_name}.variables.pkl") # Path to save the variables

    if option == "save":
        with open(path_file, "wb") as pickle_file:
            dict_variables = {
                "neg_to_filter" : neg_to_filter
            }
            pickle.dump(dict_variables, pickle_file)
    elif option == "load":
        with open(path_file, "rb") as pickle_file:
            variables = pickle.load(pickle_file)
        # Now load the variables
        for pickle_key, pickle_value in variables.items():
            print(f"* Loading variable: {pickle_key}")
            globals()[pickle_key] = pickle_value
# =============================================================================
# Call the function
# =============================================================================
data_save_load(option="load")

* Loading variable: neg_to_filter


# Prepare Data

## Load data

### Load negative (or rejected) elements

In [81]:
# Let's load the negative nada
neg_df = pd.read_csv("./data/negative_database_nomatch_corrected_named.csv", sep=",", header=0)
print(neg_df.shape)
print(neg_df.dtypes)
neg_df.head()

(682, 6)
sseqid     object
sstart      int64
send        int64
sstrand    object
sseq       object
name       object
dtype: object


Unnamed: 0,sseqid,sstart,send,sstrand,sseq,name
0,LinJ.01,36103,36242,plus,AGACAGACCGACACACGCAGCCGTGTGATGCCGCCGCCGAGGGCAG...,rejected_noCDS_c01.10
1,LinJ.01,113760,114388,plus,CAGCGCCATGCACGACATGGCCGCTGACGTCCGTAGCCCTAACTCG...,rejected_noCDS_c01.20
2,LinJ.01,146412,146530,plus,GCGAATTGTGTTCTGCGCATGCCTCTTCTCTGCCGTGCAGCATGCG...,rejected_noCDS_c01.30
3,LinJ.01,261866,262439,plus,CGGACTTGGCAAGTGGCCGCCATCGATGAAAACGCACCATGCCTTT...,rejected_noCDS_c01.40
4,LinJ.01,271363,271650,plus,CGAACGCCGCCCTCAATCGCGCGCTGAACTTCACGCGGCGGTCGAC...,rejected_noCDS_c01.50


### GTF data
This one wil be **harder** to prepare

In [82]:
# Load data
gtf_df = pd.read_csv("./data/20240703111001_LINF-Tabla_maestra_v3-20244_RP_v0.8.gtf", sep="\t", header=None) 
print(gtf_df.shape)
print(gtf_df.dtypes)
gtf_df.head()

(45368, 9)
0    object
1    object
2    object
3     int64
4     int64
5    object
6    object
7    object
8    object
dtype: object


Unnamed: 0,0,1,2,3,4,5,6,7,8
0,LinJ.01,CBM,gene,1520,5066,.,-,.,"gene_id ""LINF_010005000""; gene_name ""LINF_0100..."
1,LinJ.01,CBM,transcript,1520,5066,.,-,.,"parent_id ""LINF_010005000""; transcript_id ""LIN..."
2,LinJ.01,CBM,CDS,3710,4711,.,-,.,"parent_id ""LINF_01T0005000""; transcript_id ""LI..."
3,LinJ.01,CBM,3utr,1520,3709,.,-,.,"parent_id ""LINF_01T0005000"";"
4,LinJ.01,CBM,5utr,4712,5066,.,-,.,"parent_id ""LINF_01T0005000"";"


From `gtf_df`I only need columns 0, 2, 3, 4, 6 and 8

In [83]:
# Get from `gtf_df` the needed columns [0, 3, 4, 6, 8]
gtf_df = gtf_df[[0, 2, 3, 4, 6, 8]]
gtf_df.columns = ["chrom", "feature", "start", "end", "strand", "attributes"]
print(gtf_df.shape)
print(gtf_df.dtypes)
gtf_df.head()

(45368, 6)
chrom         object
feature       object
start          int64
end            int64
strand        object
attributes    object
dtype: object


Unnamed: 0,chrom,feature,start,end,strand,attributes
0,LinJ.01,gene,1520,5066,-,"gene_id ""LINF_010005000""; gene_name ""LINF_0100..."
1,LinJ.01,transcript,1520,5066,-,"parent_id ""LINF_010005000""; transcript_id ""LIN..."
2,LinJ.01,CDS,3710,4711,-,"parent_id ""LINF_01T0005000""; transcript_id ""LI..."
3,LinJ.01,3utr,1520,3709,-,"parent_id ""LINF_01T0005000"";"
4,LinJ.01,5utr,4712,5066,-,"parent_id ""LINF_01T0005000"";"


Now the field `attributes` it's separated by ";" and the header it's in a format like `header "data"`. We are going to transform the "attributes" column in multiple columns

#### Transforming columns

Get first all the elements that appear in the attributes columns

In [84]:
# Let's count first the number of elements in the `attributes` column
atr_dict = {}
for index, row in gtf_df.iterrows():
    # print(index, ":", sep="")
    for atr in row["attributes"].split(";"):
        atr = atr.strip()  # Remove leading and trailing whitespaces
        if len(atr.strip()) == 0:  # Skip empty attribute ""
            continue
        # print(f"\t{'-'*50}")
        # print(f"\attribute: {atr.strip()}")  
        key = atr.split(" ")[0] 
        if key not in atr_dict:
            atr_dict[key] = 1

        else:
            atr_dict[key] += 1
        # print(f"\t{atr_dict}")
print(atr_dict)

{'gene_id': 9861, 'gene_name': 9861, 'biotype': 17295, 'notes': 17319, 'parent_id': 35507, 'transcript_id': 18215, 'transcript_name': 9660, 'pseudogen': 49}


In [85]:
# get a list with the keys of atr_dict
atr_keys = list(atr_dict.keys())
print(atr_keys)

['gene_id', 'gene_name', 'biotype', 'notes', 'parent_id', 'transcript_id', 'transcript_name', 'pseudogen']


Now we'll have a list with all the elements. When indexing each row in the next steps, we can check if one of these items appear, and if not, we can add a "None" value to the attribute

In [86]:
# Now that we have the attributes count, let's create a dict for each element in "test_df" with the attributes separated
new_col_df = []
for index, row in gtf_df.iterrows():
    # print(index, ":", sep="")
    pre_data = []
    for atr in row["attributes"].split(";"):
        atr = atr.strip()  # Remove leading and trailing whitespaces
        if len(atr.strip()) == 0:  # Skip empty attribute ""
            continue
        key = atr.split(" ")[0]
        value = atr.split(" ")[1].replace('"', "")
        pre_data.append({key: value})
    
    for elem in atr_keys: # type: ignore  # Checking if the elements from atr_keys
        if elem not in [list(elem.keys())[0] for elem in pre_data]:  # If the element is not in pre_data, add it with value None
            # noinspection PyUnresolvedReferences
            pre_data.append({elem: None})

    flattened_data = {key: value for sublist in pre_data for key, value in sublist.items()}
    new_col_df.append(flattened_data)

In [87]:
# Checking how it worked
new_col_df  

[{'gene_id': 'LINF_010005000',
  'gene_name': 'LINF_010005000',
  'biotype': 'protein_coding',
  'notes': 'Protein_of_unknown_function_(DUF2946)',
  'parent_id': None,
  'transcript_id': None,
  'transcript_name': None,
  'pseudogen': None},
 {'parent_id': 'LINF_010005000',
  'transcript_id': 'LINF_01T0005000',
  'transcript_name': 'LINF_01T0005000',
  'biotype': 'protein_coding',
  'notes': 'Protein_of_unknown_function_(DUF2946)',
  'gene_id': None,
  'gene_name': None,
  'pseudogen': None},
 {'parent_id': 'LINF_01T0005000',
  'transcript_id': 'LINF_01T0005000',
  'gene_id': None,
  'gene_name': None,
  'biotype': None,
  'notes': None,
  'transcript_name': None,
  'pseudogen': None},
 {'parent_id': 'LINF_01T0005000',
  'gene_id': None,
  'gene_name': None,
  'biotype': None,
  'notes': None,
  'transcript_id': None,
  'transcript_name': None,
  'pseudogen': None},
 {'parent_id': 'LINF_01T0005000',
  'gene_id': None,
  'gene_name': None,
  'biotype': None,
  'notes': None,
  'transcri

In [88]:
# Transforming the list of dicts into a DataFrame
new_col_df = pd.DataFrame(new_col_df)
new_col_df

Unnamed: 0,gene_id,gene_name,biotype,notes,parent_id,transcript_id,transcript_name,pseudogen
0,LINF_010005000,LINF_010005000,protein_coding,Protein_of_unknown_function_(DUF2946),,,,
1,,,protein_coding,Protein_of_unknown_function_(DUF2946),LINF_010005000,LINF_01T0005000,LINF_01T0005000,
2,,,,,LINF_01T0005000,LINF_01T0005000,,
3,,,,,LINF_01T0005000,,,
4,,,,,LINF_01T0005000,,,
...,...,...,...,...,...,...,...,...
45363,,,,,LINF_36T0082400,LINF_36T0082400,,
45364,,,,,LINF_36T0082400,,,
45365,,,,,LINF_36T0082400,,,
45366,LINF_360082500,LINF_360082500,,,,,,


In [89]:
# Le's re-order the columns
new_col_df = new_col_df[["gene_id", "gene_name", "transcript_id", "transcript_name", "biotype", "parent_id", "pseudogen", "notes"]]
new_col_df

Unnamed: 0,gene_id,gene_name,transcript_id,transcript_name,biotype,parent_id,pseudogen,notes
0,LINF_010005000,LINF_010005000,,,protein_coding,,,Protein_of_unknown_function_(DUF2946)
1,,,LINF_01T0005000,LINF_01T0005000,protein_coding,LINF_010005000,,Protein_of_unknown_function_(DUF2946)
2,,,LINF_01T0005000,,,LINF_01T0005000,,
3,,,,,,LINF_01T0005000,,
4,,,,,,LINF_01T0005000,,
...,...,...,...,...,...,...,...,...
45363,,,LINF_36T0082400,,,LINF_36T0082400,,
45364,,,,,,LINF_36T0082400,,
45365,,,,,,LINF_36T0082400,,
45366,LINF_360082500,LINF_360082500,,,,,,


In [90]:
# Concatenating the new DataFrame with the original `gtf_df` and dropping the `attributes` column
gtf_df = pd.concat([gtf_df, new_col_df], axis=1)
gtf_df.drop(columns="attributes", inplace=True)
gtf_df

Unnamed: 0,chrom,feature,start,end,strand,gene_id,gene_name,transcript_id,transcript_name,biotype,parent_id,pseudogen,notes
0,LinJ.01,gene,1520,5066,-,LINF_010005000,LINF_010005000,,,protein_coding,,,Protein_of_unknown_function_(DUF2946)
1,LinJ.01,transcript,1520,5066,-,,,LINF_01T0005000,LINF_01T0005000,protein_coding,LINF_010005000,,Protein_of_unknown_function_(DUF2946)
2,LinJ.01,CDS,3710,4711,-,,,LINF_01T0005000,,,LINF_01T0005000,,
3,LinJ.01,3utr,1520,3709,-,,,,,,LINF_01T0005000,,
4,LinJ.01,5utr,4712,5066,-,,,,,,LINF_01T0005000,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
45363,LinJ.36,CDS,2739458,2740183,-,,,LINF_36T0082400,,,LINF_36T0082400,,
45364,LinJ.36,3utr,2738595,2739457,-,,,,,,LINF_36T0082400,,
45365,LinJ.36,5utr,2740184,2740374,-,,,,,,LINF_36T0082400,,
45366,LinJ.36,gene,2740760,2742268,-,LINF_360082500,LINF_360082500,,,,,,


# Compare coordinates

In this next part we are going to check for coordinates. To search which elements in the **neg_df** is inside which element in the **gtf_df**.

## Fail proof the data

In [91]:
# Copy data to make some fail-proof analysis
neg_df_test = neg_df.copy()
gtf_df_test = gtf_df.copy()

# Lets shape shapes
print(f"Shape of neg_df: {neg_df.shape}")
print(f"Shape of gtf_df: {gtf_df.shape}")

Shape of neg_df: (682, 6)
Shape of gtf_df: (45368, 13)


In [92]:
# Let's start with the shape in gtf_df_test:
gtf_df.head()

Unnamed: 0,chrom,feature,start,end,strand,gene_id,gene_name,transcript_id,transcript_name,biotype,parent_id,pseudogen,notes
0,LinJ.01,gene,1520,5066,-,LINF_010005000,LINF_010005000,,,protein_coding,,,Protein_of_unknown_function_(DUF2946)
1,LinJ.01,transcript,1520,5066,-,,,LINF_01T0005000,LINF_01T0005000,protein_coding,LINF_010005000,,Protein_of_unknown_function_(DUF2946)
2,LinJ.01,CDS,3710,4711,-,,,LINF_01T0005000,,,LINF_01T0005000,,
3,LinJ.01,3utr,1520,3709,-,,,,,,LINF_01T0005000,,
4,LinJ.01,5utr,4712,5066,-,,,,,,LINF_01T0005000,,


In [93]:
# Check elements where start < end
num_elements_start_less_end = (gtf_df_test['start'] < gtf_df_test['end']).sum()  # type: ignore
print(f"There are {num_elements_start_less_end} elements where start < end.")

# Check elements where start > end
num_elements_start_greater_end = (gtf_df_test['start'] > gtf_df_test['end']).sum()  # type: ignore
print(f"There are {num_elements_start_greater_end} elements where start > end.")

# Check elements where start == end
num_elements_start_equal_end = (gtf_df_test['start'] == gtf_df_test['end']).sum()  # type: ignore
print(f"There are {num_elements_start_equal_end} elements where start == end.")


There are 45366 elements where start < end.
There are 0 elements where start > end.
There are 2 elements where start == end.


In [94]:
# Let's check the rows where start == end
gtf_df_test[gtf_df_test['start'] == gtf_df_test['end']]


Unnamed: 0,chrom,feature,start,end,strand,gene_id,gene_name,transcript_id,transcript_name,biotype,parent_id,pseudogen,notes
495,LinJ.02,5utr,28840,28840,-,,,,,,LINF_02T0005800,,
33221,LinJ.32,5utr,1041948,1041948,-,,,,,,LINF_32T0033400,,


In [95]:
# Let's check the elements where transcript_id, transcript_name, parent_id can be "LINF_02T0005800" and "gene_id" the same but without "T"
gtf_df_test[(gtf_df_test['transcript_id'] == "LINF_02T0005800") | 
            (gtf_df_test['transcript_name'] == "LINF_02T0005800") | 
            (gtf_df_test['parent_id'] == "LINF_02T0005800") |
            (gtf_df_test['gene_id'] == "LINF_020005800")]

Unnamed: 0,chrom,feature,start,end,strand,gene_id,gene_name,transcript_id,transcript_name,biotype,parent_id,pseudogen,notes
491,LinJ.02,gene,27302,28840,-,LINF_020005800,LINF_020005800,,,protein_coding,,,hypothetical_protein_-_conserved
492,LinJ.02,transcript,27302,28840,-,,,LINF_02T0005800,LINF_02T0005800,protein_coding,LINF_020005800,,hypothetical_protein_-_conserved
493,LinJ.02,CDS,27895,28839,-,,,LINF_02T0005800,,,LINF_02T0005800,,
494,LinJ.02,3utr,27302,27894,-,,,,,,LINF_02T0005800,,
495,LinJ.02,5utr,28840,28840,-,,,,,,LINF_02T0005800,,


We can see that the CDS goes all the way till the final of the transcript except for one base. This base is will be taken by the 5'UTR.

In [96]:
# Check the "feature" elements:
condition =(gtf_df_test['start'] < gtf_df_test['end'])
gtf_df_test[condition]['feature'].value_counts()

feature
gene          9861
transcript    9660
CDS           8744
3utr          8554
5utr          8547
Name: count, dtype: int64

Interesting, there should be the same numbers of 5utr as 3utr

In [97]:
# Checking without condition
gtf_df_test['feature'].value_counts()

feature
gene          9861
transcript    9660
CDS           8744
3utr          8554
5utr          8549
Name: count, dtype: int64

In [98]:
# Checking where those elements where there is not a 3utr or 5utr
parent_feature_dict = gtf_df_test.groupby('parent_id')['feature'].apply(list).to_dict()
filtered_dict = {k: v for k, v in parent_feature_dict.items() if v not in (['transcript'], 
                                                                           ['CDS'], 
                                                                           ['CDS', '5utr', '3utr'], 
                                                                           ['CDS', '3utr', '5utr'],
                                                                           ['CDS', '3utr', '5utr', 'CDS', '3utr', '5utr'],
                                                                           ['transcript', 'transcript'],
                                                                           ['CDS', '5utr', '3utr', 'CDS', '5utr', '3utr'])}
filtered_dict

{'LINF_27T0013600': ['CDS', '5utr'],
 'LINF_30T0006850': ['CDS', '3utr'],
 'LINF_31T0037100': ['CDS', '3utr'],
 'LINF_31T0039200': ['CDS', '3utr'],
 'LINF_36T0017400': ['CDS', '3utr'],
 'LINF_36T0036000': ['CDS', '3utr'],
 'LINF_36T0071100': ['CDS', '3utr']}

<span style="color:red">These are the elements without a 3utr or 5utr</span>

Should be careful with LINF_270013600


In [99]:
gtf_df[
    (
            (gtf_df[['gene_id', 'transcript_id', 'parent_id']].isin(filtered_dict.keys()).any(axis=1)) | 
            (gtf_df['gene_id'].isin([elem.replace("T","") for elem in list(filtered_dict.keys())]))
     )
]

Unnamed: 0,chrom,feature,start,end,strand,gene_id,gene_name,transcript_id,transcript_name,biotype,parent_id,pseudogen,notes
23181,LinJ.27,gene,327990,328645,+,LINF_270013600,LINF_270013600,,,protein_coding,,unknown,Stress_responsive_A/B_Barrel_domain-containing...
23182,LinJ.27,transcript,327990,328645,+,,,LINF_27T0013600,LINF_27T0013600,protein_coding,LINF_270013600,,Stress_responsive_A/B_Barrel_domain-containing...
23183,LinJ.27,CDS,328114,328645,+,,,LINF_27T0013600,,,LINF_27T0013600,,
23184,LinJ.27,5utr,327990,328113,+,,,,,,LINF_27T0013600,,
27827,LinJ.30,gene,56144,57262,-,LINF_300006850,LINF_300006850,,,protein_coding,,unknown,polynucleotide_kinase_3'-phosphatase-_putative...
27828,LinJ.30,transcript,56144,57262,-,,,LINF_30T0006850,LINF_30T0006850,protein_coding,LINF_300006850,,polynucleotide_kinase_3'-phosphatase-_putative...
27829,LinJ.30,CDS,56787,57262,-,,,LINF_30T0006850,,,LINF_30T0006850,,
27830,LinJ.30,3utr,56144,56786,-,,,,,,LINF_30T0006850,,
31523,LinJ.31,gene,1404369,1405546,-,LINF_310037100,LINF_310037100,,,protein_coding,,,protein_of_unknown_function_-_conserved
31524,LinJ.31,transcript,1404369,1405546,-,,,LINF_31T0037100,LINF_31T0037100,protein_coding,LINF_310037100,,protein_of_unknown_function_-_conserved


## Negative elements inside GTF elements

Let's check how is the data in the dictionary. 

This way we can every NEGATIVE ELEMENT that is inside each GTF element.

In [100]:
# Create the boolean columns for each category in "feature"
boolean_df = pd.get_dummies(gtf_df['feature'], prefix='', prefix_sep='').astype(bool)

gtf_df = pd.concat([gtf_df, boolean_df], axis=1)
gtf_df.head()

Unnamed: 0,chrom,feature,start,end,strand,gene_id,gene_name,transcript_id,transcript_name,biotype,parent_id,pseudogen,notes,3utr,5utr,CDS,gene,transcript
0,LinJ.01,gene,1520,5066,-,LINF_010005000,LINF_010005000,,,protein_coding,,,Protein_of_unknown_function_(DUF2946),False,False,False,True,False
1,LinJ.01,transcript,1520,5066,-,,,LINF_01T0005000,LINF_01T0005000,protein_coding,LINF_010005000,,Protein_of_unknown_function_(DUF2946),False,False,False,False,True
2,LinJ.01,CDS,3710,4711,-,,,LINF_01T0005000,,,LINF_01T0005000,,,False,False,True,False,False
3,LinJ.01,3utr,1520,3709,-,,,,,,LINF_01T0005000,,,True,False,False,False,False
4,LinJ.01,5utr,4712,5066,-,,,,,,LINF_01T0005000,,,False,True,False,False,False


In [101]:
 # Let's drop the original "feature" column and reorder the columns
gtf_df.drop(columns="feature", inplace=True)
gtf_df = gtf_df[["chrom", "start", "end", "strand", "gene_id", "transcript_id", "parent_id", "gene", "transcript", "CDS", "3utr", "5utr", "pseudogen", "notes"]]
gtf_df.head()

Unnamed: 0,chrom,start,end,strand,gene_id,transcript_id,parent_id,gene,transcript,CDS,3utr,5utr,pseudogen,notes
0,LinJ.01,1520,5066,-,LINF_010005000,,,True,False,False,False,False,,Protein_of_unknown_function_(DUF2946)
1,LinJ.01,1520,5066,-,,LINF_01T0005000,LINF_010005000,False,True,False,False,False,,Protein_of_unknown_function_(DUF2946)
2,LinJ.01,3710,4711,-,,LINF_01T0005000,LINF_01T0005000,False,False,True,False,False,,
3,LinJ.01,1520,3709,-,,,LINF_01T0005000,False,False,False,True,False,,
4,LinJ.01,4712,5066,-,,,LINF_01T0005000,False,False,False,False,True,,


Let's make sure that we use the same column names for `gtf_df` and `neg_df`

In [102]:
print(gtf_df.columns)
print(gtf_df.shape)
gtf_df.head()

Index(['chrom', 'start', 'end', 'strand', 'gene_id', 'transcript_id',
       'parent_id', 'gene', 'transcript', 'CDS', '3utr', '5utr', 'pseudogen',
       'notes'],
      dtype='object')
(45368, 14)


Unnamed: 0,chrom,start,end,strand,gene_id,transcript_id,parent_id,gene,transcript,CDS,3utr,5utr,pseudogen,notes
0,LinJ.01,1520,5066,-,LINF_010005000,,,True,False,False,False,False,,Protein_of_unknown_function_(DUF2946)
1,LinJ.01,1520,5066,-,,LINF_01T0005000,LINF_010005000,False,True,False,False,False,,Protein_of_unknown_function_(DUF2946)
2,LinJ.01,3710,4711,-,,LINF_01T0005000,LINF_01T0005000,False,False,True,False,False,,
3,LinJ.01,1520,3709,-,,,LINF_01T0005000,False,False,False,True,False,,
4,LinJ.01,4712,5066,-,,,LINF_01T0005000,False,False,False,False,True,,


### Clean "notes" column

In [103]:
gtf_functions = gtf_df["notes"].value_counts()
gtf_functions

notes
hypothetical_protein_-_conserved                                            4043
protein_of_unknown_function_-_conserved                                     1536
hypothetical_protein                                                         372
protein_kinase                                                               140
hypothetical_protein_-__conserved                                             86
                                                                            ... 
tb-292_membrane_associated_protein-like_protein_conflicted_zone_in_study       1
tRNA                                                                           1
tRNA-seC                                                                       1
tRNA-val|Anticodon_gac                                                         1
tRNA-Cys                                                                       1
Name: count, Length: 4436, dtype: int64

In [104]:
# In `gtf_functions` filter all names with "protein" and "hypothetical" inside using a regex:
gtf_functions_protein = gtf_functions[gtf_functions.index.str.contains(r"(?=.*protein)(?=.*hypothetical)", case=False)]
gtf_functions_protein

notes
hypothetical_protein_-_conserved                             4043
hypothetical_protein                                          372
hypothetical_protein_-__conserved                              86
hypothetical_protein_-_unknown_function                        10
Hypothetical_protein                                            8
hypothetical_protein_-_conserved_                               4
Conserved_hypothetical_ATP_binding_protein_-_putative           4
hypothetical_protein_-_conserved__                              2
hypothetical_protein_-_conserved_conflicted_zone_in_study       2
hypothetical_protein_-_conserved|GF1                            2
hypothetical_protein,_conserved                                 2
hypothetical_protein_pseudogene                                 2
hypothetical_protein_conserved                                  2
hypothetical_protein-unknown_function                           2
hypothetical_protein_conflicted_zone_in_study                   1
Name

In [105]:
# Let's change 

In [106]:
# Let's rename in `neg_df` the next columns:
# 'sseqid' to 'chrom'
# 'sstart' to 'start'
# 'send' to 'end'
neg_df.rename(columns={"sseqid": "chrom", "sstart": "start", "send": "end"}, inplace=True)
print(neg_df.columns)

Index(['chrom', 'start', 'end', 'sstrand', 'sseq', 'name'], dtype='object')


Now let's repeat the dictionary process again:
The next dictionary will have elements only COMPLETELY inside GTF elements

In [107]:
# # Create interval columns
neg_df["interval"] = pd.IntervalIndex.from_arrays(neg_df["start"], neg_df["end"], closed="both")
gtf_df["interval"] = pd.IntervalIndex.from_arrays(gtf_df["start"], gtf_df["end"], closed="both")

# initialize dict
neg_gtf_dict = {neg_name: [] for neg_name in neg_df["name"].unique()}

# Find elements in neg_df that are inside gtf_df
# Find contains using boolean indexing
for i, neg_row in neg_df.iterrows():
    # Boolean mask for intervals that contain the neg_df interval
    print(f"Analyzing elem {i + 1}/{neg_df.shape[0]}")

    mask = (gtf_df['chrom'] == neg_row['chrom']) & \
           (gtf_df['start'] <= neg_row['start']) & \
           (gtf_df['end'] >= neg_row['end'])
    contains = gtf_df[mask]
    for j, gtf_row in contains.iterrows():
        neg_gtf_dict[neg_row['name']].append(gtf_row.to_dict())

Analyzing elem 1/682
Analyzing elem 2/682
Analyzing elem 3/682
Analyzing elem 4/682
Analyzing elem 5/682
Analyzing elem 6/682
Analyzing elem 7/682
Analyzing elem 8/682
Analyzing elem 9/682
Analyzing elem 10/682
Analyzing elem 11/682
Analyzing elem 12/682
Analyzing elem 13/682
Analyzing elem 14/682
Analyzing elem 15/682
Analyzing elem 16/682
Analyzing elem 17/682
Analyzing elem 18/682
Analyzing elem 19/682
Analyzing elem 20/682
Analyzing elem 21/682
Analyzing elem 22/682
Analyzing elem 23/682
Analyzing elem 24/682
Analyzing elem 25/682
Analyzing elem 26/682
Analyzing elem 27/682
Analyzing elem 28/682
Analyzing elem 29/682
Analyzing elem 30/682
Analyzing elem 31/682
Analyzing elem 32/682
Analyzing elem 33/682
Analyzing elem 34/682
Analyzing elem 35/682
Analyzing elem 36/682
Analyzing elem 37/682
Analyzing elem 38/682
Analyzing elem 39/682
Analyzing elem 40/682
Analyzing elem 41/682
Analyzing elem 42/682
Analyzing elem 43/682
Analyzing elem 44/682
Analyzing elem 45/682
Analyzing elem 46/6

In [108]:
# Prepare a pre JSON dict to not alter the original dict
neg_gtf_relation_pre_json = neg_gtf_dict

# Let's check the data
# print(neg_gtf_relation_pre_json)

In [109]:
# Let's count the data
counter_neg_inside = 0
counter_neg_not_inside = 0
for key, value in neg_gtf_dict.items():
    print("="*50)
    print(f"{key}:")
    if len(value) > 0:
        counter_neg_inside += 1
        for elem in value:
            print(f"\t{elem}")
    else:
        counter_neg_not_inside += 1

rejected_noCDS_c01.10:
	{'chrom': 'LinJ.01', 'start': 34736, 'end': 37218, 'strand': '-', 'gene_id': 'LINF_010006300', 'transcript_id': None, 'parent_id': None, 'gene': True, 'transcript': False, 'CDS': False, '3utr': False, '5utr': False, 'pseudogen': None, 'notes': 'hypothetical_protein_-_conserved__', 'interval': Interval(34736, 37218, closed='both')}
	{'chrom': 'LinJ.01', 'start': 34736, 'end': 37218, 'strand': '-', 'gene_id': None, 'transcript_id': 'LINF_01T0006300', 'parent_id': 'LINF_010006300', 'gene': False, 'transcript': True, 'CDS': False, '3utr': False, '5utr': False, 'pseudogen': None, 'notes': 'hypothetical_protein_-_conserved__', 'interval': Interval(34736, 37218, closed='both')}
	{'chrom': 'LinJ.01', 'start': 34736, 'end': 36818, 'strand': '-', 'gene_id': None, 'transcript_id': None, 'parent_id': 'LINF_01T0006300', 'gene': False, 'transcript': False, 'CDS': False, '3utr': True, '5utr': False, 'pseudogen': None, 'notes': None, 'interval': Interval(34736, 36818, closed='b

In [110]:
print(f"From the total of {len(neg_gtf_dict)} NEGATIVE ELEMENTS, {counter_neg_inside} are inside GTF elements and {counter_neg_not_inside} are not inside GTF elements.")

From the total of 681 NEGATIVE ELEMENTS, 503 are inside GTF elements and 178 are not inside GTF elements.


Let's get the elements in different dictionaries depending on if the length of "values" is > 0 or not:

In [111]:
# Get the elements which value is > 0
neg_inside_gtf_dict = {key: value for key, value in neg_gtf_dict.items() if len(value) > 0}
print(len(neg_inside_gtf_dict))

# Get the elements which value is == 0
neg_not_inside_gtf_dict = {key: value for key, value in neg_gtf_dict.items() if len(value) == 0}
print(len(neg_not_inside_gtf_dict))

503
178


let's transform it in a data frame

In [112]:
neg_inside_gtf_list = []
for key, value in neg_inside_gtf_dict.items():
    for elem in value:
        new_record = {'neg_name' : key}  # Create dict of 1 element
        new_record.update(elem)  # Update the dict with the values from elem, this way "neg_name" goes first
        neg_inside_gtf_list.append(new_record)

neg_inside_gtf_df = pd.DataFrame(neg_inside_gtf_list)

# Let's check the df
print(neg_inside_gtf_df.shape)
print(neg_inside_gtf_df.dtypes)
print(neg_inside_gtf_df['neg_name'].nunique())
neg_inside_gtf_df.head()

(1358, 16)
neg_name                        object
chrom                           object
start                            int64
end                              int64
strand                          object
gene_id                         object
transcript_id                   object
parent_id                       object
gene                              bool
transcript                        bool
CDS                               bool
3utr                              bool
5utr                              bool
pseudogen                       object
notes                           object
interval         interval[int64, both]
dtype: object
503


Unnamed: 0,neg_name,chrom,start,end,strand,gene_id,transcript_id,parent_id,gene,transcript,CDS,3utr,5utr,pseudogen,notes,interval
0,rejected_noCDS_c01.10,LinJ.01,34736,37218,-,LINF_010006300,,,True,False,False,False,False,,hypothetical_protein_-_conserved__,"[34736, 37218]"
1,rejected_noCDS_c01.10,LinJ.01,34736,37218,-,,LINF_01T0006300,LINF_010006300,False,True,False,False,False,,hypothetical_protein_-_conserved__,"[34736, 37218]"
2,rejected_noCDS_c01.10,LinJ.01,34736,36818,-,,,LINF_01T0006300,False,False,False,True,False,,,"[34736, 36818]"
3,rejected_noCDS_c01.30,LinJ.01,145613,146868,+,LINF_010010350,,,True,False,False,False,False,,,"[145613, 146868]"
4,rejected_noCDS_c01.30,LinJ.01,145613,146868,+,,LINF_01T0010350,LINF_010010350,False,True,False,False,False,,,"[145613, 146868]"


Now with these data we can analyze a lot of things

## NEGATIVE ELEMENTS intergenic

Now we will check the NEGATIVE ELEMENTS that are not overlapping any GTF element using `neg_not_inside_gtf_dict`

In [113]:
# Let's check again the dict:
print(len(neg_not_inside_gtf_dict))
neg_not_inside_gtf_dict

178


{'rejected_noCDS_c01.20': [],
 'rejected_noCDS_c01.40': [],
 'rejected_noCDS_c01.50': [],
 'rejected_noCDS_c01.60A': [],
 'rejected_noCDS_c02.140': [],
 'rejected_noCDS_c02.150': [],
 'rejected_noCDS_c02.170': [],
 'rejected_noCDS_c03.180': [],
 'rejected_noCDS_c04.200C': [],
 'rejected_noCDS_c04.210C': [],
 'rejected_noCDS_c04.230': [],
 'rejected_noCDS_c05.280': [],
 'rejected_noCDS_c05.300D': [],
 'rejected_noCDS_c05.320D': [],
 'rejected_noCDS_c05.350': [],
 'rejected_noCDS_c05.360F': [],
 'rejected_noCDS_c06.380': [],
 'rejected_noCDS_c06.400': [],
 'rejected_noCDS_c07.430G': [],
 'rejected_noCDS_c07.510': [],
 'rejected_noCDS_c08.550': [],
 'rejected_noCDS_c08.560I': [],
 'rejected_noCDS_c08.590I': [],
 'rejected_noCDS_c08.620I': [],
 'rejected_noCDS_c08.640I': [],
 'rejected_noCDS_c08.670': [],
 'rejected_noCDS_c08.680K': [],
 'rejected_noCDS_c08.710K': [],
 'rejected_noCDS_c08.720N': [],
 'rejected_noCDS_c08.760K': [],
 'rejected_noCDS_c08.770N': [],
 'rejected_noCDS_c08.800K':

In [114]:
# Let's take the keys from the dict as a list:
neg_not_inside_gtf_dict_list = list(neg_not_inside_gtf_dict.keys())
print(neg_not_inside_gtf_dict_list)

['rejected_noCDS_c01.20', 'rejected_noCDS_c01.40', 'rejected_noCDS_c01.50', 'rejected_noCDS_c01.60A', 'rejected_noCDS_c02.140', 'rejected_noCDS_c02.150', 'rejected_noCDS_c02.170', 'rejected_noCDS_c03.180', 'rejected_noCDS_c04.200C', 'rejected_noCDS_c04.210C', 'rejected_noCDS_c04.230', 'rejected_noCDS_c05.280', 'rejected_noCDS_c05.300D', 'rejected_noCDS_c05.320D', 'rejected_noCDS_c05.350', 'rejected_noCDS_c05.360F', 'rejected_noCDS_c06.380', 'rejected_noCDS_c06.400', 'rejected_noCDS_c07.430G', 'rejected_noCDS_c07.510', 'rejected_noCDS_c08.550', 'rejected_noCDS_c08.560I', 'rejected_noCDS_c08.590I', 'rejected_noCDS_c08.620I', 'rejected_noCDS_c08.640I', 'rejected_noCDS_c08.670', 'rejected_noCDS_c08.680K', 'rejected_noCDS_c08.710K', 'rejected_noCDS_c08.720N', 'rejected_noCDS_c08.760K', 'rejected_noCDS_c08.770N', 'rejected_noCDS_c08.800K', 'rejected_noCDS_c08.850', 'rejected_noCDS_c08.860', 'rejected_noCDS_c10.980O', 'rejected_noCDS_c10.1010', 'rejected_noCDS_c11.1070', 'rejected_noCDS_c11.1

Let's find the overlapping elements:

In [115]:
# Call the old code but only using the keys from the list
# initialize dict
neg_gtf_dict_2 = {name: [] for name in neg_not_inside_gtf_dict_list}

# Find elements in neg_df that overlap with a gtf_df
# Find overlaps using boolean indexing
for i, neg_row in neg_df.iterrows():
    # Boolean mask for intervals that overlaps the neg interval
    print(f"Analyzing elem {i+1}/{neg_df.shape[0]}")

    if neg_row['name'] not in neg_not_inside_gtf_dict_list:
        continue

    mask = (gtf_df['chrom'] == neg_row['chrom']) & \
            (
                ((neg_row['start'] >= gtf_df['start']) & (neg_row['start'] <= gtf_df['end'])) |
                ((neg_row['end'] >= gtf_df['start']) & (neg_row['end'] <= gtf_df['end'])) |
                ((gtf_df['start'] >= neg_row['start']) & (gtf_df['start'] <= neg_row['end'])) |
                ((gtf_df['end'] >= neg_row['start']) & (gtf_df['end'] <= neg_row['end']))
                )
    
    overlaps = gtf_df[mask]
    
    for j, gtf_row in overlaps.iterrows():
        neg_gtf_dict_2[neg_row['name']].append(gtf_row.to_dict())

Analyzing elem 1/682
Analyzing elem 2/682
Analyzing elem 3/682
Analyzing elem 4/682
Analyzing elem 5/682
Analyzing elem 6/682
Analyzing elem 7/682
Analyzing elem 8/682
Analyzing elem 9/682
Analyzing elem 10/682
Analyzing elem 11/682
Analyzing elem 12/682
Analyzing elem 13/682
Analyzing elem 14/682
Analyzing elem 15/682
Analyzing elem 16/682
Analyzing elem 17/682
Analyzing elem 18/682
Analyzing elem 19/682
Analyzing elem 20/682
Analyzing elem 21/682
Analyzing elem 22/682
Analyzing elem 23/682
Analyzing elem 24/682
Analyzing elem 25/682
Analyzing elem 26/682
Analyzing elem 27/682
Analyzing elem 28/682
Analyzing elem 29/682
Analyzing elem 30/682
Analyzing elem 31/682
Analyzing elem 32/682
Analyzing elem 33/682
Analyzing elem 34/682
Analyzing elem 35/682
Analyzing elem 36/682
Analyzing elem 37/682
Analyzing elem 38/682
Analyzing elem 39/682
Analyzing elem 40/682
Analyzing elem 41/682
Analyzing elem 42/682
Analyzing elem 43/682
Analyzing elem 44/682
Analyzing elem 45/682
Analyzing elem 46/6

In [116]:
counter_neg_overlaps = 0
counter_neg_not_overlaps = 0
for key, value in neg_gtf_dict_2.items():
    print(f"{'='*50}")
    print(f"{key}:")
    if len(value) > 0:
        counter_neg_overlaps += 1
        for elem in value:
            print(f"\t{elem}")
    else:
        counter_neg_not_overlaps += 1

rejected_noCDS_c01.20:
	{'chrom': 'LinJ.01', 'start': 114146, 'end': 116224, 'strand': '+', 'gene_id': 'LINF_010009600', 'transcript_id': None, 'parent_id': None, 'gene': True, 'transcript': False, 'CDS': False, '3utr': False, '5utr': False, 'pseudogen': None, 'notes': 'nicotinamidase|PNC1', 'interval': Interval(114146, 116224, closed='both')}
	{'chrom': 'LinJ.01', 'start': 114146, 'end': 116224, 'strand': '+', 'gene_id': None, 'transcript_id': 'LINF_01T0009600', 'parent_id': 'LINF_010009600', 'gene': False, 'transcript': True, 'CDS': False, '3utr': False, '5utr': False, 'pseudogen': None, 'notes': 'nicotinamidase|PNC1', 'interval': Interval(114146, 116224, closed='both')}
	{'chrom': 'LinJ.01', 'start': 114146, 'end': 114423, 'strand': '+', 'gene_id': None, 'transcript_id': None, 'parent_id': 'LINF_01T0009600', 'gene': False, 'transcript': False, 'CDS': False, '3utr': False, '5utr': True, 'pseudogen': None, 'notes': None, 'interval': Interval(114146, 114423, closed='both')}
rejected_no

In [117]:
print(f"From the total of {len(neg_not_inside_gtf_dict_list)} NEGATIVE ELEMENTS, {counter_neg_overlaps} are overlapping the GTF elements and {counter_neg_not_overlaps} are not overlapping the GTF elements")

From the total of 178 NEGATIVE ELEMENTS, 137 are overlapping the GTF elements and 41 are not overlapping the GTF elements


Let's join the two "neg_gtf_dict" dictionaries to save them as a JSON file

In [118]:
dict_neg_full_inside = neg_gtf_dict.copy()
dict_neg_overlap = neg_gtf_dict_2.copy()

# Let's join them
for key in dict_neg_overlap.keys():
    if key in dict_neg_full_inside.keys():
        if len(dict_neg_overlap[key]) > 0:
            dict_neg_full_inside[key].extend(dict_neg_overlap[key])


In [119]:
# Save the data to a json file
path_gtf_neg_relation_json = "./data/neg_gtf_relation.json" # Path to save the json file

# The problem will be the pandas Interval type and the JSON package. We need to create a custom serializer
def custom_serializer(obj):
    if isinstance(obj, pd.Interval):
        return {
            'left': int(obj.left) if isinstance(obj.left, np.integer) else obj.left,  # JSOn package can't process int64; transforming it
            'right': int(obj.right) if isinstance(obj.right, np.integer) else obj.right,  # JSOn package can't process int64; transforming it
            'closed': obj.closed
        }
    elif isinstance(obj, np.integer):  # Check for numpy integer types
        return int(obj)  # Convert to a standard Python int
    raise TypeError(f"Object of type {type(obj).__name__} is not JSON serializable")


# Save the data to a json file
with open(path_gtf_neg_relation_json, "w") as f:
    json.dump(dict_neg_full_inside, f, default=custom_serializer)

In [120]:
# Let's get the elements in different dictionaries
neg_overlaps_gtf_dict = {key: value for key, value in neg_gtf_dict_2.items() if len(value) > 0}

# And now for the INTERGENIC elements
neg_intergenic_gtf_dict = {key: value for key, value in neg_gtf_dict_2.items() if len(value) == 0}

In [121]:
print(len(neg_intergenic_gtf_dict))
list(neg_intergenic_gtf_dict.keys())

41


['rejected_noCDS_c01.50',
 'rejected_noCDS_c01.60A',
 'rejected_noCDS_c02.170',
 'rejected_noCDS_c04.200C',
 'rejected_noCDS_c05.360F',
 'rejected_noCDS_c06.400',
 'rejected_noCDS_c08.850',
 'rejected_noCDS_c08.860',
 'rejected_noCDS_c10.980O',
 'rejected_noCDS_c12.1120',
 'rejected_noCDS_c12.1170',
 'rejected_noCDS_c14.1420',
 'rejected_noCDS_c14.1530',
 'rejected_noCDS_c16.1720',
 'rejected_noCDS_c17.1910',
 'rejected_noCDS_c18.1980',
 'rejected_noCDS_c19.2160',
 'rejected_noCDS_c19.2230',
 'rejected_noCDS_c23.2640AI',
 'rejected_noCDS_c23.2650AI',
 'rejected_noCDS_c26.3150',
 'rejected_noCDS_BLAST_ERROR',
 'rejected_noCDS_c29.3610AY',
 'rejected_noCDS_c29.3640AY',
 'rejected_noCDS_c29.3670AY',
 'rejected_noCDS_c29.3830',
 'rejected_noCDS_c31.4110',
 'rejected_noCDS_c31.4220',
 'rejected_noCDS_c31.4230BF',
 'rejected_noCDS_c31.4240BG',
 'rejected_noCDS_c33.4880',
 'rejected_noCDS_c33.4910',
 'rejected_noCDS_c33.5030AK',
 'rejected_noCDS_c34.5340BU',
 'rejected_noCDS_c34.5390BU',
 're

## NEG ELEMENTS OVERLAPPING

In [122]:
# Check the Dict
print(len(neg_overlaps_gtf_dict))
neg_overlaps_gtf_dict

137


{'rejected_noCDS_c01.20': [{'chrom': 'LinJ.01',
   'start': 114146,
   'end': 116224,
   'strand': '+',
   'gene_id': 'LINF_010009600',
   'transcript_id': None,
   'parent_id': None,
   'gene': True,
   'transcript': False,
   'CDS': False,
   '3utr': False,
   '5utr': False,
   'pseudogen': None,
   'notes': 'nicotinamidase|PNC1',
   'interval': Interval(114146, 116224, closed='both')},
  {'chrom': 'LinJ.01',
   'start': 114146,
   'end': 116224,
   'strand': '+',
   'gene_id': None,
   'transcript_id': 'LINF_01T0009600',
   'parent_id': 'LINF_010009600',
   'gene': False,
   'transcript': True,
   'CDS': False,
   '3utr': False,
   '5utr': False,
   'pseudogen': None,
   'notes': 'nicotinamidase|PNC1',
   'interval': Interval(114146, 116224, closed='both')},
  {'chrom': 'LinJ.01',
   'start': 114146,
   'end': 114423,
   'strand': '+',
   'gene_id': None,
   'transcript_id': None,
   'parent_id': 'LINF_01T0009600',
   'gene': False,
   'transcript': False,
   'CDS': False,
   '3utr'

In [123]:
# Transform it into a DataFrame
neg_overlaps_gtf_list = []
for key, value in neg_overlaps_gtf_dict.items():
    for elem in value:
        new_record = {'neg_name': key}  # Create dict of 1 element
        new_record.update(elem)  # Update the dict with the values from elem, this way "neg_name" goes first
        neg_overlaps_gtf_list.append(new_record)

neg_overlaps_gtf_df = pd.DataFrame(neg_overlaps_gtf_list)

In [124]:
print(neg_overlaps_gtf_df.shape)
print(neg_overlaps_gtf_df.dtypes)
print(neg_overlaps_gtf_df['neg_name'].nunique())
neg_overlaps_gtf_df.head()

(493, 16)
neg_name                        object
chrom                           object
start                            int64
end                              int64
strand                          object
gene_id                         object
transcript_id                   object
parent_id                       object
gene                              bool
transcript                        bool
CDS                               bool
3utr                              bool
5utr                              bool
pseudogen                       object
notes                           object
interval         interval[int64, both]
dtype: object
137


Unnamed: 0,neg_name,chrom,start,end,strand,gene_id,transcript_id,parent_id,gene,transcript,CDS,3utr,5utr,pseudogen,notes,interval
0,rejected_noCDS_c01.20,LinJ.01,114146,116224,+,LINF_010009600,,,True,False,False,False,False,,nicotinamidase|PNC1,"[114146, 116224]"
1,rejected_noCDS_c01.20,LinJ.01,114146,116224,+,,LINF_01T0009600,LINF_010009600,False,True,False,False,False,,nicotinamidase|PNC1,"[114146, 116224]"
2,rejected_noCDS_c01.20,LinJ.01,114146,114423,+,,,LINF_01T0009600,False,False,False,False,True,,,"[114146, 114423]"
3,rejected_noCDS_c01.40,LinJ.01,262069,262979,+,LINF_010013350,,,True,False,False,False,False,,,"[262069, 262979]"
4,rejected_noCDS_c01.40,LinJ.01,262069,262979,+,,LINF_01T0013350,LINF_010013350,False,True,False,False,False,,,"[262069, 262979]"


### Divide "overlapping" and "overextended" element
In the `neg_overlaps_gtf_df` elements.There are some that touch more than one element in the GTF such as:
* 3'UTR, CDS
* 3'UTR transcript_1, 5'UTR transcript_2
* etc.

On the other part, there are some elements that don't touch more than one element. Instead, they touch one element and overextend to an intergenic zone

In [125]:
# Let's group the elements by "neg_name"
groupy_neg_overlaps_gtf_df = neg_overlaps_gtf_df.groupby('neg_name')

In [126]:
# Create the pre-list to save the elements
true_overlaps_gtf = []
overextend_elements_gtf = []

# Iterate over the groupy object
for name, group in groupy_neg_overlaps_gtf_df:
    location = group[["gene", "transcript", "CDS", "3utr", "5utr"]].sum()
    
    # Get the elements that extend by 3'utr or 5'utr
    if (location["gene"] == 1) & (location["3utr"] == 1 or location["5utr"] == 1):
        [overextend_elements_gtf.append(elem) for i, elem in group.iterrows()]
    elif (location["gene"] == 1) & (location["3utr"] == 0 and location["5utr"] == 0):  # Elements that overextend, but there are non-coding gentes.
        [overextend_elements_gtf.append(elem) for i, elem in group.iterrows()]
    else:  # The rest will be the elements that are truly overlapping more than one GTF element
        [true_overlaps_gtf.append(elem) for i, elem in group.iterrows()]

# Let's create the DataFrames
true_overlaps_gtf_df = pd.DataFrame(true_overlaps_gtf, columns=neg_overlaps_gtf_df.columns)
overextend_elements_gtf_df = pd.DataFrame(overextend_elements_gtf, columns=neg_overlaps_gtf_df.columns)

In [127]:
print(f"True Overlaps: {true_overlaps_gtf_df.shape}"
      f"\n\tUnique negs: {true_overlaps_gtf_df['neg_name'].nunique()}")
print(f"Overextend Elements: {overextend_elements_gtf_df.shape}"
      f"\n\tUnique negs: {overextend_elements_gtf_df['neg_name'].nunique()}")

True Overlaps: (288, 16)
	Unique negs: 55
Overextend Elements: (205, 16)
	Unique negs: 82


# Analyze results

* **Total NEGATIVE ELEMENTS**: 681
* A) `neg_inside_gtf_df` ==> Data frames of the 503 negs that are inside the coordinates of a GTF element.
* **Not completely inside GTF:** 178
    * B) `neg_overlaps_gtf_df`==> Data frame of the 137 elements that overlap GTF elements.
        * B.1)`true_overlaps_gtf_df`==> 55 elements
        * B.)`overextend_elements_gtf_df`==> 82 elements
    * C) `neg_intergenic_gtf_dict`==> Dictionary with the 41 INTERGENIC elements

## Prepare data

First, let's join the data frames `neg_inside_gtf_df` (503 elements) with `neg_overlaps_gtf_df` (137 elements) for a total of 640 elements

In [21]:
# Let's join `neg_inside_gtf_df` with `neg_overlaps_gtf_df`
neg_to_filter = pd.concat([neg_inside_gtf_df, neg_overlaps_gtf_df])

# Now let's sort them by chrom and then by start
neg_to_filter.sort_values(by=["chrom", "start"], inplace=True)

# Let's do some descriptive statistics
print(neg_to_filter.shape)
print(neg_to_filter['neg_name'].nunique())
print(neg_to_filter.dtypes)
neg_to_filter

NameError: name 'neg_inside_gtf_df' is not defined

## Cleaning phase

### Defining functions

In [22]:
def search_string(data_frame, searching_string):
    """
    :param data_frame: The DataFrame to search within. Must contain a 'notes' column to perform string matching.
    :param searching_string: The string to search for within the 'notes' column of the DataFrame.
    :return: A filtered DataFrame that contains only rows where 'notes' contains the searching_string, ignoring case.
    """
    filtered_df = data_frame[data_frame['notes'].fillna('').str.contains(searching_string, case=False)]
    print(f"The of filtered data: {filtered_df.shape}")
    print(f"The unique values in column 'neg_name': {filtered_df['neg_name'].nunique()}")
    return filtered_df

def checking_data(data_frame):
    """
    :param data_frame: pandas DataFrame that is being checked
    :return: None
    """
    print(f"Shape of the data frame is: {data_frame.shape}")
    print(f"Number of unique values in column 'neg_name': {data_frame['neg_name'].nunique()} ")

def group_and_count(data_frame, group_column):
    """
    :param data_frame: The input DataFrame containing the data to be grouped and counted.
    :param group_column: The column name used to group the data_frame.
    :return: A sorted DataFrame with unique notes from the group_column and their associated counts.
    """
    grouped_df = data_frame.groupby(group_column)
    grouped_column_counter = {}
    loc_counter_global = {}
    for _, group_data in grouped_df:
        notes = group_data['notes'].unique()
        for element in notes:
            if element is not None:
                if element not in grouped_column_counter:
                    grouped_column_counter[element] = 1
                else:
                    grouped_column_counter[element] += 1
                    
        loc_data = group_data[["gene", "transcript", "CDS", "3utr", "5utr"]].sum()
        for loc_element in loc_data.index:
            if loc_data[loc_element] > 0:
                if loc_element not in loc_counter_global:
                    loc_counter_global[loc_element] = 1
                else:
                    loc_counter_global[loc_element] += 1
        
    notes_counter_global_sorted = dict(sorted(grouped_column_counter.items(), key=lambda x: x[1], reverse=True))
    notes_counter_global_sorted_df = pd.DataFrame(notes_counter_global_sorted.items(), columns=["notes", "count"])
    print(loc_counter_global)
    return notes_counter_global_sorted_df

### Check function and location


In [54]:
checking_data(neg_to_filter)
### Defining functions
neg_to_filter.head()

Shape of the data frame is: (1851, 16)
Number of unique values in column 'neg_name': 640 


Unnamed: 0,neg_name,chrom,start,end,strand,gene_id,transcript_id,parent_id,gene,transcript,CDS,3utr,5utr,pseudogen,notes,interval
0,rejected_noCDS_c01.10,LinJ.01,34736,37218,-,LINF_010006300,,,True,False,False,False,False,,hypothetical_protein_-_conserved__,"[34736, 37218]"
1,rejected_noCDS_c01.10,LinJ.01,34736,37218,-,,LINF_01T0006300,LINF_010006300,False,True,False,False,False,,hypothetical_protein_-_conserved__,"[34736, 37218]"
2,rejected_noCDS_c01.10,LinJ.01,34736,36818,-,,,LINF_01T0006300,False,False,False,True,False,,,"[34736, 36818]"
0,rejected_noCDS_c01.20,LinJ.01,114146,116224,+,LINF_010009600,,,True,False,False,False,False,,nicotinamidase|PNC1,"[114146, 116224]"
1,rejected_noCDS_c01.20,LinJ.01,114146,116224,+,,LINF_01T0009600,LINF_010009600,False,True,False,False,False,,nicotinamidase|PNC1,"[114146, 116224]"


In [55]:
group_and_count(neg_to_filter, "neg_name")

{'gene': 640, 'transcript': 640, '3utr': 400, '5utr': 52, 'CDS': 4}


Unnamed: 0,notes,count
0,hypothetical_protein_-_conserved,88
1,protein_of_unknown_function_-_conserved,31
2,amastin_surface_glycoprotein_-_putative,23
3,hypothetical_protein,12
4,glucose_transporter,9
...,...,...
209,Haloacid_dehalogenase-like_hydrolase,1
210,protein-l-isoaspartate_o-methyltransferase_-_p...,1
211,2′-O-ribose_methyltransferase|MTr1,1
212,mitochondrial_carrier_protein_-_putative,1


### Checking: Hypothetical protein

We'll see what to do with this kind of data. In this case probably we won't do anything because we don't know what the gene does.

In [56]:
# Let's search for "hypothetical_protein" and add them to a new data frame
filter_data = search_string(neg_to_filter, "hypothetical_protein")

The of filtered data: (208, 16)
The unique values in column 'neg_name': 104


In [57]:
# Now let's take from "neg_to_filter" all the rows that have the same "neg_name" as in "good_negatives"
good_negatives = neg_to_filter[neg_to_filter['neg_name'].isin(filter_data['neg_name'])]
checking_data(good_negatives)

Shape of the data frame is: (349, 16)
Number of unique values in column 'neg_name': 104 


In [58]:
# Remove them now:
neg_filtered = neg_to_filter[~neg_to_filter['neg_name'].isin(good_negatives['neg_name'])]
# Let's check the data frame now
checking_data(neg_filtered)
group_and_count(neg_filtered, "neg_name")

Shape of the data frame is: (1502, 16)
Number of unique values in column 'neg_name': 536 
{'gene': 536, 'transcript': 536, '5utr': 34, '3utr': 307, 'CDS': 4}


Unnamed: 0,notes,count
0,protein_of_unknown_function_-_conserved,31
1,amastin_surface_glycoprotein_-_putative,23
2,glucose_transporter,9
3,amastin-like_protein,8
4,UDP-galactose_transporter|LPG5A,7
...,...,...
199,Haloacid_dehalogenase-like_hydrolase,1
200,protein-l-isoaspartate_o-methyltransferase_-_p...,1
201,2′-O-ribose_methyltransferase|MTr1,1
202,mitochondrial_carrier_protein_-_putative,1


### Checking: Protein of unknown function

In [59]:
filter_data = search_string(neg_filtered, "protein_of_unknown_function")
filter_data['notes'].value_counts()

The of filtered data: (84, 16)
The unique values in column 'neg_name': 39


notes
protein_of_unknown_function_-_conserved                                                                                                 64
protein_of_unknown_function_(DUF3184)                                                                                                    4
Protein_of_unknown_function_(DUF962),_putative                                                                                           2
Protein_of_unknown_function_(DUF2946)_-_putative                                                                                         2
Protein_of_unknown_function_(DUF775)_-_putative                                                                                          2
Protein_of_unknown_function_-_conserved_(L1p/L10e_family)                                                                                2
Protein_of_unknown_function_N-terminal_domain_(DUF2450)/Sec8_exocyst_complex_component_specific_domain_containing_protein_-_putative     2
Protein_of_unknown_fu

In [62]:
filter_data[filter_data["gene"] == True]

Unnamed: 0,neg_name,chrom,start,end,strand,gene_id,transcript_id,parent_id,gene,transcript,CDS,3utr,5utr,pseudogen,notes,interval
28,rejected_noCDS_c03.190,LinJ.03,235179,238900,+,LINF_030011300,,,True,False,False,False,False,,protein_of_unknown_function_-_conserved,"[235179, 238900]"
44,rejected_noCDS_c06.380,LinJ.06,59844,62540,-,LINF_060006900,,,True,False,False,False,False,,kinesin-like_protein,"[59844, 62540]"
47,rejected_noCDS_c06.380,LinJ.06,62983,65883,-,LINF_060007000,,,True,False,False,False,False,,protein_of_unknown_function_-_conserved,"[62983, 65883]"
63,rejected_noCDS_c06.410,LinJ.06,402727,406088,+,LINF_060014900,,,True,False,False,False,False,,protein_of_unknown_function_-_conserved,"[402727, 406088]"
95,rejected_noCDS_c12.1180S,LinJ.12,372226,374050,+,LINF_120012350,,,True,False,False,False,False,,"Protein_of_unknown_function_(DUF962),_putative","[372226, 374050]"
98,rejected_noCDS_c12.1180S,LinJ.12,374051,377775,+,LINF_120012400,,,True,False,False,False,False,,protein_of_unknown_function_-_conserved,"[374051, 377775]"
101,rejected_noCDS_c12.1190,LinJ.12,374051,377775,+,LINF_120012400,,,True,False,False,False,False,,protein_of_unknown_function_-_conserved,"[374051, 377775]"
121,rejected_noCDS_c12.1240S,LinJ.12,425243,429221,+,LINF_120013700,,,True,False,False,False,False,,surface_antigen_protein_2_-_putative,"[425243, 429221]"
124,rejected_noCDS_c12.1240S,LinJ.12,429222,431627,+,LINF_120013800,,,True,False,False,False,False,,protein_of_unknown_function_-_conserved,"[429222, 431627]"
206,rejected_noCDS_c12.1290T,LinJ.12,528652,535332,+,LINF_120016200,,,True,False,False,False,False,,protein_of_unknown_function_-_conserved,"[528652, 535332]"


In [60]:
filter_data[
    (filter_data["gene"] == True) & (filter_data["notes"] == "protein_of_unknown_function_(DUF3184)")
]

Unnamed: 0,neg_name,chrom,start,end,strand,gene_id,transcript_id,parent_id,gene,transcript,CDS,3utr,5utr,pseudogen,notes,interval
314,rejected_noCDS_c16.1760Y,LinJ.16,371773,376703,+,LINF_160015800,,,True,False,False,False,False,,protein_of_unknown_function_(DUF3184),"[371773, 376703]"
317,rejected_noCDS_c16.1770Y,LinJ.16,377060,383329,+,LINF_160015820,,,True,False,False,False,False,,protein_of_unknown_function_(DUF3184),"[377060, 383329]"


In [63]:
# Now let's take the all the elements with the same "neg_name" as `filter_data`
filter_data = neg_filtered[neg_filtered['neg_name'].isin(filter_data['neg_name'])]
checking_data(filter_data)

Shape of the data frame is: (136, 16)
Number of unique values in column 'neg_name': 39 


In [64]:
# Now let's add the lines in `filter_data` to `good_negatives`
good_negatives = pd.concat([good_negatives, filter_data])
checking_data(good_negatives)

Shape of the data frame is: (485, 16)
Number of unique values in column 'neg_name': 143 


In [65]:
# Now let's remove the `filter_data` data from `neg_filtered`
neg_filtered = neg_filtered[~neg_filtered['neg_name'].isin(filter_data['neg_name'])]
checking_data(neg_filtered)


Shape of the data frame is: (1366, 16)
Number of unique values in column 'neg_name': 497 


In [66]:
# Check the global functions:
group_and_count(neg_filtered, "neg_name")

{'gene': 497, 'transcript': 497, '5utr': 26, '3utr': 271, 'CDS': 4}


Unnamed: 0,notes,count
0,amastin_surface_glycoprotein_-_putative,23
1,glucose_transporter,9
2,amastin-like_protein,8
3,UDP-galactose_transporter|LPG5A,7
4,amastin-like_surface_protein_-_putative,6
...,...,...
187,Haloacid_dehalogenase-like_hydrolase,1
188,protein-l-isoaspartate_o-methyltransferase_-_p...,1
189,2′-O-ribose_methyltransferase|MTr1,1
190,mitochondrial_carrier_protein_-_putative,1


### Checking: amastin

In [67]:
# Checking contents with amastin
filter_data = search_string(neg_filtered, "amastin")
filter_data['notes'].value_counts()

The of filtered data: (86, 16)
The unique values in column 'neg_name': 43


notes
amastin_surface_glycoprotein_-_putative    46
amastin-like_protein                       16
amastin-like_surface_protein_-_putative    12
amastin_surface_glycoprotein                8
amastin_surface_protein                     2
amastin-like_surface_protein                2
Name: count, dtype: int64

In [68]:
filter_data[filter_data["gene"] == True]

Unnamed: 0,neg_name,chrom,start,end,strand,gene_id,transcript_id,parent_id,gene,transcript,CDS,3utr,5utr,pseudogen,notes,interval
78,rejected_noCDS_c08.770N,LinJ.08,337654,340312,+,LINF_080012900,,,True,False,False,False,False,,amastin-like_protein,"[337654, 340312]"
726,rejected_noCDS_c30.3870,LinJ.30,274753,277962,+,LINF_300014200,,,True,False,False,False,False,,amastin_surface_protein,"[274753, 277962]"
790,rejected_noCDS_c31.4130,LinJ.31,150123,152831,-,LINF_310009800,,,True,False,False,False,False,,amastin_surface_glycoprotein,"[150123, 152831]"
328,rejected_noCDS_c31.4120,LinJ.31,150123,152831,-,LINF_310009800,,,True,False,False,False,False,,amastin_surface_glycoprotein,"[150123, 152831]"
993,rejected_noCDS_c34.5090BR,LinJ.34,423724,426392,+,LINF_340015400,,,True,False,False,False,False,,amastin-like_surface_protein_-_putative,"[423724, 426392]"
998,rejected_noCDS_c34.5110BR,LinJ.34,432504,435210,+,LINF_340015500,,,True,False,False,False,False,,amastin-like_protein,"[432504, 435210]"
1001,rejected_noCDS_c34.5120BR,LinJ.34,445685,448392,+,LINF_340015800,,,True,False,False,False,False,,amastin-like_protein,"[445685, 448392]"
1004,rejected_noCDS_c34.5130BR,LinJ.34,450104,453599,+,LINF_340015900,,,True,False,False,False,False,,amastin-like_protein,"[450104, 453599]"
381,rejected_noCDS_c34.5140BS,LinJ.34,513172,516089,-,LINF_340017500,,,True,False,False,False,False,,amastin-like_protein,"[513172, 516089]"
1015,rejected_noCDS_c34.5190,LinJ.34,739819,743551,-,LINF_340022900,,,True,False,False,False,False,,amastin-like_protein,"[739819, 743551]"


Since whe checked that these elements al from the repetitive elements "amastin", let's delete them

In [70]:
# Now let's take the all the elements with the same "neg_name" as `filter_data`
filter_data = neg_filtered[neg_filtered['neg_name'].isin(filter_data['neg_name'])]
checking_data(filter_data)

Shape of the data frame is: (141, 16)
Number of unique values in column 'neg_name': 43 


In [72]:
# Let's remove them from `neg_filtered`
neg_filtered = neg_filtered[~neg_filtered['neg_name'].isin(filter_data['neg_name'])]
checking_data(neg_filtered)
group_and_count(neg_filtered, "neg_name")

Shape of the data frame is: (1225, 16)
Number of unique values in column 'neg_name': 454 
{'gene': 454, 'transcript': 454, '5utr': 25, '3utr': 229, 'CDS': 4}


Unnamed: 0,notes,count
0,glucose_transporter,9
1,UDP-galactose_transporter|LPG5A,7
2,Tripartite_attachment_complex_40|TAC40,6
3,phosphoglycan_beta_1-3_galactosyltransferase,5
4,Ketoacyl-CoA_synthase|Fatty_acid_elongase|ELO-3,5
...,...,...
181,Haloacid_dehalogenase-like_hydrolase,1
182,protein-l-isoaspartate_o-methyltransferase_-_p...,1
183,2′-O-ribose_methyltransferase|MTr1,1
184,mitochondrial_carrier_protein_-_putative,1


### Checking: glucose

In [74]:
filter_data = search_string(neg_filtered, "glucose")
filter_data['notes'].value_counts()

The of filtered data: (18, 16)
The unique values in column 'neg_name': 9


notes
glucose_transporter    18
Name: count, dtype: int64

In [75]:
filter_data[filter_data["gene"] == True]

Unnamed: 0,neg_name,chrom,start,end,strand,gene_id,transcript_id,parent_id,gene,transcript,CDS,3utr,5utr,pseudogen,notes,interval
466,rejected_noCDS_c36.6650CS,LinJ.36,2456799,2460093,-,LINF_360072900,,,True,False,False,False,False,,glucose_transporter,"[2456799, 2460093]"
469,rejected_noCDS_c36.6660CS,LinJ.36,2460371,2463651,-,LINF_360073000,,,True,False,False,False,False,,glucose_transporter,"[2460371, 2463651]"
472,rejected_noCDS_c36.6670CS,LinJ.36,2463918,2467210,-,LINF_360073100,,,True,False,False,False,False,,glucose_transporter,"[2463918, 2467210]"
475,rejected_noCDS_c36.6680CS,LinJ.36,2467477,2470767,-,LINF_360073200,,,True,False,False,False,False,,glucose_transporter,"[2467477, 2470767]"
478,rejected_noCDS_c36.6690CS,LinJ.36,2471045,2474323,-,LINF_360073300,,,True,False,False,False,False,,glucose_transporter,"[2471045, 2474323]"
481,rejected_noCDS_c36.6700CS,LinJ.36,2474601,2477880,-,LINF_360073400,,,True,False,False,False,False,,glucose_transporter,"[2474601, 2477880]"
484,rejected_noCDS_c36.6710CS,LinJ.36,2478158,2481440,-,LINF_360073500,,,True,False,False,False,False,,glucose_transporter,"[2478158, 2481440]"
1341,rejected_noCDS_c36.6720CS,LinJ.36,2481878,2486596,-,LINF_360073600,,,True,False,False,False,False,,glucose_transporter,"[2481878, 2486596]"
1344,rejected_noCDS_c36.6730,LinJ.36,2481878,2486596,-,LINF_360073600,,,True,False,False,False,False,,glucose_transporter,"[2481878, 2486596]"


Here is all the CS family and 6730. Let's remove them

In [76]:
# Take all elements with those names
filter_data = neg_filtered[neg_filtered['neg_name'].isin(filter_data['neg_name'])]
checking_data(filter_data)

Shape of the data frame is: (27, 16)
Number of unique values in column 'neg_name': 9 


In [77]:
# Let's remove them from `neg_filtered`
neg_filtered = neg_filtered[~neg_filtered['neg_name'].isin(filter_data['neg_name'])]
checking_data(neg_filtered)
group_and_count(neg_filtered, "neg_name")

Shape of the data frame is: (1198, 16)
Number of unique values in column 'neg_name': 445 
{'gene': 445, 'transcript': 445, '5utr': 25, '3utr': 220, 'CDS': 4}


Unnamed: 0,notes,count
0,UDP-galactose_transporter|LPG5A,7
1,Tripartite_attachment_complex_40|TAC40,6
2,phosphoglycan_beta_1-3_galactosyltransferase,5
3,Ketoacyl-CoA_synthase|Fatty_acid_elongase|ELO-3,5
4,alpha/beta_hydrolase,3
...,...,...
180,Haloacid_dehalogenase-like_hydrolase,1
181,protein-l-isoaspartate_o-methyltransferase_-_p...,1
182,2′-O-ribose_methyltransferase|MTr1,1
183,mitochondrial_carrier_protein_-_putative,1


### Checking: galactose

In [79]:
filter_data = search_string(neg_filtered, "galactose")
filter_data['notes'].value_counts()

The of filtered data: (16, 16)
The unique values in column 'neg_name': 7


notes
UDP-galactose_transporter|LPG5A    16
Name: count, dtype: int64

In [80]:
filter_data[filter_data["gene"] == True]

Unnamed: 0,neg_name,chrom,start,end,strand,gene_id,transcript_id,parent_id,gene,transcript,CDS,3utr,5utr,pseudogen,notes,interval
506,rejected_noCDS_c24.2720,LinJ.24,110205,117720,+,LINF_240008300,,,True,False,False,False,False,,UDP-galactose_transporter|LPG5A,"[110205, 117720]"
509,rejected_noCDS_c24.2730,LinJ.24,110205,117720,+,LINF_240008300,,,True,False,False,False,False,,UDP-galactose_transporter|LPG5A,"[110205, 117720]"
512,rejected_noCDS_c24.2740,LinJ.24,110205,117720,+,LINF_240008300,,,True,False,False,False,False,,UDP-galactose_transporter|LPG5A,"[110205, 117720]"
515,rejected_noCDS_c24.2750,LinJ.24,110205,117720,+,LINF_240008300,,,True,False,False,False,False,,UDP-galactose_transporter|LPG5A,"[110205, 117720]"
518,rejected_noCDS_c24.2760,LinJ.24,110205,117720,+,LINF_240008300,,,True,False,False,False,False,,UDP-galactose_transporter|LPG5A,"[110205, 117720]"
221,rejected_noCDS_c24.2710AJ,LinJ.24,110205,117720,+,LINF_240008300,,,True,False,False,False,False,,UDP-galactose_transporter|LPG5A,"[110205, 117720]"
224,rejected_noCDS_c24.2770AJ,LinJ.24,110205,117720,+,LINF_240008300,,,True,False,False,False,False,,UDP-galactose_transporter|LPG5A,"[110205, 117720]"
227,rejected_noCDS_c24.2770AJ,LinJ.24,117721,122337,+,LINF_240008400,,,True,False,False,False,False,,UDP-galactose_transporter|LPG5A,"[117721, 122337]"


In [81]:
# Check all rows with the "neg_names"
filter_data = neg_filtered[neg_filtered['neg_name'].isin(filter_data['neg_name'])]
checking_data(filter_data)

Shape of the data frame is: (26, 16)
Number of unique values in column 'neg_name': 7 
