Author: Ronny F. Pacheco Date: Sep 2024
Copyright: © 2024 Ronny Pacheco License: MIT License

---

MIT License

Copyright (c) 2024 Ronny Pacheco

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

# Needed modules

In [5]:
# Load the needed libraries
import pickle
import os
from tarfile import data_filter

import pandas as pd
import numpy as np
import json

In [6]:
# https://kioku-space.com/en/jupyter-skip-execution/
from IPython.core.magic import register_cell_magic # type: ignore


@register_cell_magic
def skip(line, cell):
    print('Skipping cell')
    if line and cell:
        pass
    return

# Pickle save

In [7]:
# %%skip
# =============================================================================
# main function
# =============================================================================
def data_save_load(option):
    """
    This function is used to save or load data for the jupyter notebook
    """
    path_folder = "ipynb_db"  # Folder to save variables
    os.makedirs(path_folder, exist_ok=True)  # Create folder if not exist
    notebook_name = os.path.basename(os.path.abspath(''))
    path_file = os.path.join(path_folder, f"{notebook_name}.variables.pkl") # Path to save the variables

    if option == "save":
        with open(path_file, "wb") as pickle_file:
            dict_variables = {
                "neg_to_filter" : neg_to_filter
            }
            pickle.dump(dict_variables, pickle_file)
    elif option == "load":
        with open(path_file, "rb") as pickle_file:
            variables = pickle.load(pickle_file)
        # Now load the variables
        for pickle_key, pickle_value in variables.items():
            print(f"* Loading variable: {pickle_key}")
            globals()[pickle_key] = pickle_value
# =============================================================================
# Call the function
# =============================================================================
data_save_load(option="load")

* Loading variable: neg_to_filter


# Prepare Data

## Load data

### Load negative (or rejected) elements

In [8]:
# Let's load the negative nada
neg_df = pd.read_csv("./data_1.1/lre_named.csv", sep=",", header=0)
print(neg_df.shape)
print(neg_df.dtypes)
neg_df.head()

(324, 6)
sseqid     object
sstart      int64
send        int64
sstrand    object
sseq       object
name       object
dtype: object


Unnamed: 0,sseqid,sstart,send,sstrand,sseq,name
0,LinJ.01,272907,275894,plus,CTTTCTCTGTCTTCACTTCCTCGGTGCGTCTGGTGGTGGTTGCGCC...,lre_1.1
1,LinJ.02,95421,95743,plus,TCCGCGATCCGTGCAGTTGGCGCCGGCCCCTCCTTCACTGCCGATG...,lre_2.1
2,LinJ.02,138891,139213,plus,TCCGCGATCCGTGCAGTTGGCGCCGGCCCCTCCTTCACTGCCGATG...,lre_2.2
3,LinJ.02,198554,198709,plus,TCCGCGATCCGTGCAGTTGGCGCCGGCCCCTCCTTCACTGCCGATG...,lre_2.3
4,LinJ.04,121451,121728,plus,CCCCCCCATCCCTGCCACCATTCCCCCATTGCCGAACCACCCCTCA...,lre_3.1


### GTF data
This one wil be **harder** to prepare

In [9]:
# Load data
gtf_df = pd.read_csv("./data/20240703111001_LINF-Tabla_maestra_v3-20244_RP_v0.8.gtf", sep="\t", header=None) 
print(gtf_df.shape)
print(gtf_df.dtypes)
gtf_df.head()

(45368, 9)
0    object
1    object
2    object
3     int64
4     int64
5    object
6    object
7    object
8    object
dtype: object


Unnamed: 0,0,1,2,3,4,5,6,7,8
0,LinJ.01,CBM,gene,1520,5066,.,-,.,"gene_id ""LINF_010005000""; gene_name ""LINF_0100..."
1,LinJ.01,CBM,transcript,1520,5066,.,-,.,"parent_id ""LINF_010005000""; transcript_id ""LIN..."
2,LinJ.01,CBM,CDS,3710,4711,.,-,.,"parent_id ""LINF_01T0005000""; transcript_id ""LI..."
3,LinJ.01,CBM,3utr,1520,3709,.,-,.,"parent_id ""LINF_01T0005000"";"
4,LinJ.01,CBM,5utr,4712,5066,.,-,.,"parent_id ""LINF_01T0005000"";"


From `gtf_df`I only need columns 0, 2, 3, 4, 6 and 8

In [10]:
# Get from `gtf_df` the needed columns [0, 3, 4, 6, 8]
gtf_df = gtf_df[[0, 2, 3, 4, 6, 8]]
gtf_df.columns = ["chrom", "feature", "start", "end", "strand", "attributes"]
print(gtf_df.shape)
print(gtf_df.dtypes)
gtf_df.head()

(45368, 6)
chrom         object
feature       object
start          int64
end            int64
strand        object
attributes    object
dtype: object


Unnamed: 0,chrom,feature,start,end,strand,attributes
0,LinJ.01,gene,1520,5066,-,"gene_id ""LINF_010005000""; gene_name ""LINF_0100..."
1,LinJ.01,transcript,1520,5066,-,"parent_id ""LINF_010005000""; transcript_id ""LIN..."
2,LinJ.01,CDS,3710,4711,-,"parent_id ""LINF_01T0005000""; transcript_id ""LI..."
3,LinJ.01,3utr,1520,3709,-,"parent_id ""LINF_01T0005000"";"
4,LinJ.01,5utr,4712,5066,-,"parent_id ""LINF_01T0005000"";"


Now the field `attributes` it's separated by ";" and the header it's in a format like `header "data"`. We are going to transform the "attributes" column in multiple columns

#### Transforming columns

Get first all the elements that appear in the attributes columns

In [11]:
# Let's count first the number of elements in the `attributes` column
atr_dict = {}
for index, row in gtf_df.iterrows():
    # print(index, ":", sep="")
    for atr in row["attributes"].split(";"):
        atr = atr.strip()  # Remove leading and trailing whitespaces
        if len(atr.strip()) == 0:  # Skip empty attribute ""
            continue
        # print(f"\t{'-'*50}")
        # print(f"\attribute: {atr.strip()}")  
        key = atr.split(" ")[0] 
        if key not in atr_dict:
            atr_dict[key] = 1

        else:
            atr_dict[key] += 1
        # print(f"\t{atr_dict}")
print(atr_dict)

{'gene_id': 9861, 'gene_name': 9861, 'biotype': 17295, 'notes': 17319, 'parent_id': 35507, 'transcript_id': 18215, 'transcript_name': 9660, 'pseudogen': 49}


In [12]:
# get a list with the keys of atr_dict
atr_keys = list(atr_dict.keys())
print(atr_keys)

['gene_id', 'gene_name', 'biotype', 'notes', 'parent_id', 'transcript_id', 'transcript_name', 'pseudogen']


Now we'll have a list with all the elements. When indexing each row in the next steps, we can check if one of these items appear, and if not, we can add a "None" value to the attribute

In [13]:
# Now that we have the attributes count, let's create a dict for each element in "test_df" with the attributes separated
new_col_df = []
for index, row in gtf_df.iterrows():
    # print(index, ":", sep="")
    pre_data = []
    for atr in row["attributes"].split(";"):
        atr = atr.strip()  # Remove leading and trailing whitespaces
        if len(atr.strip()) == 0:  # Skip empty attribute ""
            continue
        key = atr.split(" ")[0]
        value = atr.split(" ")[1].replace('"', "")
        pre_data.append({key: value})
    
    for elem in atr_keys: # type: ignore  # Checking if the elements from atr_keys
        if elem not in [list(elem.keys())[0] for elem in pre_data]:  # If the element is not in pre_data, add it with value None
            # noinspection PyUnresolvedReferences
            pre_data.append({elem: None})

    flattened_data = {key: value for sublist in pre_data for key, value in sublist.items()}
    new_col_df.append(flattened_data)

In [14]:
# Checking how it worked
new_col_df  

[{'gene_id': 'LINF_010005000',
  'gene_name': 'LINF_010005000',
  'biotype': 'protein_coding',
  'notes': 'Protein_of_unknown_function_(DUF2946)',
  'parent_id': None,
  'transcript_id': None,
  'transcript_name': None,
  'pseudogen': None},
 {'parent_id': 'LINF_010005000',
  'transcript_id': 'LINF_01T0005000',
  'transcript_name': 'LINF_01T0005000',
  'biotype': 'protein_coding',
  'notes': 'Protein_of_unknown_function_(DUF2946)',
  'gene_id': None,
  'gene_name': None,
  'pseudogen': None},
 {'parent_id': 'LINF_01T0005000',
  'transcript_id': 'LINF_01T0005000',
  'gene_id': None,
  'gene_name': None,
  'biotype': None,
  'notes': None,
  'transcript_name': None,
  'pseudogen': None},
 {'parent_id': 'LINF_01T0005000',
  'gene_id': None,
  'gene_name': None,
  'biotype': None,
  'notes': None,
  'transcript_id': None,
  'transcript_name': None,
  'pseudogen': None},
 {'parent_id': 'LINF_01T0005000',
  'gene_id': None,
  'gene_name': None,
  'biotype': None,
  'notes': None,
  'transcri

In [15]:
# Transforming the list of dicts into a DataFrame
new_col_df = pd.DataFrame(new_col_df)
new_col_df

Unnamed: 0,gene_id,gene_name,biotype,notes,parent_id,transcript_id,transcript_name,pseudogen
0,LINF_010005000,LINF_010005000,protein_coding,Protein_of_unknown_function_(DUF2946),,,,
1,,,protein_coding,Protein_of_unknown_function_(DUF2946),LINF_010005000,LINF_01T0005000,LINF_01T0005000,
2,,,,,LINF_01T0005000,LINF_01T0005000,,
3,,,,,LINF_01T0005000,,,
4,,,,,LINF_01T0005000,,,
...,...,...,...,...,...,...,...,...
45363,,,,,LINF_36T0082400,LINF_36T0082400,,
45364,,,,,LINF_36T0082400,,,
45365,,,,,LINF_36T0082400,,,
45366,LINF_360082500,LINF_360082500,,,,,,


In [16]:
# Le's re-order the columns
new_col_df = new_col_df[["gene_id", "gene_name", "transcript_id", "transcript_name", "biotype", "parent_id", "pseudogen", "notes"]]
new_col_df

Unnamed: 0,gene_id,gene_name,transcript_id,transcript_name,biotype,parent_id,pseudogen,notes
0,LINF_010005000,LINF_010005000,,,protein_coding,,,Protein_of_unknown_function_(DUF2946)
1,,,LINF_01T0005000,LINF_01T0005000,protein_coding,LINF_010005000,,Protein_of_unknown_function_(DUF2946)
2,,,LINF_01T0005000,,,LINF_01T0005000,,
3,,,,,,LINF_01T0005000,,
4,,,,,,LINF_01T0005000,,
...,...,...,...,...,...,...,...,...
45363,,,LINF_36T0082400,,,LINF_36T0082400,,
45364,,,,,,LINF_36T0082400,,
45365,,,,,,LINF_36T0082400,,
45366,LINF_360082500,LINF_360082500,,,,,,


In [17]:
# Concatenating the new DataFrame with the original `gtf_df` and dropping the `attributes` column
gtf_df = pd.concat([gtf_df, new_col_df], axis=1)
gtf_df.drop(columns="attributes", inplace=True)
gtf_df

Unnamed: 0,chrom,feature,start,end,strand,gene_id,gene_name,transcript_id,transcript_name,biotype,parent_id,pseudogen,notes
0,LinJ.01,gene,1520,5066,-,LINF_010005000,LINF_010005000,,,protein_coding,,,Protein_of_unknown_function_(DUF2946)
1,LinJ.01,transcript,1520,5066,-,,,LINF_01T0005000,LINF_01T0005000,protein_coding,LINF_010005000,,Protein_of_unknown_function_(DUF2946)
2,LinJ.01,CDS,3710,4711,-,,,LINF_01T0005000,,,LINF_01T0005000,,
3,LinJ.01,3utr,1520,3709,-,,,,,,LINF_01T0005000,,
4,LinJ.01,5utr,4712,5066,-,,,,,,LINF_01T0005000,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
45363,LinJ.36,CDS,2739458,2740183,-,,,LINF_36T0082400,,,LINF_36T0082400,,
45364,LinJ.36,3utr,2738595,2739457,-,,,,,,LINF_36T0082400,,
45365,LinJ.36,5utr,2740184,2740374,-,,,,,,LINF_36T0082400,,
45366,LinJ.36,gene,2740760,2742268,-,LINF_360082500,LINF_360082500,,,,,,


# Compare coordinates

In this next part we are going to check for coordinates. To search which elements in the **neg_df** is inside which element in the **gtf_df**.

## Fail proof the data

In [18]:
# Copy data to make some fail-proof analysis
neg_df_test = neg_df.copy()
gtf_df_test = gtf_df.copy()

# Lets shape shapes
print(f"Shape of neg_df: {neg_df.shape}")
print(f"Shape of gtf_df: {gtf_df.shape}")

Shape of neg_df: (324, 6)
Shape of gtf_df: (45368, 13)


In [19]:
# Let's start with the shape in gtf_df_test:
gtf_df.head()

Unnamed: 0,chrom,feature,start,end,strand,gene_id,gene_name,transcript_id,transcript_name,biotype,parent_id,pseudogen,notes
0,LinJ.01,gene,1520,5066,-,LINF_010005000,LINF_010005000,,,protein_coding,,,Protein_of_unknown_function_(DUF2946)
1,LinJ.01,transcript,1520,5066,-,,,LINF_01T0005000,LINF_01T0005000,protein_coding,LINF_010005000,,Protein_of_unknown_function_(DUF2946)
2,LinJ.01,CDS,3710,4711,-,,,LINF_01T0005000,,,LINF_01T0005000,,
3,LinJ.01,3utr,1520,3709,-,,,,,,LINF_01T0005000,,
4,LinJ.01,5utr,4712,5066,-,,,,,,LINF_01T0005000,,


In [20]:
# Check elements where start < end
num_elements_start_less_end = (gtf_df_test['start'] < gtf_df_test['end']).sum()  # type: ignore
print(f"There are {num_elements_start_less_end} elements where start < end.")

# Check elements where start > end
num_elements_start_greater_end = (gtf_df_test['start'] > gtf_df_test['end']).sum()  # type: ignore
print(f"There are {num_elements_start_greater_end} elements where start > end.")

# Check elements where start == end
num_elements_start_equal_end = (gtf_df_test['start'] == gtf_df_test['end']).sum()  # type: ignore
print(f"There are {num_elements_start_equal_end} elements where start == end.")


There are 45366 elements where start < end.
There are 0 elements where start > end.
There are 2 elements where start == end.


In [21]:
# Let's check the rows where start == end
gtf_df_test[gtf_df_test['start'] == gtf_df_test['end']]


Unnamed: 0,chrom,feature,start,end,strand,gene_id,gene_name,transcript_id,transcript_name,biotype,parent_id,pseudogen,notes
495,LinJ.02,5utr,28840,28840,-,,,,,,LINF_02T0005800,,
33221,LinJ.32,5utr,1041948,1041948,-,,,,,,LINF_32T0033400,,


In [22]:
# Let's check the elements where transcript_id, transcript_name, parent_id can be "LINF_02T0005800" and "gene_id" the same but without "T"
gtf_df_test[(gtf_df_test['transcript_id'] == "LINF_02T0005800") | 
            (gtf_df_test['transcript_name'] == "LINF_02T0005800") | 
            (gtf_df_test['parent_id'] == "LINF_02T0005800") |
            (gtf_df_test['gene_id'] == "LINF_020005800")]

Unnamed: 0,chrom,feature,start,end,strand,gene_id,gene_name,transcript_id,transcript_name,biotype,parent_id,pseudogen,notes
491,LinJ.02,gene,27302,28840,-,LINF_020005800,LINF_020005800,,,protein_coding,,,hypothetical_protein_-_conserved
492,LinJ.02,transcript,27302,28840,-,,,LINF_02T0005800,LINF_02T0005800,protein_coding,LINF_020005800,,hypothetical_protein_-_conserved
493,LinJ.02,CDS,27895,28839,-,,,LINF_02T0005800,,,LINF_02T0005800,,
494,LinJ.02,3utr,27302,27894,-,,,,,,LINF_02T0005800,,
495,LinJ.02,5utr,28840,28840,-,,,,,,LINF_02T0005800,,


We can see that the CDS goes all the way till the final of the transcript except for one base. This base is will be taken by the 5'UTR.

In [23]:
# Check the "feature" elements:
condition =(gtf_df_test['start'] < gtf_df_test['end'])
gtf_df_test[condition]['feature'].value_counts()

feature
gene          9861
transcript    9660
CDS           8744
3utr          8554
5utr          8547
Name: count, dtype: int64

Interesting, there should be the same numbers of 5utr as 3utr

In [24]:
# Checking without condition
gtf_df_test['feature'].value_counts()

feature
gene          9861
transcript    9660
CDS           8744
3utr          8554
5utr          8549
Name: count, dtype: int64

In [25]:
# Checking where those elements where there is not a 3utr or 5utr
parent_feature_dict = gtf_df_test.groupby('parent_id')['feature'].apply(list).to_dict()
filtered_dict = {k: v for k, v in parent_feature_dict.items() if v not in (['transcript'], 
                                                                           ['CDS'], 
                                                                           ['CDS', '5utr', '3utr'], 
                                                                           ['CDS', '3utr', '5utr'],
                                                                           ['CDS', '3utr', '5utr', 'CDS', '3utr', '5utr'],
                                                                           ['transcript', 'transcript'],
                                                                           ['CDS', '5utr', '3utr', 'CDS', '5utr', '3utr'])}
filtered_dict

{'LINF_27T0013600': ['CDS', '5utr'],
 'LINF_30T0006850': ['CDS', '3utr'],
 'LINF_31T0037100': ['CDS', '3utr'],
 'LINF_31T0039200': ['CDS', '3utr'],
 'LINF_36T0017400': ['CDS', '3utr'],
 'LINF_36T0036000': ['CDS', '3utr'],
 'LINF_36T0071100': ['CDS', '3utr']}

<span style="color:red">These are the elements without a 3utr or 5utr</span>

Should be careful with LINF_270013600


In [26]:
gtf_df[
    (
            (gtf_df[['gene_id', 'transcript_id', 'parent_id']].isin(filtered_dict.keys()).any(axis=1)) | 
            (gtf_df['gene_id'].isin([elem.replace("T","") for elem in list(filtered_dict.keys())]))
     )
]

Unnamed: 0,chrom,feature,start,end,strand,gene_id,gene_name,transcript_id,transcript_name,biotype,parent_id,pseudogen,notes
23181,LinJ.27,gene,327990,328645,+,LINF_270013600,LINF_270013600,,,protein_coding,,unknown,Stress_responsive_A/B_Barrel_domain-containing...
23182,LinJ.27,transcript,327990,328645,+,,,LINF_27T0013600,LINF_27T0013600,protein_coding,LINF_270013600,,Stress_responsive_A/B_Barrel_domain-containing...
23183,LinJ.27,CDS,328114,328645,+,,,LINF_27T0013600,,,LINF_27T0013600,,
23184,LinJ.27,5utr,327990,328113,+,,,,,,LINF_27T0013600,,
27827,LinJ.30,gene,56144,57262,-,LINF_300006850,LINF_300006850,,,protein_coding,,unknown,polynucleotide_kinase_3'-phosphatase-_putative...
27828,LinJ.30,transcript,56144,57262,-,,,LINF_30T0006850,LINF_30T0006850,protein_coding,LINF_300006850,,polynucleotide_kinase_3'-phosphatase-_putative...
27829,LinJ.30,CDS,56787,57262,-,,,LINF_30T0006850,,,LINF_30T0006850,,
27830,LinJ.30,3utr,56144,56786,-,,,,,,LINF_30T0006850,,
31523,LinJ.31,gene,1404369,1405546,-,LINF_310037100,LINF_310037100,,,protein_coding,,,protein_of_unknown_function_-_conserved
31524,LinJ.31,transcript,1404369,1405546,-,,,LINF_31T0037100,LINF_31T0037100,protein_coding,LINF_310037100,,protein_of_unknown_function_-_conserved


## Negative elements inside GTF elements

Let's check how is the data in the dictionary. 

This way we can every NEGATIVE ELEMENT that is inside each GTF element.

In [27]:
# Create the boolean columns for each category in "feature"
boolean_df = pd.get_dummies(gtf_df['feature'], prefix='', prefix_sep='').astype(bool)

gtf_df = pd.concat([gtf_df, boolean_df], axis=1)
gtf_df.head()

Unnamed: 0,chrom,feature,start,end,strand,gene_id,gene_name,transcript_id,transcript_name,biotype,parent_id,pseudogen,notes,3utr,5utr,CDS,gene,transcript
0,LinJ.01,gene,1520,5066,-,LINF_010005000,LINF_010005000,,,protein_coding,,,Protein_of_unknown_function_(DUF2946),False,False,False,True,False
1,LinJ.01,transcript,1520,5066,-,,,LINF_01T0005000,LINF_01T0005000,protein_coding,LINF_010005000,,Protein_of_unknown_function_(DUF2946),False,False,False,False,True
2,LinJ.01,CDS,3710,4711,-,,,LINF_01T0005000,,,LINF_01T0005000,,,False,False,True,False,False
3,LinJ.01,3utr,1520,3709,-,,,,,,LINF_01T0005000,,,True,False,False,False,False
4,LinJ.01,5utr,4712,5066,-,,,,,,LINF_01T0005000,,,False,True,False,False,False


In [28]:
 # Let's drop the original "feature" column and reorder the columns
gtf_df.drop(columns="feature", inplace=True)
gtf_df = gtf_df[["chrom", "start", "end", "strand", "gene_id", "transcript_id", "parent_id", "gene", "transcript", "CDS", "3utr", "5utr", "pseudogen", "notes"]]
gtf_df.head()

Unnamed: 0,chrom,start,end,strand,gene_id,transcript_id,parent_id,gene,transcript,CDS,3utr,5utr,pseudogen,notes
0,LinJ.01,1520,5066,-,LINF_010005000,,,True,False,False,False,False,,Protein_of_unknown_function_(DUF2946)
1,LinJ.01,1520,5066,-,,LINF_01T0005000,LINF_010005000,False,True,False,False,False,,Protein_of_unknown_function_(DUF2946)
2,LinJ.01,3710,4711,-,,LINF_01T0005000,LINF_01T0005000,False,False,True,False,False,,
3,LinJ.01,1520,3709,-,,,LINF_01T0005000,False,False,False,True,False,,
4,LinJ.01,4712,5066,-,,,LINF_01T0005000,False,False,False,False,True,,


Let's make sure that we use the same column names for `gtf_df` and `neg_df`

In [29]:
print(gtf_df.columns)
print(gtf_df.shape)
gtf_df.head()

Index(['chrom', 'start', 'end', 'strand', 'gene_id', 'transcript_id',
       'parent_id', 'gene', 'transcript', 'CDS', '3utr', '5utr', 'pseudogen',
       'notes'],
      dtype='object')
(45368, 14)


Unnamed: 0,chrom,start,end,strand,gene_id,transcript_id,parent_id,gene,transcript,CDS,3utr,5utr,pseudogen,notes
0,LinJ.01,1520,5066,-,LINF_010005000,,,True,False,False,False,False,,Protein_of_unknown_function_(DUF2946)
1,LinJ.01,1520,5066,-,,LINF_01T0005000,LINF_010005000,False,True,False,False,False,,Protein_of_unknown_function_(DUF2946)
2,LinJ.01,3710,4711,-,,LINF_01T0005000,LINF_01T0005000,False,False,True,False,False,,
3,LinJ.01,1520,3709,-,,,LINF_01T0005000,False,False,False,True,False,,
4,LinJ.01,4712,5066,-,,,LINF_01T0005000,False,False,False,False,True,,


### Clean "notes" column

In [30]:
gtf_functions = gtf_df["notes"].value_counts()
gtf_functions

notes
hypothetical_protein_-_conserved                                            4043
protein_of_unknown_function_-_conserved                                     1536
hypothetical_protein                                                         372
protein_kinase                                                               140
hypothetical_protein_-__conserved                                             86
                                                                            ... 
tb-292_membrane_associated_protein-like_protein_conflicted_zone_in_study       1
tRNA                                                                           1
tRNA-seC                                                                       1
tRNA-val|Anticodon_gac                                                         1
tRNA-Cys                                                                       1
Name: count, Length: 4436, dtype: int64

In [31]:
# In `gtf_functions` filter all names with "protein" and "hypothetical" inside using a regex:
gtf_functions_protein = gtf_functions[gtf_functions.index.str.contains(r"(?=.*protein)(?=.*hypothetical)", case=False)]
gtf_functions_protein

notes
hypothetical_protein_-_conserved                             4043
hypothetical_protein                                          372
hypothetical_protein_-__conserved                              86
hypothetical_protein_-_unknown_function                        10
Hypothetical_protein                                            8
hypothetical_protein_-_conserved_                               4
Conserved_hypothetical_ATP_binding_protein_-_putative           4
hypothetical_protein_-_conserved__                              2
hypothetical_protein_-_conserved_conflicted_zone_in_study       2
hypothetical_protein_-_conserved|GF1                            2
hypothetical_protein,_conserved                                 2
hypothetical_protein_pseudogene                                 2
hypothetical_protein_conserved                                  2
hypothetical_protein-unknown_function                           2
hypothetical_protein_conflicted_zone_in_study                   1
Name

In [32]:
# Let's change 

In [33]:
# Let's rename in `neg_df` the next columns:
# 'sseqid' to 'chrom'
# 'sstart' to 'start'
# 'send' to 'end'
neg_df.rename(columns={"sseqid": "chrom", "sstart": "start", "send": "end"}, inplace=True)
print(neg_df.columns)

Index(['chrom', 'start', 'end', 'sstrand', 'sseq', 'name'], dtype='object')


Now let's repeat the dictionary process again:
The next dictionary will have elements only COMPLETELY inside GTF elements

In [34]:
# # Create interval columns
neg_df["interval"] = pd.IntervalIndex.from_arrays(neg_df["start"], neg_df["end"], closed="both")
gtf_df["interval"] = pd.IntervalIndex.from_arrays(gtf_df["start"], gtf_df["end"], closed="both")

# initialize dict
neg_gtf_dict = {neg_name: [] for neg_name in neg_df["name"].unique()}

# Find elements in neg_df that are inside gtf_df
# Find contains using boolean indexing
for i, neg_row in neg_df.iterrows():
    # Boolean mask for intervals that contain the neg_df interval
    print(f"Analyzing elem {i + 1}/{neg_df.shape[0]}")

    mask = (gtf_df['chrom'] == neg_row['chrom']) & \
           (gtf_df['start'] <= neg_row['start']) & \
           (gtf_df['end'] >= neg_row['end'])
    contains = gtf_df[mask]
    for j, gtf_row in contains.iterrows():
        neg_gtf_dict[neg_row['name']].append(gtf_row.to_dict())

Analyzing elem 1/324
Analyzing elem 2/324
Analyzing elem 3/324
Analyzing elem 4/324
Analyzing elem 5/324
Analyzing elem 6/324
Analyzing elem 7/324
Analyzing elem 8/324
Analyzing elem 9/324
Analyzing elem 10/324
Analyzing elem 11/324
Analyzing elem 12/324
Analyzing elem 13/324
Analyzing elem 14/324
Analyzing elem 15/324
Analyzing elem 16/324
Analyzing elem 17/324
Analyzing elem 18/324
Analyzing elem 19/324
Analyzing elem 20/324
Analyzing elem 21/324
Analyzing elem 22/324
Analyzing elem 23/324
Analyzing elem 24/324
Analyzing elem 25/324
Analyzing elem 26/324
Analyzing elem 27/324
Analyzing elem 28/324
Analyzing elem 29/324
Analyzing elem 30/324
Analyzing elem 31/324
Analyzing elem 32/324
Analyzing elem 33/324
Analyzing elem 34/324
Analyzing elem 35/324
Analyzing elem 36/324
Analyzing elem 37/324
Analyzing elem 38/324
Analyzing elem 39/324
Analyzing elem 40/324
Analyzing elem 41/324
Analyzing elem 42/324
Analyzing elem 43/324
Analyzing elem 44/324
Analyzing elem 45/324
Analyzing elem 46/3

In [35]:
# Prepare a pre JSON dict to not alter the original dict
neg_gtf_relation_pre_json = neg_gtf_dict

# Let's check the data
# print(neg_gtf_relation_pre_json)

In [36]:
# Let's count the data
counter_neg_inside = 0
counter_neg_not_inside = 0
for key, value in neg_gtf_dict.items():
    print("="*50)
    print(f"{key}:")
    if len(value) > 0:
        counter_neg_inside += 1
        for elem in value:
            print(f"\t{elem}")
    else:
        counter_neg_not_inside += 1

lre_1.1:
lre_2.1:
	{'chrom': 'LinJ.02', 'start': 93941, 'end': 102762, 'strand': '-', 'gene_id': 'LINF_020007000', 'transcript_id': None, 'parent_id': None, 'gene': True, 'transcript': False, 'CDS': False, '3utr': False, '5utr': False, 'pseudogen': None, 'notes': 'phosphoglycan_beta_1-3_galactosyltransferase', 'interval': Interval(93941, 102762, closed='both')}
	{'chrom': 'LinJ.02', 'start': 93941, 'end': 102762, 'strand': '-', 'gene_id': None, 'transcript_id': 'LINF_02T0007000', 'parent_id': 'LINF_020007000', 'gene': False, 'transcript': True, 'CDS': False, '3utr': False, '5utr': False, 'pseudogen': None, 'notes': 'phosphoglycan_beta_1-3_galactosyltransferase', 'interval': Interval(93941, 102762, closed='both')}
	{'chrom': 'LinJ.02', 'start': 93941, 'end': 99642, 'strand': '-', 'gene_id': None, 'transcript_id': None, 'parent_id': 'LINF_02T0007000', 'gene': False, 'transcript': False, 'CDS': False, '3utr': True, '5utr': False, 'pseudogen': None, 'notes': None, 'interval': Interval(9394

In [37]:
print(f"From the total of {len(neg_gtf_dict)} NEGATIVE ELEMENTS, {counter_neg_inside} are inside GTF elements and {counter_neg_not_inside} are not inside GTF elements.")

From the total of 324 NEGATIVE ELEMENTS, 230 are inside GTF elements and 94 are not inside GTF elements.


Let's get the elements in different dictionaries depending on if the length of "values" is > 0 or not:

In [38]:
# Get the elements which value is > 0
neg_inside_gtf_dict = {key: value for key, value in neg_gtf_dict.items() if len(value) > 0}
print(len(neg_inside_gtf_dict))

# Get the elements which value is == 0
neg_not_inside_gtf_dict = {key: value for key, value in neg_gtf_dict.items() if len(value) == 0}
print(len(neg_not_inside_gtf_dict))

230
94


let's transform it in a data frame

In [39]:
neg_inside_gtf_list = []
for key, value in neg_inside_gtf_dict.items():
    for elem in value:
        new_record = {'neg_name' : key}  # Create dict of 1 element
        new_record.update(elem)  # Update the dict with the values from elem, this way "neg_name" goes first
        neg_inside_gtf_list.append(new_record)

neg_inside_gtf_df = pd.DataFrame(neg_inside_gtf_list)

# Let's check the df
print(neg_inside_gtf_df.shape)
print(neg_inside_gtf_df.dtypes)
print(neg_inside_gtf_df['neg_name'].nunique())
neg_inside_gtf_df.head()

(615, 16)
neg_name                        object
chrom                           object
start                            int64
end                              int64
strand                          object
gene_id                         object
transcript_id                   object
parent_id                       object
gene                              bool
transcript                        bool
CDS                               bool
3utr                              bool
5utr                              bool
pseudogen                       object
notes                           object
interval         interval[int64, both]
dtype: object
230


Unnamed: 0,neg_name,chrom,start,end,strand,gene_id,transcript_id,parent_id,gene,transcript,CDS,3utr,5utr,pseudogen,notes,interval
0,lre_2.1,LinJ.02,93941,102762,-,LINF_020007000,,,True,False,False,False,False,,phosphoglycan_beta_1-3_galactosyltransferase,"[93941, 102762]"
1,lre_2.1,LinJ.02,93941,102762,-,,LINF_02T0007000,LINF_020007000,False,True,False,False,False,,phosphoglycan_beta_1-3_galactosyltransferase,"[93941, 102762]"
2,lre_2.1,LinJ.02,93941,99642,-,,,LINF_02T0007000,False,False,False,True,False,,,"[93941, 99642]"
3,lre_2.2,LinJ.02,137296,139716,-,LINF_020007950,,,True,False,False,False,False,,,"[137296, 139716]"
4,lre_2.2,LinJ.02,137296,139716,-,,LINF_02T0007950,LINF_020007950,False,True,False,False,False,,,"[137296, 139716]"


Now with these data we can analyze a lot of things

## NEGATIVE ELEMENTS intergenic

Now we will check the NEGATIVE ELEMENTS that are not overlapping any GTF element using `neg_not_inside_gtf_dict`

In [40]:
# Let's check again the dict:
print(len(neg_not_inside_gtf_dict))
neg_not_inside_gtf_dict

94


{'lre_1.1': [],
 'lre_3.1': [],
 'lre_3.2': [],
 'lre_4.1': [],
 'lre_4.2': [],
 'lre_6.1': [],
 'lre_7.1': [],
 'lre_9.1': [],
 'lre_9.2': [],
 'lre_9.3': [],
 'lre_9.4': [],
 'lre_11.1': [],
 'lre_11.2': [],
 'lre_14.1': [],
 'lre_11.3': [],
 'lre_14.2': [],
 'lre_11.4': [],
 'lre_15.2': [],
 'lre_19.1': [],
 'lre_19.2': [],
 'lre_19.3': [],
 'lre_19.4': [],
 'lre_19.5': [],
 'lre_19.6': [],
 'lre_23.1': [],
 'lre_24.1': [],
 'lre_24.2': [],
 'lre_26.1': [],
 'lre_27.1': [],
 'lre_26.2': [],
 'lre_30.1': [],
 'lre_30.2': [],
 'lre_26.3': [],
 'lre_1.3': [],
 'lre_35.1': [],
 'lre_35.2': [],
 'lre_36.1': [],
 'lre_36.2': [],
 'lre_37.1': [],
 'lre_45.1': [],
 'lre_45.2': [],
 'lre_26.4': [],
 'lre_50.1': [],
 'lre_52.1': [],
 'lre_50.2': [],
 'lre_51.2': [],
 'lre_52.2': [],
 'lre_50.3': [],
 'lre_51.3': [],
 'lre_52.3': [],
 'lre_50.4': [],
 'lre_51.4': [],
 'lre_55.1': [],
 'lre_58.4': [],
 'lre_59.1': [],
 'lre_15.3': [],
 'lre_67.1': [],
 'lre_55.2': [],
 'lre_69.1': [],
 'lre_69.

In [41]:
# Let's take the keys from the dict as a list:
neg_not_inside_gtf_dict_list = list(neg_not_inside_gtf_dict.keys())
print(neg_not_inside_gtf_dict_list)

['lre_1.1', 'lre_3.1', 'lre_3.2', 'lre_4.1', 'lre_4.2', 'lre_6.1', 'lre_7.1', 'lre_9.1', 'lre_9.2', 'lre_9.3', 'lre_9.4', 'lre_11.1', 'lre_11.2', 'lre_14.1', 'lre_11.3', 'lre_14.2', 'lre_11.4', 'lre_15.2', 'lre_19.1', 'lre_19.2', 'lre_19.3', 'lre_19.4', 'lre_19.5', 'lre_19.6', 'lre_23.1', 'lre_24.1', 'lre_24.2', 'lre_26.1', 'lre_27.1', 'lre_26.2', 'lre_30.1', 'lre_30.2', 'lre_26.3', 'lre_1.3', 'lre_35.1', 'lre_35.2', 'lre_36.1', 'lre_36.2', 'lre_37.1', 'lre_45.1', 'lre_45.2', 'lre_26.4', 'lre_50.1', 'lre_52.1', 'lre_50.2', 'lre_51.2', 'lre_52.2', 'lre_50.3', 'lre_51.3', 'lre_52.3', 'lre_50.4', 'lre_51.4', 'lre_55.1', 'lre_58.4', 'lre_59.1', 'lre_15.3', 'lre_67.1', 'lre_55.2', 'lre_69.1', 'lre_69.2', 'lre_37.2', 'lre_37.3', 'lre_71.2', 'lre_71.3', 'lre_73.1', 'lre_73.2', 'lre_73.3', 'lre_73.4', 'lre_72.9', 'lre_73.5', 'lre_73.6', 'lre_73.7', 'lre_72.10', 'lre_73.8', 'lre_73.10', 'lre_72.11', 'lre_81.3', 'lre_81.4', 'lre_81.5', 'lre_81.6', 'lre_88.1', 'lre_90.1', 'lre_88.2', 'lre_93.2', 

Let's find the overlapping elements:

In [42]:
# Call the old code but only using the keys from the list
# initialize dict
neg_gtf_dict_2 = {name: [] for name in neg_not_inside_gtf_dict_list}

# Find elements in neg_df that overlap with a gtf_df
# Find overlaps using boolean indexing
for i, neg_row in neg_df.iterrows():
    # Boolean mask for intervals that overlaps the neg interval
    print(f"Analyzing elem {i+1}/{neg_df.shape[0]}")

    if neg_row['name'] not in neg_not_inside_gtf_dict_list:
        continue

    mask = (gtf_df['chrom'] == neg_row['chrom']) & \
            (
                ((neg_row['start'] >= gtf_df['start']) & (neg_row['start'] <= gtf_df['end'])) |
                ((neg_row['end'] >= gtf_df['start']) & (neg_row['end'] <= gtf_df['end'])) |
                ((gtf_df['start'] >= neg_row['start']) & (gtf_df['start'] <= neg_row['end'])) |
                ((gtf_df['end'] >= neg_row['start']) & (gtf_df['end'] <= neg_row['end']))
                )
    
    overlaps = gtf_df[mask]
    
    for j, gtf_row in overlaps.iterrows():
        neg_gtf_dict_2[neg_row['name']].append(gtf_row.to_dict())

Analyzing elem 1/324
Analyzing elem 2/324
Analyzing elem 3/324
Analyzing elem 4/324
Analyzing elem 5/324
Analyzing elem 6/324
Analyzing elem 7/324
Analyzing elem 8/324
Analyzing elem 9/324
Analyzing elem 10/324
Analyzing elem 11/324
Analyzing elem 12/324
Analyzing elem 13/324
Analyzing elem 14/324
Analyzing elem 15/324
Analyzing elem 16/324
Analyzing elem 17/324
Analyzing elem 18/324
Analyzing elem 19/324
Analyzing elem 20/324
Analyzing elem 21/324
Analyzing elem 22/324
Analyzing elem 23/324
Analyzing elem 24/324
Analyzing elem 25/324
Analyzing elem 26/324
Analyzing elem 27/324
Analyzing elem 28/324
Analyzing elem 29/324
Analyzing elem 30/324
Analyzing elem 31/324
Analyzing elem 32/324
Analyzing elem 33/324
Analyzing elem 34/324
Analyzing elem 35/324
Analyzing elem 36/324
Analyzing elem 37/324
Analyzing elem 38/324
Analyzing elem 39/324
Analyzing elem 40/324
Analyzing elem 41/324
Analyzing elem 42/324
Analyzing elem 43/324
Analyzing elem 44/324
Analyzing elem 45/324
Analyzing elem 46/3

In [43]:
counter_neg_overlaps = 0
counter_neg_not_overlaps = 0
for key, value in neg_gtf_dict_2.items():
    print(f"{'='*50}")
    print(f"{key}:")
    if len(value) > 0:
        counter_neg_overlaps += 1
        for elem in value:
            print(f"\t{elem}")
    else:
        counter_neg_not_overlaps += 1

lre_1.1:
lre_3.1:
lre_3.2:
	{'chrom': 'LinJ.04', 'start': 126105, 'end': 126374, 'strand': '-', 'gene_id': 'LINF_040008875', 'transcript_id': None, 'parent_id': None, 'gene': True, 'transcript': False, 'CDS': False, '3utr': False, '5utr': False, 'pseudogen': None, 'notes': None, 'interval': Interval(126105, 126374, closed='both')}
	{'chrom': 'LinJ.04', 'start': 126105, 'end': 126374, 'strand': '-', 'gene_id': None, 'transcript_id': 'LINF_04T0008875', 'parent_id': 'LINF_040008875', 'gene': False, 'transcript': True, 'CDS': False, '3utr': False, '5utr': False, 'pseudogen': None, 'notes': None, 'interval': Interval(126105, 126374, closed='both')}
lre_4.1:
	{'chrom': 'LinJ.05', 'start': 66289, 'end': 73912, 'strand': '+', 'gene_id': 'LINF_050007100', 'transcript_id': None, 'parent_id': None, 'gene': True, 'transcript': False, 'CDS': False, '3utr': False, '5utr': False, 'pseudogen': None, 'notes': 'dual_specificity_phosphatase-like_protein', 'interval': Interval(66289, 73912, closed='both')

In [44]:
print(f"From the total of {len(neg_not_inside_gtf_dict_list)} NEGATIVE ELEMENTS, {counter_neg_overlaps} are overlapping the GTF elements and {counter_neg_not_overlaps} are not overlapping the GTF elements")

From the total of 94 NEGATIVE ELEMENTS, 77 are overlapping the GTF elements and 17 are not overlapping the GTF elements


Let's join the two "neg_gtf_dict" dictionaries to save them as a JSON file

In [45]:
dict_neg_full_inside = neg_gtf_dict.copy()
dict_neg_overlap = neg_gtf_dict_2.copy()

# Let's join them
for key in dict_neg_overlap.keys():
    if key in dict_neg_full_inside.keys():
        if len(dict_neg_overlap[key]) > 0:
            dict_neg_full_inside[key].extend(dict_neg_overlap[key])


In [46]:
# Save the data to a json file
path_gtf_neg_relation_json = "./data/neg_gtf_relation.json" # Path to save the json file

# The problem will be the pandas Interval type and the JSON package. We need to create a custom serializer
def custom_serializer(obj):
    if isinstance(obj, pd.Interval):
        return {
            'left': int(obj.left) if isinstance(obj.left, np.integer) else obj.left,  # JSOn package can't process int64; transforming it
            'right': int(obj.right) if isinstance(obj.right, np.integer) else obj.right,  # JSOn package can't process int64; transforming it
            'closed': obj.closed
        }
    elif isinstance(obj, np.integer):  # Check for numpy integer types
        return int(obj)  # Convert to a standard Python int
    raise TypeError(f"Object of type {type(obj).__name__} is not JSON serializable")


# Save the data to a json file
with open(path_gtf_neg_relation_json, "w") as f:
    json.dump(dict_neg_full_inside, f, default=custom_serializer)

In [47]:
# Let's get the elements in different dictionaries
neg_overlaps_gtf_dict = {key: value for key, value in neg_gtf_dict_2.items() if len(value) > 0}

# And now for the INTERGENIC elements
neg_intergenic_gtf_dict = {key: value for key, value in neg_gtf_dict_2.items() if len(value) == 0}

In [48]:
print(len(neg_intergenic_gtf_dict))
list(neg_intergenic_gtf_dict.keys())

17


['lre_1.1',
 'lre_3.1',
 'lre_6.1',
 'lre_15.2',
 'lre_35.1',
 'lre_35.2',
 'lre_51.2',
 'lre_51.3',
 'lre_51.4',
 'lre_58.4',
 'lre_59.1',
 'lre_37.2',
 'lre_73.6',
 'lre_73.10',
 'lre_81.6',
 'lre_90.1',
 'lre_99.2']

## NEG ELEMENTS OVERLAPPING

In [49]:
# Check the Dict
print(len(neg_overlaps_gtf_dict))
neg_overlaps_gtf_dict

77


{'lre_3.2': [{'chrom': 'LinJ.04',
   'start': 126105,
   'end': 126374,
   'strand': '-',
   'gene_id': 'LINF_040008875',
   'transcript_id': None,
   'parent_id': None,
   'gene': True,
   'transcript': False,
   'CDS': False,
   '3utr': False,
   '5utr': False,
   'pseudogen': None,
   'notes': None,
   'interval': Interval(126105, 126374, closed='both')},
  {'chrom': 'LinJ.04',
   'start': 126105,
   'end': 126374,
   'strand': '-',
   'gene_id': None,
   'transcript_id': 'LINF_04T0008875',
   'parent_id': 'LINF_040008875',
   'gene': False,
   'transcript': True,
   'CDS': False,
   '3utr': False,
   '5utr': False,
   'pseudogen': None,
   'notes': None,
   'interval': Interval(126105, 126374, closed='both')}],
 'lre_4.1': [{'chrom': 'LinJ.05',
   'start': 66289,
   'end': 73912,
   'strand': '+',
   'gene_id': 'LINF_050007100',
   'transcript_id': None,
   'parent_id': None,
   'gene': True,
   'transcript': False,
   'CDS': False,
   '3utr': False,
   '5utr': False,
   'pseudogen

In [50]:
# Transform it into a DataFrame
neg_overlaps_gtf_list = []
for key, value in neg_overlaps_gtf_dict.items():
    for elem in value:
        new_record = {'neg_name': key}  # Create dict of 1 element
        new_record.update(elem)  # Update the dict with the values from elem, this way "neg_name" goes first
        neg_overlaps_gtf_list.append(new_record)

neg_overlaps_gtf_df = pd.DataFrame(neg_overlaps_gtf_list)

In [51]:
print(neg_overlaps_gtf_df.shape)
print(neg_overlaps_gtf_df.dtypes)
print(neg_overlaps_gtf_df['neg_name'].nunique())
neg_overlaps_gtf_df.head()

(266, 16)
neg_name                        object
chrom                           object
start                            int64
end                              int64
strand                          object
gene_id                         object
transcript_id                   object
parent_id                       object
gene                              bool
transcript                        bool
CDS                               bool
3utr                              bool
5utr                              bool
pseudogen                       object
notes                           object
interval         interval[int64, both]
dtype: object
77


Unnamed: 0,neg_name,chrom,start,end,strand,gene_id,transcript_id,parent_id,gene,transcript,CDS,3utr,5utr,pseudogen,notes,interval
0,lre_3.2,LinJ.04,126105,126374,-,LINF_040008875,,,True,False,False,False,False,,,"[126105, 126374]"
1,lre_3.2,LinJ.04,126105,126374,-,,LINF_04T0008875,LINF_040008875,False,True,False,False,False,,,"[126105, 126374]"
2,lre_4.1,LinJ.05,66289,73912,+,LINF_050007100,,,True,False,False,False,False,,dual_specificity_phosphatase-like_protein,"[66289, 73912]"
3,lre_4.1,LinJ.05,66289,73912,+,,LINF_05T0007100,LINF_050007100,False,True,False,False,False,,dual_specificity_phosphatase-like_protein,"[66289, 73912]"
4,lre_4.1,LinJ.05,66289,67362,+,,,LINF_05T0007100,False,False,False,False,True,,,"[66289, 67362]"


### Divide "overlapping" and "overextended" element
In the `neg_overlaps_gtf_df` elements.There are some that touch more than one element in the GTF such as:
* 3'UTR, CDS
* 3'UTR transcript_1, 5'UTR transcript_2
* etc.

On the other part, there are some elements that don't touch more than one element. Instead, they touch one element and overextend to an intergenic zone

In [52]:
# Let's group the elements by "neg_name"
groupy_neg_overlaps_gtf_df = neg_overlaps_gtf_df.groupby('neg_name')

In [53]:
# Create the pre-list to save the elements
true_overlaps_gtf = []
overextend_elements_gtf = []

# Iterate over the groupy object
for name, group in groupy_neg_overlaps_gtf_df:
    location = group[["gene", "transcript", "CDS", "3utr", "5utr"]].sum()
    
    # Get the elements that extend by 3'utr or 5'utr
    if (location["gene"] == 1) & (location["3utr"] == 1 or location["5utr"] == 1):
        [overextend_elements_gtf.append(elem) for i, elem in group.iterrows()]
    elif (location["gene"] == 1) & (location["3utr"] == 0 and location["5utr"] == 0):  # Elements that overextend, but there are non-coding gentes.
        [overextend_elements_gtf.append(elem) for i, elem in group.iterrows()]
    else:  # The rest will be the elements that are truly overlapping more than one GTF element
        [true_overlaps_gtf.append(elem) for i, elem in group.iterrows()]

# Let's create the DataFrames
true_overlaps_gtf_df = pd.DataFrame(true_overlaps_gtf, columns=neg_overlaps_gtf_df.columns)
overextend_elements_gtf_df = pd.DataFrame(overextend_elements_gtf, columns=neg_overlaps_gtf_df.columns)

In [54]:
print(f"True Overlaps: {true_overlaps_gtf_df.shape}"
      f"\n\tUnique negs: {true_overlaps_gtf_df['neg_name'].nunique()}")
print(f"Overextend Elements: {overextend_elements_gtf_df.shape}"
      f"\n\tUnique negs: {overextend_elements_gtf_df['neg_name'].nunique()}")

True Overlaps: (144, 16)
	Unique negs: 28
Overextend Elements: (122, 16)
	Unique negs: 49


# Analyze results

* **Total NEGATIVE ELEMENTS**: 324
* A) `neg_inside_gtf_df` ==> Data frames of the 230 negs that are inside the coordinates of a GTF element.
* **Not completely inside GTF:** 94
    * B) `neg_overlaps_gtf_df`==> Data frame of the 77 elements that overlap GTF elements.
        * B.1)`true_overlaps_gtf_df`==> 28 elements
        * B.)`overextend_elements_gtf_df`==> 49 elements
    * C) `neg_intergenic_gtf_dict`==> Dictionary with the 17 INTERGENIC elements

## Prepare data

First, let's join the data frames `neg_inside_gtf_df` (503 elements) with `neg_overlaps_gtf_df` (137 elements) for a total of 640 elements

In [None]:
# Let's join `neg_inside_gtf_df` with `neg_overlaps_gtf_df`. Only removing the intergenic ones.
neg_to_filter = pd.concat([neg_inside_gtf_df, neg_overlaps_gtf_df])

# Now let's sort them by chrom and then by start
neg_to_filter.sort_values(by=["chrom", "start"], inplace=True)

# Let's do some descriptive statistics

In [55]:
print(neg_to_filter.shape)
print(neg_to_filter['neg_name'].nunique())
print(neg_to_filter.dtypes)
neg_to_filter

(881, 16)
307
neg_name                        object
chrom                           object
start                            int64
end                              int64
strand                          object
gene_id                         object
transcript_id                   object
parent_id                       object
gene                              bool
transcript                        bool
CDS                               bool
3utr                              bool
5utr                              bool
pseudogen                       object
notes                           object
interval         interval[int64, both]
dtype: object


Unnamed: 0,neg_name,chrom,start,end,strand,gene_id,transcript_id,parent_id,gene,transcript,CDS,3utr,5utr,pseudogen,notes,interval
0,lre_2.1,LinJ.02,93941,102762,-,LINF_020007000,,,True,False,False,False,False,,phosphoglycan_beta_1-3_galactosyltransferase,"[93941, 102762]"
1,lre_2.1,LinJ.02,93941,102762,-,,LINF_02T0007000,LINF_020007000,False,True,False,False,False,,phosphoglycan_beta_1-3_galactosyltransferase,"[93941, 102762]"
2,lre_2.1,LinJ.02,93941,99642,-,,,LINF_02T0007000,False,False,False,True,False,,,"[93941, 99642]"
3,lre_2.2,LinJ.02,137296,139716,-,LINF_020007950,,,True,False,False,False,False,,,"[137296, 139716]"
4,lre_2.2,LinJ.02,137296,139716,-,,LINF_02T0007950,LINF_020007950,False,True,False,False,False,,,"[137296, 139716]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
610,lre_100.2,LinJ.36,2664678,2668468,-,LINF_360079500,,,True,False,False,False,False,,adaptin_complex_1_subunit|beta_adaptin,"[2664678, 2668468]"
611,lre_100.2,LinJ.36,2664678,2668468,-,,LINF_36T0079500,LINF_360079500,False,True,False,False,False,,adaptin_complex_1_subunit|beta_adaptin,"[2664678, 2668468]"
612,lre_100.2,LinJ.36,2664678,2665863,-,,,LINF_36T0079500,False,False,False,True,False,,,"[2664678, 2665863]"
613,lre_99.3,LinJ.36,2706448,2707984,-,LINF_360080850,,,True,False,False,False,False,,,"[2706448, 2707984]"


In [56]:
neg_to_filter.to_csv('./data_1.1/lre_to_filter.csv', header=True, index=False)