Author: Ronny F. Pacheco Date: Sep 2024
Copyright: © 2024 Ronny Pacheco License: MIT License

---

MIT License

Copyright (c) 2024 Ronny Pacheco

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

# Needed modules

In [14]:
# Load the needed libraries
import pickle
import os
import pandas as pd

In [15]:
# https://kioku-space.com/en/jupyter-skip-execution/
from IPython.core.magic import register_cell_magic # type: ignore

@register_cell_magic
def skip(line, cell):
    return

# Pickle save

In [16]:
%%skip
# =============================================================================
# main function
# =============================================================================
def data_save_load(option):
    """
    This function is used to save or load data for the Jupyter notebook.
    
    Parameters:
    option (str): Either 'save' or 'load' to save or load variables.
    
    Returns:
    dict: Dictionary of loaded variables (only when option is 'load').
    """
    path_folder = "ipynb_db"  # Folder to save variables
    os.makedirs(path_folder, exist_ok=True)  # Create folder if not exist
    path_file = os.path.join(path_folder, "variables.pkl") # Path to save the variables

    if option == "save":
        with open(path_file, "wb") as f:
            variables_dict = {}  # Dictionary to save the variables
            pickle.dump(variables_dict, f)
    elif option == "load":
        with open(path_file, "rb") as f:
            variables = pickle.load(f)
        # Now load the variables
        for key, value in variables.items():
            variables[key] = value

# =============================================================================
# Call the function
# =============================================================================
data_save_load(option="save")

# 1. Prepare Data

## 1. Load data

### 1.1 Load negative (or rejected) elements

In [17]:
# Let's load the negative nada
neg_df = pd.read_csv("./data/negative_database_nomatch_corrected_named.csv", sep=",", header=0)
print(neg_df.shape)
print(neg_df.dtypes)
neg_df.head()

(682, 6)
sseqid     object
sstart      int64
send        int64
sstrand    object
sseq       object
name       object
dtype: object


Unnamed: 0,sseqid,sstart,send,sstrand,sseq,name
0,LinJ.01,36103,36242,plus,AGACAGACCGACACACGCAGCCGTGTGATGCCGCCGCCGAGGGCAG...,rejected_noCDS_c01.10
1,LinJ.01,113760,114388,plus,CAGCGCCATGCACGACATGGCCGCTGACGTCCGTAGCCCTAACTCG...,rejected_noCDS_c01.20
2,LinJ.01,146412,146530,plus,GCGAATTGTGTTCTGCGCATGCCTCTTCTCTGCCGTGCAGCATGCG...,rejected_noCDS_c01.30
3,LinJ.01,261866,262439,plus,CGGACTTGGCAAGTGGCCGCCATCGATGAAAACGCACCATGCCTTT...,rejected_noCDS_c01.40
4,LinJ.01,271363,271650,plus,CGAACGCCGCCCTCAATCGCGCGCTGAACTTCACGCGGCGGTCGAC...,rejected_noCDS_c01.50


Now we need to get the **sider_name** without the 'ID=' element

### 1.2 GTF data
This one wil be **harder** to prepare

In [18]:
# Load data
gtf_df = pd.read_csv("./data/20240703111001_LINF-Tabla_maestra_v3-20244_RP_v0.8.gtf", sep="\t", header=None)
print(gtf_df.shape)
print(gtf_df.dtypes)
gtf_df.head()

(45179, 9)
0    object
1    object
2    object
3     int64
4     int64
5    object
6    object
7    object
8    object
dtype: object


Unnamed: 0,0,1,2,3,4,5,6,7,8
0,LinJ.01,CBM,gene,1520,5066,.,-,.,"gene_id ""LINF_010005000""; gene_name ""LINF_0100..."
1,LinJ.01,CBM,transcript,1520,5066,.,-,.,"parent_id ""LINF_010005000""; transcript_id ""LIN..."
2,LinJ.01,CBM,CDS,3710,4711,.,-,.,"parent_id ""LINF_01T0005000""; transcript_id ""LI..."
3,LinJ.01,CBM,3utr,1520,3709,.,-,.,"parent_id ""LINF_01T0005000""; notes ""Protein_of..."
4,LinJ.01,CBM,5utr,4712,5066,.,-,.,"parent_id ""LINF_01T0005000""; notes ""Protein_of..."


From `gtf_df`I only need columns 0, 2, 3, 4, 6 and 8

In [19]:
# Get from `gtf_df` the needed columns [0, 3, 4, 6, 8]
gtf_df = gtf_df[[0, 2, 3, 4, 6, 8]]
gtf_df.columns = ["chrom", "feature", "start", "end", "strand", "attributes"]
print(gtf_df.shape)
print(gtf_df.dtypes)
gtf_df.head()

(45179, 6)
chrom         object
feature       object
start          int64
end            int64
strand        object
attributes    object
dtype: object


Unnamed: 0,chrom,feature,start,end,strand,attributes
0,LinJ.01,gene,1520,5066,-,"gene_id ""LINF_010005000""; gene_name ""LINF_0100..."
1,LinJ.01,transcript,1520,5066,-,"parent_id ""LINF_010005000""; transcript_id ""LIN..."
2,LinJ.01,CDS,3710,4711,-,"parent_id ""LINF_01T0005000""; transcript_id ""LI..."
3,LinJ.01,3utr,1520,3709,-,"parent_id ""LINF_01T0005000""; notes ""Protein_of..."
4,LinJ.01,5utr,4712,5066,-,"parent_id ""LINF_01T0005000""; notes ""Protein_of..."


Now the field `attributes` it's separated by ";" and the header its in a format like `header "data"`. We are going to transform the "attributes" column in multiple columns

#### 1.2.1 Transforming colums

Get first all the elements that appear in the attributes columns

In [20]:
# Let's count first the number of elements in the `attributes` column
atr_dict = {}
for index, row in gtf_df.iterrows():
    # print(index, ":", sep="")
    for atr in row["attributes"].split(";"):
        atr = atr.strip()  # Remove leading and trailing whitespaces
        if len(atr.strip()) == 0:  # Skip empty attribute ""
            continue
        # print(f"\t{'-'*50}")
        # print(f"\attribute: {atr.strip()}")  
        key = atr.split(" ")[0] 
        if key not in atr_dict:
            atr_dict[key] = 1

        else:
            atr_dict[key] += 1
        # print(f"\t{atr_dict}")
print(atr_dict)

{'gene_id': 9861, 'gene_name': 9861, 'biotype': 17106, 'notes': 42974, 'parent_id': 35318, 'transcript_id': 18215, 'transcript_name': 9660, 'pseudogen': 49}


In [21]:
# get a list with the keys of atr_dict
atr_keys = list(atr_dict.keys())
print(atr_keys)

['gene_id', 'gene_name', 'biotype', 'notes', 'parent_id', 'transcript_id', 'transcript_name', 'pseudogen']


Now we'll have a list with all the elements. When indexing each row in the next steps, we can check if one of this items appear, and if not, we can add a "None" value to the attribute

In [22]:
# Now that we have the attributes count, let's create a dict for each element in "test_df" with the attributes separated
new_col_df = []
for index, row in gtf_df.iterrows():
    # print(index, ":", sep="")
    pre_data = []
    for atr in row["attributes"].split(";"):
        atr = atr.strip()  # Remove leading and trailing whitespaces
        if len(atr.strip()) == 0:  # Skip empty attribute ""
            continue
        key = atr.split(" ")[0]
        value = atr.split(" ")[1].replace('"', "")
        pre_data.append({key: value})
    
    for elem in atr_keys: # type: ignore  # Checking if the elements from atr_keys
        if elem not in [list(elem.keys())[0] for elem in pre_data]:  # If the element is not in pre_data, add it with value None
            # noinspection PyUnresolvedReferences
            pre_data.append({elem: None})

    flattened_data = {key: value for sublist in pre_data for key, value in sublist.items()}
    new_col_df.append(flattened_data)

In [23]:
# Checking how it worked
new_col_df  

[{'gene_id': 'LINF_010005000',
  'gene_name': 'LINF_010005000',
  'biotype': 'protein_coding',
  'notes': 'Protein_of_unknown_function_(DUF2946)',
  'parent_id': None,
  'transcript_id': None,
  'transcript_name': None,
  'pseudogen': None},
 {'parent_id': 'LINF_010005000',
  'transcript_id': 'LINF_01T0005000',
  'transcript_name': 'LINF_01T0005000',
  'biotype': 'protein_coding',
  'notes': 'Protein_of_unknown_function_(DUF2946)',
  'gene_id': None,
  'gene_name': None,
  'pseudogen': None},
 {'parent_id': 'LINF_01T0005000',
  'transcript_id': 'LINF_01T0005000',
  'notes': 'Protein_of_unknown_function_(DUF2946)',
  'gene_id': None,
  'gene_name': None,
  'biotype': None,
  'transcript_name': None,
  'pseudogen': None},
 {'parent_id': 'LINF_01T0005000',
  'notes': 'Protein_of_unknown_function_(DUF2946)',
  'gene_id': None,
  'gene_name': None,
  'biotype': None,
  'transcript_id': None,
  'transcript_name': None,
  'pseudogen': None},
 {'parent_id': 'LINF_01T0005000',
  'notes': 'Prote

In [24]:
# Transforming the list of dicts into a DataFrame
new_col_df = pd.DataFrame(new_col_df)
new_col_df

Unnamed: 0,gene_id,gene_name,biotype,notes,parent_id,transcript_id,transcript_name,pseudogen
0,LINF_010005000,LINF_010005000,protein_coding,Protein_of_unknown_function_(DUF2946),,,,
1,,,protein_coding,Protein_of_unknown_function_(DUF2946),LINF_010005000,LINF_01T0005000,LINF_01T0005000,
2,,,,Protein_of_unknown_function_(DUF2946),LINF_01T0005000,LINF_01T0005000,,
3,,,,Protein_of_unknown_function_(DUF2946),LINF_01T0005000,,,
4,,,,Protein_of_unknown_function_(DUF2946),LINF_01T0005000,,,
...,...,...,...,...,...,...,...,...
45174,,,,Hypothetical_protein_-_conserved,LINF_36T0082400,LINF_36T0082400,,
45175,,,,Hypothetical_protein_-_conserved,LINF_36T0082400,,,
45176,,,,Hypothetical_protein_-_conserved,LINF_36T0082400,,,
45177,LINF_360082500,LINF_360082500,,,,,,


In [25]:
# Le's re-order the columns
new_col_df = new_col_df[["gene_id", "gene_name", "transcript_id", "transcript_name", "biotype", "parent_id", "pseudogen", "notes"]]
new_col_df

Unnamed: 0,gene_id,gene_name,transcript_id,transcript_name,biotype,parent_id,pseudogen,notes
0,LINF_010005000,LINF_010005000,,,protein_coding,,,Protein_of_unknown_function_(DUF2946)
1,,,LINF_01T0005000,LINF_01T0005000,protein_coding,LINF_010005000,,Protein_of_unknown_function_(DUF2946)
2,,,LINF_01T0005000,,,LINF_01T0005000,,Protein_of_unknown_function_(DUF2946)
3,,,,,,LINF_01T0005000,,Protein_of_unknown_function_(DUF2946)
4,,,,,,LINF_01T0005000,,Protein_of_unknown_function_(DUF2946)
...,...,...,...,...,...,...,...,...
45174,,,LINF_36T0082400,,,LINF_36T0082400,,Hypothetical_protein_-_conserved
45175,,,,,,LINF_36T0082400,,Hypothetical_protein_-_conserved
45176,,,,,,LINF_36T0082400,,Hypothetical_protein_-_conserved
45177,LINF_360082500,LINF_360082500,,,,,,


In [26]:
# Concatenating the new DataFrame with the original `gtf_df` and dropping the `attributes` column
gtf_df = pd.concat([gtf_df, new_col_df], axis=1)
gtf_df.drop(columns="attributes", inplace=True)
gtf_df

Unnamed: 0,chrom,feature,start,end,strand,gene_id,gene_name,transcript_id,transcript_name,biotype,parent_id,pseudogen,notes
0,LinJ.01,gene,1520,5066,-,LINF_010005000,LINF_010005000,,,protein_coding,,,Protein_of_unknown_function_(DUF2946)
1,LinJ.01,transcript,1520,5066,-,,,LINF_01T0005000,LINF_01T0005000,protein_coding,LINF_010005000,,Protein_of_unknown_function_(DUF2946)
2,LinJ.01,CDS,3710,4711,-,,,LINF_01T0005000,,,LINF_01T0005000,,Protein_of_unknown_function_(DUF2946)
3,LinJ.01,3utr,1520,3709,-,,,,,,LINF_01T0005000,,Protein_of_unknown_function_(DUF2946)
4,LinJ.01,5utr,4712,5066,-,,,,,,LINF_01T0005000,,Protein_of_unknown_function_(DUF2946)
...,...,...,...,...,...,...,...,...,...,...,...,...,...
45174,LinJ.36,CDS,2739458,2740183,-,,,LINF_36T0082400,,,LINF_36T0082400,,Hypothetical_protein_-_conserved
45175,LinJ.36,3utr,2738595,2739457,-,,,,,,LINF_36T0082400,,Hypothetical_protein_-_conserved
45176,LinJ.36,5utr,2740184,2740374,-,,,,,,LINF_36T0082400,,Hypothetical_protein_-_conserved
45177,LinJ.36,gene,2740760,2742268,-,LINF_360082500,LINF_360082500,,,,,,


## 2. Compare coordinates

In this next part we are going to check for coordinates. To search which elements in the **sider_df** is inside which element in the **gtf_df**.

### 2.1 Fail proof the data

In [31]:
# Copy data to make some fail-proof analysis
neg_df_test = neg_df.copy()
gtf_df_test = gtf_df.copy()

# Lets shape shapes
print(f"Shape of neg_df: {neg_df.shape}")
print(f"Shape of gtf_df: {gtf_df.shape}")

Shape of neg_df: (682, 6)
Shape of gtf_df: (45179, 13)


In [32]:
# Let's start with the shape in gtf_df_test:
gtf_df.head()

Unnamed: 0,chrom,feature,start,end,strand,gene_id,gene_name,transcript_id,transcript_name,biotype,parent_id,pseudogen,notes
0,LinJ.01,gene,1520,5066,-,LINF_010005000,LINF_010005000,,,protein_coding,,,Protein_of_unknown_function_(DUF2946)
1,LinJ.01,transcript,1520,5066,-,,,LINF_01T0005000,LINF_01T0005000,protein_coding,LINF_010005000,,Protein_of_unknown_function_(DUF2946)
2,LinJ.01,CDS,3710,4711,-,,,LINF_01T0005000,,,LINF_01T0005000,,Protein_of_unknown_function_(DUF2946)
3,LinJ.01,3utr,1520,3709,-,,,,,,LINF_01T0005000,,Protein_of_unknown_function_(DUF2946)
4,LinJ.01,5utr,4712,5066,-,,,,,,LINF_01T0005000,,Protein_of_unknown_function_(DUF2946)


In [45]:
# Check elements where start < end
num_elements_start_less_end = (gtf_df_test['start'] < gtf_df_test['end']).sum()
print(f"There are {num_elements_start_less_end} elements where start < end.")

# Check elements where start > end
num_elements_start_greater_end = (gtf_df_test['start'] > gtf_df_test['end']).sum()
print(f"There are {num_elements_start_greater_end} elements where start > end.")

# Check elements where start == end
num_elements_start_equal_end = (gtf_df_test['start'] == gtf_df_test['end']).sum()
print(f"There are {num_elements_start_equal_end} elements where start == end.")


There are 45177 elements where start < end.
There are 0 elements where start > end.
There are 2 elements where start == end.


In [46]:
# Let's check the rows where start == end
gtf_df_test[gtf_df_test['start'] == gtf_df_test['end']]


Unnamed: 0,chrom,feature,start,end,strand,gene_id,gene_name,transcript_id,transcript_name,biotype,parent_id,pseudogen,notes
495,LinJ.02,5utr,28840,28840,-,,,,,,LINF_02T0005800,,hypothetical_protein_-_conserved
33059,LinJ.32,5utr,1041948,1041948,-,,,,,,LINF_32T0033400,,SpoU_rRNA_Methylase_family


In [47]:
# Let's check the elements where transcript_id, transcript_name, parent_id can be "LINF_02T0005800" and "gene_id" the same but without "T"
gtf_df_test[(gtf_df_test['transcript_id'] == "LINF_02T0005800") | 
            (gtf_df_test['transcript_name'] == "LINF_02T0005800") | 
            (gtf_df_test['parent_id'] == "LINF_02T0005800") |
            (gtf_df_test['gene_id'] == "LINF_020005800")]

Unnamed: 0,chrom,feature,start,end,strand,gene_id,gene_name,transcript_id,transcript_name,biotype,parent_id,pseudogen,notes
491,LinJ.02,gene,27302,28840,-,LINF_020005800,LINF_020005800,,,protein_coding,,,hypothetical_protein_-_conserved
492,LinJ.02,transcript,27302,28840,-,,,LINF_02T0005800,LINF_02T0005800,protein_coding,LINF_020005800,,hypothetical_protein_-_conserved
493,LinJ.02,CDS,27895,28839,-,,,LINF_02T0005800,,,LINF_02T0005800,,hypothetical_protein_-_conserved
494,LinJ.02,3utr,27302,27894,-,,,,,,LINF_02T0005800,,hypothetical_protein_-_conserved
495,LinJ.02,5utr,28840,28840,-,,,,,,LINF_02T0005800,,hypothetical_protein_-_conserved


We can see that the CDS goes all the way till the final of the transcript except for one base. This base is will be taken by the 5'UTR.

In [40]:
# Check the "feature" elements:
condition =(gtf_df_test['start'] < gtf_df_test['end'])
gtf_df_test[condition]['feature'].value_counts()

feature
gene          9861
transcript    9660
CDS           8555
3utr          8554
5utr          8547
Name: count, dtype: int64

Interesting, there should be the same numbers of 5utr as 3utr

In [41]:
# Checking without condition
gtf_df_test['feature'].value_counts()

feature
gene          9861
transcript    9660
CDS           8555
3utr          8554
5utr          8549
Name: count, dtype: int64

In [48]:
# Checking where those elements where there is not a 3utr or 5utr
parent_feature_dict = gtf_df_test.groupby('parent_id')['feature'].apply(list).to_dict()
filtered_dict = {k: v for k, v in parent_feature_dict.items() if v not in (['transcript'], 
                                                                           ['CDS'], 
                                                                           ['CDS', '5utr', '3utr'], 
                                                                           ['CDS', '3utr', '5utr'],
                                                                           ['CDS', '3utr', '5utr', 'CDS', '3utr', '5utr'],
                                                                           ['transcript', 'transcript'],
                                                                           ['CDS', '5utr', '3utr', 'CDS', '5utr', '3utr'])}
filtered_dict

{'LINF_27T0013600': ['CDS', '5utr'],
 'LINF_30T0006850': ['CDS', '3utr'],
 'LINF_31T0037100': ['CDS', '3utr'],
 'LINF_31T0039200': ['CDS', '3utr'],
 'LINF_36T0017400': ['CDS', '3utr'],
 'LINF_36T0036000': ['CDS', '3utr'],
 'LINF_36T0071100': ['CDS', '3utr']}

<span style="color:red">These are the elements without a 3utr or 5utr</span>

Should be careful with LINF_270013600
