# Pending
1. ...
1. ...
1. ...

# Prepare dataset
- Input: None. This notebook will download the dataset by itself!!!
- Output:
    - We use each transformation method to transform the "eight strings" and save into a CSV.
    - The "eight strings" are as follows
        - 5p_cleav, 5p_cleav_compl
        - 5p_non_cleav, 5p_non_cleav_compl
        - 3p_cleav, 3p_cleav_compl
        - 5p_non_cleav, 5p_non_cleav_compl
    - P.S. 1. (5p_cleav, 5p_non_cleav, 3p_cleav, 5p_non_cleav) are constructed from the same pre-miRNA. Then we employ the secondary structure information to construct complementary strand of it. 
    - P.S. 2. For some transformation methods, compl are represented into two time series.

## 1. Download data from miRBase
Download the microRNA database from [miRBase](https://mirbase.org/). 

Go to [Download page](https://mirbase.org/download/).
Download [miRNA.dat](https://mirbase.org/download/miRNA.dat) (All published miRNA data in EMBL format).

In [2]:
from pathlib import Path
dataset_dir = Path("../data")
if not (dataset_dir.is_dir()):
    print(f"[INFO] Can't find existing 'miRBase' database in current directory or parent directory, downloading...")

    # Download
    # https://stackoverflow.com/questions/33886917/how-to-install-wget-in-macos
    !wget https://mirbase.org/download/miRNA.dat
    # Ensure a data directory exists and move the downloaded database there
    !mkdir ../data
    !mv miRNA.dat ../data
    !mkdir ../data/notebook_sessions # For the save points
    print(f"[INFO] Current data dir: {dataset_dir}")
else:
    # If the target dataset directory exists, we don't need to download it
    print(f"[INFO] 'miRBase' database exists, feel free to proceed!")
    print(f"[INFO] Current data dir: {dataset_dir}")

[INFO] 'miRBase' database exists, feel free to proceed!
[INFO] Current data dir: ../data


P.S.
https://stackoverflow.com/questions/7591240/what-does-dot-slash-refer-to-in-terms-of-an-html-file-path-location
- / means the root of the current drive;
- ./ means the current directory;
- ../ means the parent of the current directory.

We have downloaded the miRNA.dat. It is Release 22.1.

In [4]:
# pip install biopython
# pip install pandas
from Bio import SeqIO
import pandas as pd
records_data = []
with open('../data/miRNA.dat', 'r') as file:
    for record in SeqIO.parse(file, 'embl'):
        record_dict = {
            'Name': record.name,
            'Accession': record.id,
            # 'Description': record.description,
            # Keep only the first two words (i.e., the organisem)
            'Organism': ' '.join(record.description.split()[:2]),  
            'Sequence': str(record.seq),  # Convert sequence to string
            'miRNA_1_Product': None,
            'miRNA_1_Location': None,
            'miRNA_1_Evidence': None,
            'miRNA_2_Product': None,
            'miRNA_2_Location': None,
            'miRNA_2_Evidence': None,
        }
        # Try to retrieve the 1st feature
        try:
            print(f"[INFO] {record.name}'s 1st feature exists!")
            if record.features[0].type == 'miRNA':
                # Extracting more features from EMBL files with Biopython
                # https://www.biostars.org/p/151783/
                record_dict['miRNA_1_Product'] = record.features[0].qualifiers.get('product', [''])[0]
                record_dict['miRNA_1_Location'] = str(record.features[0].location)
                record_dict['miRNA_1_Evidence'] = record.features[0].qualifiers.get('evidence', [''])[0]
        except IndexError:
            print(f"[INFO] {record.name}'s 1st feature does not exist!")
        records_data.append(record_dict)
        # Try to retrieve th2nd feature
        try:
            print(f"[INFO] {record.name}'s 2nd feature exists!")
            if record.features[0].type == 'miRNA':
                record_dict['miRNA_2_Product'] = record.features[1].qualifiers.get('product', [''])[0]
                record_dict['miRNA_2_Location'] = str(record.features[1].location)
                record_dict['miRNA_2_Evidence'] = record.features[1].qualifiers.get('evidence', [''])[0]
        except IndexError:
            print(f"[INFO] {record.name}'s 2nd feature does not exist!")
# Create a DataFrame
df = pd.DataFrame(records_data)



[INFO] cel-let-7's 1st feature exists!
[INFO] cel-let-7's 2nd feature exists!
[INFO] cel-lin-4's 1st feature exists!
[INFO] cel-lin-4's 2nd feature exists!
[INFO] cel-mir-1's 1st feature exists!
[INFO] cel-mir-1's 2nd feature exists!
[INFO] cel-mir-2's 1st feature exists!
[INFO] cel-mir-2's 2nd feature exists!
[INFO] cel-mir-34's 1st feature exists!
[INFO] cel-mir-34's 2nd feature exists!
[INFO] cel-mir-35's 1st feature exists!
[INFO] cel-mir-35's 2nd feature exists!
[INFO] cel-mir-36's 1st feature exists!
[INFO] cel-mir-36's 2nd feature exists!
[INFO] cel-mir-37's 1st feature exists!
[INFO] cel-mir-37's 2nd feature exists!
[INFO] cel-mir-38's 1st feature exists!
[INFO] cel-mir-38's 2nd feature exists!
[INFO] cel-mir-39's 1st feature exists!
[INFO] cel-mir-39's 2nd feature exists!
[INFO] cel-mir-40's 1st feature exists!
[INFO] cel-mir-40's 2nd feature exists!
[INFO] cel-mir-41's 1st feature exists!
[INFO] cel-mir-41's 2nd feature exists!
[INFO] cel-mir-42's 1st feature exists!
[INFO] c

In [6]:
df


Unnamed: 0,Name,Accession,Organism,Sequence,miRNA_1_Product,miRNA_1_Location,miRNA_1_Evidence,miRNA_2_Product,miRNA_2_Location,miRNA_2_Evidence
0,cel-let-7,MI0000001,Caenorhabditis elegans,UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAU...,cel-let-7-5p,[16:38](+),experimental,cel-let-7-3p,[59:81](+),experimental
1,cel-lin-4,MI0000002,Caenorhabditis elegans,AUGCUUCCGGCCUGUUCCCUGAGACCUCAAGUGUGAGUGUACUAUU...,cel-lin-4-5p,[15:36](+),experimental,cel-lin-4-3p,[54:76](+),experimental
2,cel-mir-1,MI0000003,Caenorhabditis elegans,AAAGUGACCGUACCGAGCUGCAUACUUCCUUACAUGCCCAUACUAU...,cel-miR-1-5p,[20:42](+),experimental,cel-miR-1-3p,[60:81](+),experimental
3,cel-mir-2,MI0000004,Caenorhabditis elegans,UAAACAGUAUACAGAAAGCCAUCAAAGCGGUGGUUGAUGUGUUGCA...,cel-miR-2-5p,[19:41](+),experimental,cel-miR-2-3p,[60:83](+),experimental
4,cel-mir-34,MI0000005,Caenorhabditis elegans,CGGACAAUGCUCGAGAGGCAGUGUGGUUAGCUGGUUGCAUAUUUCC...,cel-miR-34-5p,[15:37](+),experimental,cel-miR-34-3p,[52:74](+),experimental
...,...,...,...,...,...,...,...,...,...,...
38584,smc-mir-12461,MI0041070,Symbiodinium microadriaticum,GAGGAUGCUGAUCAUUCACUGGCCCCCUGUGGACACGUGUGUUGCA...,smc-miR-12461-5p,[0:22](+),experimental,smc-miR-12461-3p,[65:87](+),experimental
38585,hsa-mir-9902-2,MI0041071,Homo sapiens,GCAGGGAAAGGGAACCCAGAAAUCUGGUAUGCCAGCAAAGAGAGUA...,hsa-miR-9902,[14:36](+),experimental,,,
38586,gga-mir-1784b,MI0041072,Gallus gallus,UUCUGCUCCUAUUUAAGUCAAUGGCAGAACUCUCACUGAUUUCAAU...,gga-miR-1784b-5p,[0:22](+),experimental,gga-miR-1784b-3p,[36:58](+),experimental
38587,mdo-mir-7385g-1,MI0041073,Monodelphis domestica,UAGUCUGAUAUUCCAUGUUUCUAUGUCAUGAAACUUGGAGCAUAGA...,mdo-miR-7385g-5p,[0:22](+),experimental,mdo-miR-7385g-3p,[38:59](+),experimental


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38589 entries, 0 to 38588
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Name              38589 non-null  object
 1   Accession         38589 non-null  object
 2   Organism          38589 non-null  object
 3   Sequence          38589 non-null  object
 4   miRNA_1_Product   38582 non-null  object
 5   miRNA_1_Location  38582 non-null  object
 6   miRNA_1_Evidence  38582 non-null  object
 7   miRNA_2_Product   14303 non-null  object
 8   miRNA_2_Location  14303 non-null  object
 9   miRNA_2_Evidence  14303 non-null  object
dtypes: object(10)
memory usage: 2.9+ MB


### 1.1 Inspecting the datatypes

In [5]:
print(pd.api.types.is_string_dtype(df["Organism"]))
print(pd.api.types.is_string_dtype(df["Sequence"]))
print(pd.api.types.is_string_dtype(df["miRNA_1_Evidence"]))
print(pd.api.types.is_string_dtype(df["miRNA_1_Location"]))

True
True
False
False


In [6]:
df["miRNA_1_Evidence"].dtype, df["miRNA_1_Evidence"].dtype.name, df["miRNA_1_Location"].dtype, df["miRNA_1_Location"].dtype.name

(dtype('O'), 'object', dtype('O'), 'object')

In [7]:
# We cast all the columns to the datatype we want.
df["Name"] = df["Name"].astype("string")
df["Accession"] = df["Accession"].astype("string")
df["Organism"] = df["Organism"].astype("category")
df["Sequence"] = df["Sequence"].astype("string")
df["miRNA_1_Product"] = df["miRNA_1_Product"].astype("string")
df["miRNA_1_Location"] = df["miRNA_1_Location"].astype("string")
df["miRNA_1_Evidence"] = df["miRNA_1_Evidence"].astype("category")
df["miRNA_2_Product"] = df["miRNA_2_Product"].astype("string")
df["miRNA_2_Location"] = df["miRNA_2_Location"].astype("string")
df["miRNA_2_Evidence"] = df["miRNA_2_Evidence"].astype("category")

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38589 entries, 0 to 38588
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   Name              38589 non-null  string  
 1   Accession         38589 non-null  string  
 2   Organism          38589 non-null  category
 3   Sequence          38589 non-null  string  
 4   miRNA_1_Product   38582 non-null  string  
 5   miRNA_1_Location  38582 non-null  string  
 6   miRNA_1_Evidence  38582 non-null  category
 7   miRNA_2_Product   14303 non-null  string  
 8   miRNA_2_Location  14303 non-null  string  
 9   miRNA_2_Evidence  14303 non-null  category
dtypes: category(3), string(7)
memory usage: 2.2 MB


In [9]:
df.head()

Unnamed: 0,Name,Accession,Organism,Sequence,miRNA_1_Product,miRNA_1_Location,miRNA_1_Evidence,miRNA_2_Product,miRNA_2_Location,miRNA_2_Evidence
0,cel-let-7,MI0000001,Caenorhabditis elegans,UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAU...,cel-let-7-5p,[16:38](+),experimental,cel-let-7-3p,[59:81](+),experimental
1,cel-lin-4,MI0000002,Caenorhabditis elegans,AUGCUUCCGGCCUGUUCCCUGAGACCUCAAGUGUGAGUGUACUAUU...,cel-lin-4-5p,[15:36](+),experimental,cel-lin-4-3p,[54:76](+),experimental
2,cel-mir-1,MI0000003,Caenorhabditis elegans,AAAGUGACCGUACCGAGCUGCAUACUUCCUUACAUGCCCAUACUAU...,cel-miR-1-5p,[20:42](+),experimental,cel-miR-1-3p,[60:81](+),experimental
3,cel-mir-2,MI0000004,Caenorhabditis elegans,UAAACAGUAUACAGAAAGCCAUCAAAGCGGUGGUUGAUGUGUUGCA...,cel-miR-2-5p,[19:41](+),experimental,cel-miR-2-3p,[60:83](+),experimental
4,cel-mir-34,MI0000005,Caenorhabditis elegans,CGGACAAUGCUCGAGAGGCAGUGUGGUUAGCUGGUUGCAUAUUUCC...,cel-miR-34-5p,[15:37](+),experimental,cel-miR-34-3p,[52:74](+),experimental


The df seems alright. Proceed!

In [10]:

# Function to split and convert the location string
def split_location(location):
    if pd.notnull(location):
        cleaned_str = location.strip("[]()+")
        parts = cleaned_str.split(":")
        return int(parts[0]), int(parts[1])
    else:
        return -1, -1

-  [Updated] I try to use back the original convection.
    -  We add one to the starting position only because the coordinates in python are in the 0-based, half-open system.
See [Reading location of a feature of a miRNA entry in miRBase in EMBL format](https://www.biostars.org/p/9608240/) for more information.

In [11]:
df[['miRNA_1_Start', 'miRNA_1_End']] = df['miRNA_1_Location'].apply(lambda x: pd.Series(split_location(x)))
# df["miRNA_1_Start"] = df["miRNA_1_Start"]+1
df[['miRNA_2_Start', 'miRNA_2_End']] = df['miRNA_2_Location'].apply(lambda x: pd.Series(split_location(x)))
# df["miRNA_2_Start"] = df["miRNA_2_Start"]+1

In [12]:
df

Unnamed: 0,Name,Accession,Organism,Sequence,miRNA_1_Product,miRNA_1_Location,miRNA_1_Evidence,miRNA_2_Product,miRNA_2_Location,miRNA_2_Evidence,miRNA_1_Start,miRNA_1_End,miRNA_2_Start,miRNA_2_End
0,cel-let-7,MI0000001,Caenorhabditis elegans,UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAU...,cel-let-7-5p,[16:38](+),experimental,cel-let-7-3p,[59:81](+),experimental,16,38,59,81
1,cel-lin-4,MI0000002,Caenorhabditis elegans,AUGCUUCCGGCCUGUUCCCUGAGACCUCAAGUGUGAGUGUACUAUU...,cel-lin-4-5p,[15:36](+),experimental,cel-lin-4-3p,[54:76](+),experimental,15,36,54,76
2,cel-mir-1,MI0000003,Caenorhabditis elegans,AAAGUGACCGUACCGAGCUGCAUACUUCCUUACAUGCCCAUACUAU...,cel-miR-1-5p,[20:42](+),experimental,cel-miR-1-3p,[60:81](+),experimental,20,42,60,81
3,cel-mir-2,MI0000004,Caenorhabditis elegans,UAAACAGUAUACAGAAAGCCAUCAAAGCGGUGGUUGAUGUGUUGCA...,cel-miR-2-5p,[19:41](+),experimental,cel-miR-2-3p,[60:83](+),experimental,19,41,60,83
4,cel-mir-34,MI0000005,Caenorhabditis elegans,CGGACAAUGCUCGAGAGGCAGUGUGGUUAGCUGGUUGCAUAUUUCC...,cel-miR-34-5p,[15:37](+),experimental,cel-miR-34-3p,[52:74](+),experimental,15,37,52,74
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38584,smc-mir-12461,MI0041070,Symbiodinium microadriaticum,GAGGAUGCUGAUCAUUCACUGGCCCCCUGUGGACACGUGUGUUGCA...,smc-miR-12461-5p,[0:22](+),experimental,smc-miR-12461-3p,[65:87](+),experimental,0,22,65,87
38585,hsa-mir-9902-2,MI0041071,Homo sapiens,GCAGGGAAAGGGAACCCAGAAAUCUGGUAUGCCAGCAAAGAGAGUA...,hsa-miR-9902,[14:36](+),experimental,,,,14,36,-1,-1
38586,gga-mir-1784b,MI0041072,Gallus gallus,UUCUGCUCCUAUUUAAGUCAAUGGCAGAACUCUCACUGAUUUCAAU...,gga-miR-1784b-5p,[0:22](+),experimental,gga-miR-1784b-3p,[36:58](+),experimental,0,22,36,58
38587,mdo-mir-7385g-1,MI0041073,Monodelphis domestica,UAGUCUGAUAUUCCAUGUUUCUAUGUCAUGAAACUUGGAGCAUAGA...,mdo-miR-7385g-5p,[0:22](+),experimental,mdo-miR-7385g-3p,[38:59](+),experimental,0,22,38,59


In [2]:
# Save Point
# https://stackoverflow.com/questions/34342155/how-to-pickle-or-store-jupyter-ipython-notebook-session-for-later
import dill
dill.dump_session('../data/notebook_sessions/1_prepare_dataset_1.db')

Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "/Users/cyuab/Desktop/time-series-classification-dicer/env/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code
  File "/var/folders/2w/d1h5t_nn2slf5r75qqdvlt0m0000gn/T/ipykernel_6387/1752664196.py", line 3, in <module>
    import dill
ModuleNotFoundError: No module named 'dill'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/cyuab/Desktop/time-series-classification-dicer/env/lib/python3.12/site-packages/pygments/styles/__init__.py", line 89, in get_style_by_name
ModuleNotFoundError: No module named 'pygments.styles.default'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/cyuab/Desktop/time-series-classification-dicer/env/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 2168, in showtraceback
  File "/Users/cyuab/Desktop/time-series-classification-di

In [None]:
import dill
dill.load_session('../data/notebook_sessions/1_prepare_dataset_1.db')

The sequence of "cel-let-7":

In [17]:
temp = len(df.loc[0].Sequence)
print("Length:", temp)
print(df.loc[0].Sequence[0:5] + "..." + df.loc[0].Sequence[temp-5:temp])

Length: 99
UACAC...UUCGA


## 2 Selecting the rows related to Homo sapiens

In [1]:
df = df[df['Organism']=='Homo sapiens']

NameError: name 'df' is not defined

In [19]:
df

Unnamed: 0,Name,Accession,Organism,Sequence,miRNA_1_Product,miRNA_1_Location,miRNA_1_Evidence,miRNA_2_Product,miRNA_2_Location,miRNA_2_Evidence,miRNA_1_Start,miRNA_1_End,miRNA_2_Start,miRNA_2_End
57,hsa-let-7a-1,MI0000060,Homo sapiens,UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCAC...,hsa-let-7a-5p,[5:27](+),experimental,hsa-let-7a-3p,[56:77](+),experimental,5,27,56,77
58,hsa-let-7a-2,MI0000061,Homo sapiens,AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGA...,hsa-let-7a-5p,[4:26](+),experimental,hsa-let-7a-2-3p,[49:71](+),experimental,4,26,49,71
59,hsa-let-7a-3,MI0000062,Homo sapiens,GGGUGAGGUAGUAGGUUGUAUAGUUUGGGGCUCUGCCCUGCUAUGG...,hsa-let-7a-5p,[3:25](+),experimental,hsa-let-7a-3p,[51:72](+),experimental,3,25,51,72
60,hsa-let-7b,MI0000063,Homo sapiens,CGGGGUGAGGUAGUAGGUUGUGUGGUUUCAGGGCAGUGAUGUUGCC...,hsa-let-7b-5p,[5:27](+),experimental,hsa-let-7b-3p,[59:81](+),experimental,5,27,59,81
61,hsa-let-7c,MI0000064,Homo sapiens,GCAUCCGGGUUGAGGUAGUAGGUUGUAUGGUUUAGAGUUACACCCU...,hsa-let-7c-5p,[10:32](+),experimental,hsa-let-7c-3p,[55:77](+),experimental,10,32,55,77
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37307,hsa-mir-12132,MI0039734,Homo sapiens,UUAACAUCUUUUCCAUCAUAAUUCUCAUAGUAAUAAUAGUAAUGUU...,hsa-miR-12132,[65:87](+),experimental,,,,65,87,-1,-1
37308,hsa-mir-12133,MI0039735,Homo sapiens,GAAGUGUACUUUUUAAUGGUGCCAAACAGCAGUUGAUCUAUAAUAA...,hsa-miR-12133,[52:74](+),experimental,,,,52,74,-1,-1
37312,hsa-mir-12135,MI0039739,Homo sapiens,UGUGGAUAUUCUUUUUUGAUACUACAGCAAAACUCAGCAAGUUGUA...,hsa-miR-12135,[52:70](+),experimental,,,,52,70,-1,-1
37313,hsa-mir-12136,MI0039740,Homo sapiens,GAAAAAGUCAUGGAGGCCAUGGGGUUGGCUUGAAACCAGCUUUGGG...,hsa-miR-12136,[0:18](+),experimental,,,,0,18,-1,-1


The sequence of "hsa-let-7a-1":

In [20]:
temp = len(df.loc[57].Sequence)
print("Length:", temp)
print(df.loc[57].Sequence[0:5] + "..." + df.loc[57].Sequence[temp-5:temp])

Length: 80
UGGGA...UCCUA


## 3 Selecting the rows with experimental supports for the corresponding 5p, 3p products

In [21]:
df[(df['miRNA_2_Evidence']=='experimental')]

Unnamed: 0,Name,Accession,Organism,Sequence,miRNA_1_Product,miRNA_1_Location,miRNA_1_Evidence,miRNA_2_Product,miRNA_2_Location,miRNA_2_Evidence,miRNA_1_Start,miRNA_1_End,miRNA_2_Start,miRNA_2_End
57,hsa-let-7a-1,MI0000060,Homo sapiens,UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCAC...,hsa-let-7a-5p,[5:27](+),experimental,hsa-let-7a-3p,[56:77](+),experimental,5,27,56,77
58,hsa-let-7a-2,MI0000061,Homo sapiens,AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGA...,hsa-let-7a-5p,[4:26](+),experimental,hsa-let-7a-2-3p,[49:71](+),experimental,4,26,49,71
59,hsa-let-7a-3,MI0000062,Homo sapiens,GGGUGAGGUAGUAGGUUGUAUAGUUUGGGGCUCUGCCCUGCUAUGG...,hsa-let-7a-5p,[3:25](+),experimental,hsa-let-7a-3p,[51:72](+),experimental,3,25,51,72
60,hsa-let-7b,MI0000063,Homo sapiens,CGGGGUGAGGUAGUAGGUUGUGUGGUUUCAGGGCAGUGAUGUUGCC...,hsa-let-7b-5p,[5:27](+),experimental,hsa-let-7b-3p,[59:81](+),experimental,5,27,59,81
61,hsa-let-7c,MI0000064,Homo sapiens,GCAUCCGGGUUGAGGUAGUAGGUUGUAUGGUUUAGAGUUACACCCU...,hsa-let-7c-5p,[10:32](+),experimental,hsa-let-7c-3p,[55:77](+),experimental,10,32,55,77
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31319,hsa-mir-10399,MI0033423,Homo sapiens,AAUUACAGAUUGUCUCAGAGAAAACAAAUGAGUUACUCUCUCGGAC...,hsa-miR-10399-5p,[0:21](+),experimental,hsa-miR-10399-3p,[37:58](+),experimental,0,21,37,58
31320,hsa-mir-10400,MI0033424,Homo sapiens,CGGCGGCGGCGGCUCUGGGCGAGGCGGCGGGGCCUGGGCUCCCGGA...,hsa-miR-10400-5p,[0:21](+),experimental,hsa-miR-10400-3p,[33:55](+),experimental,0,21,33,55
31321,hsa-mir-10401,MI0033425,Homo sapiens,CGUGUGGGAAGGCGUGGGGUGCGGACCCCGGCCCGACCUCGCCGUC...,hsa-miR-10401-5p,[0:20](+),experimental,hsa-miR-10401-3p,[35:56](+),experimental,0,20,35,56
31322,hsa-mir-10396b,MI0033426,Homo sapiens,CGGCGGGGCUCGGAGCCGGGCUUCGGCCGGGCCCCGGGCCCUCGAC...,hsa-miR-10396b-5p,[0:20](+),experimental,hsa-miR-10396b-3p,[29:51](+),experimental,0,20,29,51


In [22]:
df[(df['miRNA_1_Evidence']=='experimental') &  (df['miRNA_2_Evidence']!='experimental')]

Unnamed: 0,Name,Accession,Organism,Sequence,miRNA_1_Product,miRNA_1_Location,miRNA_1_Evidence,miRNA_2_Product,miRNA_2_Location,miRNA_2_Evidence,miRNA_1_Start,miRNA_1_End,miRNA_2_Start,miRNA_2_End
105,hsa-mir-107,MI0000114,Homo sapiens,CUCUCUGCUUUCAGCUUCUUUACAGUGUUGCCUUGUGGCAUGGAGU...,hsa-miR-107,[49:72](+),experimental,,,,49,72,-1,-1
226,hsa-mir-196a-1,MI0000238,Homo sapiens,GUGAAUUAGGUAGUUUCAUGUUGUUGGGCCUGGGUUUCUGAACACA...,hsa-miR-196a-5p,[6:28](+),experimental,hsa-miR-196a-1-3p,[44:65](+),not_experimental,6,28,44,65
228,hsa-mir-198,MI0000240,Homo sapiens,UCAUUGGUCCAGAGGGGAGAUAGGUUCCUGUGAUUUUUCCUUCUUC...,hsa-miR-198,[5:27](+),experimental,,,,5,27,-1,-1
251,hsa-mir-7-3,MI0000265,Homo sapiens,AGAUUAGAGUGGCUGUGGUCUAGUGCUGUGUGGAAGACUAGUGAUU...,hsa-miR-7-5p,[30:54](+),experimental,,,,30,54,-1,-1
256,hsa-mir-181b-1,MI0000270,Homo sapiens,CCUGUGCAGAGAUUAUUUUUUAAAAGGUCACAAUCAACAUUCAUUG...,hsa-miR-181b-5p,[35:58](+),experimental,hsa-miR-181b-3p,[75:96](+),not_experimental,35,58,75,96
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37307,hsa-mir-12132,MI0039734,Homo sapiens,UUAACAUCUUUUCCAUCAUAAUUCUCAUAGUAAUAAUAGUAAUGUU...,hsa-miR-12132,[65:87](+),experimental,,,,65,87,-1,-1
37308,hsa-mir-12133,MI0039735,Homo sapiens,GAAGUGUACUUUUUAAUGGUGCCAAACAGCAGUUGAUCUAUAAUAA...,hsa-miR-12133,[52:74](+),experimental,,,,52,74,-1,-1
37312,hsa-mir-12135,MI0039739,Homo sapiens,UGUGGAUAUUCUUUUUUGAUACUACAGCAAAACUCAGCAAGUUGUA...,hsa-miR-12135,[52:70](+),experimental,,,,52,70,-1,-1
37313,hsa-mir-12136,MI0039740,Homo sapiens,GAAAAAGUCAUGGAGGCCAUGGGGUUGGCUUGAAACCAGCUUUGGG...,hsa-miR-12136,[0:18](+),experimental,,,,0,18,-1,-1


The sequence of "hsa-mir-107":

In [24]:
temp = len(df.loc[105].Sequence)
print("Length:", temp)
print(df.loc[105].Sequence[0:5] + "..." + df.loc[105].Sequence[temp-5:temp])

Length: 81
CUCUC...ACAGA


The sequence of "hsa-mir-196a-1":

In [25]:
temp = len(df.loc[226].Sequence)
print("Length:", temp)
print(df.loc[226].Sequence[0:5] + "..." + df.loc[226].Sequence[temp-5:temp])

Length: 70
GUGAA...UUCAC


In [26]:
df[(df['miRNA_1_Evidence']!='experimental') &  (df['miRNA_2_Evidence']=='experimental')]

Unnamed: 0,Name,Accession,Organism,Sequence,miRNA_1_Product,miRNA_1_Location,miRNA_1_Evidence,miRNA_2_Product,miRNA_2_Location,miRNA_2_Evidence,miRNA_1_Start,miRNA_1_End,miRNA_2_Start,miRNA_2_End
101,hsa-mir-103a-1,MI0000109,Homo sapiens,UACUGCCCUCGGCUUCUUUACAGUGCUGCCUUGUUGCAUAUGGAUC...,hsa-miR-103a-1-5p,[10:33](+),not_experimental,hsa-miR-103a-3p,[47:70](+),experimental,10,33,47,70
227,hsa-mir-197,MI0000239,Homo sapiens,GGCUGUGCCGGGUAGAGAGGGCAGUGGGAGGUAAGAGCUCUUCACC...,hsa-miR-197-5p,[8:31](+),not_experimental,hsa-miR-197-3p,[47:69](+),experimental,8,31,47,69
264,hsa-mir-203a,MI0000283,Homo sapiens,GUGUUGGGGACUCGCGCGCUGGGUCCAGUGGUUCUUAACAGUUCAA...,hsa-miR-203a-5p,[26:51](+),not_experimental,hsa-miR-203a-3p,[64:86](+),experimental,26,51,64,86
422,hsa-mir-137,MI0000454,Homo sapiens,GGUCCUCUGACUCUCUUCGGUGACGGGUAUUCUUGGGUGGAUAAUA...,hsa-miR-137-5p,[22:45](+),not_experimental,hsa-miR-137-3p,[58:81](+),experimental,22,45,58,81
506,hsa-mir-320a,MI0000542,Homo sapiens,CUCCCCUCCGCCUUCUCUUCCCGGUUCUUCCCGGAGUCGGGAAAAG...,hsa-miR-320a-5p,[9:31](+),not_experimental,hsa-miR-320a-3p,[41:63](+),experimental,9,31,41,63
611,hsa-mir-1-1,MI0000651,Homo sapiens,UGGGAAACAUACUUCUUUAUAUGCCCAUAUGGACCUGCUAAGCUAU...,hsa-miR-1-5p,[6:28](+),not_experimental,hsa-miR-1-3p,[45:67](+),experimental,6,28,45,67
696,hsa-mir-101-2,MI0000739,Homo sapiens,ACUGUCCUUUUUCGGUUAUCAUGGUACCGAUGCUGUAUAUCUGAAA...,hsa-miR-101-2-5p,[11:33](+),not_experimental,hsa-miR-101-3p,[48:69](+),experimental,11,33,48,69
699,hsa-mir-34b,MI0000742,Homo sapiens,GUGCUCGGUUUGUAGGCAGUGUCAUUAGCUGAUUGUACUGUGGUGG...,hsa-miR-34b-5p,[12:35](+),not_experimental,hsa-miR-34b-3p,[49:71](+),experimental,12,35,49,71
702,hsa-mir-301a,MI0000745,Homo sapiens,ACUGCUAACGAAUGCUCUGACUUUAUUGCACUACUGUACUUUACAG...,hsa-miR-301a-5p,[13:35](+),not_experimental,hsa-miR-301a-3p,[50:73](+),experimental,13,35,50,73
722,hsa-mir-365a,MI0000767,Homo sapiens,ACCGCAGGGAAAAUGAGGGACUUUUGGGGGCAGAUGUGUUUCCAUU...,hsa-miR-365a-5p,[15:38](+),not_experimental,hsa-miR-365a-3p,[55:77](+),experimental,15,38,55,77


In [27]:
df = df[(df['miRNA_1_Evidence']=='experimental') & (df['miRNA_2_Evidence']=='experimental')]

In [28]:
df.describe()

Unnamed: 0,miRNA_1_Start,miRNA_1_End,miRNA_2_Start,miRNA_2_End
count,827.0,827.0,827.0,827.0
mean,10.592503,32.649335,50.354293,72.110036
std,7.494604,7.513209,9.642596,9.745738
min,0.0,18.0,28.0,49.0
25%,5.0,27.0,44.0,66.0
50%,9.0,31.0,49.0,71.0
75%,14.0,37.0,55.0,77.0
max,52.0,73.0,138.0,160.0


For these columns, we represent the null value as -1.
Since the min values of them are greater than -1, we do not have any null values in these columns now.



In [29]:
df

Unnamed: 0,Name,Accession,Organism,Sequence,miRNA_1_Product,miRNA_1_Location,miRNA_1_Evidence,miRNA_2_Product,miRNA_2_Location,miRNA_2_Evidence,miRNA_1_Start,miRNA_1_End,miRNA_2_Start,miRNA_2_End
57,hsa-let-7a-1,MI0000060,Homo sapiens,UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCAC...,hsa-let-7a-5p,[5:27](+),experimental,hsa-let-7a-3p,[56:77](+),experimental,5,27,56,77
58,hsa-let-7a-2,MI0000061,Homo sapiens,AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGA...,hsa-let-7a-5p,[4:26](+),experimental,hsa-let-7a-2-3p,[49:71](+),experimental,4,26,49,71
59,hsa-let-7a-3,MI0000062,Homo sapiens,GGGUGAGGUAGUAGGUUGUAUAGUUUGGGGCUCUGCCCUGCUAUGG...,hsa-let-7a-5p,[3:25](+),experimental,hsa-let-7a-3p,[51:72](+),experimental,3,25,51,72
60,hsa-let-7b,MI0000063,Homo sapiens,CGGGGUGAGGUAGUAGGUUGUGUGGUUUCAGGGCAGUGAUGUUGCC...,hsa-let-7b-5p,[5:27](+),experimental,hsa-let-7b-3p,[59:81](+),experimental,5,27,59,81
61,hsa-let-7c,MI0000064,Homo sapiens,GCAUCCGGGUUGAGGUAGUAGGUUGUAUGGUUUAGAGUUACACCCU...,hsa-let-7c-5p,[10:32](+),experimental,hsa-let-7c-3p,[55:77](+),experimental,10,32,55,77
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31319,hsa-mir-10399,MI0033423,Homo sapiens,AAUUACAGAUUGUCUCAGAGAAAACAAAUGAGUUACUCUCUCGGAC...,hsa-miR-10399-5p,[0:21](+),experimental,hsa-miR-10399-3p,[37:58](+),experimental,0,21,37,58
31320,hsa-mir-10400,MI0033424,Homo sapiens,CGGCGGCGGCGGCUCUGGGCGAGGCGGCGGGGCCUGGGCUCCCGGA...,hsa-miR-10400-5p,[0:21](+),experimental,hsa-miR-10400-3p,[33:55](+),experimental,0,21,33,55
31321,hsa-mir-10401,MI0033425,Homo sapiens,CGUGUGGGAAGGCGUGGGGUGCGGACCCCGGCCCGACCUCGCCGUC...,hsa-miR-10401-5p,[0:20](+),experimental,hsa-miR-10401-3p,[35:56](+),experimental,0,20,35,56
31322,hsa-mir-10396b,MI0033426,Homo sapiens,CGGCGGGGCUCGGAGCCGGGCUUCGGCCGGGCCCCGGGCCCUCGAC...,hsa-miR-10396b-5p,[0:20](+),experimental,hsa-miR-10396b-3p,[29:51](+),experimental,0,20,29,51


In [30]:
# Verify all the entries in miRNA_1_Product ends with -5p
if len(df) == len(df[df['miRNA_1_Product'].str.endswith('-5p')]):
    print("All the entries in miRNA_1_Product ends with -5p")
else:
    print("Not all the entries in miRNA_1_Product ends with -5p")
# Verify all the entries in miRNA_2_Product ends with -3p
if len(df) == len(df[df['miRNA_2_Product'].str.endswith('-3p')]):
    print("All the entries in miRNA_2_Product ends with -3p")
else:
    print("Not all the entries in miRNA_2_Product ends with -3p")

All the entries in miRNA_1_Product ends with -5p
All the entries in miRNA_2_Product ends with -3p


## 4 Delete the non-relevant columns 

In [31]:
# https://stackoverflow.com/questions/11285613/selecting-multiple-columns-in-a-pandas-dataframe
df = df[['Name', 'Sequence', 'miRNA_1_Start', 'miRNA_1_End', 'miRNA_2_Start', 'miRNA_2_End']]

In [32]:
df

Unnamed: 0,Name,Sequence,miRNA_1_Start,miRNA_1_End,miRNA_2_Start,miRNA_2_End
57,hsa-let-7a-1,UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCAC...,5,27,56,77
58,hsa-let-7a-2,AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGA...,4,26,49,71
59,hsa-let-7a-3,GGGUGAGGUAGUAGGUUGUAUAGUUUGGGGCUCUGCCCUGCUAUGG...,3,25,51,72
60,hsa-let-7b,CGGGGUGAGGUAGUAGGUUGUGUGGUUUCAGGGCAGUGAUGUUGCC...,5,27,59,81
61,hsa-let-7c,GCAUCCGGGUUGAGGUAGUAGGUUGUAUGGUUUAGAGUUACACCCU...,10,32,55,77
...,...,...,...,...,...,...
31319,hsa-mir-10399,AAUUACAGAUUGUCUCAGAGAAAACAAAUGAGUUACUCUCUCGGAC...,0,21,37,58
31320,hsa-mir-10400,CGGCGGCGGCGGCUCUGGGCGAGGCGGCGGGGCCUGGGCUCCCGGA...,0,21,33,55
31321,hsa-mir-10401,CGUGUGGGAAGGCGUGGGGUGCGGACCCCGGCCCGACCUCGCCGUC...,0,20,35,56
31322,hsa-mir-10396b,CGGCGGGGCUCGGAGCCGGGCUUCGGCCGGGCCCCGGGCCCUCGAC...,0,20,29,51


In [33]:
# Save Point
dill.dump_session('../data/notebook_sessions/1_prepare_dataset_4.db')


In [34]:
dill.load_session('../data/notebook_sessions/1_prepare_dataset_4.db')

Display the whole sequence of "hsa-let-7a-1" in three rows

In [37]:
len(df.loc[57].Sequence)

80

In [41]:

len(df.loc[57].Sequence)/3, df.loc[57].Sequence[0:27], df.loc[57].Sequence[27:27+27], df.loc[57].Sequence[27+27:27+27+26]


(26.666666666666668,
 'UGGGAUGAGGUAGUAGGUUGUAUAGUU',
 'UUAGGGUCACACCCACCACUGGGAGAU',
 'AACUAUACAAUCUACUGUCUUUCCUA')

In [42]:
print("5p: " + df.loc[57].Sequence[5:27] + " 3p: " + df.loc[57].Sequence[56:77])

5p: UGAGGUAGUAGGUUGUAUAGUU 3p: CUAUACAAUCUACUGUCUUUC


## 5 Retrieve the Secondary Structure of `Sequence` by external Folding programs

In [43]:
import ViennaRNA
# https://viennarna.readthedocs.io/en/latest/api_python.html#usage
import RNA

### 5.1 Add the secondary structure (ss) in Dot-Bracket Notation for each row in df

In [45]:
# Predict the secondary structure (ss) for input seq
def predict_ss(seq):
    fc  = RNA.fold_compound(seq)
    (ss, mfe) = fc.mfe()
    return ss

In [46]:
# df[['SS']] = df['Sequence'].apply(lambda seq: pd.Series(predict_ss(seq)))
# The above line works but shows SettingWithCopyWarning
df.loc[:, 'SS'] = df['Sequence'].apply(lambda seq: pd.Series(predict_ss(seq)))

In [47]:
df

Unnamed: 0,Name,Sequence,miRNA_1_Start,miRNA_1_End,miRNA_2_Start,miRNA_2_End,SS
57,hsa-let-7a-1,UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCAC...,5,27,56,77,(((((.(((((((((((((((((((((.....(((...((((.......
58,hsa-let-7a-2,AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGA...,4,26,49,71,(((..(((.(((.(((((((((((((.........(((......))...
59,hsa-let-7a-3,GGGUGAGGUAGUAGGUUGUAUAGUUUGGGGCUCUGCCCUGCUAUGG...,3,25,51,72,(((.(((((((((((((((((((((((((((...)))))).........
60,hsa-let-7b,CGGGGUGAGGUAGUAGGUUGUGUGGUUUCAGGGCAGUGAUGUUGCC...,5,27,59,81,(((((.(((((((((((((((((((((((.((((((.....)))))...
61,hsa-let-7c,GCAUCCGGGUUGAGGUAGUAGGUUGUAUGGUUUAGAGUUACACCCU...,10,32,55,77,((.((((((..(((.(((.(((((((((((((..((.(..((...)...
...,...,...,...,...,...,...,...
31319,hsa-mir-10399,AAUUACAGAUUGUCUCAGAGAAAACAAAUGAGUUACUCUCUCGGAC...,0,21,37,58,..((((((.((((((.(((((.(((......)))..))))).))))...
31320,hsa-mir-10400,CGGCGGCGGCGGCUCUGGGCGAGGCGGCGGGGCCUGGGCUCCCGGA...,0,21,33,55,.(((..((.((.(((.....))).)).))..)))......((((.....
31321,hsa-mir-10401,CGUGUGGGAAGGCGUGGGGUGCGGACCCCGGCCCGACCUCGCCGUC...,0,20,35,56,((.((((((.(((((.((((.(((...))))))).))...))).))...
31322,hsa-mir-10396b,CGGCGGGGCUCGGAGCCGGGCUUCGGCCGGGCCCCGGGCCCUCGAC...,0,20,29,51,(((((((((((((.(((.(((....))).))).)))))))).)).)...


In [48]:
# https://stackoverflow.com/questions/20490274/how-to-reset-index-in-a-pandas-dataframe
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,Name,Sequence,miRNA_1_Start,miRNA_1_End,miRNA_2_Start,miRNA_2_End,SS
0,hsa-let-7a-1,UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCAC...,5,27,56,77,(((((.(((((((((((((((((((((.....(((...((((.......
1,hsa-let-7a-2,AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGA...,4,26,49,71,(((..(((.(((.(((((((((((((.........(((......))...
2,hsa-let-7a-3,GGGUGAGGUAGUAGGUUGUAUAGUUUGGGGCUCUGCCCUGCUAUGG...,3,25,51,72,(((.(((((((((((((((((((((((((((...)))))).........
3,hsa-let-7b,CGGGGUGAGGUAGUAGGUUGUGUGGUUUCAGGGCAGUGAUGUUGCC...,5,27,59,81,(((((.(((((((((((((((((((((((.((((((.....)))))...
4,hsa-let-7c,GCAUCCGGGUUGAGGUAGUAGGUUGUAUGGUUUAGAGUUACACCCU...,10,32,55,77,((.((((((..(((.(((.(((((((((((((..((.(..((...)...
...,...,...,...,...,...,...,...
822,hsa-mir-10399,AAUUACAGAUUGUCUCAGAGAAAACAAAUGAGUUACUCUCUCGGAC...,0,21,37,58,..((((((.((((((.(((((.(((......)))..))))).))))...
823,hsa-mir-10400,CGGCGGCGGCGGCUCUGGGCGAGGCGGCGGGGCCUGGGCUCCCGGA...,0,21,33,55,.(((..((.((.(((.....))).)).))..)))......((((.....
824,hsa-mir-10401,CGUGUGGGAAGGCGUGGGGUGCGGACCCCGGCCCGACCUCGCCGUC...,0,20,35,56,((.((((((.(((((.((((.(((...))))))).))...))).))...
825,hsa-mir-10396b,CGGCGGGGCUCGGAGCCGGGCUUCGGCCGGGCCCCGGGCCCUCGAC...,0,20,29,51,(((((((((((((.(((.(((....))).))).)))))))).)).)...


### 5.2 Create the complementary strand for each string in the row of df from the secondary structure

In [86]:
# # Construct complementary string for input seq
# def construct_compl(seq, ss):
#     i = 0
#     j = len(seq)-1
#     # https://stackoverflow.com/questions/1228299/changing-a-character-in-a-string
#     seq_compl = ['i'] * len(seq)
#     while i <= j:
#         print(i, j)
#         if ss[i] == '(':
#             if ss[j] == ')':
#                 seq_compl[i] = seq[j]
#                 seq_compl[j] = seq[i]
#                 print("The",i,"elt of input seq",seq[i],"is mapped to the ",j," element which is",seq_compl[i])
#                 i += 1
#                 j -= 1
#             elif ss[j] == '.':
#                 seq_compl[j] = '_'
#                 print("The",j,"elt of input seq",seq_compl[i],"is mapped to the unpairing character _")
#                 j -= 1
#         elif ss[i] == '.':
#             seq_compl[i] = '_'
#             print("The",i,"elt of input seq",seq[i],"is mapped to the null character _")
#             i += 1
#         else:
#             # https://stackoverflow.com/questions/2052390/manually-raising-throwing-an-exception-in-python
#             raise ValueError('A very specific bad thing happened.')    
#     return "".join(seq_compl)
  

In [49]:
# Construct complementary strand and prob seq for input seq using stack
# https://www.geeksforgeeks.org/stack-in-python/
# https://github.com/ViennaRNA/ViennaRNA/issues/46
def construct_compl(seq, ss):
    # https://stackoverflow.com/questions/1228299/changing-a-character-in-a-string
    seq_compl = ['i'] * len(seq)
    seq_compl_prob = [-1] * len(seq)
    (propensity,ensemble_energy) = RNA.pf_fold(seq)
    parenthesis_stack = []
    for i in range(len(seq)):
        if ss[i] == '(':
            parenthesis_stack.append(i)
        elif ss[i] == '.':
            seq_compl[i] = '_'
        elif ss[i] == ')':
            j = parenthesis_stack.pop()
            seq_compl[j], seq_compl[i] = seq[i], seq[j]
            seq_compl_prob[i] = RNA.get_pr(i +1, j+1)
            seq_compl_prob[j] = RNA.get_pr(i +1, j+1)
        else:
            # https://stackoverflow.com/questions/2052390/manually-raising-throwing-an-exception-in-python
            raise ValueError('A very specific bad thing happened.')
    # https://stackoverflow.com/questions/47969756/pandas-apply-function-that-returns-two-new-columns    
    return pd.Series(["".join(seq_compl), seq_compl_prob])

In [50]:
# Test the construct_compl(seq, ss)
seq = df.loc[0].Sequence
ss = df.loc[0].SS
seq_compl, seq_compl_prob = construct_compl(seq, ss)
seq, ss, seq_compl, seq_compl_prob


('UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCACUGGGAGAUAACUAUACAAUCUACUGUCUUUCCUA',
 '(((((.(((((((((((((((((((((.....(((...((((....)))).)))))))))))))))))))))))))))))',
 'AUCCU_UUCUGUCAUCUAACAUAUCAA_____UAG___GGGU____ACCC_CUGUUGAUAUGUUGGAUGAUGGAGAGGGU',
 [0.5493786436077366,
  0.94554699810418,
  0.986955654826434,
  0.9871518080317534,
  0.9039272199945173,
  -1,
  0.8410709646198647,
  0.9736948543001552,
  0.9813040597206824,
  0.8898251061201992,
  0.9291679985726867,
  0.9975284750288336,
  0.9999419082190152,
  0.9994870750249327,
  0.9993990065365326,
  0.9998536169074631,
  0.9985678063762455,
  0.9987582701160881,
  0.998882802020682,
  0.999948614713236,
  0.9994679517739496,
  0.9989972739174591,
  0.9989705602761316,
  0.9993176437187281,
  0.9997423393655187,
  0.981519566130822,
  0.9144092768903046,
  -1,
  -1,
  -1,
  -1,
  -1,
  0.7931920086071264,
  0.8071775998760317,
  0.8068907396417209,
  -1,
  -1,
  -1,
  0.8433408244527671,
  0.8451606782098134,
  0.84485702034

Print the secondary structure of "hsa-let-7a-1" in three rows

In [57]:
len(df.loc[0].SS)/3, df.loc[0].SS[0:27], df.loc[0].SS[27:27+27], df.loc[0].SS[27+27:27+27+26]

(26.666666666666668,
 '(((((.(((((((((((((((((((((',
 '.....(((...((((....)))).)))',
 '))))))))))))))))))))))))))')

In [58]:
def print_list_3dp(my_list):
    my_formatted_list = [ '%.3f' % elem for elem in  my_list]
    # https://stackoverflow.com/questions/5326112/how-to-round-each-item-in-a-list-of-floats-to-2-decimal-places
    # https://stackoverflow.com/questions/44639357/print-python-list-without-quotation-marks-or-space-after-commas
    print("time series = ",  ', '.join(my_formatted_list))

In [60]:
def print_list_3dp_vertically(my_list):
    my_formatted_list = [ '%.3f' % elem for elem in  my_list]
    # https://stackoverflow.com/questions/5982206/how-to-print-a-linebreak-in-a-python-function
    print('\n'.join(my_formatted_list))

In [61]:
print_list_3dp(seq_compl_prob)

time series =  0.549, 0.946, 0.987, 0.987, 0.904, -1.000, 0.841, 0.974, 0.981, 0.890, 0.929, 0.998, 1.000, 0.999, 0.999, 1.000, 0.999, 0.999, 0.999, 1.000, 0.999, 0.999, 0.999, 0.999, 1.000, 0.982, 0.914, -1.000, -1.000, -1.000, -1.000, -1.000, 0.793, 0.807, 0.807, -1.000, -1.000, -1.000, 0.843, 0.845, 0.845, 0.613, -1.000, -1.000, -1.000, -1.000, 0.613, 0.845, 0.845, 0.843, -1.000, 0.807, 0.807, 0.793, 0.914, 0.982, 1.000, 0.999, 0.999, 0.999, 0.999, 1.000, 0.999, 0.999, 0.999, 1.000, 0.999, 0.999, 1.000, 0.998, 0.929, 0.890, 0.981, 0.974, 0.841, 0.904, 0.987, 0.987, 0.946, 0.549


For better coloring, we treat the probability that is > -1 (which represents the probability of the unpaired base) and <= 0.9 as the same to make the variation of those with probabilities > 0.9 more stand out

In [62]:
# https://stackoverflow.com/questions/2612802/how-do-i-clone-a-list-so-that-it-doesnt-change-unexpectedly-after-assignment
temp = seq_compl_prob.copy()
temp
for i in range(len(temp)):
    if temp[i] == -1:
        temp[i] = 0.7
    elif temp[i] <= 0.9:
        temp[i] = 0.9
temp
# https://stackoverflow.com/questions/18380419/normalization-to-bring-in-the-range-of-0-1
import numpy as np
temp = (temp - np.min(temp)) / (np.max(temp) - np.min(temp))
print_list_3dp_vertically(temp)

0.667
0.819
0.957
0.957
0.680
0.000
0.667
0.912
0.938
0.667
0.764
0.992
1.000
0.998
0.998
1.000
0.995
0.996
0.996
1.000
0.998
0.997
0.997
0.998
0.999
0.939
0.715
0.000
0.000
0.000
0.000
0.000
0.667
0.667
0.667
0.000
0.000
0.000
0.667
0.667
0.667
0.667
0.000
0.000
0.000
0.000
0.667
0.667
0.667
0.667
0.000
0.667
0.667
0.667
0.715
0.939
0.999
0.998
0.997
0.997
0.998
1.000
0.996
0.996
0.995
1.000
0.998
0.998
1.000
0.992
0.764
0.667
0.938
0.912
0.667
0.680
0.957
0.957
0.819
0.667


In [63]:
# For checking the index
# https://stackoverflow.com/questions/34753872/how-do-i-display-the-index-of-a-list-element-in-python
for (i, item) in enumerate(seq_compl, start=0):
    print(i, item)

0 A
1 U
2 C
3 C
4 U
5 _
6 U
7 U
8 C
9 U
10 G
11 U
12 C
13 A
14 U
15 C
16 U
17 A
18 A
19 C
20 A
21 U
22 A
23 U
24 C
25 A
26 A
27 _
28 _
29 _
30 _
31 _
32 U
33 A
34 G
35 _
36 _
37 _
38 G
39 G
40 G
41 U
42 _
43 _
44 _
45 _
46 A
47 C
48 C
49 C
50 _
51 C
52 U
53 G
54 U
55 U
56 G
57 A
58 U
59 A
60 U
61 G
62 U
63 U
64 G
65 G
66 A
67 U
68 G
69 A
70 U
71 G
72 G
73 A
74 G
75 A
76 G
77 G
78 G
79 U


In [64]:
# Test for computing the max pair probability for a given nucleutide
seq = df.loc[57].Sequence
(propensity,ensemble_energy) = RNA.pf_fold(seq)
# Find max for nucleutide 1
i = 8
prob_max = 0
for j in range(0, len(seq)):
    if RNA.get_pr(i, j+1) > prob_max:
        prob_max = RNA.get_pr(i, j+1)
print(prob_max)

0.9688683891307968


In [65]:
df.head()

Unnamed: 0,Name,Sequence,miRNA_1_Start,miRNA_1_End,miRNA_2_Start,miRNA_2_End,SS
0,hsa-let-7a-1,UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCAC...,5,27,56,77,(((((.(((((((((((((((((((((.....(((...((((.......
1,hsa-let-7a-2,AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGA...,4,26,49,71,(((..(((.(((.(((((((((((((.........(((......))...
2,hsa-let-7a-3,GGGUGAGGUAGUAGGUUGUAUAGUUUGGGGCUCUGCCCUGCUAUGG...,3,25,51,72,(((.(((((((((((((((((((((((((((...)))))).........
3,hsa-let-7b,CGGGGUGAGGUAGUAGGUUGUGUGGUUUCAGGGCAGUGAUGUUGCC...,5,27,59,81,(((((.(((((((((((((((((((((((.((((((.....)))))...
4,hsa-let-7c,GCAUCCGGGUUGAGGUAGUAGGUUGUAUGGUUUAGAGUUACACCCU...,10,32,55,77,((.((((((..(((.(((.(((((((((((((..((.(..((...)...


In [66]:
# Test seq_compl and seq_compl_prob for all rows in the df
for i in range(len(df)):
    seq = df.loc[i].Sequence
    ss = df.loc[i].SS
    seq_compl, seq_compl_prob = construct_compl(seq, ss)
    print(i)
    print(seq)
    print(seq_compl)
    print(len(seq_compl_prob), seq_compl_prob)

0
UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCACUGGGAGAUAACUAUACAAUCUACUGUCUUUCCUA
AUCCU_UUCUGUCAUCUAACAUAUCAA_____UAG___GGGU____ACCC_CUGUUGAUAUGUUGGAUGAUGGAGAGGGU
80 [0.5493786436077366, 0.94554699810418, 0.986955654826434, 0.9871518080317534, 0.9039272199945173, -1, 0.8410709646198647, 0.9736948543001552, 0.9813040597206824, 0.8898251061201992, 0.9291679985726867, 0.9975284750288336, 0.9999419082190152, 0.9994870750249327, 0.9993990065365326, 0.9998536169074631, 0.9985678063762455, 0.9987582701160881, 0.998882802020682, 0.999948614713236, 0.9994679517739496, 0.9989972739174591, 0.9989705602761316, 0.9993176437187281, 0.9997423393655187, 0.981519566130822, 0.9144092768903046, -1, -1, -1, -1, -1, 0.7931920086071264, 0.8071775998760317, 0.8068907396417209, -1, -1, -1, 0.8433408244527671, 0.8451606782098134, 0.8448570203446218, 0.6132957407489772, -1, -1, -1, -1, 0.6132957407489772, 0.8448570203446218, 0.8451606782098134, 0.8433408244527671, -1, 0.8068907396417209, 0.8071775998760317, 0.

In [67]:
# https://stackoverflow.com/questions/34279378/apply-function-with-two-arguments-to-columns
df[['Seq_Compl', 'Pair_Prob']] = df.apply(lambda x: construct_compl(x['Sequence'], x['SS']), axis=1)

In [69]:
df

Unnamed: 0,Name,Sequence,miRNA_1_Start,miRNA_1_End,miRNA_2_Start,miRNA_2_End,SS,Seq_Compl,Pair_Prob
0,hsa-let-7a-1,UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCAC...,5,27,56,77,(((((.(((((((((((((((((((((.....(((...((((.......,AUCCU_UUCUGUCAUCUAACAUAUCAA_____UAG___GGGU____...,"[0.5493786436077366, 0.94554699810418, 0.98695..."
1,hsa-let-7a-2,AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGA...,4,26,49,71,(((..(((.(((.(((((((((((((.........(((......))...,UCC__UUC_AUC_UCCGACAUGUCAA_________UAG______CU...,"[0.7077965199915421, 0.975100070400267, 0.9750..."
2,hsa-let-7a-3,GGGUGAGGUAGUAGGUUGUAUAGUUUGGGGCUCUGCCCUGCUAUGG...,3,25,51,72,(((.(((((((((((((((((((((((((((...)))))).........,UCC_UUCUGUCAUCUAACAUAUCAAGUCCCG___CGGGGU______...,"[0.6570633923904023, 0.9052413524746695, 0.906..."
3,hsa-let-7b,CGGGGUGAGGUAGUAGGUUGUGUGGUUUCAGGGCAGUGAUGUUGCC...,5,27,59,81,(((((.(((((((((((((((((((((((.((((((.....)))))...,GUCCC_UUCCGUCAUCCAACAUAUCAAGG_CCCGUU_____GACGG...,"[0.8325127972409344, 0.9705872421262327, 0.986..."
4,hsa-let-7c,GCAUCCGGGUUGAGGUAGUAGGUUGUAUGGUUUAGAGUUACACCCU...,10,32,55,77,((.((((((..(((.(((.(((((((((((((..((.(..((...)...,CG_AGGUUC__UUC_AUC_UCCAACAUGUCAA__UU_A__GU___A...,"[0.8770356651792834, 0.8954524321214109, -1, 0..."
...,...,...,...,...,...,...,...,...,...
822,hsa-mir-10399,AAUUACAGAUUGUCUCAGAGAAAACAAAUGAGUUACUCUCUCGGAC...,0,21,37,58,..((((((.((((((.(((((.(((......)))..))))).))))...,__GAUGUC_AACAGG_UCUCU_UUG______CAA__AGAGA_UCUG...,"[-1, -1, 0.5536799172926145, 0.952238919161697..."
823,hsa-mir-10400,CGGCGGCGGCGGCUCUGGGCGAGGCGGCGGGGCCUGGGCUCCCGGA...,0,21,33,55,.(((..((.((.(((.....))).)).))..)))......((((.....,_CCG__GC_GC_GAG_____CUC_GC_GC__CGG______GGGC__...,"[-1, 0.2829866540884834, 0.286620168215649, 0...."
824,hsa-mir-10401,CGUGUGGGAAGGCGUGGGGUGCGGACCCCGGCCCGACCUCGCCGUC...,0,20,35,56,((.((((((.(((((.((((.(((...))))))).))...))).))...,GC_CGCCCU_CCGCA_CCCG_GCC___GGCUGGG_UG___CGG_AG...,"[0.6978111842677427, 0.7779373716843502, -1, 0..."
825,hsa-mir-10396b,CGGCGGGGCUCGGAGCCGGGCUUCGGCCGGGCCCCGGGCCCUCGAC...,0,20,29,51,(((((((((((((.(((.(((....))).))).)))))))).)).)...,GCCGCCCCGGGCC_CGG_CCG____CGG_CCG_GGCUCGGG_GC_G...,"[0.9707186076954393, 0.9994799857404904, 0.999..."


In [70]:
df.head()

Unnamed: 0,Name,Sequence,miRNA_1_Start,miRNA_1_End,miRNA_2_Start,miRNA_2_End,SS,Seq_Compl,Pair_Prob
0,hsa-let-7a-1,UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCAC...,5,27,56,77,(((((.(((((((((((((((((((((.....(((...((((.......,AUCCU_UUCUGUCAUCUAACAUAUCAA_____UAG___GGGU____...,"[0.5493786436077366, 0.94554699810418, 0.98695..."
1,hsa-let-7a-2,AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGA...,4,26,49,71,(((..(((.(((.(((((((((((((.........(((......))...,UCC__UUC_AUC_UCCGACAUGUCAA_________UAG______CU...,"[0.7077965199915421, 0.975100070400267, 0.9750..."
2,hsa-let-7a-3,GGGUGAGGUAGUAGGUUGUAUAGUUUGGGGCUCUGCCCUGCUAUGG...,3,25,51,72,(((.(((((((((((((((((((((((((((...)))))).........,UCC_UUCUGUCAUCUAACAUAUCAAGUCCCG___CGGGGU______...,"[0.6570633923904023, 0.9052413524746695, 0.906..."
3,hsa-let-7b,CGGGGUGAGGUAGUAGGUUGUGUGGUUUCAGGGCAGUGAUGUUGCC...,5,27,59,81,(((((.(((((((((((((((((((((((.((((((.....)))))...,GUCCC_UUCCGUCAUCCAACAUAUCAAGG_CCCGUU_____GACGG...,"[0.8325127972409344, 0.9705872421262327, 0.986..."
4,hsa-let-7c,GCAUCCGGGUUGAGGUAGUAGGUUGUAUGGUUUAGAGUUACACCCU...,10,32,55,77,((.((((((..(((.(((.(((((((((((((..((.(..((...)...,CG_AGGUUC__UUC_AUC_UCCAACAUGUCAA__UU_A__GU___A...,"[0.8770356651792834, 0.8954524321214109, -1, 0..."


In [55]:
df = df.rename(columns={'Sequence': 'Seq'})

In [58]:
# https://stackoverflow.com/questions/13411544/delete-a-column-from-a-pandas-dataframe
df = df.drop('SS', axis=1)
df

Unnamed: 0,Name,Seq,miRNA_1_Start,miRNA_1_End,miRNA_2_Start,miRNA_2_End,Seq_Compl,Pair_Prob
0,hsa-let-7a-1,UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCAC...,5,27,56,77,AUCCU_UUCUGUCAUCUAACAUAUCAA_____UAG___GGGU____...,"[0.5493786436077366, 0.94554699810418, 0.98695..."
1,hsa-let-7a-2,AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGA...,4,26,49,71,UCC__UUC_AUC_UCCGACAUGUCAA_________UAG______CU...,"[0.7077965199915421, 0.975100070400267, 0.9750..."
2,hsa-let-7a-3,GGGUGAGGUAGUAGGUUGUAUAGUUUGGGGCUCUGCCCUGCUAUGG...,3,25,51,72,UCC_UUCUGUCAUCUAACAUAUCAAGUCCCG___CGGGGU______...,"[0.6570633923904023, 0.9052413524746695, 0.906..."
3,hsa-let-7b,CGGGGUGAGGUAGUAGGUUGUGUGGUUUCAGGGCAGUGAUGUUGCC...,5,27,59,81,GUCCC_UUCCGUCAUCCAACAUAUCAAGG_CCCGUU_____GACGG...,"[0.8325127972409344, 0.9705872421262327, 0.986..."
4,hsa-let-7c,GCAUCCGGGUUGAGGUAGUAGGUUGUAUGGUUUAGAGUUACACCCU...,10,32,55,77,CG_AGGUUC__UUC_AUC_UCCAACAUGUCAA__UU_A__GU___A...,"[0.8770356651792834, 0.8954524321214109, -1, 0..."
...,...,...,...,...,...,...,...,...
822,hsa-mir-10399,AAUUACAGAUUGUCUCAGAGAAAACAAAUGAGUUACUCUCUCGGAC...,0,21,37,58,__GAUGUC_AACAGG_UCUCU_UUG______CAA__AGAGA_UCUG...,"[-1, -1, 0.5536799172926145, 0.952238919161697..."
823,hsa-mir-10400,CGGCGGCGGCGGCUCUGGGCGAGGCGGCGGGGCCUGGGCUCCCGGA...,0,21,33,55,_CCG__GC_GC_GAG_____CUC_GC_GC__CGG______GGGC__...,"[-1, 0.2829866540884834, 0.286620168215649, 0...."
824,hsa-mir-10401,CGUGUGGGAAGGCGUGGGGUGCGGACCCCGGCCCGACCUCGCCGUC...,0,20,35,56,GC_CGCCCU_CCGCA_CCCG_GCC___GGCUGGG_UG___CGG_AG...,"[0.6978111842677427, 0.7779373716843502, -1, 0..."
825,hsa-mir-10396b,CGGCGGGGCUCGGAGCCGGGCUUCGGCCGGGCCCCGGGCCCUCGAC...,0,20,29,51,GCCGCCCCGGGCC_CGG_CCG____CGG_CCG_GGCUCGGG_GC_G...,"[0.9707186076954393, 0.9994799857404904, 0.999..."


In [71]:
# Save Point
dill.dump_session('../data/notebook_sessions/1_prepare_dataset_5.db')

In [72]:
dill.load_session('../data/notebook_sessions/1_prepare_dataset_5.db')

## 6.Time Series Transformation (aka Genomic signal processing (GSP))

In [73]:
test = "AAAACGGUU"
test_prob = [0.9, 0.8, 0.8, 0.8, 0.7, 0.6, 0.8, 0.8, 0.9]
test, len(test), test_prob, len(test_prob)

('AAAACGGUU', 9, [0.9, 0.8, 0.8, 0.8, 0.7, 0.6, 0.8, 0.8, 0.9], 9)

In [74]:
test2 = "AAA_ACG__GUU"
test2_prob = [0.9, 0.8, 0.8, -1, 0.8, 0.7, 0.6, -1, -1, 0.8, 0.8, 0.9]
test2, len(test2), test2_prob, len(test2_prob)

('AAA_ACG__GUU',
 12,
 [0.9, 0.8, 0.8, -1, 0.8, 0.7, 0.6, -1, -1, 0.8, 0.8, 0.9],
 12)

### transform_original

In [75]:
# https://stackoverflow.com/questions/37130146/is-it-possible-to-detect-the-number-of-return-values-of-a-function-in-python
def transform_original(seq, prob_seq, use_prob_seq): #ts: time series
    # prob_seq is for dummy purpose to ensure all the transformations have the same input parameters
    # So I can run a function on the function list
    return seq, None

### transform_single

In [76]:
def transform_single(seq, prob_seq, use_prob_seq): #ts: time series
    # prob_seq is for dummy purpose to ensure all the transformations have the same input parameters
    # So I can run a function on the function list
    ts = [None] * len(seq)
    for i in range(len(seq)):
        if use_prob_seq:
            prob = prob_seq[i]
        else:
            prob = 1
        if seq[i] == 'A':
            ts[i] = 2 * prob
        elif seq[i] == 'G':
            ts[i] = 1 * prob
        elif seq[i] == 'C':
            ts[i] = -1 * prob
        elif seq[i] == 'U':
            ts[i] = -2 * prob
        elif seq[i] == '_':
            ts[i] = 0
        else:
            raise ValueError('The sequence contains invalid characters')  
    return ts, None

In [77]:
temp_1, temp_2 = transform_single(test, test_prob, False)

In [78]:
temp_1, temp_2 = transform_single(test2, test2_prob, False)

In [79]:
print(temp_1, temp_2)

[2, 2, 2, 0, 2, -1, 1, 0, 0, 1, -2, -2] None


In [80]:
if temp_2:
    print(temp_1, temp_2)
else:
    print(temp_1)


[2, 2, 2, 0, 2, -1, 1, 0, 0, 1, -2, -2]


### transform_cum

In [81]:
def transform_cum(seq, prob_seq, use_prob_seq): #ts: time series
    ts = [None] * (len(seq)+1)
    ts[0] = 0
    for i in range(len(seq)):
        if use_prob_seq:
            prob = prob_seq[i]
        else:
            prob = 1
        if seq[i] == 'A':
            ts[i+1] = ts[i] + 2 * prob
        elif seq[i] == 'G':
            ts[i+1] = ts[i] + 1 * prob
        elif seq[i] == 'C':
            ts[i+1] = ts[i] - 1 * prob
        elif seq[i] == 'U':
            ts[i+1] = ts[i] - 2 * prob
        elif seq[i] == '_':
            ts[i+1] = ts[i]
        else:
            raise ValueError('The sequence contains invalid characters')  
    return ts, None

### transform_cum_multi_samelen

In [82]:
def transform_cum_multi_samelen(seq, prob_seq, use_prob_seq): #ts: time series
    ts_1 = [None] * (len(seq)+1)
    ts_2= [None] * (len(seq)+1)
    ts_1[0] = 0
    ts_2[0] = 0
    for i in range(len(seq)):
        if use_prob_seq:
            prob = prob_seq[i]
        else:
            prob = 1
        if seq[i] == 'A':
            ts_1[i+1] = ts_1[i] + 1 * prob
            ts_2[i+1] = ts_2[i] + 0
        elif seq[i] == 'G':
            ts_1[i+1] = ts_1[i] - 1 * prob
            ts_2[i+1] = ts_2[i] + 0
        elif seq[i] == 'C':
            ts_1[i+1] = ts_1[i] + 0
            ts_2[i+1] = ts_2[i] + 1 * prob
        elif seq[i] == 'U':
            ts_1[i+1] = ts_1[i] + 0
            ts_2[i+1] = ts_2[i] - 1 * prob
        elif seq[i] == '_':
            ts_1[i+1] = ts_1[i] + 0
            ts_2[i+1] = ts_2[i] + 0
        else:
            raise ValueError('The sequence contains invalid characters')  
    return ts_1, ts_2

### transform_cum_multi_difflen

In [83]:
def transform_cum_multi_difflen(seq, prob_seq, use_prob_seq): #ts: time series
    ts_1 = [None] * (len(seq)+1)
    ts_2= [None] * (len(seq)+1)
    j = 0
    k = 0
    ts_1[j] = 0
    ts_2[k] = 0
    for i in range(len(seq)):
        if use_prob_seq:
            prob = prob_seq[i]
        else:
            prob = 1
        if seq[i] == 'A':
            ts_1[j+1] = ts_1[j] + 1 * prob
            j += 1
        elif seq[i] == 'G':
            ts_1[j+1] = ts_1[j] - 1 * prob
            j += 1
        elif seq[i] == 'C':
            ts_2[k+1] = ts_2[k] + 1 * prob
            k += 1
        elif seq[i] == 'U':
            ts_2[k+1] = ts_2[k] - 1 * prob
            k += 1
        elif seq[i] == '_':
            # Do nothing
            pass
        else:
            raise ValueError('The sequence contains invalid characters')  
    return ts_1[0:j+1], ts_2[0:k+1]

In [84]:
# Save Point
dill.dump_session('../data/notebook_sessions/1_prepare_dataset_6.db')


In [85]:
dill.load_session('../data/notebook_sessions/1_prepare_dataset_6.db')

## 7 Extract the four patterns (with their compl) and obtain the time series representation 
- The four patterns: 5p-cleav, 3p-cleav, 5p-non-cleav, 3p-non-cleav
- The corresponding (compl) complementary string are: xxx-compl, e.g., 5p-non-cleav-compl
- In Time Series (ts) representation: xxx-ts, e.g., 5p-non-cleav-compl-ts

In [88]:
df.head()


Unnamed: 0,Name,Sequence,miRNA_1_Start,miRNA_1_End,miRNA_2_Start,miRNA_2_End,SS,Seq_Compl,Pair_Prob
0,hsa-let-7a-1,UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCAC...,5,27,56,77,(((((.(((((((((((((((((((((.....(((...((((.......,AUCCU_UUCUGUCAUCUAACAUAUCAA_____UAG___GGGU____...,"[0.5493786436077366, 0.94554699810418, 0.98695..."
1,hsa-let-7a-2,AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGA...,4,26,49,71,(((..(((.(((.(((((((((((((.........(((......))...,UCC__UUC_AUC_UCCGACAUGUCAA_________UAG______CU...,"[0.7077965199915421, 0.975100070400267, 0.9750..."
2,hsa-let-7a-3,GGGUGAGGUAGUAGGUUGUAUAGUUUGGGGCUCUGCCCUGCUAUGG...,3,25,51,72,(((.(((((((((((((((((((((((((((...)))))).........,UCC_UUCUGUCAUCUAACAUAUCAAGUCCCG___CGGGGU______...,"[0.6570633923904023, 0.9052413524746695, 0.906..."
3,hsa-let-7b,CGGGGUGAGGUAGUAGGUUGUGUGGUUUCAGGGCAGUGAUGUUGCC...,5,27,59,81,(((((.(((((((((((((((((((((((.((((((.....)))))...,GUCCC_UUCCGUCAUCCAACAUAUCAAGG_CCCGUU_____GACGG...,"[0.8325127972409344, 0.9705872421262327, 0.986..."
4,hsa-let-7c,GCAUCCGGGUUGAGGUAGUAGGUUGUAUGGUUUAGAGUUACACCCU...,10,32,55,77,((.((((((..(((.(((.(((((((((((((..((.(..((...)...,CG_AGGUUC__UUC_AUC_UCCAACAUGUCAA__UU_A__GU___A...,"[0.8770356651792834, 0.8954524321214109, -1, 0..."


In [90]:
# Change column names in df
# https://stackoverflow.com/questions/11346283/renaming-column-names-in-pandas
df = df.rename(columns={'Sequence': 'Seq'})
df

Unnamed: 0,Name,Seq,miRNA_1_Start,miRNA_1_End,miRNA_2_Start,miRNA_2_End,SS,Seq_Compl,Pair_Prob
0,hsa-let-7a-1,UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCAC...,5,27,56,77,(((((.(((((((((((((((((((((.....(((...((((.......,AUCCU_UUCUGUCAUCUAACAUAUCAA_____UAG___GGGU____...,"[0.5493786436077366, 0.94554699810418, 0.98695..."
1,hsa-let-7a-2,AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGA...,4,26,49,71,(((..(((.(((.(((((((((((((.........(((......))...,UCC__UUC_AUC_UCCGACAUGUCAA_________UAG______CU...,"[0.7077965199915421, 0.975100070400267, 0.9750..."
2,hsa-let-7a-3,GGGUGAGGUAGUAGGUUGUAUAGUUUGGGGCUCUGCCCUGCUAUGG...,3,25,51,72,(((.(((((((((((((((((((((((((((...)))))).........,UCC_UUCUGUCAUCUAACAUAUCAAGUCCCG___CGGGGU______...,"[0.6570633923904023, 0.9052413524746695, 0.906..."
3,hsa-let-7b,CGGGGUGAGGUAGUAGGUUGUGUGGUUUCAGGGCAGUGAUGUUGCC...,5,27,59,81,(((((.(((((((((((((((((((((((.((((((.....)))))...,GUCCC_UUCCGUCAUCCAACAUAUCAAGG_CCCGUU_____GACGG...,"[0.8325127972409344, 0.9705872421262327, 0.986..."
4,hsa-let-7c,GCAUCCGGGUUGAGGUAGUAGGUUGUAUGGUUUAGAGUUACACCCU...,10,32,55,77,((.((((((..(((.(((.(((((((((((((..((.(..((...)...,CG_AGGUUC__UUC_AUC_UCCAACAUGUCAA__UU_A__GU___A...,"[0.8770356651792834, 0.8954524321214109, -1, 0..."
...,...,...,...,...,...,...,...,...,...
822,hsa-mir-10399,AAUUACAGAUUGUCUCAGAGAAAACAAAUGAGUUACUCUCUCGGAC...,0,21,37,58,..((((((.((((((.(((((.(((......)))..))))).))))...,__GAUGUC_AACAGG_UCUCU_UUG______CAA__AGAGA_UCUG...,"[-1, -1, 0.5536799172926145, 0.952238919161697..."
823,hsa-mir-10400,CGGCGGCGGCGGCUCUGGGCGAGGCGGCGGGGCCUGGGCUCCCGGA...,0,21,33,55,.(((..((.((.(((.....))).)).))..)))......((((.....,_CCG__GC_GC_GAG_____CUC_GC_GC__CGG______GGGC__...,"[-1, 0.2829866540884834, 0.286620168215649, 0...."
824,hsa-mir-10401,CGUGUGGGAAGGCGUGGGGUGCGGACCCCGGCCCGACCUCGCCGUC...,0,20,35,56,((.((((((.(((((.((((.(((...))))))).))...))).))...,GC_CGCCCU_CCGCA_CCCG_GCC___GGCUGGG_UG___CGG_AG...,"[0.6978111842677427, 0.7779373716843502, -1, 0..."
825,hsa-mir-10396b,CGGCGGGGCUCGGAGCCGGGCUUCGGCCGGGCCCCGGGCCCUCGAC...,0,20,29,51,(((((((((((((.(((.(((....))).))).)))))))).)).)...,GCCGCCCCGGGCC_CGG_CCG____CGG_CCG_GGCUCGGG_GC_G...,"[0.9707186076954393, 0.9994799857404904, 0.999..."


Check df to see is everything going right.

In [91]:
test = df.loc[0] 
print(test.Name)
print("Seq" + test.Seq, len(test.Seq))
print("Seq_Compl", test.Seq_Compl, len(test.Seq_Compl))
print("Pair_Prob", test.Pair_Prob)
print(len(test.Pair_Prob))
five_p_cleav = test.Seq[test.miRNA_1_Start:test.miRNA_1_End]
print("5p string", five_p_cleav, "len", len(five_p_cleav))

hsa-let-7a-1
SeqUGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCACUGGGAGAUAACUAUACAAUCUACUGUCUUUCCUA 80
Seq_Compl AUCCU_UUCUGUCAUCUAACAUAUCAA_____UAG___GGGU____ACCC_CUGUUGAUAUGUUGGAUGAUGGAGAGGGU 80
Pair_Prob [0.5493786436077366, 0.94554699810418, 0.986955654826434, 0.9871518080317534, 0.9039272199945173, -1, 0.8410709646198647, 0.9736948543001552, 0.9813040597206824, 0.8898251061201992, 0.9291679985726867, 0.9975284750288336, 0.9999419082190152, 0.9994870750249327, 0.9993990065365326, 0.9998536169074631, 0.9985678063762455, 0.9987582701160881, 0.998882802020682, 0.999948614713236, 0.9994679517739496, 0.9989972739174591, 0.9989705602761316, 0.9993176437187281, 0.9997423393655187, 0.981519566130822, 0.9144092768903046, -1, -1, -1, -1, -1, 0.7931920086071264, 0.8071775998760317, 0.8068907396417209, -1, -1, -1, 0.8433408244527671, 0.8451606782098134, 0.8448570203446218, 0.6132957407489772, -1, -1, -1, -1, 0.6132957407489772, 0.8448570203446218, 0.8451606782098134, 0.8433408244527671, -1, 0.806

It looks alright.

Try to extract the four strings (two cleavage patterns and two non-cleavage patterns) from Seq in df

In [120]:
# 5p-cleav
test.Seq[test.miRNA_1_End-7: test.miRNA_1_End-7+14], test.Seq_Compl[test.miRNA_1_End-7: test.miRNA_1_End-7+14]

('UAUAGUUUUAGGGU', 'AUAUCAA_____UA')

In [121]:
# 3p-cleav
test.Seq[test.miRNA_2_Start-7: test.miRNA_2_Start-7+14], test.Seq_Compl[test.miRNA_2_Start-7: test.miRNA_2_Start-7+14]

('GAGAUAACUAUACA', 'C_CUGUUGAUAUGU')

In [122]:
# 5p-non-cleav
test.Seq[test.miRNA_1_End-7-6: test.miRNA_1_End-7+14-6], test.Seq_Compl[test.miRNA_1_End-7-6: test.miRNA_1_End-7+14-6]

('AGGUUGUAUAGUUU', 'UCUAACAUAUCAA_')

In [123]:
# 3p-non-cleav
test.Seq[test.miRNA_2_Start-7+6: test.miRNA_2_Start-7+14+6], test.Seq_Compl[test.miRNA_2_Start-7+6: test.miRNA_2_Start-7+14+6]

('ACUAUACAAUCUAC', 'UGAUAUGUUGGAUG')

It looks alright with the result in "ReCGBM: a gradient boosting-based method for predicting human dicer cleavage sites"! But the counting is indeed troublesome. It takes some times.

Table: Time series transformation for RNA string s

In [96]:
s=test.Seq[test.miRNA_2_Start-7: test.miRNA_2_Start-7+14]
s = s[0:10]
s_compl=test.Seq_Compl[test.miRNA_2_Start-7: test.miRNA_2_Start-7+14]
s_compl = s_compl[0:10]
s_prob = test.Pair_Prob[test.miRNA_2_Start-7: test.miRNA_2_Start-7+14]
s_prob = s_prob[0:10]
s, len(s), s_prob, s_compl, len(s_compl), len(s_prob)

('GAGAUAACUA',
 10,
 [0.8433408244527671,
  -1,
  0.8068907396417209,
  0.8071775998760317,
  0.7931920086071264,
  0.9144092768903046,
  0.981519566130822,
  0.9997423393655187,
  0.9993176437187281,
  0.9989705602761316],
 'C_CUGUUGAU',
 10,
 10)

Original string

In [97]:
transform_single(s,s_prob, False), transform_cum(s,s_prob, False), transform_cum_multi_samelen(s,s_prob, False), transform_cum_multi_difflen(s,s_prob, False)


(([1, 2, 1, 2, -2, 2, 2, -1, -2, 2], None),
 ([0, 1, 3, 4, 6, 4, 6, 8, 7, 5, 7], None),
 ([0, -1, 0, -1, 0, 0, 1, 2, 2, 2, 3], [0, 0, 0, 0, 0, -1, -1, -1, 0, -1, -1]),
 ([0, -1, 0, -1, 0, 1, 2, 3], [0, -1, 0, -1]))

Complementary string (Without pairwise probability)

In [98]:
transform_single(s_compl,s_prob, False), transform_cum(s_compl,s_prob, False), transform_cum_multi_samelen(s_compl,s_prob, False), transform_cum_multi_difflen(s_compl,s_prob, False)


(([-1, 0, -1, -2, 1, -2, -2, 1, 2, -2], None),
 ([0, -1, -1, -2, -4, -3, -5, -7, -6, -4, -6], None),
 ([0, 0, 0, 0, 0, -1, -1, -1, -2, -1, -1],
  [0, 1, 1, 2, 1, 1, 0, -1, -1, -1, -2]),
 ([0, -1, -2, -1], [0, 1, 2, 1, 0, -1, -2]))

Complementary string (With pairwise probability)

In [99]:
transform_single(s_compl,s_prob, True), transform_cum(s_compl,s_prob, True), transform_cum_multi_samelen(s_compl,s_prob, True), transform_cum_multi_difflen(s_compl,s_prob, True)


(([-0.8433408244527671,
   0,
   -0.8068907396417209,
   -1.6143551997520633,
   0.7931920086071264,
   -1.8288185537806092,
   -1.963039132261644,
   0.9997423393655187,
   1.9986352874374562,
   -1.9979411205522632],
  None),
 ([0,
   -0.8433408244527671,
   -0.8433408244527671,
   -1.650231564094488,
   -3.2645867638465513,
   -2.471394755239425,
   -4.300213309020034,
   -6.2632524412816775,
   -5.263510101916159,
   -3.264874814478703,
   -5.262815935030966],
  None),
 ([0,
   0,
   0,
   0,
   0,
   -0.7931920086071264,
   -0.7931920086071264,
   -0.7931920086071264,
   -1.7929343479726452,
   -0.7936167042539171,
   -0.7936167042539171],
  [0,
   0.8433408244527671,
   0.8433408244527671,
   1.650231564094488,
   0.8430539642184564,
   0.8430539642184564,
   -0.07135531267184825,
   -1.0528748788026703,
   -1.0528748788026703,
   -1.0528748788026703,
   -2.051845439078802]),
 ([0, -0.7931920086071264, -1.7929343479726452, -0.7936167042539171],
  [0,
   0.8433408244527671,
   1.6

In [100]:
print_list_3dp(transform_single(s_compl,s_prob, True)[0])

time series =  -0.843, 0.000, -0.807, -1.614, 0.793, -1.829, -1.963, 1.000, 1.999, -1.998


In [101]:
print_list_3dp(transform_cum(s_compl,s_prob, True)[0])

time series =  0.000, -0.843, -0.843, -1.650, -3.265, -2.471, -4.300, -6.263, -5.264, -3.265, -5.263


In [102]:
print_list_3dp(transform_cum_multi_samelen(s_compl,s_prob, True)[0])

time series =  0.000, 0.000, 0.000, 0.000, 0.000, -0.793, -0.793, -0.793, -1.793, -0.794, -0.794


In [103]:
print_list_3dp(transform_cum_multi_samelen(s_compl,s_prob, True)[1])

time series =  0.000, 0.843, 0.843, 1.650, 0.843, 0.843, -0.071, -1.053, -1.053, -1.053, -2.052


In [104]:
print_list_3dp(transform_cum_multi_difflen(s_compl,s_prob, True)[0])

time series =  0.000, -0.793, -1.793, -0.794


In [105]:
print_list_3dp(transform_cum_multi_difflen(s_compl,s_prob, True)[1])

time series =  0.000, 0.843, 1.650, 0.843, -0.071, -1.053, -2.052


In [106]:
def transform_and_save(df, func, use_prob_seq):
    print("Applying", func.__name__)
    df_temp = df.copy()
    df_temp[["five_p_cleav", "five_p_cleav_compl", 
             "five_p_non_cleav", "five_p_non_cleav_compl", 
             "three_p_cleav", "three_p_cleav_compl", 
             "three_p_non_cleav", "three_p_non_cleav_compl"]] = df_temp.apply(lambda row: extract_eight_strings(row), axis=1, result_type="expand")
    # Handle 5p-cleav and its -compl
    df_temp[["five_p_cleav_1", "five_p_cleav_2"]] = df_temp.apply(lambda row: func(row["five_p_cleav"], row["Pair_Prob"], False), axis=1, result_type="expand")
    df_temp[["five_p_cleav_compl_1", "five_p_cleav_compl_2"]] = df_temp.apply(lambda row: func(row["five_p_cleav_compl"], row["Pair_Prob"], use_prob_seq), axis=1, result_type="expand")
    # Handle 5p-non-cleav and its -compl
    df_temp[["five_p_non_cleav_1", "five_p_non_cleav_2"]] = df_temp.apply(lambda row: func(row["five_p_non_cleav"], row["Pair_Prob"], False), axis=1, result_type="expand")
    df_temp[["five_p_non_cleav_compl_1", "five_p_non_cleav_compl_2"]] = df_temp.apply(lambda row: func(row["five_p_non_cleav_compl"], row["Pair_Prob"], use_prob_seq), axis=1, result_type="expand")
    # Handle 3p-cleav and its -compl_non
    df_temp[["three_p_cleav_1", "three_p_cleav_2"]] = df_temp.apply(lambda row: func(row["three_p_cleav"], row["Pair_Prob"], False), axis=1, result_type="expand")
    df_temp[["three_p_cleav_compl_1", "three_p_cleav_compl_2"]] = df_temp.apply(lambda row: func(row["three_p_cleav_compl"], row["Pair_Prob"], use_prob_seq), axis=1, result_type="expand")
    # Handle 3p-non-cleav and its -compl
    df_temp[["three_p_non_cleav_1", "three_p_non_cleav_2"]] = df_temp.apply(lambda row: func(row["three_p_non_cleav"], row["Pair_Prob"], False), axis=1, result_type="expand")
    df_temp[["three_p_non_cleav_compl_1", "three_p_non_cleav_compl_2"]] = df_temp.apply(lambda row: func(row["three_p_non_cleav_compl"], row["Pair_Prob"], use_prob_seq), axis=1, result_type="expand")
    # Save it
    # https://stackoverflow.com/questions/33659139/apply-multiple-functions-to-the-same-argument-in-functional-python
    # https://www.reddit.com/r/learnpython/comments/1c71zga/better_way_to_pass_the_same_argument_to_multiple/
    # https://stackoverflow.com/questions/251464/how-to-get-a-function-name-as-a-string
    # print(df_temp)
    # https://note.nkmk.me/en/python-str-remove-strip/
    if use_prob_seq:
        filePath = "../data/01_" + func.__name__.replace('transform_', '') + "_prob.csv"
    else:
        filePath = "../data/01_" + func.__name__.replace('transform_', '') +".csv"
    if not df_temp["five_p_cleav_2"].loc[0]:
        print("single ts")
        df_temp[["five_p_cleav_1", "five_p_cleav_compl_1", 
             "five_p_non_cleav_1", "five_p_non_cleav_compl_1", 
             "three_p_cleav_1", "three_p_cleav_compl_1", 
             "three_p_non_cleav_1", "three_p_non_cleav_compl_1"]].to_csv(filePath, index=False)
    else:
        print("two ts")
        df_temp[["five_p_cleav_1", "five_p_cleav_compl_1", "five_p_cleav_2", "five_p_cleav_compl_2",
             "five_p_non_cleav_1", "five_p_non_cleav_compl_1", "five_p_non_cleav_2", "five_p_non_cleav_compl_2", 
             "three_p_cleav_1", "three_p_cleav_compl_1", "three_p_cleav_2", "three_p_cleav_compl_2", 
             "three_p_non_cleav_1", "three_p_non_cleav_compl_1", "three_p_non_cleav_2", "three_p_non_cleav_compl_2"]].to_csv(filePath, index=False)
    # Testing
    # return df_temp
    # df_temp = df_temp.drop(columns=["five_p_cleav", "five_p_cleav_compl", "five_p_non_cleav", "five_p_non_cleav_compl" ,"three_p_cleav", "three_p_cleav_compl", "three_p_non_cleav", "three_p_non_cleav_compl"])

In [107]:
df.head()

Unnamed: 0,Name,Seq,miRNA_1_Start,miRNA_1_End,miRNA_2_Start,miRNA_2_End,SS,Seq_Compl,Pair_Prob
0,hsa-let-7a-1,UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCAC...,5,27,56,77,(((((.(((((((((((((((((((((.....(((...((((.......,AUCCU_UUCUGUCAUCUAACAUAUCAA_____UAG___GGGU____...,"[0.5493786436077366, 0.94554699810418, 0.98695..."
1,hsa-let-7a-2,AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGA...,4,26,49,71,(((..(((.(((.(((((((((((((.........(((......))...,UCC__UUC_AUC_UCCGACAUGUCAA_________UAG______CU...,"[0.7077965199915421, 0.975100070400267, 0.9750..."
2,hsa-let-7a-3,GGGUGAGGUAGUAGGUUGUAUAGUUUGGGGCUCUGCCCUGCUAUGG...,3,25,51,72,(((.(((((((((((((((((((((((((((...)))))).........,UCC_UUCUGUCAUCUAACAUAUCAAGUCCCG___CGGGGU______...,"[0.6570633923904023, 0.9052413524746695, 0.906..."
3,hsa-let-7b,CGGGGUGAGGUAGUAGGUUGUGUGGUUUCAGGGCAGUGAUGUUGCC...,5,27,59,81,(((((.(((((((((((((((((((((((.((((((.....)))))...,GUCCC_UUCCGUCAUCCAACAUAUCAAGG_CCCGUU_____GACGG...,"[0.8325127972409344, 0.9705872421262327, 0.986..."
4,hsa-let-7c,GCAUCCGGGUUGAGGUAGUAGGUUGUAUGGUUUAGAGUUACACCCU...,10,32,55,77,((.((((((..(((.(((.(((((((((((((..((.(..((...)...,CG_AGGUUC__UUC_AUC_UCCAACAUGUCAA__UU_A__GU___A...,"[0.8770356651792834, 0.8954524321214109, -1, 0..."


In [108]:
def extract_eight_strings(r):
    # 5p-cleav
    five_p_cleav = r.Seq[r.miRNA_1_End-7: r.miRNA_1_End-7+14]
    five_p_cleav_compl = r.Seq_Compl[r.miRNA_1_End-7: r.miRNA_1_End-7+14]
    # 5p-non-cleav
    five_p_non_cleav = r.Seq[r.miRNA_1_End-7-6: r.miRNA_1_End-7+14-6]
    five_p_non_cleav_compl = r.Seq_Compl[r.miRNA_1_End-7-6: r.miRNA_1_End-7+14-6]
    # 3p-cleav
    three_p_cleav = r.Seq[r.miRNA_2_Start-7: r.miRNA_2_Start-7+14]
    three_p_cleav_compl = r.Seq_Compl[r.miRNA_2_Start-7: r.miRNA_2_Start-7+14]
    # 3p-non-cleav
    three_p_non_cleav = r.Seq[r.miRNA_2_Start-7+6: r.miRNA_2_Start-7+14+6]
    three_p_non_cleav_compl =  r.Seq_Compl[r.miRNA_2_Start-7+6: r.miRNA_2_Start-7+14+6]
    return five_p_cleav, five_p_cleav_compl, five_p_non_cleav, five_p_non_cleav_compl, three_p_cleav, three_p_cleav_compl, three_p_non_cleav, three_p_non_cleav_compl

In [109]:
# https://stackoverflow.com/questions/11736407/apply-list-of-functions-on-an-object-in-python
# The list of our Transformations
func_list_return_one_ts = [transform_original, transform_single, transform_cum]
func_list_return_two_ts = [transform_cum_multi_samelen, transform_cum_multi_difflen]

In [110]:
df_temp = df.copy()
df_temp[["five_p_cleav", "five_p_cleav_compl", 
             "five_p_non_cleav", "five_p_non_cleav_compl", 
             "three_p_cleav", "three_p_cleav_compl", 
             "three_p_non_cleav", "three_p_non_cleav_compl"]] = df_temp.apply(lambda row: extract_eight_strings(row), axis=1, result_type="expand")
df_temp
df_temp[["five_p_cleav_1", "five_p_cleav_2"]] = df_temp.apply(lambda row: transform_cum(row["five_p_cleav"], row["Pair_Prob"], False), axis=1, result_type="expand")
df_temp.head()

Unnamed: 0,Name,Seq,miRNA_1_Start,miRNA_1_End,miRNA_2_Start,miRNA_2_End,SS,Seq_Compl,Pair_Prob,five_p_cleav,five_p_cleav_compl,five_p_non_cleav,five_p_non_cleav_compl,three_p_cleav,three_p_cleav_compl,three_p_non_cleav,three_p_non_cleav_compl,five_p_cleav_1,five_p_cleav_2
0,hsa-let-7a-1,UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCAC...,5,27,56,77,(((((.(((((((((((((((((((((.....(((...((((.......,AUCCU_UUCUGUCAUCUAACAUAUCAA_____UAG___GGGU____...,"[0.5493786436077366, 0.94554699810418, 0.98695...",UAUAGUUUUAGGGU,AUAUCAA_____UA,AGGUUGUAUAGUUU,UCUAACAUAUCAA_,GAGAUAACUAUACA,C_CUGUUGAUAUGU,ACUAUACAAUCUAC,UGAUAUGUUGGAUG,"[0, -2, 0, -2, 0, 1, -1, -3, -5, -7, -5, -4, -...",
1,hsa-let-7a-2,AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGA...,4,26,49,71,(((..(((.(((.(((((((((((((.........(((......))...,UCC__UUC_AUC_UCCGACAUGUCAA_________UAG______CU...,"[0.7077965199915421, 0.975100070400267, 0.9750...",UAUAGUUUAGAAUU,AUGUCAA_______,AGGUUGUAUAGUUU,UCCGACAUGUCAA_,GAGAUAACUGUACA,__CUAUUGAUAUGU,ACUGUACAGCCUCC,UGAUAUGUUGGA_G,"[0, -2, 0, -2, 0, 1, -1, -3, -5, -3, -2, 0, 2,...",
2,hsa-let-7a-3,GGGUGAGGUAGUAGGUUGUAUAGUUUGGGGCUCUGCCCUGCUAUGG...,3,25,51,72,(((.(((((((((((((((((((((((((((...)))))).........,UCC_UUCUGUCAUCUAACAUAUCAAGUCCCG___CGGGGU______...,"[0.6570633923904023, 0.9052413524746695, 0.906...",UAUAGUUUGGGGCU,AUAUCAAGUCCCG_,AGGUUGUAUAGUUU,UCUAACAUAUCAAG,GGGAUAACUAUACA,_____UUGAUAUGU,ACUAUACAAUCUAC,UGAUAUGUUGGAUG,"[0, -2, 0, -2, 0, 1, -1, -3, -5, -4, -3, -2, -...",
3,hsa-let-7b,CGGGGUGAGGUAGUAGGUUGUGUGGUUUCAGGGCAGUGAUGUUGCC...,5,27,59,81,(((((.(((((((((((((((((((((((.((((((.....)))))...,GUCCC_UUCCGUCAUCCAACAUAUCAAGG_CCCGUU_____GACGG...,"[0.8325127972409344, 0.9705872421262327, 0.986...",UGUGGUUUCAGGGC,AUAUCAAGG_CCCG,AGGUUGUGUGGUUU,UCCAACAUAUCAAG,AAGAUAACUAUACA,U_____UGGUGUGU,ACUAUACAACCUAC,UGGUGUGUUGGAUG,"[0, -2, -1, -3, -2, -1, -3, -5, -7, -8, -6, -5...",
4,hsa-let-7c,GCAUCCGGGUUGAGGUAGUAGGUUGUAUGGUUUAGAGUUACACCCU...,10,32,55,77,((.((((((..(((.(((.(((((((((((((..((.(..((...)...,CG_AGGUUC__UUC_AUC_UCCAACAUGUCAA__UU_A__GU___A...,"[0.8770356651792834, 0.8954524321214109, -1, 0...",UAUGGUUUAGAGUU,AUGUCAA__UU_A_,AGGUUGUAUGGUUU,UCCAACAUGUCAA_,GAGUUAACUGUACA,_U_AGUUGGUAUGU,ACUGUACAACCUUC,UGGUAUGUUGGA_G,"[0, -2, 0, -2, -1, 0, -2, -4, -6, -4, -3, -1, ...",


In [111]:
def transform_and_save(df, func, use_prob_seq):
    print("Applying", func.__name__)
    df_temp = df.copy()
    df_temp[["five_p_cleav", "five_p_cleav_compl", 
             "five_p_non_cleav", "five_p_non_cleav_compl", 
             "three_p_cleav", "three_p_cleav_compl", 
             "three_p_non_cleav", "three_p_non_cleav_compl"]] = df_temp.apply(lambda row: extract_eight_strings(row), axis=1, result_type="expand")
    # Handle 5p-cleav and its -compl
    df_temp[["five_p_cleav_1", "five_p_cleav_2"]] = df_temp.apply(lambda row: func(row["five_p_cleav"], row["Pair_Prob"], False), axis=1, result_type="expand")
    df_temp[["five_p_cleav_compl_1", "five_p_cleav_compl_2"]] = df_temp.apply(lambda row: func(row["five_p_cleav_compl"], row["Pair_Prob"], use_prob_seq), axis=1, result_type="expand")
    # Handle 5p-non-cleav and its -compl
    df_temp[["five_p_non_cleav_1", "five_p_non_cleav_2"]] = df_temp.apply(lambda row: func(row["five_p_non_cleav"], row["Pair_Prob"], False), axis=1, result_type="expand")
    df_temp[["five_p_non_cleav_compl_1", "five_p_non_cleav_compl_2"]] = df_temp.apply(lambda row: func(row["five_p_non_cleav_compl"], row["Pair_Prob"], use_prob_seq), axis=1, result_type="expand")
    # Handle 3p-cleav and its -compl_non
    df_temp[["three_p_cleav_1", "three_p_cleav_2"]] = df_temp.apply(lambda row: func(row["three_p_cleav"], row["Pair_Prob"], False), axis=1, result_type="expand")
    df_temp[["three_p_cleav_compl_1", "three_p_cleav_compl_2"]] = df_temp.apply(lambda row: func(row["three_p_cleav_compl"], row["Pair_Prob"], use_prob_seq), axis=1, result_type="expand")
    # Handle 3p-non-cleav and its -compl
    df_temp[["three_p_non_cleav_1", "three_p_non_cleav_2"]] = df_temp.apply(lambda row: func(row["three_p_non_cleav"], row["Pair_Prob"], False), axis=1, result_type="expand")
    df_temp[["three_p_non_cleav_compl_1", "three_p_non_cleav_compl_2"]] = df_temp.apply(lambda row: func(row["three_p_non_cleav_compl"], row["Pair_Prob"], use_prob_seq), axis=1, result_type="expand")
    # Save it
    # https://stackoverflow.com/questions/33659139/apply-multiple-functions-to-the-same-argument-in-functional-python
    # https://www.reddit.com/r/learnpython/comments/1c71zga/better_way_to_pass_the_same_argument_to_multiple/
    # https://stackoverflow.com/questions/251464/how-to-get-a-function-name-as-a-string
    # print(df_temp)
    # https://note.nkmk.me/en/python-str-remove-strip/
    if use_prob_seq:
        filePath = "../data/01_" + func.__name__.replace('transform_', '') + "_prob.csv"
    else:
        filePath = "../data/01_" + func.__name__.replace('transform_', '') +".csv"
    # https://stackoverflow.com/questions/3965104/not-none-test-in-python
    if not df_temp["five_p_cleav_2"].loc[0]: # The first element (loc[0]) will ne None if the transformation only generates one ts
        print("single ts")
        df_temp[["five_p_cleav_1", "five_p_cleav_compl_1", 
             "five_p_non_cleav_1", "five_p_non_cleav_compl_1", 
             "three_p_cleav_1", "three_p_cleav_compl_1", 
             "three_p_non_cleav_1", "three_p_non_cleav_compl_1"]].to_csv(filePath, index=False)
    else:
        print("two ts")
        df_temp[["five_p_cleav_1", "five_p_cleav_compl_1", "five_p_cleav_2", "five_p_cleav_compl_2",
             "five_p_non_cleav_1", "five_p_non_cleav_compl_1", "five_p_non_cleav_2", "five_p_non_cleav_compl_2", 
             "three_p_cleav_1", "three_p_cleav_compl_1", "three_p_cleav_2", "three_p_cleav_compl_2", 
             "three_p_non_cleav_1", "three_p_non_cleav_compl_1", "three_p_non_cleav_2", "three_p_non_cleav_compl_2"]].to_csv(filePath, index=False)
    # Testing
    # return df_temp
    # df_temp = df_temp.drop(columns=["five_p_cleav", "five_p_cleav_compl", "five_p_non_cleav", "five_p_non_cleav_compl" ,"three_p_cleav", "three_p_cleav_compl", "three_p_non_cleav", "three_p_non_cleav_compl"])

In [None]:
# def transform_and_save(df, func, use_prob_seq):
#     print("Applying", func.__name__)
#     df_temp = df.copy()
#     df_temp[["five_p_cleav", "five_p_cleav_compl", "five_p_non_cleav", "five_p_non_cleav_compl" ,"three_p_cleav", "three_p_cleav_compl", "three_p_non_cleav", "three_p_non_cleav_compl"]] = df_temp.apply(lambda row: extract_eight_strings(row), axis=1, result_type="expand")
#     # Handle 5p-cleav and its -compl
#     df_temp.loc[:, "five_p_cleav"] = df_temp.apply(lambda row: func(row["five_p_cleav"], row["Pair_Prob"], False), axis=1)
#     df_temp.loc[:, "five_p_cleav_compl"] = df_temp.apply(lambda row: func(row["five_p_cleav_compl"], row["Pair_Prob"], use_prob_seq), axis=1)
#     # Handle 5p-non-cleav and its -compl
#     df_temp.loc[:, "five_p_non_cleav"] = df_temp.apply(lambda row: func(row["five_p_non_cleav"], row["Pair_Prob"], False), axis=1)
#     df_temp.loc[:, "five_p_non_cleav_compl"] = df_temp.apply(lambda row: func(row["five_p_non_cleav_compl"], row["Pair_Prob"], use_prob_seq), axis=1)
#     # Handle 3p-cleav and its -compl_non
#     df_temp.loc[:, "three_p_cleav"] = df_temp.apply(lambda row: func(row["three_p_cleav"], row["Pair_Prob"], False), axis=1)
#     df_temp.loc[:, "three_p_cleav_compl"] = df_temp.apply(lambda row: func(row["three_p_cleav_compl"], row["Pair_Prob"], use_prob_seq), axis=1)
#     # Handle 3p-non-cleav and its -compl
#     df_temp.loc[:, "three_p_non_cleav"] = df_temp.apply(lambda row: func(row["three_p_non_cleav"], row["Pair_Prob"], False), axis=1)
#     df_temp.loc[:, "three_p_non_cleav_compl"] = df_temp.apply(lambda row: func(row["three_p_non_cleav_compl"], row["Pair_Prob"], use_prob_seq), axis=1)
#     # Save it
#     # https://stackoverflow.com/questions/33659139/apply-multiple-functions-to-the-same-argument-in-functional-python
#     # https://www.reddit.com/r/learnpython/comments/1c71zga/better_way_to_pass_the_same_argument_to_multiple/
#     # https://stackoverflow.com/questions/251464/how-to-get-a-function-name-as-a-string
#     # print(df_temp)
#     # https://note.nkmk.me/en/python-str-remove-strip/
#     if use_prob_seq:
#         filePath = "../data/01_" + func.__name__.replace('transform_', '') + "_prob.csv"
#     else:
#         filePath = "../data/01_" + func.__name__.replace('transform_', '') +".csv"
#     df_temp[["five_p_cleav", "five_p_cleav_compl", "five_p_non_cleav", "five_p_non_cleav_compl" ,"three_p_cleav", "three_p_cleav_compl", "three_p_non_cleav", "three_p_non_cleav_compl"]].to_csv(filePath,
#               index=False)
#     # Testing
#     # return df_temp
#     # df_temp = df_temp.drop(columns=["five_p_cleav", "five_p_cleav_compl", "five_p_non_cleav", "five_p_non_cleav_compl" ,"three_p_cleav", "three_p_cleav_compl", "three_p_non_cleav", "three_p_non_cleav_compl"])

In [112]:
df

Unnamed: 0,Name,Seq,miRNA_1_Start,miRNA_1_End,miRNA_2_Start,miRNA_2_End,SS,Seq_Compl,Pair_Prob
0,hsa-let-7a-1,UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCAC...,5,27,56,77,(((((.(((((((((((((((((((((.....(((...((((.......,AUCCU_UUCUGUCAUCUAACAUAUCAA_____UAG___GGGU____...,"[0.5493786436077366, 0.94554699810418, 0.98695..."
1,hsa-let-7a-2,AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGA...,4,26,49,71,(((..(((.(((.(((((((((((((.........(((......))...,UCC__UUC_AUC_UCCGACAUGUCAA_________UAG______CU...,"[0.7077965199915421, 0.975100070400267, 0.9750..."
2,hsa-let-7a-3,GGGUGAGGUAGUAGGUUGUAUAGUUUGGGGCUCUGCCCUGCUAUGG...,3,25,51,72,(((.(((((((((((((((((((((((((((...)))))).........,UCC_UUCUGUCAUCUAACAUAUCAAGUCCCG___CGGGGU______...,"[0.6570633923904023, 0.9052413524746695, 0.906..."
3,hsa-let-7b,CGGGGUGAGGUAGUAGGUUGUGUGGUUUCAGGGCAGUGAUGUUGCC...,5,27,59,81,(((((.(((((((((((((((((((((((.((((((.....)))))...,GUCCC_UUCCGUCAUCCAACAUAUCAAGG_CCCGUU_____GACGG...,"[0.8325127972409344, 0.9705872421262327, 0.986..."
4,hsa-let-7c,GCAUCCGGGUUGAGGUAGUAGGUUGUAUGGUUUAGAGUUACACCCU...,10,32,55,77,((.((((((..(((.(((.(((((((((((((..((.(..((...)...,CG_AGGUUC__UUC_AUC_UCCAACAUGUCAA__UU_A__GU___A...,"[0.8770356651792834, 0.8954524321214109, -1, 0..."
...,...,...,...,...,...,...,...,...,...
822,hsa-mir-10399,AAUUACAGAUUGUCUCAGAGAAAACAAAUGAGUUACUCUCUCGGAC...,0,21,37,58,..((((((.((((((.(((((.(((......)))..))))).))))...,__GAUGUC_AACAGG_UCUCU_UUG______CAA__AGAGA_UCUG...,"[-1, -1, 0.5536799172926145, 0.952238919161697..."
823,hsa-mir-10400,CGGCGGCGGCGGCUCUGGGCGAGGCGGCGGGGCCUGGGCUCCCGGA...,0,21,33,55,.(((..((.((.(((.....))).)).))..)))......((((.....,_CCG__GC_GC_GAG_____CUC_GC_GC__CGG______GGGC__...,"[-1, 0.2829866540884834, 0.286620168215649, 0...."
824,hsa-mir-10401,CGUGUGGGAAGGCGUGGGGUGCGGACCCCGGCCCGACCUCGCCGUC...,0,20,35,56,((.((((((.(((((.((((.(((...))))))).))...))).))...,GC_CGCCCU_CCGCA_CCCG_GCC___GGCUGGG_UG___CGG_AG...,"[0.6978111842677427, 0.7779373716843502, -1, 0..."
825,hsa-mir-10396b,CGGCGGGGCUCGGAGCCGGGCUUCGGCCGGGCCCCGGGCCCUCGAC...,0,20,29,51,(((((((((((((.(((.(((....))).))).)))))))).)).)...,GCCGCCCCGGGCC_CGG_CCG____CGG_CCG_GGCUCGGG_GC_G...,"[0.9707186076954393, 0.9994799857404904, 0.999..."


In [113]:
transform_and_save(df, transform_original, False)
transform_and_save(df, transform_single, False)
transform_and_save(df, transform_single, True)
transform_and_save(df, transform_cum, False)
transform_and_save(df, transform_cum, True)

Applying transform_original
single ts
Applying transform_single
single ts
Applying transform_single
single ts
Applying transform_cum
single ts
Applying transform_cum
single ts


In [114]:
transform_and_save(df, transform_cum_multi_samelen, False)
transform_and_save(df, transform_cum_multi_samelen, True)
transform_and_save(df, transform_cum_multi_difflen, False)
transform_and_save(df, transform_cum_multi_difflen, True)



Applying transform_cum_multi_samelen
two ts
Applying transform_cum_multi_samelen
two ts
Applying transform_cum_multi_difflen
two ts
Applying transform_cum_multi_difflen
two ts


In [115]:
df_temp = pd.read_csv("../data/01_cum_multi_samelen.csv", low_memory=False)
df_temp

Unnamed: 0,five_p_cleav_1,five_p_cleav_compl_1,five_p_cleav_2,five_p_cleav_compl_2,five_p_non_cleav_1,five_p_non_cleav_compl_1,five_p_non_cleav_2,five_p_non_cleav_compl_2,three_p_cleav_1,three_p_cleav_compl_1,three_p_cleav_2,three_p_cleav_compl_2,three_p_non_cleav_1,three_p_non_cleav_compl_1,three_p_non_cleav_2,three_p_non_cleav_compl_2
0,"[0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 0, -1, -1]","[0, 1, 1, 2, 2, 2, 3, 4, 4, 4, 4, 4, 4, 4, 5]","[0, -1, -1, -2, -2, -2, -3, -4, -5, -6, -6, -6...","[0, 0, -1, -1, -2, -1, -1, -1, -1, -1, -1, -1,...","[0, 1, 0, -1, -1, -1, -2, -2, -1, -1, 0, -1, -...","[0, 0, 0, 0, 1, 2, 2, 3, 3, 4, 4, 4, 5, 6, 6]","[0, 0, 0, 0, -1, -2, -2, -3, -3, -4, -4, -4, -...","[0, -1, 0, -1, -1, -1, 0, 0, -1, -1, -2, -1, -...","[0, -1, 0, -1, 0, 0, 1, 2, 2, 2, 3, 3, 4, 4, 5]","[0, 0, 0, 0, 0, -1, -1, -1, -2, -1, -1, 0, 0, ...","[0, 0, 0, 0, 0, -1, -1, -1, 0, -1, -1, -2, -2,...","[0, 1, 1, 2, 1, 1, 0, -1, -1, -1, -2, -2, -3, ...","[0, 1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6]","[0, 0, -1, 0, 0, 1, 1, 0, 0, 0, -1, -2, -1, -1...","[0, 0, 1, 0, 0, -1, -1, 0, 0, 0, -1, 0, -1, -1...","[0, -1, -1, -1, -2, -2, -3, -3, -4, -5, -5, -5..."
1,"[0, 0, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 3, 3, 3]","[0, 1, 1, 0, 0, 0, 1, 2, 2, 2, 2, 2, 2, 2, 2]","[0, -1, -1, -2, -2, -2, -3, -4, -5, -5, -5, -5...","[0, 0, -1, -1, -2, -1, -1, -1, -1, -1, -1, -1,...","[0, 1, 0, -1, -1, -1, -2, -2, -1, -1, 0, -1, -...","[0, 0, 0, 0, -1, 0, 0, 1, 1, 0, 0, 0, 1, 2, 2]","[0, 0, 0, 0, -1, -2, -2, -3, -3, -4, -4, -4, -...","[0, -1, 0, 1, 1, 1, 2, 2, 1, 1, 0, 1, 1, 1, 1]","[0, -1, 0, -1, 0, 0, 1, 2, 2, 2, 1, 1, 2, 2, 3]","[0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 2, 2, 1, 1]","[0, 0, 0, 0, 0, -1, -1, -1, 0, -1, -1, -2, -2,...","[0, 0, 0, 1, 0, 0, -1, -2, -2, -2, -3, -3, -4,...","[0, 1, 1, 1, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1]","[0, 0, -1, 0, 0, 1, 1, 0, 0, 0, -1, -2, -1, -1...","[0, 0, 1, 0, 0, -1, -1, 0, 0, 0, 1, 2, 1, 2, 3]","[0, -1, -1, -1, -2, -2, -3, -3, -4, -5, -5, -5..."
2,"[0, 0, 1, 1, 2, 1, 1, 1, 1, 0, -1, -2, -3, -3,...","[0, 1, 1, 2, 2, 2, 3, 4, 3, 3, 3, 3, 3, 2, 2]","[0, -1, -1, -2, -2, -2, -3, -4, -5, -5, -5, -5...","[0, 0, -1, -1, -2, -1, -1, -1, -1, -2, -1, 0, ...","[0, 1, 0, -1, -1, -1, -2, -2, -1, -1, 0, -1, -...","[0, 0, 0, 0, 1, 2, 2, 3, 3, 4, 4, 4, 5, 6, 5]","[0, 0, 0, 0, -1, -2, -2, -3, -3, -4, -4, -4, -...","[0, -1, 0, -1, -1, -1, 0, 0, -1, -1, -2, -1, -...","[0, -1, -2, -3, -2, -2, -1, 0, 0, 0, 1, 1, 2, ...","[0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 1, 1, 0, 0]","[0, 0, 0, 0, 0, -1, -1, -1, 0, -1, -1, -2, -2,...","[0, 0, 0, 0, 0, 0, -1, -2, -2, -2, -3, -3, -4,...","[0, 1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6]","[0, 0, -1, 0, 0, 1, 1, 0, 0, 0, -1, -2, -1, -1...","[0, 0, 1, 0, 0, -1, -1, 0, 0, 0, -1, 0, -1, -1...","[0, -1, -1, -1, -2, -2, -3, -3, -4, -5, -5, -5..."
3,"[0, 0, -1, -1, -2, -3, -3, -3, -3, -3, -2, -3,...","[0, 1, 1, 2, 2, 2, 3, 4, 3, 2, 2, 2, 2, 2, 1]","[0, -1, -1, -2, -2, -2, -3, -4, -5, -4, -4, -4...","[0, 0, -1, -1, -2, -1, -1, -1, -1, -1, -1, 0, ...","[0, 1, 0, -1, -1, -1, -2, -2, -3, -3, -4, -5, ...","[0, 0, 0, 0, 1, 2, 2, 3, 3, 4, 4, 4, 5, 6, 5]","[0, 0, 0, 0, -1, -2, -2, -3, -3, -4, -4, -4, -...","[0, -1, 0, 1, 1, 1, 2, 2, 1, 1, 0, 1, 1, 1, 1]","[0, 1, 2, 1, 2, 2, 3, 4, 4, 4, 5, 5, 6, 6, 7]","[0, 0, 0, 0, 0, 0, 0, 0, -1, -2, -2, -3, -3, -...","[0, 0, 0, 0, 0, -1, -1, -1, 0, -1, -1, -2, -2,...","[0, -1, -1, -1, -1, -1, -1, -2, -2, -2, -3, -3...","[0, 1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6]","[0, 0, -1, -2, -2, -3, -3, -4, -4, -4, -5, -6,...","[0, 0, 1, 0, 0, -1, -1, 0, 0, 0, 1, 2, 1, 1, 2]","[0, -1, -1, -1, -2, -2, -3, -3, -4, -5, -5, -5..."
4,"[0, 0, 1, 1, 0, -1, -1, -1, -1, 0, -1, 0, -1, ...","[0, 1, 1, 0, 0, 0, 1, 2, 2, 2, 2, 2, 2, 3, 3]","[0, -1, -1, -2, -2, -2, -3, -4, -5, -5, -5, -5...","[0, 0, -1, -1, -2, -1, -1, -1, -1, -1, -2, -3,...","[0, 1, 0, -1, -1, -1, -2, -2, -1, -1, -2, -3, ...","[0, 0, 0, 0, 1, 2, 2, 3, 3, 2, 2, 2, 3, 4, 4]","[0, 0, 0, 0, -1, -2, -2, -3, -3, -4, -4, -4, -...","[0, -1, 0, 1, 1, 1, 2, 2, 1, 1, 0, 1, 1, 1, 1]","[0, -1, 0, -1, -1, -1, 0, 1, 1, 1, 0, 0, 1, 1, 2]","[0, 0, 0, 0, 1, 0, 0, 0, -1, -2, -2, -1, -1, -...","[0, 0, 0, 0, -1, -2, -2, -2, -1, -2, -2, -3, -...","[0, 0, -1, -1, -1, -1, -2, -3, -3, -3, -4, -4,...","[0, 1, 1, 1, 0, 0, 1, 1, 2, 3, 3, 3, 3, 3, 3]","[0, 0, -1, -2, -2, -1, -1, -2, -2, -2, -3, -4,...","[0, 0, 1, 0, 0, -1, -1, 0, 0, 0, 1, 2, 1, 0, 1]","[0, -1, -1, -1, -2, -2, -3, -3, -4, -5, -5, -5..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
822,"[0, 0, 0, 1, 0, 1, 0, 1, 2, 3, 4, 4, 5, 6, 7]","[0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -2...","[0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]","[0, 0, 0, -1, 0, -1, 0, -1, -1, -2, -3, -3, -3...","[0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 2]","[0, 0, 1, 2, 2, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1]","[0, 0, -1, -2, -2, -3, -2, -3, -2, -2, -2, -2,...","[0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0]","[0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, -1]","[0, 0, 0, 1, 2, 2, 2, 3, 2, 3, 2, 3, 3, 3, 3]","[0, 0, 0, -1, -2, -2, -1, -2, -1, -2, -1, -2, ...","[0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1]","[0, 0, 0, 0, 0, 0, 0, -1, -2, -1, -1, 0, 1, 0, 0]","[0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, -1]","[0, -1, 0, -1, 0, -1, 0, 0, 0, 0, 1, 1, 1, 1, 2]","[0, 0, 0, 0, 0, 0, 0, -1, 0, -1, -1, -2, -3, -..."
823,"[0, 0, 0, -1, -2, -3, -3, -4, -3, -4, -5, -5, ...","[0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -2...","[0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 3]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 2, 2, 2]","[0, -1, -1, -2, -3, -3, -3, -3, -3, -4, -5, -6...","[0, 0, -1, -1, -1, -2, -1, -2, -2, -2, -2, -2,...","[0, 0, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2]","[0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1]","[0, -1, -1, -2, -3, -4, -5, -5, -5, -5, -6, -7...","[0, 0, -1, -1, -1, -1, -1, -2, -3, -3, -3, -3,...","[0, 0, 1, 1, 1, 1, 1, 2, 3, 2, 2, 2, 2, 3, 2]","[0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2]","[0, 0, 0, 0, -1, -2, -3, -3, -3, -3, -3, -3, -...","[0, -1, -2, -2, -2, -2, -2, -2, -2, -3, -4, -5...","[0, 1, 2, 1, 1, 1, 1, 2, 1, 2, 3, 4, 4, 4, 4]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]"
824,"[0, -1, -1, -2, -3, -4, -5, -5, -6, -6, -7, -8...","[0, 0, 1, 1, 1, 1, 1, 0, 0, -1, -1, -1, -1, -1...","[0, 0, -1, -1, -1, -1, -1, -2, -2, -1, -1, -1,...","[0, 1, 1, 1, 2, 3, 4, 4, 4, 4, 5, 6, 6, 6, 6]","[0, -1, 0, 1, 0, -1, -1, -2, -2, -3, -4, -5, -...","[0, 0, 0, 0, 0, 0, -1, -1, 0, 0, 0, 0, 0, -1, -1]","[0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, -1, -1]","[0, 1, 0, 0, 1, 2, 2, 3, 3, 3, 4, 5, 6, 6, 6]","[0, 0, -1, -2, -2, -2, -2, -3, -2, -2, -2, -2,...","[0, -1, -1, -1, -2, -3, -4, -4, -4, -5, -5, -5...","[0, 1, 1, 1, 2, 3, 4, 4, 4, 5, 6, 5, 6, 6, 7]","[0, 0, 1, 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, 0...","[0, -1, 0, 0, 0, 0, 0, -1, -1, -1, -2, -2, -2,...","[0, 0, 0, -1, -1, -1, -1, -1, -2, -3, -3, -2, ...","[0, 0, 0, 1, 2, 1, 2, 2, 3, 4, 4, 3, 4, 5, 6]","[0, 0, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0..."
825,"[0, 1, 0, 0, 0, -1, -2, -3, -3, -3, -3, -3, -4...","[0, 0, 0, -1, -2, -2, -2, -2, -3, -3, -3, -3, ...","[0, 0, 0, 1, 2, 2, 2, 2, 3, 2, 1, 2, 2, 2, 3]","[0, 0, 1, 1, 1, 1, 2, 3, 3, 3, 3, 3, 3, 4, 4]","[0, -1, -1, -1, -1, -2, -3, -2, -3, -3, -3, -4...","[0, 0, -1, -2, -3, -3, -3, -3, -3, -4, -5, -5,...","[0, 0, 1, 0, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3, 4]","[0, 1, 1, 1, 1, 2, 3, 3, 4, 4, 4, 4, 5, 6, 6]","[0, 0, 0, -1, -2, -2, -2, -3, -4, -5, -5, -5, ...","[0, 0, 0, 0, 0, -1, -2, -2, -2, -2, -3, -3, -4...","[0, -1, 0, 0, 0, 1, 2, 2, 2, 2, 3, 4, 5, 6, 6]","[0, 0, 0, 0, 1, 1, 1, 1, 2, 3, 3, 3, 3, 3, 4]","[0, -1, -2, -3, -3, -3, -3, -3, -4, -5, -6, -6...","[0, 0, 0, 0, -1, -1, -2, -3, -3, -3, -3, -4, -...","[0, 0, 0, 0, 1, 2, 3, 4, 4, 4, 4, 5, 6, 7, 6]","[0, 0, 1, 2, 2, 2, 2, 2, 3, 2, 3, 3, 3, 3, 3]"


In [116]:
# Save Point
dill.dump_session('../data/notebook_sessions/1_prepare_dataset_7.db')


In [117]:
dill.load_session('../data/notebook_sessions/1_prepare_dataset_7.db')

## End of this Notebook

In [376]:
import datetime
print(f"This Notebook last end-to-end runs on: {datetime.datetime.now()}\n")

This Notebook last end-to-end runs on: 2025-02-07 18:38:31.035878

