### Este notebook nos facilitará la codificación o caracterización por propiedades de las secuencias

Este notebook hace:

1. Lee el dataset de secuencias
2. Maneja las secuencias y extrae random de cada clase
3. Caracteriza las secuencias usando propiedades fisicoquímicas

In [6]:
pip install pandas modlamp

Collecting modlamp
  Using cached modlamp-4.3.2-py3-none-any.whl.metadata (13 kB)
Collecting nose>=1.3.7 (from modlamp)
  Using cached nose-1.3.7-py3-none-any.whl.metadata (1.7 kB)
Collecting scipy>=0.17.0 (from modlamp)
  Using cached scipy-1.16.1-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (61 kB)
Collecting matplotlib>=1.5.1 (from modlamp)
  Using cached matplotlib-3.10.5-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (11 kB)
Collecting scikit-learn>=0.18.0 (from modlamp)
  Using cached scikit_learn-1.7.1-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (11 kB)
Collecting requests>=2.11.1 (from modlamp)
  Using cached requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting lxml>=3.6.4 (from modlamp)
  Using cached lxml-6.0.1-cp313-cp313-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl.metadata (3.8 kB)
Collecting joblib>=0.15.1 (from modlamp)
  Using cached joblib-1.5.1-py3-none-any.whl.metadata (5.6 kB)
Collecting

- Seccion import libraries/modules

In [5]:
import pandas as pd
import numpy as np
from modlamp.descriptors import GlobalDescriptor

- Seccion de implementación de funciones auxiliares

In [6]:
def mw(sequence):
    gd = GlobalDescriptor([sequence])
    gd.calculate_MW(amide=True)
    val = float(np.round(gd.descriptor[0][0], 5))
    return val

In [8]:
def isoelectric_point(sequence):
    gd = GlobalDescriptor([sequence])
    gd.isoelectric_point(amide=True)
    val = float(np.round(gd.descriptor[0][0], 5))
    return val

In [9]:
def charge_density(sequence):
    gd = GlobalDescriptor([sequence])
    gd.charge_density(amide=True)
    val = float(np.round(gd.descriptor[0][0], 5))
    return val

In [26]:
def instability_index(sequence):
    gd = GlobalDescriptor([sequence])
    gd.instability_index()
    val = float(np.round(gd.descriptor[0][0], 5))
    return val

In [24]:
def boman_index(sequence):
    gd = GlobalDescriptor([sequence])
    gd.boman_index()
    val = float(np.round(gd.descriptor[0][0], 5))
    return val

In [17]:
def hydrophobic_ratio(sequence):
    gd = GlobalDescriptor([sequence])
    gd.hydrophobic_ratio()
    val = float(np.round(gd.descriptor[0][0], 5))
    return val

In [13]:
df = pd.read_csv("../raw_data/demo_amp.csv")
df.head(5)

Unnamed: 0,sequence,label
0,QEDCELCINVACTGC,0
1,MAATTTATSLFSSRLHFQNQNQGYGFPAKTPNSLQVNQIIDGRKMR...,0
2,SKGKKANKDVELARG,1
3,ADLEVVAATYVLVA,1
4,MAESPSESTSDSLSTTTSTKPAQSGTVSISSPQSHHVVFPEIPIEIVS,0


In [14]:
df.shape

(21220, 2)

In [15]:
df["label"].value_counts()

label
0    10610
1    10610
Name: count, dtype: int64

In [22]:
df[:10]

Unnamed: 0,sequence,label,mw,charge_density,hydrophobic_ratio
0,QEDCELCINVACTGC,0,1599.84,-0.00142,0.53333
1,MAATTTATSLFSSRLHFQNQNQGYGFPAKTPNSLQVNQIIDGRKMR...,0,5854.58,0.00087,0.37037
2,SKGKKANKDVELARG,1,1599.84,0.0025,0.26667
3,ADLEVVAATYVLVA,1,1432.67,-0.0007,0.71429
4,MAESPSESTSDSLSTTTSTKPAQSGTVSISSPQSHHVVFPEIPIEIVS,0,4971.4,-0.00056,0.27083
5,MLRFTHVLNNGAKRSALSLGRSYLRGFGSMHGPRVA,0,3957.61,0.00182,0.38889
6,MFRRAAFIKPRLTGFIRFN,0,2340.84,0.00256,0.52632
7,ISIGIKCSPSIDLCEGQCRIRKYFTGYCSGDTCHCSG,0,4001.61,0.00042,0.35135
8,AISCGQVSSALSPCISYARGNGAKPPVACCSGVKRLAGAAQSTADK...,1,9255.82,0.00103,0.43011
9,STQEVSGHPEHHLV,0,1555.66,-0.00045,0.21429


In [18]:
matrix_data = []

for sequence in df["sequence"].values:
    row = [
        mw(sequence),
        charge_density(sequence),
        hydrophobic_ratio(sequence)
    ]
    matrix_data.append(row)

df_1 = pd.DataFrame(data=matrix_data, columns=["mw", "charge_density", "hydrophobic_ratio"])
df_1

Unnamed: 0,mw,charge_density,hydrophobic_ratio
0,1599.84,-0.00142,0.53333
1,5854.58,0.00087,0.37037
2,1599.84,0.00250,0.26667
3,1432.67,-0.00070,0.71429
4,4971.40,-0.00056,0.27083
...,...,...,...
21215,5109.73,-0.00076,0.39583
21216,3794.47,0.00134,0.44118
21217,2823.28,-0.00121,0.57692
21218,2991.63,0.00064,0.53571


In [19]:
df = pd.concat([df, df_1], axis=1)
df.head(5)

Unnamed: 0,sequence,label,mw,charge_density,hydrophobic_ratio
0,QEDCELCINVACTGC,0,1599.84,-0.00142,0.53333
1,MAATTTATSLFSSRLHFQNQNQGYGFPAKTPNSLQVNQIIDGRKMR...,0,5854.58,0.00087,0.37037
2,SKGKKANKDVELARG,1,1599.84,0.0025,0.26667
3,ADLEVVAATYVLVA,1,1432.67,-0.0007,0.71429
4,MAESPSESTSDSLSTTTSTKPAQSGTVSISSPQSHHVVFPEIPIEIVS,0,4971.4,-0.00056,0.27083


In [27]:
df["boman_index"] = df["sequence"].apply(boman_index)
df["isoelectric_point"] = df["sequence"].apply(isoelectric_point)
df["instability_index"] = df["sequence"].apply(instability_index)

In [28]:
df.to_csv("../results/procesed_data.csv", index=False)