# **Preparación de datos**
Autor: Christian Gabriel Lara López
 
Repo: [GitLab](https://git.uclv.edu.cu/clara/ampeptides/)


# **Cargar Dataset de Péptidos**

In [None]:
! wget https://github.com/chrislara01/AMP/blob/master/TR_starPep_AMP.fasta
! wget https://github.com/chrislara01/AMP/blob/master/EX_starPep_AMP.fasta
! wget https://github.com/chrislara01/AMP/blob/master/TS_starPep_AMP.fasta

--2024-01-13 05:48:34--  https://github.com/chrislara01/AMP/blob/master/TR_starPep_AMP.fasta
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4260 (4.2K) [text/plain]
Saving to: ‘TR_starPep_AMP.fasta’


2024-01-13 05:48:34 (859 KB/s) - ‘TR_starPep_AMP.fasta’ saved [4260/4260]



# **Eliminando secuencias redundantes utilizando CD-HIT**

In [95]:
! cd-hit -i EX_starPep_AMP.fasta -o EX_starPep_AMP_cdhit.txt -c 0.99

Program: CD-HIT, V4.8.1, Mar 01 2019, 14:14:47
Command: cd-hit -i EX_starPep_AMP.fasta -o
         EX_starPep_AMP_cdhit.txt -c 0.99

Started: Sat Jan 13 06:41:47 2024
                            Output                              
----------------------------------------------------------------
total seq: 15318
longest and shortest : 100 and 11
Total letters: 507196
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 2M
Buffer          : 1 X 10M = 10M
Table           : 1 X 65M = 65M
Miscellaneous   : 0M
Total           : 79M

Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 90119636

comparing sequences from          0  to      15318
....................    10000  finished       9567  clusters
.....
    15318  finished      14296  clusters

Approximated maximum memory consumption: 82M
writing new database
writing clustering information
program completed !

Total CPU time 0.29


In [96]:
! cd-hit -i TR_starPep_AMP.fasta -o TR_starPep_AMP_cdhit.txt -c 0.99

Program: CD-HIT, V4.8.1, Mar 01 2019, 14:14:47
Command: cd-hit -i TR_starPep_AMP.fasta -o
         TR_starPep_AMP_cdhit.txt -c 0.99

Started: Sat Jan 13 06:42:24 2024
                            Output                              
----------------------------------------------------------------
Discarding invalid sequence or sequence without identifier and description!

total seq: 0
longest and shortest : 0 and 18446744073709551615
Total letters: 0
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 0M
Buffer          : 1 X 10M = 10M
Table           : 1 X 65M = 65M
Miscellaneous   : 0M
Total           : 75M

Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 90520634


        0  finished          0  clusters

Approximated maximum memory consumption: 75M
writing new database
writing clustering information
program completed !

Total CPU time 0.09


In [97]:
! cd-hit -i TS_starPep_AMP.fasta -o TS_starPep_AMP_cdhit.txt -c 0.99

Program: CD-HIT, V4.8.1, Mar 01 2019, 14:14:47
Command: cd-hit -i TS_starPep_AMP.fasta -o
         TS_starPep_AMP_cdhit.txt -c 0.99

Started: Sat Jan 13 06:42:28 2024
                            Output                              
----------------------------------------------------------------
total seq: 4452
longest and shortest : 100 and 11
Total letters: 157377
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 0M
Buffer          : 1 X 10M = 10M
Table           : 1 X 65M = 65M
Miscellaneous   : 0M
Total           : 76M

Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 90401149

comparing sequences from          0  to       4452
....
     4452  finished       4244  clusters

Approximated maximum memory consumption: 77M
writing new database
writing clustering information
program completed !

Total CPU time 0.15


# **Implementando funciones para el cálculo de features**

In [1]:
import pandas as pd
from Pfeature.pfeature import *

In [2]:
def split_np(file):
    """Divide archivos txt de tipo fasta en sets positivos y negativos"""
    with open (file, "r") as fasta:
        pos = ''
        neg = ''
        fastaread = fasta.read()
        array = fastaread.split('>')[1:]
        for line in array:
            if 'nonAMP' in line:
                neg += ('>' + line)
            else:
                pos += ('>' + line)
    pos = pos.removesuffix('\n')
    neg = neg.removesuffix('\n')
    with open(file.replace('.txt', '_pos.txt'), "w") as test:
        test.write(pos)
    with open(file.replace('.txt', '_neg.txt'), "w") as test:
        test.write(neg)
    return array

In [3]:
#Definiendo funciones para el calculo de features de los datasets

# Aminoacid composition
def aac(input):
  a = input.rstrip('txt')
  output = a + 'aac.csv'
  df_out = aac_wp(input, output)
  df_in = pd.read_csv(output)
  return df_in

# Physico-Chemical properties composition
def pcp(input):
  a = input.rstrip('txt')
  output = a + 'pcp.csv'
  df_out = pcp_wp(input, output)
  df_in = pd.read_csv(output)
  return df_in

# Atom composition
def atc(input):
  a = input.rstrip('txt')
  output = a + 'atc.csv'
  df_out = atc_wp(input, output)
  df_in = pd.read_csv(output)
  return df_in

# Repetitive residue information
def rri(input):
  a = input.rstrip('txt')
  output = a + 'rri.csv'
  df_out = rri_wp(input, output)
  df_in = pd.read_csv(output)
  return df_in

# Distance distribution of residues
def ddr(input):
  a = input.rstrip('txt')
  output = a + 'ddr.csv'
  df_out = ddr_wp(input, output)
  df_in = pd.read_csv(output)
  return df_in

# Shannon  entropy at protein level
def sep(input):
  a = input.rstrip('txt')
  output = a + 'sep.csv'
  df_out = sep_wp(input, output)
  df_in = pd.read_csv(output)
  return df_in

# Shannon  entropy residue
def ser(input):
  a = input.rstrip('txt')
  output = a + 'ser.csv'
  df_out = ser_wp(input, output)
  df_in = pd.read_csv(output)
  return df_in

# Shannon Entropy of Physicochemical Property
def spc(input):
  a = input.rstrip('txt')
  output = a + 'spc.csv'
  df_out = spc_wp(input, output)
  df_in = pd.read_csv(output)
  return df_in

# Pseudo amino acid composition
def pac(input):
  a = input.rstrip('txt')
  output = a + 'pac.csv'
  df_out = paac_wp(input, output, 1, 0.5)
  df_in = pd.read_csv(output)
  return df_in

# Amphiphilic Pseudo Amino Acid Composition
def apac(input):
  a = input.rstrip('txt')
  output = a + 'apac.csv'
  df_out = apaac_wp(input, output, 1, 0.5)
  df_in = pd.read_csv(output)
  return df_in

# Quasi secuence order
def qqos(input):
  a = input.rstrip('txt')
  output = a + 'qqos.csv'
  df_out = qos_wp(input, output, 1, 0.5)
  df_in = pd.read_csv(output)
  return df_in


In [4]:
# Metodo para concatenar los features calculados de los datasets positivo y negativo

def feature_calc(po, ne, feature_name):
  # Calcular feature
  po_feature = feature_name(po)
  ne_feature = feature_name(ne)
  # Crear columna de clase 
  po_class = pd.Series(['positive' for i in range(len(po_feature))])
  ne_class = pd.Series(['negative' for i in range(len(ne_feature))])
  # Combinar secuencias positivas y negativas
  po_ne_class = pd.concat([po_class, ne_class], axis=0)
  po_ne_class.name = 'class'
  po_ne_feature = pd.concat([po_feature, ne_feature], axis=0)
  # Combinar features y clase
  df = pd.concat([po_ne_feature, po_ne_class], axis=1)
  return df

In [5]:
# Encapsulamiento de la division y concatenacion en el calculo de features
# en archivos txt

def fastaFeatureCalc(input, feature):
    split_np(input)
    pos = input.replace('.txt', '_pos.txt')
    neg = input.replace('.txt', '_neg.txt')
    return feature_calc(pos, neg, feature)

def fastaFeatureListCalc(input, features):
    split_np(input)
    pos = input.replace('.txt', '_pos.txt')
    neg = input.replace('.txt', '_neg.txt')
    featureslist = []
    for feature in features:
        featureslist.append(feature_calc(pos, neg, feature))
    return featureslist

# **Preprocesamiento de los datos**

## **Cálculo de los features**

In [8]:
feat_list = [aac, pcp, atc, rri, ddr, sep, ser, spc, pac, apac, qqos]
feature_list_tr = fastaFeatureListCalc('TR_starPep_AMP_cdhit.txt', feat_list)
feature_list_ts = fastaFeatureListCalc('TS_starPep_AMP_cdhit.txt', feat_list)
feature_list_ex = fastaFeatureListCalc('EX_starPep_AMP_cdhit.txt', feat_list)

## **Mapeando los valores nominales de las clases a numeros enteros**

In [9]:
for feature in feature_list_tr:
    feature['class'] = feature['class'].map({"positive": 1, "negative": 0})
    
for feature in feature_list_ts:
    feature['class'] = feature['class'].map({"positive": 1, "negative": 0}) 

for feature in feature_list_ex:
    feature['class'] = feature['class'].map({"positive": 1, "negative": 0}) 

## **Selección de features utilizando umbral de varianza**

In [81]:
# Seleccion de atributos utilizando Umbral de varianza
from sklearn.feature_selection import VarianceThreshold

fs = VarianceThreshold(threshold=0.1)

selec_features_func = []
# Train dataset
X_array_tr = []
y_tr = feature_list_tr[0]['class']
for feature in feature_list_tr:
    X_array_tr.append(feature.drop('class', axis=1))

transf_feat_list_tr = []
for feature in X_array_tr:
    fs.fit_transform(feature)
    transf_feat_list_tr.append(feature.loc[:, fs.get_support()])
    selec_features_func.append(fs.get_support())

# Test dataset
X_array_ts = []
y_ts = feature_list_ts[0]['class']
for feature in feature_list_ts:
    X_array_ts.append(feature.drop('class', axis=1))

transf_feat_list_ts = []
for feature, support in zip(X_array_ts ,selec_features_func):
    transf_feat_list_ts.append(feature.loc[:, support])  

# External dataset
X_array_ex = []
y_ex = feature_list_ex[0]['class']

for feature in feature_list_ex:
    X_array_ex.append(feature.drop('class', axis=1))

transf_feat_list_ex = []
for feature, support in zip(X_array_ex ,selec_features_func):
    transf_feat_list_ex.append(feature.loc[:, support])  


## **Combinando los resultados**

In [82]:
# Concatenando features seleccionados de cada funcion
comb_feat_tr = pd.concat(transf_feat_list_tr, axis=1)
comb_feat_ts = pd.concat(transf_feat_list_ts, axis=1)
comb_feat_ex = pd.concat(transf_feat_list_ex, axis=1)


In [83]:
# Agregando la columna de clasificacion
final_dataset_tr = pd.concat([comb_feat_tr, y_tr], axis=1)
final_dataset_ts = pd.concat([comb_feat_ts, y_ts], axis=1)
final_dataset_ex = pd.concat([comb_feat_ex, y_ex], axis=1)

In [84]:
final_dataset_tr

Unnamed: 0,AAC_A,AAC_C,AAC_D,AAC_E,AAC_F,AAC_G,AAC_H,AAC_I,AAC_K,AAC_L,...,APAAC1_T,APAAC1_V,APAAC1_W,APAAC1_Y,QSO1_SC_A,QSO1_SC_K,QSO1_SC_L,QSO1_SC_R,QSO1_SC_S,class
0,0.00,0.00,7.69,0.00,15.38,15.38,0.00,23.08,7.69,15.38,...,0.00,0.00,0.0,0.00,0.0000,0.2982,0.5963,0.0000,0.2982,1
1,15.38,0.00,0.00,0.00,7.69,7.69,0.00,15.38,0.00,30.77,...,0.00,0.00,0.0,0.00,1.0819,0.0000,2.1638,0.0000,1.0819,1
2,15.38,0.00,0.00,0.00,15.38,7.69,0.00,0.00,15.38,38.46,...,0.00,0.00,0.0,0.00,0.6483,0.6483,1.6209,0.0000,0.3242,1
3,0.00,0.00,0.00,5.88,0.00,0.00,5.88,5.88,0.00,29.41,...,5.88,5.88,0.0,5.88,0.0000,0.0000,1.7396,0.0000,0.0000,1
4,0.00,0.00,7.69,0.00,7.69,7.69,0.00,7.69,0.00,23.08,...,15.38,7.69,0.0,0.00,0.0000,0.0000,1.1310,0.3770,0.0000,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7656,7.78,5.56,10.00,8.89,4.44,4.44,1.11,10.00,6.67,12.22,...,2.22,4.44,0.0,1.11,0.4196,0.3597,0.6594,0.1199,0.4795,0
7657,0.00,0.00,0.00,0.00,10.53,21.05,5.26,0.00,5.26,10.53,...,15.79,5.26,0.0,0.00,0.0000,0.2677,0.5354,0.0000,0.2677,0
7658,0.00,10.00,0.00,5.00,5.00,10.00,5.00,0.00,10.00,5.00,...,15.00,5.00,0.0,5.00,0.0000,0.5855,0.2927,0.0000,0.0000,0
7659,6.06,6.06,4.55,9.09,3.03,1.52,3.03,6.06,9.09,4.55,...,3.03,6.06,0.0,4.55,0.2923,0.4385,0.2192,0.2192,0.4385,0


In [85]:
final_dataset_ts

Unnamed: 0,AAC_A,AAC_C,AAC_D,AAC_E,AAC_F,AAC_G,AAC_H,AAC_I,AAC_K,AAC_L,...,APAAC1_T,APAAC1_V,APAAC1_W,APAAC1_Y,QSO1_SC_A,QSO1_SC_K,QSO1_SC_L,QSO1_SC_R,QSO1_SC_S,class
0,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,8.33,25.00,...,0.00,8.33,0.00,8.33,0.0000,0.3983,1.1950,0.0000,1.1950,1
1,0.00,0.0,0.00,0.00,8.82,11.76,0.00,2.94,5.88,2.94,...,2.94,5.88,2.94,5.88,0.0000,0.3384,0.1692,0.3384,0.0000,1
2,0.00,0.0,0.00,0.00,6.25,0.00,0.00,12.50,18.75,12.50,...,12.50,0.00,0.00,6.25,0.0000,0.7313,0.4875,0.2438,0.4875,1
3,9.52,0.0,4.76,0.00,0.00,9.52,4.76,4.76,0.00,0.00,...,9.52,4.76,4.76,19.05,0.4290,0.0000,0.0000,0.2145,0.6435,1
4,55.56,0.0,0.00,0.00,0.00,5.56,0.00,0.00,33.33,0.00,...,0.00,0.00,0.00,5.56,1.6558,0.9935,0.0000,0.0000,0.0000,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2184,6.25,0.0,6.25,0.00,0.00,12.50,0.00,6.25,0.00,31.25,...,12.50,6.25,0.00,0.00,0.2826,0.0000,1.4130,0.2826,0.0000,0
2185,2.20,2.2,3.30,7.69,8.79,7.69,4.40,3.30,5.49,10.99,...,2.20,4.40,2.20,4.40,0.1235,0.3087,0.6173,0.2469,0.3704,0
2186,23.08,0.0,15.38,7.69,0.00,0.00,0.00,7.69,15.38,0.00,...,7.69,7.69,0.00,0.00,0.6797,0.4531,0.0000,0.2266,0.0000,0
2187,25.00,5.0,0.00,20.00,0.00,0.00,0.00,0.00,15.00,5.00,...,5.00,10.00,0.00,0.00,1.1196,0.6718,0.2239,0.0000,0.0000,0


In [86]:
final_dataset_ex

Unnamed: 0,AAC_A,AAC_C,AAC_D,AAC_E,AAC_F,AAC_G,AAC_H,AAC_I,AAC_K,AAC_L,...,APAAC1_T,APAAC1_V,APAAC1_W,APAAC1_Y,QSO1_SC_A,QSO1_SC_K,QSO1_SC_L,QSO1_SC_R,QSO1_SC_S,class
0,0.00,0.00,0.00,0.00,0.00,16.67,0.00,16.67,25.00,16.67,...,0.00,0.00,8.33,0.00,0.0000,0.6988,0.4659,0.2329,0.2329,1
1,23.81,0.00,0.00,0.00,0.00,19.05,0.00,14.29,14.29,9.52,...,0.00,4.76,0.00,0.00,1.1499,0.6899,0.4600,0.0000,0.4600,1
2,5.41,0.00,5.41,0.00,10.81,10.81,0.00,5.41,21.62,5.41,...,2.70,10.81,0.00,0.00,0.1956,0.7822,0.1956,0.1956,0.3911,1
3,11.11,11.11,5.56,0.00,5.56,5.56,0.00,0.00,5.56,5.56,...,0.00,0.00,0.00,0.00,0.5148,0.2574,0.2574,0.5148,1.0295,1
4,16.28,0.00,2.33,4.65,2.33,6.98,0.00,6.98,16.28,4.65,...,4.65,6.98,6.98,0.00,0.7621,0.7621,0.2177,0.0000,0.3266,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10720,8.77,3.51,1.75,10.53,3.51,7.02,5.26,1.75,3.51,19.30,...,0.00,0.00,3.51,7.02,0.3453,0.1381,0.7598,0.3453,0.1381,0
10721,16.67,2.78,2.78,5.56,5.56,5.56,8.33,5.56,5.56,19.44,...,5.56,0.00,2.78,2.78,0.8789,0.2930,1.0253,0.0000,0.2930,0
10722,6.45,0.00,3.23,0.00,12.90,3.23,3.23,16.13,6.45,9.68,...,3.23,6.45,0.00,3.23,0.3988,0.3988,0.5983,0.1994,0.3988,0
10723,2.33,0.00,4.65,2.33,2.33,13.95,6.98,4.65,4.65,9.30,...,2.33,4.65,0.00,6.98,0.1075,0.2151,0.4302,0.1075,0.5377,0


# **Exportando resultados del preprocesado**

In [87]:
from sklearn.utils import shuffle

final_dataset_tr_shuffled = shuffle(final_dataset_tr)
final_dataset_ts_shuffled = shuffle(final_dataset_ts)
final_dataset_ex_shuffled = shuffle(final_dataset_ex)
final_dataset_tr_shuffled.to_csv('train.csv' , index=False, header=True)
final_dataset_ts_shuffled.to_csv('test.csv' , index=False, header=True)
final_dataset_ex_shuffled.to_csv('external.csv' , index=False, header=True)