# **Preparación de datos**
Autor: Christian Gabriel Lara López
 
Repo: [GitLab](https://git.uclv.edu.cu/clara/ampeptides/)


# **Cargar Dataset de Péptidos**

In [None]:
! wget https://github.com/chrislara01/AMP/blob/master/TR_starPep_AMP.fasta
! wget https://github.com/chrislara01/AMP/blob/master/EX_starPep_AMP.fasta
! wget https://github.com/chrislara01/AMP/blob/master/TS_starPep_AMP.fasta

--2024-01-13 05:48:34--  https://github.com/chrislara01/AMP/blob/master/TR_starPep_AMP.fasta
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4260 (4.2K) [text/plain]
Saving to: ‘TR_starPep_AMP.fasta’


2024-01-13 05:48:34 (859 KB/s) - ‘TR_starPep_AMP.fasta’ saved [4260/4260]



# **Eliminando secuencias redundantes utilizando CD-HIT**

In [95]:
! cd-hit -i EX_starPep_AMP.fasta -o EX_starPep_AMP_cdhit.txt -c 0.99

Program: CD-HIT, V4.8.1, Mar 01 2019, 14:14:47
Command: cd-hit -i EX_starPep_AMP.fasta -o
         EX_starPep_AMP_cdhit.txt -c 0.99

Started: Sat Jan 13 06:41:47 2024
                            Output                              
----------------------------------------------------------------
total seq: 15318
longest and shortest : 100 and 11
Total letters: 507196
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 2M
Buffer          : 1 X 10M = 10M
Table           : 1 X 65M = 65M
Miscellaneous   : 0M
Total           : 79M

Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 90119636

comparing sequences from          0  to      15318
....................    10000  finished       9567  clusters
.....
    15318  finished      14296  clusters

Approximated maximum memory consumption: 82M
writing new database
writing clustering information
program completed !

Total CPU time 0.29


In [96]:
! cd-hit -i TR_starPep_AMP.fasta -o TR_starPep_AMP_cdhit.txt -c 0.99

Program: CD-HIT, V4.8.1, Mar 01 2019, 14:14:47
Command: cd-hit -i TR_starPep_AMP.fasta -o
         TR_starPep_AMP_cdhit.txt -c 0.99

Started: Sat Jan 13 06:42:24 2024
                            Output                              
----------------------------------------------------------------
Discarding invalid sequence or sequence without identifier and description!

total seq: 0
longest and shortest : 0 and 18446744073709551615
Total letters: 0
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 0M
Buffer          : 1 X 10M = 10M
Table           : 1 X 65M = 65M
Miscellaneous   : 0M
Total           : 75M

Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 90520634


        0  finished          0  clusters

Approximated maximum memory consumption: 75M
writing new database
writing clustering information
program completed !

Total CPU time 0.09


In [97]:
! cd-hit -i TS_starPep_AMP.fasta -o TS_starPep_AMP_cdhit.txt -c 0.99

Program: CD-HIT, V4.8.1, Mar 01 2019, 14:14:47
Command: cd-hit -i TS_starPep_AMP.fasta -o
         TS_starPep_AMP_cdhit.txt -c 0.99

Started: Sat Jan 13 06:42:28 2024
                            Output                              
----------------------------------------------------------------
total seq: 4452
longest and shortest : 100 and 11
Total letters: 157377
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 0M
Buffer          : 1 X 10M = 10M
Table           : 1 X 65M = 65M
Miscellaneous   : 0M
Total           : 76M

Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 90401149

comparing sequences from          0  to       4452
....
     4452  finished       4244  clusters

Approximated maximum memory consumption: 77M
writing new database
writing clustering information
program completed !

Total CPU time 0.15


# **Implementando funciones para el cálculo de features**

In [1]:
import pandas as pd
from Pfeature.pfeature import *

In [2]:
def split_np(file):
    """Divide archivos txt de tipo fasta en sets positivos y negativos"""
    with open (file, "r") as fasta:
        pos = ''
        neg = ''
        fastaread = fasta.read()
        array = fastaread.split('>')[1:]
        for line in array:
            if 'nonAMP' in line:
                neg += ('>' + line)
            else:
                pos += ('>' + line)
    pos = pos.removesuffix('\n')
    neg = neg.removesuffix('\n')
    with open(file.replace('.txt', '_pos.txt'), "w") as test:
        test.write(pos)
    with open(file.replace('.txt', '_neg.txt'), "w") as test:
        test.write(neg)
    return array

In [24]:
#Definiendo funciones para el calculo de features de los datasets

# Aminoacid composition
def aac(input):
  a = input.rstrip('txt')
  output = a + 'aac.csv'
  df_out = aac_wp(input, output)
  df_in = pd.read_csv(output)
  return df_in

# Physico-Chemical properties composition
def pcp(input):
  a = input.rstrip('txt')
  output = a + 'pcp.csv'
  df_out = pcp_wp(input, output)
  df_in = pd.read_csv(output)
  return df_in

# Atom composition
def atc(input):
  a = input.rstrip('txt')
  output = a + 'atc.csv'
  df_out = atc_wp(input, output)
  df_in = pd.read_csv(output)
  return df_in

# Bond composition
def btc(input):
  a = input.rstrip('txt')
  output = a + 'btc.csv'
  df_out = btb_wp(input, output)
  df_in = pd.read_csv(output)
  return df_in

# Repetitive residue information
def rri(input):
  a = input.rstrip('txt')
  output = a + 'rri.csv'
  df_out = rri_wp(input, output)
  df_in = pd.read_csv(output)
  return df_in

# Distance distribution of residues
def ddr(input):
  a = input.rstrip('txt')
  output = a + 'ddr.csv'
  df_out = ddr_wp(input, output)
  df_in = pd.read_csv(output)
  return df_in

# Shannon  entropy at protein level
def sep(input):
  a = input.rstrip('txt')
  output = a + 'sep.csv'
  df_out = sep_wp(input, output)
  df_in = pd.read_csv(output)
  return df_in

# Shannon  entropy residue
def ser(input):
  a = input.rstrip('txt')
  output = a + 'ser.csv'
  df_out = ser_wp(input, output)
  df_in = pd.read_csv(output)
  return df_in

# Shannon Entropy of Physicochemical Property
def spc(input):
  a = input.rstrip('txt')
  output = a + 'spc.csv'
  df_out = spc_wp(input, output)
  df_in = pd.read_csv(output)
  return df_in

# Pseudo amino acid composition
def pac(input):
  a = input.rstrip('txt')
  output = a + 'pac.csv'
  df_out = paac_wp(input, output, 1, 0.5)
  df_in = pd.read_csv(output)
  return df_in

# Amphiphilic Pseudo Amino Acid Composition
def apac(input):
  a = input.rstrip('txt')
  output = a + 'apac.csv'
  df_out = apaac_wp(input, output, 1, 0.5)
  df_in = pd.read_csv(output)
  return df_in

# Quasi secuence order
def qqos(input):
  a = input.rstrip('txt')
  output = a + 'qqos.csv'
  df_out = qos_wp(input, output, 1, 0.5)
  df_in = pd.read_csv(output)
  return df_in


In [28]:
# Metodo para concatenar los features calculados de los datasets positivo y negativo

def feature_calc(po, ne, feature_name):
  # Calcular feature
  po_feature = feature_name(po)
  ne_feature = feature_name(ne)
  # Crear columna de clase 
  po_class = pd.Series(['positive' for i in range(len(po_feature))])
  ne_class = pd.Series(['negative' for i in range(len(ne_feature))])
  # Combinar secuencias positivas y negativas
  po_ne_class = pd.concat([po_class, ne_class], axis=0)
  po_ne_class.name = 'class'
  po_ne_feature = pd.concat([po_feature, ne_feature], axis=0)
  # Combinar features y clase
  df = pd.concat([po_ne_feature, po_ne_class], axis=1)
  return df

In [29]:
# Encapsulamiento de la division y concatenacion en el calculo de features
# en archivos txt

def fastaFeatureCalc(input, feature):
    split_np(input)
    pos = input.replace('.txt', '_pos.txt')
    neg = input.replace('.txt', '_neg.txt')
    return feature_calc(pos, neg, feature)

def fastaFeatureListCalc(input, features):
    split_np(input)
    pos = input.replace('.txt', '_pos.txt')
    neg = input.replace('.txt', '_neg.txt')
    featureslist = []
    for feature in features:
        featureslist.append(feature_calc(pos, neg, feature))
    return featureslist

# **Preprocesamiento de los datos**

In [97]:
# Calculo de features
feat_list = [aac, pcp]
feature_list = fastaFeatureListCalc('TS_starPep_AMP_cdhit.txt', feat_list)

In [100]:
for feature in feature_list:
    feature['class'] = feature['class'].map({"positive": 1, "negative": 0}) 

In [104]:
# Seleccion de atributos utilizando Umbral de varianza
from sklearn.feature_selection import VarianceThreshold
X_array = []
y = feature_list[0]['class']
for feature in feature_list:
    X_array.append(feature.drop('class', axis=1))
fs = VarianceThreshold(threshold=0.1)
transf_feat_list = []
for feature in X_array:
    fs.fit_transform(feature)
    transf_feat_list.append(feature.loc[:, fs.get_support()])    

In [108]:
# Concatenando features seleccionados de cada funcion
comb_feat = pd.concat(transf_feat_list, axis=1)

In [109]:
# Agregando la columna de clasificacion
final_dataset = pd.concat([comb_feat, y], axis=1)

# **Exportando resultados del preprocesado**

In [113]:
from sklearn.utils import shuffle

final_dataset_shuffled = shuffle(final_dataset)
final_dataset_shuffled.to_csv('test.csv' , index=False, header=True)