


# Generar embeddigs de estructura utilizando Prost5

Para generar embeddigs de estructuras de proteínas con Prost5, priemro es indisponsable preprocesar las estructuras para convertirlas en formato 3Di. Esto se puede hacer utilizando Foldseek, proceso se realizó en el notebook `dev/embeddings/Foldseek_3di.py` de este repocitorio.

In [None]:
# Cargar libreias necesarias
import h5py
import numpy as np
import os
from itertools import islice


In [3]:
# Clonar el repositorio de Prost5
!git clone https://github.com/mheinzinger/ProstT5.git

Cloning into 'ProstT5'...


In [1]:
# Descomprimir
! unzip ProstT5-main.zip

Archive:  ProstT5-main.zip
d9858ad5eb774d5bc7ca5dc31d8d364049ccc87b
   creating: ProstT5-main/
  inflating: ProstT5-main/LICENSE    
  inflating: ProstT5-main/README.md  
   creating: ProstT5-main/cnn_chkpnt/
  inflating: ProstT5-main/cnn_chkpnt/README.md  
  inflating: ProstT5-main/cnn_chkpnt/model.pt  
   creating: ProstT5-main/cnn_chkpnt_AA_CNN/
  inflating: ProstT5-main/cnn_chkpnt_AA_CNN/README.md  
  inflating: ProstT5-main/cnn_chkpnt_AA_CNN/model.pt  
   creating: ProstT5-main/notebooks/
  inflating: ProstT5-main/notebooks/ProstT5_inverseFolding.ipynb  
  inflating: ProstT5-main/prostt5_sketch2.png  
   creating: ProstT5-main/scripts/
  inflating: ProstT5-main/scripts/README.md  
  inflating: ProstT5-main/scripts/embed.py  
  inflating: ProstT5-main/scripts/finetune_prostt5_lora_script.py  
  inflating: ProstT5-main/scripts/generate_foldseek_db.py  
  inflating: ProstT5-main/scripts/predict_3Di_encoderOnly.py  
  inflating: ProstT5-main/scripts/predict_AA_encoderOnly.py  
   crea

## Generación de los embeddings

En este apartado, se utilizaron los archivos 3Di de las estructuras de las proteínas. Estos se encuentran en este repositorio en la ruta `data/embeddings/estructura/3Dmi`
Los archivos utilizados son los siguientes: 
* Datos de entrenamiento: train_3di.fasta 
* Datos de validación: val_3di.fasta
* Datos de evaluación: test_3di.fasta

Para generar los embeddings se corrio el comando:

`! python ProstT5-main/scripts/embed.py --input ruta/datos.fasta --output ProstT5_output/structure_embeddings_datos.h5 --half 1 --is_3Di 1 --per_protein 1`

Este permite generar embeddings por proteína realizando un promedio de los embeddings de los residuos por proteína.



### Conjunto de datos de entrenamiento

In [3]:
!python ProstT5-main/scripts/embed.py --input 3di_data/train_3di.fasta --output ProstT5_output/structure_embeddings_train.h5 --half 1 --is_3Di 1 --per_protein 1

2025-05-04 13:05:36.736165: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-04 13:05:36.758789: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Using device: cuda:0
Loading T5 from: Rostlab/ProstT5
  return self.fget.__get__(instance, owner)()
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. T

In [21]:
# Saber el total embeddigs generados en el conjunto de datos
with h5py.File('ProstT5_output/structure_embeddings_train.h5', 'r') as f:
    total_proteins = len(f.keys())
    print(f"Total de proteínas en el archivo: {total_proteins}")

Total de proteínas en el archivo: 18256


In [19]:
# Revisar la dimensión de los embeddings
with h5py.File('ProstT5_output/structure_embeddings_train.h5', 'r') as f:
    # Iterar solo sobre los primeros 10 IDs de proteínas
    for protein_id in islice(f.keys(), 10):
        # Acceder al embedding
        embedding = f[protein_id][:]
        print(f"{protein_id}: shape {embedding.shape}")

A0A023PXQ4: shape (1024,)
A0A023T778: shape (1024,)
A0A061ACF5: shape (1024,)
A0A061ACH8: shape (1024,)
A0A061ACH9: shape (1024,)
A0A061ACL6: shape (1024,)
A0A061ACM7: shape (1024,)
A0A061ACQ8: shape (1024,)
A0A061ACX4: shape (1024,)
A0A061AD29: shape (1024,)


In [20]:
# Verificar el tipo de dato de los embeddings
with h5py.File('ProstT5_output/structure_embeddings_train.h5', 'r') as f:
    # Iterar solo sobre los primeros 10 IDs de proteínas
    for protein_id in islice(f.keys(), 10):
        # Acceder al embedding
        embedding = f[protein_id][:]
        print(f"{protein_id}: type {type(embedding)}")

A0A023PXQ4: type <class 'numpy.ndarray'>
A0A023T778: type <class 'numpy.ndarray'>
A0A061ACF5: type <class 'numpy.ndarray'>
A0A061ACH8: type <class 'numpy.ndarray'>
A0A061ACH9: type <class 'numpy.ndarray'>
A0A061ACL6: type <class 'numpy.ndarray'>
A0A061ACM7: type <class 'numpy.ndarray'>
A0A061ACQ8: type <class 'numpy.ndarray'>
A0A061ACX4: type <class 'numpy.ndarray'>
A0A061AD29: type <class 'numpy.ndarray'>


### Conjunto de datos de validación

In [1]:
!python ProstT5-main/scripts/embed.py --input 3di_data/val_3di.fasta --output ProstT5_output/structure_embeddings_val.h5 --half 1 --is_3Di 1 --per_protein 1

2025-05-04 13:00:31.363521: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-04 13:00:31.386079: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Using device: cuda:0
Loading T5 from: Rostlab/ProstT5
  return self.fget.__get__(instance, owner)()
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. T

In [2]:
# Saber el total embeddigs generados en el conjunto de datos 
import h5py
from itertools import islice
with h5py.File('structure_embeddings/structure_embeddings_val.h5', 'r') as f:
    total_proteins = len(f.keys())
    print(f"Total de proteínas en el archivo: {total_proteins}")

Total de proteínas en el archivo: 513


In [16]:
# Revisar la dimensión de los embeddings
with h5py.File('ProstT5_output/structure_embeddings_val.h5', 'r') as f:
    # Iterar solo sobre los primeros 10 IDs de proteínas
    for protein_id in islice(f.keys(), 10):
        # Acceder al embedding
        embedding = f[protein_id][:]
        print(f"{protein_id}: shape {embedding.shape}")

A0A087WRJ2: shape (1024,)
A0A0B5JS55: shape (1024,)
A0A0G2JE97: shape (1024,)
A0A0G2KCY3: shape (1024,)
A0A0G2L325: shape (1024,)
A0A0K2H545: shape (1024,)
A0A0R4IBM8: shape (1024,)
A0A0R4IEZ3: shape (1024,)
A0A0R4IKF5: shape (1024,)
A0A0R4IP63: shape (1024,)


In [17]:
# Verificar el tipo de dato de los embeddings
with h5py.File('ProstT5_output/structure_embeddings_val.h5', 'r') as f:
    # Iterar solo sobre los primeros 10 IDs de proteínas
    for protein_id in islice(f.keys(), 10):
        # Acceder al embedding
        embedding = f[protein_id][:]
        print(f"{protein_id}: type {type(embedding)}")

A0A087WRJ2: type <class 'numpy.ndarray'>
A0A0B5JS55: type <class 'numpy.ndarray'>
A0A0G2JE97: type <class 'numpy.ndarray'>
A0A0G2KCY3: type <class 'numpy.ndarray'>
A0A0G2L325: type <class 'numpy.ndarray'>
A0A0K2H545: type <class 'numpy.ndarray'>
A0A0R4IBM8: type <class 'numpy.ndarray'>
A0A0R4IEZ3: type <class 'numpy.ndarray'>
A0A0R4IKF5: type <class 'numpy.ndarray'>
A0A0R4IP63: type <class 'numpy.ndarray'>


### Conjunto de datos de evaluación

In [2]:

!python ProstT5-main/scripts/embed.py --input 3di_data/test_3di.fasta --output ProstT5_output/structure_embeddings_test.h5 --half 1 --is_3Di 1 --per_protein 1

2025-05-04 13:04:28.744354: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-04 13:04:28.766886: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Using device: cuda:0
Loading T5 from: Rostlab/ProstT5
  return self.fget.__get__(instance, owner)()
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. T

In [14]:
# Saber el total embeddigs generados en el conjunto de datos 
with h5py.File('ProstT5_output/structure_embeddings_test.h5', 'r') as f:
    total_proteins = len(f.keys())
    print(f"Total de proteínas en el archivo: {total_proteins}")

Total de proteínas en el archivo: 523


In [12]:
# Revisar la dimensión de los embeddings
with h5py.File('ProstT5_output/structure_embeddings_test.h5', 'r') as f:
    # Iterar solo sobre los primeros 10 IDs de proteínas
    for protein_id in islice(f.keys(), 10):
        # Acceder al embedding
        embedding = f[protein_id][:]
        print(f"{protein_id}: shape {embedding.shape}")

A0A0K2H416: shape (1024,)
A0A0K2H4T3: shape (1024,)
A0A0K2H4Y0: shape (1024,)
A0A0K2H571: shape (1024,)
A0A0K2H597: shape (1024,)
A0A0K2H599: shape (1024,)
A0A0K2H5Z1: shape (1024,)
A0A0K2H6J8: shape (1024,)
A0A0K2H6X7: shape (1024,)
A0A0K2H776: shape (1024,)


In [13]:
# Verificar el tipo de dato de los embeddings
with h5py.File('ProstT5_output/structure_embeddings_test.h5', 'r') as f:
    # Iterar solo sobre los primeros 10 IDs de proteínas
    for protein_id in islice(f.keys(), 10):
        # Acceder al embedding
        embedding = f[protein_id][:]
        print(f"{protein_id}: type {type(embedding)}")

A0A0K2H416: type <class 'numpy.ndarray'>
A0A0K2H4T3: type <class 'numpy.ndarray'>
A0A0K2H4Y0: type <class 'numpy.ndarray'>
A0A0K2H571: type <class 'numpy.ndarray'>
A0A0K2H597: type <class 'numpy.ndarray'>
A0A0K2H599: type <class 'numpy.ndarray'>
A0A0K2H5Z1: type <class 'numpy.ndarray'>
A0A0K2H6J8: type <class 'numpy.ndarray'>
A0A0K2H6X7: type <class 'numpy.ndarray'>
A0A0K2H776: type <class 'numpy.ndarray'>
