<a href="https://colab.research.google.com/github/amoyag/Bioquimica_Ing_Proteinas/blob/main/1-Intro_PyRosetta/clase1-pose.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<!--NOTEBOOK_HEADER-->
*This notebook contains material from [PyRosetta](https://RosettaCommons.github.io/PyRosetta.notebooks);
content is available [on Github](https://github.com/RosettaCommons/PyRosetta.notebooks.git).*

# Working with poses
Keywords: pose_from_pdb(), sequence(), cleanATOM, annotated_sequence()

We will get practice working with the `Pose` class in PyRosetta. We will load in a protein from a PDB files, use the `Pose` class to learn about the sequence of the protein.

On the corresponding `Pose` lab found on the PyRosetta website, you will find various useful commands to interrogate poses; these may come in handy for the exercises.

In [None]:
# Necesito instalar de nuevo PyRosetta

!pip install pyrosettacolabsetup
import pyrosettacolabsetup; pyrosettacolabsetup.install_pyrosetta()
import pyrosetta; pyrosetta.init()

In [None]:
from pyrosetta import *
init()
import os
notebook_path = os.path.abspath("clase1-pose.ipynb")

┌───────────────────────────────────────────────────────────────────────────────┐
│                                  PyRosetta-4                                  │
│               Created in JHU by Sergey Lyskov and PyRosetta Team              │
│               (C) Copyright Rosetta Commons Member Institutions               │
│                                                                               │
│ NOTE: USE OF PyRosetta FOR COMMERCIAL PURPOSES REQUIRES PURCHASE OF A LICENSE │
│          See LICENSE.PyRosetta.md or email license@uw.edu for details         │
└───────────────────────────────────────────────────────────────────────────────┘
PyRosetta-4 2025 [Rosetta PyRosetta4.MinSizeRel.python312.ubuntu 2025.37+release.df75a9c48e763e52a7aa3f5dfba077f4da88dbf5 2025-09-03T12:23:30] retrieved from: http://www.pyrosetta.org
core.init: Checking for fconfig files in pwd and ./rosetta/flags
core.init: Rosetta version: PyRosetta4.MinSizeRel.python312.ubuntu r408 2025.37+release.df75a9c

In [None]:
pdb_file = os.path.join(os.path.dirname(notebook_path), "/content/google_drive/MyDrive/BIP_25-26/1-Intro_PyRosetta/5tj3.pdb")
pose = pose_from_pdb(pdb_file)

core.import_pose.import_pose: File '/content/google_drive/MyDrive/BIP_25-26/1-Intro_PyRosetta/5tj3.pdb' automatically determined to be of type PDB from contents.
core.pack.pack_missing_sidechains: packing residue number 233 because of missing atom number 6 atom name  CG
core.pack.pack_missing_sidechains: packing residue number 350 because of missing atom number 6 atom name  CG
core.pack.pack_missing_sidechains: packing residue number 353 because of missing atom number 6 atom name  CG
core.pack.pack_missing_sidechains: packing residue number 354 because of missing atom number 6 atom name  CG
core.pack.pack_missing_sidechains: packing residue number 382 because of missing atom number 6 atom name  CG
core.pack.pack_missing_sidechains: packing residue number 454 because of missing atom number 6 atom name  CG
core.pack.task: Packer task: initialize from command line()
core.scoring.ScoreFunctionFactory: SCOREFUNCTION: ref2015
core.pack.pack_rotamers: built 90 rotamers at 6 positions.
core.pa

## What is a pose?

The Pose class includes various types of information that describe a structure. Some of the core components include the Energies, PDBInfo, and Conformation. See the Rosetta3 paper to learn more: https://www.sciencedirect.com/science/article/pii/B9780123812704000196

**Exercise**

Do a little research to find out a definition of `Pose`. Write down your definition below.

As an example, let's use our pose to look at the sequence of 5TJ3:
`pose.sequence()`

In [None]:
pose.sequence()

'NAVPRPKLVVGLVVDQMRWDYLYRYYSKYGEGGFKRMLNTGYSLNNVHIDYVPTVTAIGHTSIFTGSVPSIHGIAGNDWYDKELGKSVYCTSDETVQPVGTTSNSVGQHSPRNLWSTTVTDQLGLATNFTSKVVGVSLKDRASILPAGHNPTGAFWFDDTTGKFITSTYYTKELPKWVNDFNNKNVPAQLVANGWNTLLPINQYTESSEDNVEWEGLLGSKKTPTFPYTDLAKDYEAKKGLIRTTPFGNTLTLQMADAAIDGNQMGVDDITDFLTVNLASTDYVGHNFGPNSIEVEDTYLRLDRDLADFFNNLDKKVGKGNYLVFLSADHGAAHSVGFMQAHKMPTGFFDMKKEMNAKLKQKFGADNIIAAAMNYQVYFDRKVLADSKLELDDVRDYVMTELKKEPSVLYVLSTDEIWESSIPEPIKSRVINGYNWKRSGDIQIISKDGYLSAYSKKGTTHSVWNSYDSHIPLLFMGWGIKQGESNQPYHMTDIAPTVSSLLKIQFPSGAVGKPITEVIGZZZZ'

Sometimes PDB files do not conform to standards and need to be cleaned to be loaded successfully with PyRosetta. One way to make sure the file is loaded successfully is to only include the ATOM lines from the PDB file. Alternatively, you could use the cleanATOM function in pyrosetta.toolbox to achieve the same:

In [None]:
from pyrosetta.toolbox import cleanATOM
cleanATOM(pdb_file)

This method will create a cleaned `5tj3.clean.pdb` file for you. Lets load this into PyRosetta as well:

In [None]:
pdb_file_clean = os.path.join(os.path.dirname(notebook_path), "/content/google_drive/MyDrive/BIP_25-26/1-Intro_PyRosetta/5tj3.clean.pdb")
pose_clean = pose_from_pdb(pdb_file_clean)

core.import_pose.import_pose: File '/content/google_drive/MyDrive/BIP_25-26/1-Intro_PyRosetta/5tj3.clean.pdb' automatically determined to be of type PDB from contents.
core.pack.pack_missing_sidechains: packing residue number 232 because of missing atom number 6 atom name  CG
core.pack.pack_missing_sidechains: packing residue number 349 because of missing atom number 6 atom name  CG
core.pack.pack_missing_sidechains: packing residue number 352 because of missing atom number 6 atom name  CG
core.pack.pack_missing_sidechains: packing residue number 353 because of missing atom number 6 atom name  CG
core.pack.pack_missing_sidechains: packing residue number 381 because of missing atom number 6 atom name  CG
core.pack.pack_missing_sidechains: packing residue number 453 because of missing atom number 6 atom name  CG
core.pack.task: Packer task: initialize from command line()
core.scoring.ScoreFunctionFactory: SCOREFUNCTION: ref2015
core.pack.pack_rotamers: built 90 rotamers at 6 positions.
c

In our case, we could load in the PDB file for 5tj3 without cleaning it. In fact, we've lost some residues when cleaning the PDB file with cleanATOM. What is the difference in the `sequence` of the `pose_clean` now, compared to before?

In [None]:
# print out the sequence of pose_clean
pose_clean.sequence()

'NAVPRPKLVVGLVVDQMRWDYLYRYYSKYGEGGFKRMLNTGYSLNNVHIDYVPTVAIGHTSIFTGSVPSIHGIAGNDWYDKELGKSVYCTSDETVQPVGTTSNSVGQHSPRNLWSTTVTDQLGLATNFTSKVVGVSLKDRASILPAGHNPTGAFWFDDTTGKFITSTYYTKELPKWVNDFNNKNVPAQLVANGWNTLLPINQYTESSEDNVEWEGLLGSKKTPTFPYTDLAKDYEAKKGLIRTTPFGNTLTLQMADAAIDGNQMGVDDITDFLTVNLASTDYVGHNFGPNSIEVEDTYLRLDRDLADFFNNLDKKVGKGNYLVFLSADHGAAHSVGFMQAHKMPTGFFDMKKEMNAKLKQKFGADNIIAAAMNYQVYFDRKVLADSKLELDDVRDYVMTELKKEPSVLYVLSTDEIWESSIPEPIKSRVINGYNWKRSGDIQIISKDGYLSAYSKKGTTHSVWNSYDSHIPLLFMGWGIKQGESNQPYHMTDIAPTVSSLLKIQFPSGAVGKPITEVIG'

With the function `annotated_sequence` below, we can start to see in more detail what the differences are. Note that non-canonical amino acids and hetatms are spelled out more explicitly now.

**Exercise**

Visually inspect the sequences to find the difference(s) between the `pose_clean.annotated_sequence()` and `pose.annotated_sequence()`. Were residues removed? Which ones?

In [None]:

pose.annotated_sequence()

'N[ASN:NtermProteinFull]AVPRPKLVVGLVVDQMRWDYLYRYYSKYGEGGFKRMLNTGYSLNNVHIDYVPTVT[THR:phosphorylated]AIGHTSIFTGSVPSIHGIAGNDWYDKELGKSVYCTSDETVQPVGTTSNSVGQHSPRNLWSTTVTDQLGLATNFTSKVVGVSLKDRASILPAGHNPTGAFWFDDTTGKFITSTYYTKELPKWVNDFNNKNVPAQLVANGWNTLLPINQYTESSEDNVEWEGLLGSKKTPTFPYTDLAKDYEAKKGLIRTTPFGNTLTLQMADAAIDGNQMGVDDITDFLTVNLASTDYVGHNFGPNSIEVEDTYLRLDRDLADFFNNLDKKVGKGNYLVFLSADHGAAHSVGFMQAHKMPTGFFDMKKEMNAKLKQKFGADNIIAAAMNYQVYFDRKVLADSKLELDDVRDYVMTELKKEPSVLYVLSTDEIWESSIPEPIKSRVINGYNWKRSGDIQIISKDGYLSAYSKKGTTHSVWNSYDSHIPLLFMGWGIKQGESNQPYHMTDIAPTVSSLLKIQFPSGAVGKPITEVIG[GLY:CtermProteinFull]Z[ZN]Z[ZN]Z[ZN]Z[ZN]'

In [None]:
pose_clean.annotated_sequence()

'N[ASN:NtermProteinFull]AVPRPKLVVGLVVDQMRWDYLYRYYSKYGEGGFKRMLNTGYSLNNVHIDYVPTVAIGHTSIFTGSVPSIHGIAGNDWYDKELGKSVYCTSDETVQPVGTTSNSVGQHSPRNLWSTTVTDQLGLATNFTSKVVGVSLKDRASILPAGHNPTGAFWFDDTTGKFITSTYYTKELPKWVNDFNNKNVPAQLVANGWNTLLPINQYTESSEDNVEWEGLLGSKKTPTFPYTDLAKDYEAKKGLIRTTPFGNTLTLQMADAAIDGNQMGVDDITDFLTVNLASTDYVGHNFGPNSIEVEDTYLRLDRDLADFFNNLDKKVGKGNYLVFLSADHGAAHSVGFMQAHKMPTGFFDMKKEMNAKLKQKFGADNIIAAAMNYQVYFDRKVLADSKLELDDVRDYVMTELKKEPSVLYVLSTDEIWESSIPEPIKSRVINGYNWKRSGDIQIISKDGYLSAYSKKGTTHSVWNSYDSHIPLLFMGWGIKQGESNQPYHMTDIAPTVSSLLKIQFPSGAVGKPITEVIG[GLY:CtermProteinFull]'

**Advanced exercise**

Write a program to automatically find the differences between these two sequences.

In [None]:
# 1. Cargar el archivo PDB y construir la pose original
# Usamos la función 'pose_from_pdb' de PyRosetta para cargar el archivo PDB
# La ruta al archivo PDB ya está definida en la variable 'pdb_file'
pose = pose_from_pdb(pdb_file)
print("Pose original cargada.")

# 2. Limpiar la pose con cleanATOM
# Usamos la función 'cleanATOM' del toolbox de PyRosetta para crear un archivo PDB limpio
# La ruta al archivo PDB limpio ya está definida en la variable 'pdb_file_clean'
cleanATOM(pdb_file)
print(f"Archivo PDB limpio creado: {pdb_file_clean}")

# Cargar la pose limpia desde el nuevo archivo PDB
pose_clean = pose_from_pdb(pdb_file_clean)
print("Pose limpia cargada.")

# 3. Sacar la secuencia  de cada pose (original y clean)
# Usamos el método 'sequence()' de cada pose para obtener la secuencia anotada
sequence_original = pose.sequence()
sequence_clean = pose_clean.sequence()

print("\nSecuencia anotada de la pose original:")
print(sequence_original)

print("\nSecuencia anotada de la pose limpia:")
print(sequence_clean)

# 4. Comparar las secuencias y decir las diferencias.
# Vamos a comparar las dos secuencias caracter a caracter para encontrar las diferencias
print("\nComparando secuencias y mostrando diferencias:")

# Aseguramos que podemos iterar hasta el final de la secuencia más corta
min_len = min(len(sequence_original), len(sequence_clean))

differences_found = False
for i in range(min_len):
    if sequence_original[i] != sequence_clean[i]:
        print(f"Diferencia en la posición {i}:")
        print(f"  Original: {sequence_original[i]}")
        print(f"  Limpia:   {sequence_clean[i]}")
        differences_found = True

# Si una secuencia es más larga que la otra, también lo reportamos como una diferencia
if len(sequence_original) > min_len:
    print(f"\nLa secuencia original es más larga. Caracteres adicionales:")
    print(sequence_original[min_len:])
    differences_found = True
elif len(sequence_clean) > min_len:
    print(f"\nLa secuencia limpia es más larga. Caracteres adicionales:")
    print(sequence_clean[min_len:])
    differences_found = True

if not differences_found:
    print("No se encontraron diferencias en las secuencias (hasta la longitud de la secuencia más corta).")

# Nota: La comparación caracter a caracter puede no ser ideal para diferencias complejas
# como la eliminación de residuos completos, pero mostrará dónde las cadenas de texto difieren.
# Las salidas de 'annotated_sequence()' ya nos dan una buena indicación de los residuos eliminados o modificados.

core.import_pose.import_pose: File '/content/google_drive/MyDrive/BIP_25-26/1-Intro_PyRosetta/5tj3.pdb' automatically determined to be of type PDB from contents.
core.pack.pack_missing_sidechains: packing residue number 233 because of missing atom number 6 atom name  CG
core.pack.pack_missing_sidechains: packing residue number 350 because of missing atom number 6 atom name  CG
core.pack.pack_missing_sidechains: packing residue number 353 because of missing atom number 6 atom name  CG
core.pack.pack_missing_sidechains: packing residue number 354 because of missing atom number 6 atom name  CG
core.pack.pack_missing_sidechains: packing residue number 382 because of missing atom number 6 atom name  CG
core.pack.pack_missing_sidechains: packing residue number 454 because of missing atom number 6 atom name  CG
core.pack.task: Packer task: initialize from command line()
core.scoring.ScoreFunctionFactory: SCOREFUNCTION: ref2015
core.pack.pack_rotamers: built 90 rotamers at 6 positions.
core.pa