## Overview
This is readme form ProteinMPNN describing training dataset
[train_dataset](https://github.com/dauparas/ProteinMPNN/blob/main/training/README.md)

Training set for ProteinMPNN curated by Ivan Anishchanko.

Each PDB entry is represented as a collection of .pt files:
    PDBID_CHAINID.pt - contains CHAINID chain from PDBID
    PDBID.pt         - metadata and information on biological assemblies

PDBID_CHAINID.pt has the following fields:
    seq  - amino acid sequence (string)
    xyz  - atomic coordinates [L,14,3]
    mask - boolean mask [L,14]
    bfac - temperature factors [L,14]
    occ  - occupancy [L,14] (is 1 for most atoms, <1 if alternative conformations are present)

PDBID.pt:
    method        - experimental method (str)
    date          - deposition date (str)
    resolution    - resolution (float)
    chains        - list of CHAINIDs (there is a corresponding PDBID_CHAINID.pt file for each of these)
    tm            - pairwise similarity between chains (TM-score,seq.id.,rmsd from TM-align) [num_chains,num_chains,3]
    asmb_ids      - biounit IDs as in the PDB (list of str)
    asmb_details  - how the assembly was identified: author, or software, or smth else (list of str)
    asmb_method   - PISA or smth else (list of str)

    asmb_chains    - list of chains which each biounit is composed of (list of str, each str contains comma separated CHAINIDs)
    asmb_xformIDX  - (one per biounit) xforms to be applied to chains from asmb_chains[IDX], [n,4,4]
                     [n,:3,:3] - rotation matrices
                     [n,3,:3] - translation vectors

list.csv:
   CHAINID    - chain label, PDBID_CHAINID
   DEPOSITION - deposition date
   RESOLUTION - structure resolution
   HASH       - unique 6-digit hash for the sequence
   CLUSTER    - sequence cluster the chain belongs to (clusters were generated at seqID=30%)
   SEQUENCE   - reference amino acid sequence

valid_clusters.txt - clusters used for validation

test_clusters.txt - clusters used for testing

## Subjects of this notebook

1. Download data
2. visualize data
3. Build module function to regenerate training datasets

In [1]:
!wget https://files.ipd.uw.edu/pub/training_sets/pdb_2021aug02_sample.tar.gz
!tar xvf "pdb_2021aug02_sample.tar.gz"

--2024-12-14 22:14:31--  https://files.ipd.uw.edu/pub/training_sets/pdb_2021aug02_sample.tar.gz
Connecting to 128.59.114.167:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 49690915 (47M) [application/octet-stream]
Saving to: ‘pdb_2021aug02_sample.tar.gz’


2024-12-14 22:14:34 (18.1 MB/s) - ‘pdb_2021aug02_sample.tar.gz’ saved [49690915/49690915]

./pdb_2021aug02_sample/
./pdb_2021aug02_sample/README
./pdb_2021aug02_sample/list.csv
./pdb_2021aug02_sample/pdb/
./pdb_2021aug02_sample/pdb/l3/
./pdb_2021aug02_sample/pdb/l3/5l3p.pt
./pdb_2021aug02_sample/pdb/l3/5l3g_A.pt
./pdb_2021aug02_sample/pdb/l3/5l3f.pt
./pdb_2021aug02_sample/pdb/l3/5l3r_B.pt
./pdb_2021aug02_sample/pdb/l3/4l3o_G.pt
./pdb_2021aug02_sample/pdb/l3/1l3b_E.pt
./pdb_2021aug02_sample/pdb/l3/3l3t_C.pt
./pdb_2021aug02_sample/pdb/l3/6l3y_A.pt
./pdb_2021aug02_sample/pdb/l3/5l3p_DB.pt
./pdb_2021aug02_sample/pdb/l3/3l36_B.pt
./pdb_2021aug02_sample/pdb/l3/2l3y_A.pt
./pdb_2021aug02_sample/pdb/l3/2l36.pt
./pd

In [4]:
# Data Visulization
import torch
data=torch.load('./pdb_2021aug02_sample/pdb/l3/1l30.pt')
for key, value in data.items():
    print(f"{key}: {value}")


method: X-RAY_DIFFRACTION
date: 1989-05-01
resolution: 1.7
chains: ['A']
seq: [['MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNCNGVITKDEAEKLFNQDVDAAVRGILRNAKLKLVYDSLDAVRRCALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL', 'MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNCNGVITKDEAEKLFNQDVDAAVRGILRNAKLKLVYDSLDAVRRCALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL']]
id: 1L30
asmb_chains: ['A,B']
asmb_details: ['author_defined_assembly']
asmb_method: ['?']
asmb_ids: ['1']
asmb_xform0: tensor([[[1., 0., 0., 0.],
         [0., 1., 0., 0.],
         [0., 0., 1., 0.],
         [0., 0., 0., 1.]]])
tm: tensor([[[1., 1., 0.]]])


In [15]:
# Beside above distionary containing overall info about the protein, it contains subsidiary dicts.
# the details of individual chains are saved in seperate dictionary
data_A=torch.load('./pdb_2021aug02_sample/pdb/l3/1l30_A.pt')
for key, value in data_A.items():
    if hasattr(value, 'shape'):  # For NumPy arrays or similar objects
        print(f"{key}: {value.shape}")
    elif isinstance(value, (list, tuple)):  # For lists or tuples
        print(f"{key}: ({len(value)},)")
    elif isinstance(value, str):  # For strings
        print(f"{key}: {len(value)}")

seq: 164
xyz: torch.Size([164, 14, 3])
mask: torch.Size([164, 14])
bfac: torch.Size([164, 14])
occ: torch.Size([164, 14])


## data comments

164: Length of protein <br>
xyz: coordinates has [x, y, z] values of dimension 3 <br>
We will show what are in the dimension 14 

In [16]:
# This is how AA structures are hardcoded
#digitizing atom in a residue to index coming from here
RES_NAMES = [
    'ALA','ARG','ASN','ASP','CYS',
    'GLN','GLU','GLY','HIS','ILE',
    'LEU','LYS','MET','PHE','PRO',
    'SER','THR','TRP','TYR','VAL'
]

RES_NAMES_1 = 'ARNDCQEGHILKMFPSTWYV'

to1letter = {aaa:a for a,aaa in zip(RES_NAMES_1,RES_NAMES)}
to3letter = {a:aaa for a,aaa in zip(RES_NAMES_1,RES_NAMES)}

# each AA is only represnted by backbone plus sidechain heavy atoms
ATOM_NAMES = [
    ("N", "CA", "C", "O", "CB"), # ala
    ("N", "CA", "C", "O", "CB", "CG", "CD", "NE", "CZ", "NH1", "NH2"), # arg
    ("N", "CA", "C", "O", "CB", "CG", "OD1", "ND2"), # asn
    ("N", "CA", "C", "O", "CB", "CG", "OD1", "OD2"), # asp
    ("N", "CA", "C", "O", "CB", "SG"), # cys
    ("N", "CA", "C", "O", "CB", "CG", "CD", "OE1", "NE2"), # gln
    ("N", "CA", "C", "O", "CB", "CG", "CD", "OE1", "OE2"), # glu
    ("N", "CA", "C", "O"), # gly
    ("N", "CA", "C", "O", "CB", "CG", "ND1", "CD2", "CE1", "NE2"), # his
    ("N", "CA", "C", "O", "CB", "CG1", "CG2", "CD1"), # ile
    ("N", "CA", "C", "O", "CB", "CG", "CD1", "CD2"), # leu
    ("N", "CA", "C", "O", "CB", "CG", "CD", "CE", "NZ"), # lys
    ("N", "CA", "C", "O", "CB", "CG", "SD", "CE"), # met
    ("N", "CA", "C", "O", "CB", "CG", "CD1", "CD2", "CE1", "CE2", "CZ"), # phe
    ("N", "CA", "C", "O", "CB", "CG", "CD"), # pro
    ("N", "CA", "C", "O", "CB", "OG"), # ser
    ("N", "CA", "C", "O", "CB", "OG1", "CG2"), # thr
    ("N", "CA", "C", "O", "CB", "CG", "CD1", "CD2", "CE2", "CE3", "NE1", "CZ2", "CZ3", "CH2"), # trp
    ("N", "CA", "C", "O", "CB", "CG", "CD1", "CD2", "CE1", "CE2", "CZ", "OH"), # tyr
    ("N", "CA", "C", "O", "CB", "CG1", "CG2") # val
]

# the longest one is Trp
print(len(("N", "CA", "C", "O", "CB", "CG", "CD1", "CD2", "CE2", "CE3", "NE1", "CZ2", "CZ3", "CH2")))

14


In [18]:
# This is how  atoms in residues are indexed

idx2ra = {(RES_NAMES_1[i],j):(RES_NAMES[i],a) for i in range(20) for j,a in enumerate(ATOM_NAMES[i])}
for key, value in idx2ra.items():
    print(f"{key}: {value}")

('A', 0): ('ALA', 'N')
('A', 1): ('ALA', 'CA')
('A', 2): ('ALA', 'C')
('A', 3): ('ALA', 'O')
('A', 4): ('ALA', 'CB')
('R', 0): ('ARG', 'N')
('R', 1): ('ARG', 'CA')
('R', 2): ('ARG', 'C')
('R', 3): ('ARG', 'O')
('R', 4): ('ARG', 'CB')
('R', 5): ('ARG', 'CG')
('R', 6): ('ARG', 'CD')
('R', 7): ('ARG', 'NE')
('R', 8): ('ARG', 'CZ')
('R', 9): ('ARG', 'NH1')
('R', 10): ('ARG', 'NH2')
('N', 0): ('ASN', 'N')
('N', 1): ('ASN', 'CA')
('N', 2): ('ASN', 'C')
('N', 3): ('ASN', 'O')
('N', 4): ('ASN', 'CB')
('N', 5): ('ASN', 'CG')
('N', 6): ('ASN', 'OD1')
('N', 7): ('ASN', 'ND2')
('D', 0): ('ASP', 'N')
('D', 1): ('ASP', 'CA')
('D', 2): ('ASP', 'C')
('D', 3): ('ASP', 'O')
('D', 4): ('ASP', 'CB')
('D', 5): ('ASP', 'CG')
('D', 6): ('ASP', 'OD1')
('D', 7): ('ASP', 'OD2')
('C', 0): ('CYS', 'N')
('C', 1): ('CYS', 'CA')
('C', 2): ('CYS', 'C')
('C', 3): ('CYS', 'O')
('C', 4): ('CYS', 'CB')
('C', 5): ('CYS', 'SG')
('Q', 0): ('GLN', 'N')
('Q', 1): ('GLN', 'CA')
('Q', 2): ('GLN', 'C')
('Q', 3): ('GLN', 'O')
('Q

In [19]:
aa2idx = {(r,a):i for r,atoms in zip(RES_NAMES,ATOM_NAMES)
          for i,a in enumerate(atoms)}
for key, value in aa2idx.items():
    print(f"{key}: {value}")


('ALA', 'N'): 0
('ALA', 'CA'): 1
('ALA', 'C'): 2
('ALA', 'O'): 3
('ALA', 'CB'): 4
('ARG', 'N'): 0
('ARG', 'CA'): 1
('ARG', 'C'): 2
('ARG', 'O'): 3
('ARG', 'CB'): 4
('ARG', 'CG'): 5
('ARG', 'CD'): 6
('ARG', 'NE'): 7
('ARG', 'CZ'): 8
('ARG', 'NH1'): 9
('ARG', 'NH2'): 10
('ASN', 'N'): 0
('ASN', 'CA'): 1
('ASN', 'C'): 2
('ASN', 'O'): 3
('ASN', 'CB'): 4
('ASN', 'CG'): 5
('ASN', 'OD1'): 6
('ASN', 'ND2'): 7
('ASP', 'N'): 0
('ASP', 'CA'): 1
('ASP', 'C'): 2
('ASP', 'O'): 3
('ASP', 'CB'): 4
('ASP', 'CG'): 5
('ASP', 'OD1'): 6
('ASP', 'OD2'): 7
('CYS', 'N'): 0
('CYS', 'CA'): 1
('CYS', 'C'): 2
('CYS', 'O'): 3
('CYS', 'CB'): 4
('CYS', 'SG'): 5
('GLN', 'N'): 0
('GLN', 'CA'): 1
('GLN', 'C'): 2
('GLN', 'O'): 3
('GLN', 'CB'): 4
('GLN', 'CG'): 5
('GLN', 'CD'): 6
('GLN', 'OE1'): 7
('GLN', 'NE2'): 8
('GLU', 'N'): 0
('GLU', 'CA'): 1
('GLU', 'C'): 2
('GLU', 'O'): 3
('GLU', 'CB'): 4
('GLU', 'CG'): 5
('GLU', 'CD'): 6
('GLU', 'OE1'): 7
('GLU', 'OE2'): 8
('GLY', 'N'): 0
('GLY', 'CA'): 1
('GLY', 'C'): 2
('GLY', '

In [None]:
print(data_A['seq'][0])
data_A['xyz'][0,:]
#N", "CA", "C", "O", "CB", "CG", "SD", "CE")

M


tensor([[ 36.5060, -24.8400,   8.9400],
        [ 36.8400, -23.4580,   9.0180],
        [ 35.5980, -22.7920,   9.5260],
        [ 34.5200, -23.2510,   9.2150],
        [ 37.1590, -22.9150,   7.6350],
        [ 37.4560, -21.4620,   7.8160],
        [ 39.1430, -21.0160,   7.4100],
        [ 40.0850, -22.4920,   7.8700],
        [     nan,      nan,      nan],
        [     nan,      nan,      nan],
        [     nan,      nan,      nan],
        [     nan,      nan,      nan],
        [     nan,      nan,      nan],
        [     nan,      nan,      nan]])

In [3]:
# example of get transformation matrix

from Bio.PDB import MMCIFParser

# Load the structure
parser = MMCIFParser()
structure = parser.get_structure("example", "/home/yunyao/em_angelo_test/EMD-8650/5VA1.pdb")

# Access assembly transformations
assemblies = structure.header['biomoltrans']
for asmb_id, asmb in assemblies.items():
    for xform in asmb['transformations']:
        rotation = xform[:3, :3]  # Extract rotation matrix
        translation = xform[:3, 3]  # Extract translation vector
        print(f"Rotation:\n{rotation}\nTranslation: {translation}")


KeyError: '_atom_site.id'