We have modified some of out input files (e.g., mnt/diffuse-private/raw/sampleworks/initial_dataset_40_occ_sweeps/processed/4OLE/4OLE_single_001_density_input.cif) so that the two numberings _atom_site.label_seq_id and _atom_site.auth_seq_id are the same. This numbering may or may not be respected by downstream models. Protenix for instance renumbers everything starting from chain A and residue 1. We should maintain the original labelings throughout, resetting to the original used by the www.rcsb.org (or a user PDB-style mmCIF file pre-deposition). I.e., generally label_seq_id starts from 1, and auth_seq_id is the numbering of the full biological protein.
In particular, this means we need to propagate additional fields through the pipeline, rather than just those automatically loaded by atomworks.io.parse, atomworks.io.utils.io_utils.load_any, etc...
As a corollary, we should make sure that we define which labeling we are using for selection strings used in evaluation.
We have modified some of out input files (e.g., mnt/diffuse-private/raw/sampleworks/initial_dataset_40_occ_sweeps/processed/4OLE/4OLE_single_001_density_input.cif) so that the two numberings _atom_site.label_seq_id and _atom_site.auth_seq_id are the same. This numbering may or may not be respected by downstream models. Protenix for instance renumbers everything starting from chain A and residue 1. We should maintain the original labelings throughout, resetting to the original used by the www.rcsb.org (or a user PDB-style mmCIF file pre-deposition). I.e., generally label_seq_id starts from 1, and auth_seq_id is the numbering of the full biological protein.
In particular, this means we need to propagate additional fields through the pipeline, rather than just those automatically loaded by atomworks.io.parse, atomworks.io.utils.io_utils.load_any, etc...
As a corollary, we should make sure that we define which labeling we are using for selection strings used in evaluation.