Skip to content

CIF file outputs should maintain the original file(s)' label_seq_id, auth_seq_id numbers and chain letterings #214

@marcuscollins

Description

@marcuscollins

We have modified some of out input files (e.g., mnt/diffuse-private/raw/sampleworks/initial_dataset_40_occ_sweeps/processed/4OLE/4OLE_single_001_density_input.cif) so that the two numberings _atom_site.label_seq_id and _atom_site.auth_seq_id are the same. This numbering may or may not be respected by downstream models. Protenix for instance renumbers everything starting from chain A and residue 1. We should maintain the original labelings throughout, resetting to the original used by the www.rcsb.org (or a user PDB-style mmCIF file pre-deposition). I.e., generally label_seq_id starts from 1, and auth_seq_id is the numbering of the full biological protein.

In particular, this means we need to propagate additional fields through the pipeline, rather than just those automatically loaded by atomworks.io.parse, atomworks.io.utils.io_utils.load_any, etc...

As a corollary, we should make sure that we define which labeling we are using for selection strings used in evaluation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    CIF issuesAll issues related to the writing, reading, or parsing of CIF files or objects.engineeringTask that is best suited to software engineers, not research scientistsenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions