Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new helper functions to PDBManager #322

Merged
merged 14 commits into from
May 25, 2023
17 changes: 9 additions & 8 deletions graphein/ml/datasets/pdb_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -1778,7 +1778,7 @@ def merge_pdb_chain_groups(self, group: DataFrameGroupBy) -> pd.DataFrame:

def select_pdb_by_criterion(
self, pdb: PandasPdb, field: str, field_values: List[Any]
) -> PandasPdb:
) -> Optional[PandasPdb]:
"""Filter a PDB using a field selection.

:param pdb: The PDB object to filter by a field.
Expand All @@ -1789,18 +1789,18 @@ def select_pdb_by_criterion(
the PDB.
:type field_values: List[Any]

:return: The filtered PDB object.
:rtype: PandasPdb
:return: The filtered PDB object or instead `None` to signify
that no atoms within the PDB object were found after filtering.
:rtype: Optional[PandasPdb], optional
"""
for key in pdb.df:
if field in pdb.df[key]:
filtered_pdb = pdb.df[key][
pdb.df[key][field].isin(field_values)
]
if "ATOM" in key:
assert (
len(filtered_pdb) > 0
), "Filtered DataFrame must contain atoms."
if "ATOM" in key and len(filtered_pdb) == 0:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be being a little pedantic here but I see two possible edge cases with this:

  1. Sometimes protein atoms are stored as HETATMs (typically modified residues but this kind of bad practice does happen as an abuse of the PDB format to suit some niche way to store structure data)
  2. Similarly, what if the desired selection is actually the HETATM data? Protein-nucleic acid complexes or protein-peptide complexes may store the ligand as a HETATM

Copy link
Contributor Author

@amorehead amorehead May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. Points 1 and 2 didn't come to my mind when I originally implemented this, and I kept expanding on it until now. In light of these points, I think it makes more sense to avoid skipping such DataFrames altogether. It will then be the user's responsibility to "vet" the exported PDB files for these kinds of edge cases in their selected PDBs. Better yet, we can simply issue a warning to users that no "standard" atoms were found post-filtering. However, we would then still export the PDB complex as requested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've modified this logic to only issue a warning in such an edge case now.

log.warning("Filtered DataFrame does not contain any atoms. Skipping DataFrame...")
return None
pdb.df[key] = filtered_pdb
return pdb

Expand Down Expand Up @@ -1891,7 +1891,8 @@ def write_out_pdb_chain_groups(
pdb, "chain_id", chains
)
# export selected chains within the same PDB file
pdb_chains.to_pdb(str(output_pdb_filepath))
if pdb_chains:
pdb_chains.to_pdb(str(output_pdb_filepath))

def write_df_pdbs(
self,
Expand Down