Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buggy graph encoding #205

Closed
Vincentx15 opened this issue Aug 19, 2022 · 14 comments
Closed

Buggy graph encoding #205

Vincentx15 opened this issue Aug 19, 2022 · 14 comments
Labels
bug Something isn't working High Priority

Comments

@Vincentx15
Copy link

Hi folks, thanks for this package !

I'm experiencing bugs when creating graphs from pdb. First of all, if no node_metadata_functions are given (comment below), it defaults to using meiler embeddings which use a file that cannot be found in : graphein/protein/features/nodes/meiler_embeddings.csv. Second of all, some nodes are empty, which raises errors when trying to add edges.

I enclose a small script that raises the errors for a random PDB (please download the pdb data/4f5s.ent.gz).

import graphein.protein as gp
from graphein.protein.config import ProteinGraphConfig
from graphein.protein.graphs import construct_graph
from graphein.protein.features.nodes.amino_acid import amino_acid_one_hot

new_funcs = {
    "granularity": 'CA',
    "keep_hets": [False],
    "edge_construction_functions":
        [
         gp.add_disulfide_interactions,
         gp.add_ionic_interactions,
         ],
    "node_metadata_functions": [amino_acid_one_hot]
}
config = ProteinGraphConfig(**new_funcs)
_ = construct_graph(config=config, pdb_path=f"data/4f5s.ent.gz", chain_selection="A")

This should return the following error :

Traceback (most recent call last):
  File "/home/vmallet/projects/foldmine/toto.py", line 48, in <module>
    _ = construct_graph(config=config, pdb_path=f"data/4f5s.ent.gz", chain_selection="A")
  File "/home/vmallet/anaconda3/envs/foldmine/lib/python3.10/site-packages/graphein/protein/graphs.py", line 719, in construct_graph
    g = compute_edges(
  File "/home/vmallet/anaconda3/envs/foldmine/lib/python3.10/site-packages/graphein/protein/graphs.py", line 587, in compute_edges
    func(G)
  File "/home/vmallet/anaconda3/envs/foldmine/lib/python3.10/site-packages/graphein/protein/edges/distance.py", line 315, in add_ionic_interactions
    and G.nodes[r2]["residue_name"] in NEG_AA
KeyError: 'residue_name'

I'm using Ubuntu 20, python 3.10, graphein 1.5.1

Thanks in advance !

@a-r-j
Copy link
Owner

a-r-j commented Aug 25, 2022

Hi @Vincentx15 thanks for the bug report - I can reproduce it. I’ll check it out this weekend.

@a-r-j a-r-j added the bug Something isn't working label Aug 25, 2022
@Vincentx15
Copy link
Author

Hello guys, do you have any updates on this ? Or maybe an idea about how long this issue would take to be solved ?

Thanks a lot for your work !

@a-r-j
Copy link
Owner

a-r-j commented Sep 5, 2022

Hey @Vincentx15 - apologies. It’s a super busy time for me. This is very much on my to do list. I will try to get at it later this week if not early next. Thanks for your patience.

@Vincentx15
Copy link
Author

Ok no problem, I understand ! Thanks a lot for the support :)

@1511878618
Copy link
Contributor

1511878618 commented Sep 8, 2022

It looks like the meiler_embeddings.csv should have been removed from the package in some recent update, and the code still loads it from the original location.
I found the file in the repository a-r-j/graphein/master/graphein/protein/features/nodes/meiler_embeddings.csv. I downloaded it and put it in my own path in the package's directory:Anaconda/envs/pyg/lib/python3.9/site-packages/graphein/protein/features/nodes and then it looks like works.

This file is supposed to be used to embed amino acids, and currently it looks like is from https://link.springer.com/article/10.1007/s008940100038.
Perhaps there could be an api or somethings to allow users to customize this embedding methods instead of using the official default embedding

@1511878618
Copy link
Contributor

ok, i find how to modif this embedding func.

class graphein.protein.config.ProteinGraphConfig(*, granularity: typing.Union[typing.Literal['N', 'CA', 'C', 'O', 'CB', 'OG', 'CG', 'CD1', 'CD2', 'CE1', 'CE2', 'CZ', 'OD1', 'ND2', 'CG1', 'CG2', 'CD', 'CE', 'NZ', 'OD2', 'OE1', 'NE2', 'OE2', 'OH', 'NE', 'NH1', 'NH2', 'OG1', 'SD', 'ND1', 'SG', 'NE1', 'CE3', 'CZ2', 'CZ3', 'CH2', 'OXT'], typing.Literal['atom', 'centroids']] = 'CA', keep_hets: bool = False, insertions: bool = False, pdb_dir: pathlib.Path = PosixPath('../examples/pdbs'), verbose: bool = False, exclude_waters: bool = True, deprotonate: bool = False, protein_df_processing_functions: typing.List[typing.Callable] = None, edge_construction_functions: typing.List[typing.Union[typing.Callable, str]] = [<function add_peptide_bonds>], node_metadata_functions: typing.List[typing.Union[typing.Callable, str]] = [<function meiler_embedding>], edge_metadata_functions: typing.List[typing.Union[typing.Callable, str]] = None, graph_metadata_functions: typing.List[typing.Callable] = None, get_contacts_config: graphein.protein.config.GetContactsConfig = None, dssp_config: graphein.protein.config.DSSPConfig = None)[[source]](https://graphein.ai/_modules/graphein/protein/config.html#ProteinGraphConfig)

by setting node_metadata_functions value as embedding function.

@a-r-j
Copy link
Owner

a-r-j commented Sep 8, 2022

Thanks @1511878618 that's super helpful. Looks like the embedding file wasn't packaged in the tarball on PyPi. I'll try to reup with the missing file.

@1511878618
Copy link
Contributor

Ok, and i see some comments on the parameters of function like below
image
And it seems your example code on doc about this function is not work anymore, I'm wondering whether the older version of graphein is stable? and which version could be more stable........hmmm
image
image
Even, i remove these parameters, it still not works. Also i see some errors on your doc and is it Is it because of these comments happens on function define

@a-r-j
Copy link
Owner

a-r-j commented Sep 8, 2022

Hi @1511878618 could you open a separate issue for this. The docs do need to be updated for the Dataset utils and those comments removed. In the example you shared, the problem is that the graph_label_map arg has been replaced with graph_labels (which accepts a list of tensors rather than a dictionary).

@1511878618
Copy link
Contributor

ok i'll do it

@a-r-j
Copy link
Owner

a-r-j commented Sep 14, 2022

@Vincentx15 I figured out the cause of the problem.

The add_disulfide_interactions function isn't checking for node membership in the graph. Your protein seems to contain several inter-chain disulfide interactions which are being found but the B chain CYS residues are being added to the graph. This causes mismatch in shape with the distance matrix and adjacency matrix downstream.

You can see this here:

import graphein.protein as gp
from graphein.protein.config import ProteinGraphConfig
from graphein.protein.graphs import construct_graph
from graphein.protein.features.nodes.amino_acid import amino_acid_one_hot

new_funcs = {
    "granularity": 'CA',
    "keep_hets": [False],
    "edge_construction_functions":
        [  
         gp.add_ionic_interactions,
         ],
    "node_metadata_functions": [amino_acid_one_hot]
}
config = ProteinGraphConfig(**new_funcs)

g = construct_graph(config=config, pdb_code="4f5s", chain_selection="A")
print(len(g.nodes)) #583 nodes

gp.add_disulfide_interactions(g)
print(len(g.nodes)) #617 nodes
print(g.nodes) # here we can see the CYS residues on the B chain have been added to the graph.

It's a pretty straightforward fix & I'll push an update ASAP :)

@a-r-j
Copy link
Owner

a-r-j commented Sep 14, 2022

And until I get around to the PR, here's a fixed disulfide function. I've tested in the example you used above and it works as expected :)

import networkx as nx
import pandas as pd
from typing import Optional

from graphein.protein.resi_atoms import DISULFIDE_ATOMS, DISULFIDE_RESIS
from graphein.protein.edges.distance import add_interacting_resis, get_interacting_atoms, compute_distmat, filter_dataframe

import loguru as log

def add_disulfide_interactions(
    G: nx.Graph, rgroup_df: Optional[pd.DataFrame] = None
):
    """
    Find all disulfide interactions between CYS residues
    (:const:`~graphein.protein.resi_atoms.DISULFIDE_RESIS`,
    :const:`~graphein.protein.resi_atoms.DISULFIDE_ATOMS`).
    Criteria: sulfur atom pairs are within 2.2A of each other.
    :param G: networkx protein graph
    :type G: nx.Graph
    :param rgroup_df: pd.DataFrame containing rgroup data, defaults to ``None``,
        which retrieves the df from the provided nx graph.
    :type rgroup_df: pd.DataFrame, optional
    """
    # Check for existence of at least two Cysteine residues
    residues = [d["residue_name"] for _, d in G.nodes(data=True)]
    if residues.count("CYS") < 2:
        log.debug(
            f"{residues.count('CYS')} CYS residues found. Cannot add disulfide interactions with fewer than two CYS residues."
        )
        return

    if rgroup_df is None:
        rgroup_df = G.graph["rgroup_df"]
    disulfide_df = filter_dataframe(
        rgroup_df, "residue_name", DISULFIDE_RESIS, True
    )
    disulfide_df = filter_dataframe(
        disulfide_df, "atom_name", DISULFIDE_ATOMS, True
    )
    disulfide_df = filter_dataframe(disulfide_df, "node_id", list(G.nodes), True) # this line is the fix

    distmat = compute_distmat(disulfide_df)
    interacting_atoms = get_interacting_atoms(2.2, distmat)
    add_interacting_resis(G, interacting_atoms, disulfide_df, ["disulfide"])

@Vincentx15
Copy link
Author

Hey guys, thanks a lot for the fix. Splitting in chains solved it for me !

Best,
Vincent

a-r-j added a commit that referenced this issue Sep 14, 2022
a-r-j added a commit that referenced this issue Sep 16, 2022
* fix disulphide bond node filtering #205

* Local dataset (#208) (#211)

* add load local pdb_files to ProteinGraphDataset

* load local pdb_files from a list

* test and black and isort and add CHANGELOG.md

* docstring formatting

* comment jupyter; just run  and ssh or vscode to interact with container

* install dependency of jupyter nbextension

* Fixed Dockerfile and unit tests

* Update chage log

Co-authored-by: xutingfeng <xutingfeng@big.ac.cn>
Co-authored-by: Arian Jamasb <arjamasb@gmail.com>
Co-authored-by: ryan <ryan.greenhalgh@hotmail.co.uk>

Co-authored-by: Nicktf <49584439+1511878618@users.noreply.github.com>
Co-authored-by: xutingfeng <xutingfeng@big.ac.cn>
Co-authored-by: ryan <ryan.greenhalgh@hotmail.co.uk>

* fix typo

Co-authored-by: Nicktf <49584439+1511878618@users.noreply.github.com>
Co-authored-by: xutingfeng <xutingfeng@big.ac.cn>
Co-authored-by: ryan <ryan.greenhalgh@hotmail.co.uk>
a-r-j added a commit that referenced this issue Sep 19, 2022
@a-r-j a-r-j mentioned this issue Sep 19, 2022
a-r-j added a commit that referenced this issue Sep 19, 2022
* update version strings

* include node datafiles in manifest #205

* update changelog for 1.5.2

* isort

* typo

* rm conda env activation to speedup CI
@a-r-j
Copy link
Owner

a-r-j commented Sep 19, 2022

This should be fully resolved in 1.5.2 (pip install graphein==1.5.2). Closing the issue - let me know if you run into any trouble :)

@a-r-j a-r-j closed this as completed Sep 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working High Priority
Projects
None yet
Development

No branches or pull requests

3 participants