Fix and improve the utilities for PyG #234

anton-bushuiev · 2022-11-10T20:36:46Z

Reference Issues/PRs

No Reference Issues/PRs

What does this implement/fix? Explain your changes

This PR fixes the bugs related to the processing of PyG data.

`graphein.ml.conversion.convert_nx_to_pyg`

Fix the coordinates bug. Currently, the function creates a data.coords tensor with an extra dimension because coords is also a “graph-level feature”.
Improve to convert everything that is possible to torch.Tensors. Makes further usage much more easier and a resulting data object much more PyG-like.
Extend to multiple edge_indices for multiple edge types.

`graphein.ml.visualisation.plot_pyg_data`

Fix the arguments for the nested plotly_protein_structure_graph. Currently, it lacks the positional value for node_size_feature and the order of the following arguments is completely broken.
Adapt coords processing to the change number 2 in convert_nx_to_pyg.

What testing did you do to verify the changes in this PR?

Pull Request Checklist

Added a note about the modification or contribution to the ./CHANGELOG.md file (if applicable)
Added appropriate unit test functions in the ./graphein/tests/* directories (if applicable)
Modify documentation in the corresponding Jupyter Notebook under ./notebooks/ (if applicable)
Ran python -m py.test tests/ and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., python -m py.test tests/protein/test_graphs.py)
Checked for style issues by running black . and isort .

`kneighbors_graph(X=dist_mat, ...)` is wrong since `X` may not be a distance matrix. This leads to wrong results which may be similar to correct ones.

for more information, see https://pre-commit.ci

a-r-j · 2023-01-28T21:10:42Z

I've made some changes to the CI/CD in #244 which I expect will resolve the test failures here

a-r-j · 2023-01-30T13:40:20Z

It looks like we're running into an issue when it comes to collating the distance matrices into a batch.

I suppose we have three options:

flatten distance matrices: (e.g. n x n -> n^2 x 1)
add additional dimension: 1 x n x n which then becomes 1 x max(n) x max(n) where max(n) is the largest number of nodes in the batch and smaller proteins are padded appropriately
Drop it entirely since it's easy to compute in torch.

Do you have any strong opinions?

anton-bushuiev · 2023-01-30T18:14:33Z

That's a good question. I'll try it and let you know.

anton-bushuiev · 2023-02-03T22:02:33Z

Hi, @a-r-j!

From my perspective, (1) is much better than (2) but (3) is the best. I would not serialize distance matrices for several reasons:

it's easy to compute them in torch, as you wrote
they introduce additional O(n^2) memory overhead
from my experience they are typically not needed after data preparation.

What do you think about it?

a-r-j · 2023-02-03T22:14:24Z

Very good points @anton-bushuiev.

I am not sure about not supporting it all. While distance matrices are easy to compute, we can still run into scenarios where there are matrix features we may want to include (e.g. Hbond map). Thus, I propose:

Distance matrices should not be stored by default due to the memory overhead
Users have to specifically include them in the columns arg if they are to be included
If matrices are to be stored, they should be stored in a sparse format (e.g. using this.

anton-bushuiev · 2023-02-04T12:47:45Z

I agree with the first two points but I am not sure about sparse format. Working with protein graphs, distance matrices are always dense because physical laws allow only zeros on diagonals. That's why I think simple reshaping would be the best:

# Flatten before writing
data.dist_mat = data.dist_mat.reshape(data.num_nodes * data.num_nodes)
# Restore after reading
data.dist_mat = data.dist_mat.reshape((data.num_nodes, data.num_nodes))

Do I miss the scenarios when they may be sparse? What is Hbond map (can you please send some link to its usage in Graphein)?

for more information, see https://pre-commit.ci

a-r-j

LGTM, thanks for the contribution!

sonarcloud · 2023-02-10T13:22:44Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

rg314 · 2023-03-14T11:31:10Z

Hi @anton-bushuiev,

Your commit for "Improve convert_nx_to_pyg" added some breaking changes. Edges do not necessarily have a kind unless I'm not understanding something?

graphein/graphein/ml/conversion.py

Line 307 in b308a58

# Split edge index by edge kind

        # Split edge index by edge kind
        kind_strs = np.array(list(map(lambda x: "_".join(x), data["kind"])))
        for kind in set(kind_strs):
            key = f"edge_index_{kind}"
            if key in self.columns:
                mask = kind_strs == kind
                data[key] = edge_index[:, mask]
        if "kind" not in self.columns:
            del data["kind"]

anton-bushuiev and others added 18 commits November 2, 2022 19:28

Fix bug in add_k_nn_edges.

54613aa

`kneighbors_graph(X=dist_mat, ...)` is wrong since `X` may not be a distance matrix. This leads to wrong results which may be similar to correct ones.

Extend add_k_nn_edges.

1de248f

Add types to docstring

27ee8af

Update changelog

018fd9c

Add kind_name argument

71fee7a

Test filter_distmat

74968ce

Merge branch 'master' of https://github.com/anton-bushuiev/graphein

77a89c6

Merge branch 'a-r-j:master' into master

c91cede

Merge branch 'master' of https://github.com/anton-bushuiev/graphein

713d0e3

Set default value of long_interaction_threshold to 0

beb15d3

Fix filtering bug in add_k_nn_edges

584c9f9

Test add_k_nn_edges

b9cc99b

Refactor with add_edge

fd1b36b

Fix bug for empty edges_to_excl

fdc8b96

Improve convert_nx_to_pyg

5075462

Fix bug in plot_pyg_data

21f10a1

Test convert_nx_to_pyg on multimers

febaa2b

Merge

48941fa

anton-bushuiev changed the base branch from master to anon November 10, 2022 20:37

anton-bushuiev changed the base branch from anon to master November 10, 2022 20:37

pre-commit-ci bot and others added 10 commits November 10, 2022 20:38

[pre-commit.ci] auto fixes from pre-commit.com hooks

e856693

for more information, see https://pre-commit.ci

Update CHANGELOG.md

9b89b44

Merge branch 'master' of https://github.com/anton-bushuiev/graphein

c3a5e84

Fix version in CHANGELOG.md

a80a387

Handle corner cases

629a61c

Handle NaNs in coordinatess

f1fcc29

Add PyG install to CI

f54a41f

typo in CI config

05f2ef0

bump torch versions in CI

b5156d8

make pyg-related tests conditional pyg installation

7f8c9c1

remove wildcard version number for pyyaml

82c6e7f

a-r-j added 3 commits January 30, 2023 12:22

Merge branch 'master' into master

5e80ee0

fix typo

eca0cfa

fix additonal typos

3930a53

anton-bushuiev and others added 15 commits February 9, 2023 18:40

Extend aggregation to vectors

a5903cf

Implement aggregate_feature_over_residues

e39a5a7

Merge remote-tracking branch 'origin/master'

7179825

[pre-commit.ci] auto fixes from pre-commit.com hooks

53430aa

for more information, see https://pre-commit.ci

Add docstring and aggregation type

edd58ef

[pre-commit.ci] auto fixes from pre-commit.com hooks

cc8fa07

for more information, see https://pre-commit.ci

import literal from typing extensions

e538cc8

[pre-commit.ci] auto fixes from pre-commit.com hooks

df9f9fe

for more information, see https://pre-commit.ci

Merge branch 'a-r-j:master' into master

c459122

Add missing median in exception message

9092358

Fix nullcontext

dc679ae

fix dataset test

bd1f4fa

fix division by zero errors in edge colouring

d1f1c8c

[pre-commit.ci] auto fixes from pre-commit.com hooks

00f99fd

for more information, see https://pre-commit.ci

update changlelog

c3f8554

a-r-j approved these changes Feb 10, 2023

View reviewed changes

a-r-j merged commit ea8be9f into a-r-j:master Feb 10, 2023

rg314 mentioned this pull request Mar 14, 2023

KeyError: 'kind' for split edge index by edge kind when converting nx to pyg #280

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and improve the utilities for PyG #234

Fix and improve the utilities for PyG #234

anton-bushuiev commented Nov 10, 2022 •

edited

a-r-j commented Jan 28, 2023

a-r-j commented Jan 30, 2023 •

edited

anton-bushuiev commented Jan 30, 2023

anton-bushuiev commented Feb 3, 2023

a-r-j commented Feb 3, 2023 •

edited

anton-bushuiev commented Feb 4, 2023

a-r-j left a comment

sonarcloud bot commented Feb 10, 2023

rg314 commented Mar 14, 2023

Fix and improve the utilities for PyG #234

Fix and improve the utilities for PyG #234

Conversation

anton-bushuiev commented Nov 10, 2022 • edited

Reference Issues/PRs

What does this implement/fix? Explain your changes

graphein.ml.conversion.convert_nx_to_pyg

graphein.ml.visualisation.plot_pyg_data

What testing did you do to verify the changes in this PR?

Pull Request Checklist

a-r-j commented Jan 28, 2023

a-r-j commented Jan 30, 2023 • edited

anton-bushuiev commented Jan 30, 2023

anton-bushuiev commented Feb 3, 2023

a-r-j commented Feb 3, 2023 • edited

anton-bushuiev commented Feb 4, 2023

a-r-j left a comment

Choose a reason for hiding this comment

sonarcloud bot commented Feb 10, 2023

rg314 commented Mar 14, 2023

anton-bushuiev commented Nov 10, 2022 •

edited

`graphein.ml.conversion.convert_nx_to_pyg`

`graphein.ml.visualisation.plot_pyg_data`

a-r-j commented Jan 30, 2023 •

edited

a-r-j commented Feb 3, 2023 •

edited