Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and improve the utilities for PyG #234

Merged
merged 58 commits into from Feb 10, 2023
Merged

Conversation

anton-bushuiev
Copy link
Contributor

@anton-bushuiev anton-bushuiev commented Nov 10, 2022

Reference Issues/PRs

No Reference Issues/PRs

What does this implement/fix? Explain your changes

This PR fixes the bugs related to the processing of PyG data.

graphein.ml.conversion.convert_nx_to_pyg
  1. Fix the coordinates bug. Currently, the function creates a data.coords tensor with an extra dimension because coords is also a “graph-level feature”.
  2. Improve to convert everything that is possible to torch.Tensors. Makes further usage much more easier and a resulting data object much more PyG-like.
  3. Extend to multiple edge_indices for multiple edge types.
graphein.ml.visualisation.plot_pyg_data
  1. Fix the arguments for the nested plotly_protein_structure_graph. Currently, it lacks the positional value for node_size_feature and the order of the following arguments is completely broken.
  2. Adapt coords processing to the change number 2 in convert_nx_to_pyg.

What testing did you do to verify the changes in this PR?

Pull Request Checklist

  • Added a note about the modification or contribution to the ./CHANGELOG.md file (if applicable)
  • Added appropriate unit test functions in the ./graphein/tests/* directories (if applicable)
  • Modify documentation in the corresponding Jupyter Notebook under ./notebooks/ (if applicable)
  • Ran python -m py.test tests/ and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., python -m py.test tests/protein/test_graphs.py)
  • Checked for style issues by running black . and isort .

@anton-bushuiev anton-bushuiev changed the base branch from master to anon November 10, 2022 20:37
@anton-bushuiev anton-bushuiev changed the base branch from anon to master November 10, 2022 20:37
@a-r-j
Copy link
Owner

a-r-j commented Jan 28, 2023

I've made some changes to the CI/CD in #244 which I expect will resolve the test failures here

@a-r-j
Copy link
Owner

a-r-j commented Jan 30, 2023

It looks like we're running into an issue when it comes to collating the distance matrices into a batch.

I suppose we have three options:

  1. flatten distance matrices: (e.g. n x n -> n^2 x 1)
  2. add additional dimension: 1 x n x n which then becomes 1 x max(n) x max(n) where max(n) is the largest number of nodes in the batch and smaller proteins are padded appropriately
  3. Drop it entirely since it's easy to compute in torch.

Do you have any strong opinions?

@anton-bushuiev
Copy link
Contributor Author

That's a good question. I'll try it and let you know.

@anton-bushuiev
Copy link
Contributor Author

Hi, @a-r-j!

From my perspective, (1) is much better than (2) but (3) is the best. I would not serialize distance matrices for several reasons:

  • it's easy to compute them in torch, as you wrote
  • they introduce additional O(n^2) memory overhead
  • from my experience they are typically not needed after data preparation.

What do you think about it?

@a-r-j
Copy link
Owner

a-r-j commented Feb 3, 2023

Very good points @anton-bushuiev.

I am not sure about not supporting it all. While distance matrices are easy to compute, we can still run into scenarios where there are matrix features we may want to include (e.g. Hbond map). Thus, I propose:

  • Distance matrices should not be stored by default due to the memory overhead
  • Users have to specifically include them in the columns arg if they are to be included
  • If matrices are to be stored, they should be stored in a sparse format (e.g. using this.

@anton-bushuiev
Copy link
Contributor Author

I agree with the first two points but I am not sure about sparse format. Working with protein graphs, distance matrices are always dense because physical laws allow only zeros on diagonals. That's why I think simple reshaping would be the best:

# Flatten before writing
data.dist_mat = data.dist_mat.reshape(data.num_nodes * data.num_nodes)
# Restore after reading
data.dist_mat = data.dist_mat.reshape((data.num_nodes, data.num_nodes))

Do I miss the scenarios when they may be sparse? What is Hbond map (can you please send some link to its usage in Graphein)?

Copy link
Owner

@a-r-j a-r-j left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the contribution!

@sonarcloud
Copy link

sonarcloud bot commented Feb 10, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@a-r-j a-r-j merged commit ea8be9f into a-r-j:master Feb 10, 2023
@rg314
Copy link
Contributor

rg314 commented Mar 14, 2023

Hi @anton-bushuiev,

Your commit for "Improve convert_nx_to_pyg" added some breaking changes. Edges do not necessarily have a kind unless I'm not understanding something?

# Split edge index by edge kind

        # Split edge index by edge kind
        kind_strs = np.array(list(map(lambda x: "_".join(x), data["kind"])))
        for kind in set(kind_strs):
            key = f"edge_index_{kind}"
            if key in self.columns:
                mask = kind_strs == kind
                data[key] = edge_index[:, mask]
        if "kind" not in self.columns:
            del data["kind"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants