change the pdb_paths working style and support for loading both local… #214

1511878618 · 2022-09-19T14:54:29Z

Reference Issues/PRs

Fixes #210

What does this implement/fix? Explain your changes

support for loading both local and downloading pdb files, and save at self.raw_dir

What testing did you do to verify the changes in this PR?

Currently, none of this, just test on dataset loader doc notebook.

More changes and complements will be done at tomorrow morning.........

Pull Request Checklist

Added a note about the modification or contribution to the ./CHANGELOG.md file (if applicable)
Added appropriate unit test functions in the ./graphein/tests/* directories (if applicable)
Modify documentation in the corresponding Jupyter Notebook under ./notebooks/ (if applicable)
Ran python -m py.test tests/ and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., python -m py.test tests/protein/test_graphs.py)
Checked for style issues by running black . and isort .

… and downloading files

review-notebook-app · 2022-09-19T14:54:32Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

a-r-j · 2022-09-19T14:56:49Z

Awesome work, thanks @1511878618 ! If you merge into #213 the CI pipeline should be much faster & I can push a release to PyPI tonight :)

1511878618 · 2022-09-19T15:33:58Z

Awesome work, thanks @1511878618 ! If you merge into #213 the CI pipeline should be much faster & I can push a release to PyPI tonight :)

I'm not sure about how to change into #213, and i see master branch is associated with #213. Or, maybe you can help me?

Also, I'm not sure whether are there some potential bugs in it~

a-r-j · 2022-09-19T19:36:06Z

tests/ml/test.ipynb

@@ -0,0 +1,247 @@
+{


Could you please switch out this notebook for unit tests :)

ok, i'll try it later in this month

graphein/protein/utils.py

a-r-j · 2022-09-19T19:37:09Z

graphein/ml/datasets/torch_geometric_dataset.py

-        graph_label_map: Optional[Dict[str, torch.Tensor]] = None,
-        node_label_map: Optional[Dict[str, torch.Tensor]] = None,
-        chain_selection_map: Optional[Dict[str, List[str]]] = None,
+        pdb_paths: Optional[List[str]] = [],


What's the reasoning for using empty lists as the default arg?

empty lists can add together even if they are empty, while None can't. So we can skip some if for different statements of the user pass pdb_paths or pdb_codes or uniprot_ids, and just merge them into self.structures, which is used at process func and it works like os.listdir(self.raw_dir).

As for some potential bugs, i'm really not sure would this will cause some bugs as i use empty list instead of None.

I think this should be None

https://stackoverflow.com/questions/366422/what-is-the-pythonic-way-to-avoid-default-parameters-that-are-empty-lists

If you want to retain the behaviour inside the object, you could do:

if working_list is None: working_list = []

Yes, and actually i have done that in the latest commit

a-r-j · 2022-09-19T19:39:20Z

graphein/ml/datasets/torch_geometric_dataset.py

+        else:
+            self.pdb_paths_name = []
+
+        self.structures = list(set(self.pdb_codes + self.uniprot_ids + self.pdb_paths_name))  # remove some pdb_codes is in pdb_path and loaded repeately


I don't think this should be a set operation. With chain selections you may want to have e.g. 3eiy_A and 3eiy_B as different examples in your dataset.

Well, i guess it would'n make a difference at chain selection, this set operation is to drop duplicate in the result list of pdb_codes + uniprot_ids + paths_name. As you can see, local dir may have some pdb files like 10gs.pdb, and if pdb_codes also have 10gs to download, and self.structures would contain double 10gs and so the finial dataset object will have duplicate Data object.

It becomes a problem here though (L283), no?

def process(self): """Process structures into PyG format and save to disk.""" # Read data into huge `Data` list. structure_files = [ f"{self.raw_dir}/{pdb}.pdb" for pdb in self.structures ]

Well, i guess not.
code like below in the tests/ml/test.ipynb

from graphein.ml.datasets import InMemoryProteinGraphDataset local_dir = "../protein/test_data" pdb_paths = [osp.join(local_dir, pdb_file) for pdb_file in os.listdir(local_dir) if pdb_file.endswith(".pdb")] ds = InMemoryProteinGraphDataset(root = "../protein/test_data/InMemoryProteinGraphDataset", name = "InMemoryProteinGraphDataset_test", pdb_paths=pdb_paths, pdb_codes=["10gs"], uniprot_ids=["A0A6J1BG53", "A0A6P5Z5F7"], af_version=3)

and before running it:

then run it :

Could you see what happens with:

from graphein.ml.dataset import InMemoryProteinGraphDataset ds = InMemoryProteinGraphDataset(root = ""../protein.test_data/InMemoryProteinGraphDataset", pdb_paths=pdb_paths, pdb_codes = ["4hhb", "4hhb"], chain_selection=["A","B"])

well, i'll try later

Could you see what happens with:

from graphein.ml.dataset import InMemoryProteinGraphDataset ds = InMemoryProteinGraphDataset(root = ""../protein.test_data/InMemoryProteinGraphDataset", pdb_paths=pdb_paths, pdb_codes = ["4hhb", "4hhb"], chain_selection=["A","B"])

and why i ['4hhs', '4hhs']

I guess this may need lots of change~

a-r-j · 2022-09-19T19:40:51Z

graphein/ml/datasets/torch_geometric_dataset.py

@@ -225,6 +251,7 @@ def download(self):
                for pdb in set(self.pdb_codes)
                if not os.path.exists(Path(self.raw_dir) / f"{pdb}.pdb")
            ]
+        print("downloading uniprotids")


Using log would be better.

ohhhhhhhhhhh, too sry for these print, forget to remove them, XD.

I'll remove them today

a-r-j · 2022-09-19T19:41:32Z

graphein/ml/datasets/torch_geometric_dataset.py

@@ -237,6 +264,7 @@ def download(self):
            ]

    def __len__(self) -> int:
+        """Returns length of data set (number of structures)."""


We should return the number of examples (not just the number of structures for the multiple chain reason I mentioned previously)

a-r-j · 2022-09-19T19:43:49Z

graphein/ml/datasets/torch_geometric_dataset.py

@@ -492,8 +513,10 @@ def processed_file_names(self) -> List[str]:

    @property
    def raw_dir(self) -> str:
-        if self.pdb_paths is not None:
-            return self.pdb_path  # replace raw dir with user local pdb_path
+        if self.pdb_paths:


I actually think it would be useful to allow users to choose a path for raw_dir when initialising the Dataset objects.

Yes, I agree.
If we simply change self.raw_dir instead of self.pdb_paths, where the former is a folder dir the latter is a list containing pdb_file dir, i guess we will use os.listdir to get local pdb files dir.
And the question is if os.listdir in the func, and then the order of self.structure maybe hard to match the order of graph_labels and node_labels, since we match the labels by index of list, i guess.

I'm not sure about this, i prefer to dict, which key is the names like {'10gs':0} would be better than {0:0}. And then we could just change the raw_dir and os.listdir and get a list of pdb file dir containing both local and downloaded pdb files, and process and assign each pdb files with their node_graph_label or chain_selection or graph_label by their name (remove root path and suffix like ./test/10gs.pdb -> 10gs) not by the enumunated index (which i think it is hard to match the correct order with pdb files when passing the graph_labels)

This description is not very clear, i'll try to make it clear later...

If something wrong in my understanding, please tell me 😄 , i'm still reading and learning your code lol. It's really a pythonic code, i learnt a lot 👍 👍

If we simply change self.raw_dir instead of self.pdb_paths, where the former is a folder dir the latter is a list containing pdb_file dir, i guess we will use `os.listdir`` to get local pdb files dir.

I don't think this is the best idea. I think being explicit about the paths users want to use is best. For instance, people may want to use only a subset of their dataset (rather than everything in the directory - e.g. imagine where you want to keep all your pdb files together but train/test on different subsets). It also has the potential problem with hidden files like .DS_Store etc. You're also completely right about the matching the list to node labels etc.

I'm not sure about this, i prefer to dict, which key is the names like {'10gs':0} would be better than {0:0}

This was my initial implementation. However, this ran into the problem where you may have different examples in your dataset drawn from different chains of the same PDB. E.g. imagine you have 3eiy_A and 3eiy_B with different labels. The current implementation allows for this, whereas indexing on the PDB name does not.

If something wrong in my understanding, please tell me 😄 , i'm still reading and learning your code lol. It's really a pythonic code, i learnt a lot 👍 👍

Thanks!! Me too!

…in func

…dataset_pyg_dev merge from v1.5.2

…rameters to Keyword Parameters since the former do not work and update the usage of it into dataloader_tutorial.ipynb also some little change in datasetload.py

… add first transform which turn attr of pyg Data to tensor

codecov-commenter · 2022-09-22T15:07:29Z

Codecov Report

Base: 40.27% // Head: 48.00% // Increases project coverage by +7.73% 🎉

Coverage data is based on head (07cd92a) compared to base (8123f42).
Patch coverage: 51.68% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #214      +/-   ##
==========================================
+ Coverage   40.27%   48.00%   +7.73%     
==========================================
  Files          48       86      +38     
  Lines        2811     5458    +2647     
==========================================
+ Hits         1132     2620    +1488     
- Misses       1679     2838    +1159

Impacted Files	Coverage Δ
graphein/grn/parse_trrust.py	`37.77% <ø> (ø)`
graphein/ml/diffusion.py	`0.00% <0.00%> (ø)`
graphein/ml/transform.py	`0.00% <0.00%> (ø)`
graphein/ppi/edges.py	`100.00% <ø> (ø)`
graphein/ppi/graph_metadata.py	`0.00% <ø> (ø)`
graphein/ppi/graphs.py	`54.34% <ø> (ø)`
graphein/ppi/parse_biogrid.py	`75.00% <ø> (ø)`
graphein/ppi/visualisation.py	`0.00% <0.00%> (ø)`
graphein/protein/analysis.py	`0.00% <0.00%> (ø)`
graphein/protein/edges/intramolecular.py	`22.68% <0.00%> (ø)`
... and 79 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

sonarcloud · 2022-09-23T13:28:44Z

SonarCloud Quality Gate failed.

0 Bugs
0 Vulnerabilities
0 Security Hotspots
3 Code Smells

No Coverage information
27.0% Duplication

sonarcloud · 2022-10-23T17:01:21Z

SonarCloud Quality Gate failed.

0 Bugs
0 Vulnerabilities
0 Security Hotspots
3 Code Smells

No Coverage information
27.0% Duplication

change the pdb_paths working style and support for loading both local…

524beb2

… and downloading files

Delete AF-Q5VSL9-F1-model_v3.pdb

f5b017c

a-r-j reviewed Sep 19, 2022

View reviewed changes

graphein/protein/utils.py Outdated Show resolved Hide resolved

a-r-j reviewed Sep 19, 2022

View reviewed changes

1511878618 added 4 commits September 20, 2022 11:10

remove empty list as default value in function and set to empty list …

b24bdeb

…in func

Merge branch 'dataloader_dev' of github.com:1511878618/graphein into …

ce6d36b

…dataset_pyg_dev merge from v1.5.2

update ml/visualisation plot_pyg_data with changing the positional pa…

f5da2c4

…rameters to Keyword Parameters since the former do not work and update the usage of it into dataloader_tutorial.ipynb also some little change in datasetload.py

add transform module and set a tutorial on dataloader_tutorial.ipynb;…

9a67631

… add first transform which turn attr of pyg Data to tensor

update comments of transform

07cd92a

add nbformat install to CI

a3dfff8

a-r-j added enhancement New feature or request 1 - Priority P1 High Priority labels Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change the pdb_paths working style and support for loading both local… #214

change the pdb_paths working style and support for loading both local… #214

1511878618 commented Sep 19, 2022

review-notebook-app bot commented Sep 19, 2022

a-r-j commented Sep 19, 2022

1511878618 commented Sep 19, 2022

a-r-j Sep 19, 2022

1511878618 Sep 22, 2022

a-r-j Sep 19, 2022

1511878618 Sep 20, 2022

a-r-j Sep 20, 2022

1511878618 Sep 20, 2022

a-r-j Sep 19, 2022

1511878618 Sep 20, 2022

a-r-j Sep 20, 2022

1511878618 Sep 20, 2022

a-r-j Sep 22, 2022

1511878618 Sep 22, 2022

1511878618 Sep 22, 2022

a-r-j Sep 19, 2022

1511878618 Sep 20, 2022

a-r-j Sep 19, 2022

1511878618 Sep 20, 2022

a-r-j Sep 19, 2022

1511878618 Sep 20, 2022 •

edited

a-r-j Sep 20, 2022 •

edited

codecov-commenter commented Sep 22, 2022 •

edited

sonarcloud bot commented Sep 23, 2022

sonarcloud bot commented Oct 23, 2022

change the pdb_paths working style and support for loading both local… #214

Are you sure you want to change the base?

change the pdb_paths working style and support for loading both local… #214

Conversation

1511878618 commented Sep 19, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes

What testing did you do to verify the changes in this PR?

Pull Request Checklist

review-notebook-app bot commented Sep 19, 2022

a-r-j commented Sep 19, 2022

1511878618 commented Sep 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

1511878618 Sep 20, 2022 • edited

Choose a reason for hiding this comment

a-r-j Sep 20, 2022 • edited

Choose a reason for hiding this comment

codecov-commenter commented Sep 22, 2022 • edited

Codecov Report

sonarcloud bot commented Sep 23, 2022

sonarcloud bot commented Oct 23, 2022

1511878618 Sep 20, 2022 •

edited

a-r-j Sep 20, 2022 •

edited

codecov-commenter commented Sep 22, 2022 •

edited