Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change the pdb_paths working style and support for loading both local… #214

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

1511878618
Copy link
Contributor

Reference Issues/PRs

Fixes #210

What does this implement/fix? Explain your changes

support for loading both local and downloading pdb files, and save at self.raw_dir

What testing did you do to verify the changes in this PR?

Currently, none of this, just test on dataset loader doc notebook.

More changes and complements will be done at tomorrow morning.........

Pull Request Checklist

  • Added a note about the modification or contribution to the ./CHANGELOG.md file (if applicable)
  • Added appropriate unit test functions in the ./graphein/tests/* directories (if applicable)
  • Modify documentation in the corresponding Jupyter Notebook under ./notebooks/ (if applicable)
  • Ran python -m py.test tests/ and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., python -m py.test tests/protein/test_graphs.py)
  • Checked for style issues by running black . and isort .

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@a-r-j
Copy link
Owner

a-r-j commented Sep 19, 2022

Awesome work, thanks @1511878618 ! If you merge into #213 the CI pipeline should be much faster & I can push a release to PyPI tonight :)

@1511878618
Copy link
Contributor Author

Awesome work, thanks @1511878618 ! If you merge into #213 the CI pipeline should be much faster & I can push a release to PyPI tonight :)

I'm not sure about how to change into #213, and i see master branch is associated with #213. Or, maybe you can help me?

Also, I'm not sure whether are there some potential bugs in it~

@@ -0,0 +1,247 @@
{
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please switch out this notebook for unit tests :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i'll try it later in this month

graphein/protein/utils.py Outdated Show resolved Hide resolved
graph_label_map: Optional[Dict[str, torch.Tensor]] = None,
node_label_map: Optional[Dict[str, torch.Tensor]] = None,
chain_selection_map: Optional[Dict[str, List[str]]] = None,
pdb_paths: Optional[List[str]] = [],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning for using empty lists as the default arg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty lists can add together even if they are empty, while None can't. So we can skip some if for different statements of the user pass pdb_paths or pdb_codes or uniprot_ids, and just merge them into self.structures, which is used at process func and it works like os.listdir(self.raw_dir).

As for some potential bugs, i'm really not sure would this will cause some bugs as i use empty list instead of None.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be None

https://stackoverflow.com/questions/366422/what-is-the-pythonic-way-to-avoid-default-parameters-that-are-empty-lists

If you want to retain the behaviour inside the object, you could do:

if working_list is None: working_list = []

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and actually i have done that in the latest commit

else:
self.pdb_paths_name = []

self.structures = list(set(self.pdb_codes + self.uniprot_ids + self.pdb_paths_name)) # remove some pdb_codes is in pdb_path and loaded repeately
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be a set operation. With chain selections you may want to have e.g. 3eiy_A and 3eiy_B as different examples in your dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, i guess it would'n make a difference at chain selection, this set operation is to drop duplicate in the result list of pdb_codes + uniprot_ids + paths_name. As you can see, local dir may have some pdb files like 10gs.pdb, and if pdb_codes also have 10gs to download, and self.structures would contain double 10gs and so the finial dataset object will have duplicate Data object.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It becomes a problem here though (L283), no?

    def process(self):
        """Process structures into PyG format and save to disk."""
        # Read data into huge `Data` list.
        structure_files = [
            f"{self.raw_dir}/{pdb}.pdb" for pdb in self.structures
        ]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, i guess not.
code like below in the tests/ml/test.ipynb

from graphein.ml.datasets import InMemoryProteinGraphDataset

local_dir = "../protein/test_data"
pdb_paths = [osp.join(local_dir, pdb_file) for pdb_file in os.listdir(local_dir) if pdb_file.endswith(".pdb")]

ds = InMemoryProteinGraphDataset(root = "../protein/test_data/InMemoryProteinGraphDataset",
                    name = "InMemoryProteinGraphDataset_test",
                    pdb_paths=pdb_paths,
                    pdb_codes=["10gs"],
                    uniprot_ids=["A0A6J1BG53", "A0A6P5Z5F7"],
                    af_version=3)

and before running it:
image

then run it :
image

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you see what happens with:

from graphein.ml.dataset import InMemoryProteinGraphDataset

ds = InMemoryProteinGraphDataset(root = ""../protein.test_data/InMemoryProteinGraphDataset", pdb_paths=pdb_paths, pdb_codes = ["4hhb", "4hhb"], chain_selection=["A","B"])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, i'll try later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you see what happens with:

from graphein.ml.dataset import InMemoryProteinGraphDataset

ds = InMemoryProteinGraphDataset(root = ""../protein.test_data/InMemoryProteinGraphDataset", pdb_paths=pdb_paths, pdb_codes = ["4hhb", "4hhb"], chain_selection=["A","B"])

image

and why i ['4hhs', '4hhs']

image

I guess this may need lots of change~

@@ -225,6 +251,7 @@ def download(self):
for pdb in set(self.pdb_codes)
if not os.path.exists(Path(self.raw_dir) / f"{pdb}.pdb")
]
print("downloading uniprotids")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using log would be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohhhhhhhhhhh, too sry for these print, forget to remove them, XD.

I'll remove them today

@@ -237,6 +264,7 @@ def download(self):
]

def __len__(self) -> int:
"""Returns length of data set (number of structures)."""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should return the number of examples (not just the number of structures for the multiple chain reason I mentioned previously)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@@ -492,8 +513,10 @@ def processed_file_names(self) -> List[str]:

@property
def raw_dir(self) -> str:
if self.pdb_paths is not None:
return self.pdb_path # replace raw dir with user local pdb_path
if self.pdb_paths:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think it would be useful to allow users to choose a path for raw_dir when initialising the Dataset objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree.
If we simply change self.raw_dir instead of self.pdb_paths, where the former is a folder dir the latter is a list containing pdb_file dir, i guess we will use os.listdir to get local pdb files dir.
And the question is if os.listdir in the func, and then the order of self.structure maybe hard to match the order of graph_labels and node_labels, since we match the labels by index of list, i guess.
image

image

image

I'm not sure about this, i prefer to dict, which key is the names like {'10gs':0} would be better than {0:0}. And then we could just change the raw_dir and os.listdir and get a list of pdb file dir containing both local and downloaded pdb files, and process and assign each pdb files with their node_graph_label or chain_selection or graph_label by their name (remove root path and suffix like ./test/10gs.pdb -> 10gs) not by the enumunated index (which i think it is hard to match the correct order with pdb files when passing the graph_labels)

This description is not very clear, i'll try to make it clear later...

If something wrong in my understanding, please tell me 😄 , i'm still reading and learning your code lol. It's really a pythonic code, i learnt a lot 👍 👍

Copy link
Owner

@a-r-j a-r-j Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we simply change self.raw_dir instead of self.pdb_paths, where the former is a folder dir the latter is a list containing pdb_file dir, i guess we will use `os.listdir`` to get local pdb files dir.

I don't think this is the best idea. I think being explicit about the paths users want to use is best. For instance, people may want to use only a subset of their dataset (rather than everything in the directory - e.g. imagine where you want to keep all your pdb files together but train/test on different subsets). It also has the potential problem with hidden files like .DS_Store etc. You're also completely right about the matching the list to node labels etc.

I'm not sure about this, i prefer to dict, which key is the names like {'10gs':0} would be better than {0:0}

This was my initial implementation. However, this ran into the problem where you may have different examples in your dataset drawn from different chains of the same PDB. E.g. imagine you have 3eiy_A and 3eiy_B with different labels. The current implementation allows for this, whereas indexing on the PDB name does not.

If something wrong in my understanding, please tell me 😄 , i'm still reading and learning your code lol. It's really a pythonic code, i learnt a lot 👍 👍

Thanks!! Me too!

…rameters to Keyword Parameters since the former do not work and update the usage of it into dataloader_tutorial.ipynb also some little change in datasetload.py
… add first transform which turn attr of pyg Data to tensor
@codecov-commenter
Copy link

codecov-commenter commented Sep 22, 2022

Codecov Report

Base: 40.27% // Head: 48.00% // Increases project coverage by +7.73% 🎉

Coverage data is based on head (07cd92a) compared to base (8123f42).
Patch coverage: 51.68% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #214      +/-   ##
==========================================
+ Coverage   40.27%   48.00%   +7.73%     
==========================================
  Files          48       86      +38     
  Lines        2811     5458    +2647     
==========================================
+ Hits         1132     2620    +1488     
- Misses       1679     2838    +1159     
Impacted Files Coverage Δ
graphein/grn/parse_trrust.py 37.77% <ø> (ø)
graphein/ml/diffusion.py 0.00% <0.00%> (ø)
graphein/ml/transform.py 0.00% <0.00%> (ø)
graphein/ppi/edges.py 100.00% <ø> (ø)
graphein/ppi/graph_metadata.py 0.00% <ø> (ø)
graphein/ppi/graphs.py 54.34% <ø> (ø)
graphein/ppi/parse_biogrid.py 75.00% <ø> (ø)
graphein/ppi/visualisation.py 0.00% <0.00%> (ø)
graphein/protein/analysis.py 0.00% <0.00%> (ø)
graphein/protein/edges/intramolecular.py 22.68% <0.00%> (ø)
... and 79 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@sonarcloud
Copy link

sonarcloud bot commented Sep 23, 2022

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 3 Code Smells

No Coverage information No Coverage information
27.0% 27.0% Duplication

@sonarcloud
Copy link

sonarcloud bot commented Oct 23, 2022

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 3 Code Smells

No Coverage information No Coverage information
27.0% 27.0% Duplication

@a-r-j a-r-j added enhancement New feature or request 1 - Priority P1 High Priority labels Nov 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - Priority P1 High Priority enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using both local and downloaded PDBs in ML DataLoaders
3 participants