Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Development of PDBManager Class (WIP) #272

Merged
merged 131 commits into from
Mar 31, 2023
Merged

Conversation

amorehead
Copy link
Contributor

@amorehead amorehead commented Feb 25, 2023

Reference Issues/PRs

#270 @a-r-j

What does this implement/fix? Explain your changes

Adds a utility for creating selections of experimental PDB structures

What testing did you do to verify the changes in this PR?

WIP

Draws the following metadata:

id | pdb | chain | length | molecule_type | name | sequence | ligands | source | resolution | experiment_type
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
100d_A | 100d | A | 10 | na | DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP... | CCGGCGCCGG | [SPM] |   | 1.90 | diffraction
100d_B | 100d | B | 10 | na | DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP... | CCGGCGCCGG | [SPM] |   | 1.90 | diffraction
101d_A | 101d | A | 12 | na | DNA (5'-D(*CP*GP*CP*GP*AP*AP*TP*TP*(CBR)P*GP*C... | CGCGAATTCGCG | [CBR, MG, NT] |   | 2.25 | diffraction
101d_B | 101d | B | 12 | na | DNA (5'-D(*CP*GP*CP*GP*AP*AP*TP*TP*(CBR)P*GP*C... | CGCGAATTCGCG | [CBR, MG, NT] |   | 2.25 | diffraction
101m_A | 101m | A | 154 | protein | MYOGLOBIN | MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR... | [HEM, NBN, SO4] | Physeter catodon | 2.07 | diffraction
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ...
9xia_A | 9xia | A | 388 | protein | XYLOSE ISOMERASE | MNYQPTPEDRFTFGLWTVGWQGRDPFGDATRRALDPVESVQRLAEL... | [DFR, MN] | Streptomyces rubiginosus | 1.90 | diffraction
9xim_A | 9xim | A | 393 | protein | D-XYLOSE ISOMERASE | SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG... | [MN, XLS] | Actinoplanes missouriensis | 2.40 | diffraction
9xim_B | 9xim | B | 393 | protein | D-XYLOSE ISOMERASE | SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG... | [MN, XLS] | Actinoplanes missouriensis | 2.40 | diffraction
9xim_C | 9xim | C | 393 | protein | D-XYLOSE ISOMERASE | SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG... | [MN, XLS] | Actinoplanes missouriensis | 2.40 | diffraction
9xim_D | 9xim | D | 393 | protein | D-XYLOSE ISOMERASE | SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG... | [MN, XLS] | Actinoplanes missouriensis | 2.40 | diffraction

Currently missing:

  • Multiprocessing PDB download method.
  • Chain extraction logic for downloaded PDB files.
  • Initial support for providing an arbitrary number of dataset splits.
  • Initial support for sequence-based splits using the clustering method of MMSeqs.
  • Release date (i.e., time)-based dataset splits (e.g., for real-world prediction scenarios such as those in CASP).
  • Tracking to which split a selection of PDB chains belong (e.g., train, val, and test).
  • Being able to read in a sequence DataFrame from a FASTA file and to write out the current sequence DataFrame to either a FASTA file or a CSV file.
  • The ability to use our sequence DataFrame to output (i.e., collate) the PDB entries of multiple chains corresponding to the same PDB ID as a single PDB file (i.e., structure file curation for potentially multiple PDB chains).

Pull Request Checklist

  • Added a note about the modification or contribution to the ./CHANGELOG.md file (if applicable)
  • Added appropriate unit test functions in the ./graphein/tests/* directories (if applicable)
  • Modify documentation in the corresponding Jupyter Notebook under ./notebooks/ (if applicable)
  • Ran python -m py.test tests/ and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., python -m py.test tests/protein/test_graphs.py)
  • Checked for style issues by running black . and isort .

@amorehead amorehead mentioned this pull request Feb 25, 2023
7 tasks
@codecov-commenter
Copy link

codecov-commenter commented Feb 25, 2023

Codecov Report

Patch coverage: 39.87% and project coverage change: +3.60 🎉

Comparison is base (8123f42) 40.27% compared to head (cd4db9e) 43.87%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #272      +/-   ##
==========================================
+ Coverage   40.27%   43.87%   +3.60%     
==========================================
  Files          48      113      +65     
  Lines        2811     7718    +4907     
==========================================
+ Hits         1132     3386    +2254     
- Misses       1679     4332    +2653     
Impacted Files Coverage Δ
graphein/ml/datasets/foldcomp_dataset.py 0.00% <0.00%> (ø)
graphein/ml/diffusion.py 0.00% <0.00%> (ø)
graphein/ml/metrics/__init__.py 0.00% <0.00%> (ø)
graphein/ml/metrics/gdt.py 0.00% <0.00%> (ø)
graphein/ml/metrics/tm_score.py 0.00% <0.00%> (ø)
graphein/ppi/graph_metadata.py 0.00% <0.00%> (ø)
graphein/ppi/visualisation.py 0.00% <0.00%> (ø)
graphein/protein/analysis.py 0.00% <0.00%> (ø)
graphein/protein/features/utils.py 27.77% <0.00%> (ø)
graphein/protein/folding_utils.py 0.00% <0.00%> (ø)
... and 95 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@a-r-j
Copy link
Owner

a-r-j commented Feb 26, 2023

Wow, thanks for the refactor & additions! I added deposition dates for temporal splits.

I also had to rename the file as pdb was causing namespace clashes with the python debugger.

@amorehead
Copy link
Contributor Author

amorehead commented Feb 26, 2023

@a-r-j, no problem! Happy to do so. Funny enough, earlier today, I accidentally also implemented parsing of deposition dates (should have checked my GitHub notifications beforehand). However, to minimize development style friction, I've conformed my implementation to reuse most parts of your implementation. In addition, I've also added support for time-based (i.e., deposition date) splits, as well as new PDB filters in this commit.

@amorehead
Copy link
Contributor Author

One practical question about SonarCloud: Any ideas for how we can get around this security issue with this HTTP URL we have listed in our code? It looks like the HTTPS version of this URL does not exist, so I am unsure of how to proceed to work around this flag.
image

@a-r-j
Copy link
Owner

a-r-j commented Feb 26, 2023

Eep! Sorry, should have dropped a note first.

Re SonarCloud, it's not a problem; I can manually override the flag since there's no HTTPS alternative (AFAIK).

@amorehead
Copy link
Contributor Author

@a-r-j, apologies for my delayed response time here. I just ran the latest version of the PDBManager tutorial notebook, and I was successfully able to perform all time-based splitting and sequence-based clustering in succession. I did not encounter any freezing issues, and (very nicely) I was able to reproduce all outputs from your most recent execution of the tutorial notebook. This suggests that our entire data fetching and splitting procedure is truly deterministic and reproducible! Great work on putting this notebook together. The only small comment I have on it is that users may also like to know that they can compose multiple types of splitting operations together. For example, they can perform a time-based split after clustering sequences into train/val/test splits without corrupting the sequence-based splits. Other than that, everything looks good to me!

@a-r-j
Copy link
Owner

a-r-j commented Mar 31, 2023

Great! It seems it's the clustering step with mmseqs that jupyter wasn't happy with.

The only small comment I have on it is that users may also like to know that they can compose multiple types of splitting operations together. For example, they can perform a time-based split after clustering sequences into train/val/test splits without corrupting the sequence-based splits.

Good point!. Will add this.

@sonarcloud
Copy link

sonarcloud bot commented Mar 31, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

@a-r-j a-r-j merged commit 83295a8 into a-r-j:master Mar 31, 2023
@a-r-j a-r-j mentioned this pull request May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants