Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 350 patch #351

Closed
wants to merge 2 commits into from
Closed

Issue 350 patch #351

wants to merge 2 commits into from

Conversation

RichardBruskiewich
Copy link
Collaborator

@RichardBruskiewich RichardBruskiewich commented Oct 27, 2021

Splitting out RDF/OWL test failure patch from PR #349 (see this other PR).

Richard Bruskiewich added 2 commits October 26, 2021 13:52
…rsion (this eliminates the designated_type error, but still has some other unit test failures)
…isting size of the cache as smaller than the CACHE_SIZE?
@balhoff
Copy link

balhoff commented Oct 27, 2021

@RichardBruskiewich I don't think I have enough knowledge of this project to review, sorry!

@RichardBruskiewich
Copy link
Collaborator Author

RichardBruskiewich commented Oct 27, 2021

@RichardBruskiewich I don't think I have enough knowledge of this project to review, sorry!

Hi @balhoff, (cc: @deepakunni3) maybe I can just get you to comment on the 'reified nodes' issue: what are we expecting as Predicates as typical KGX RDF NT input triples?

The KGX dereify() method ignores OBAN node properties like http://purl.org/oban/association_has_subject, http://purl.org/oban/association_has_predicate and http://purl.org/oban/association_has_object, which are used in the sample OBAN NT data file (how universal are those properties as RDF globally?).

I'm simply wondering whether or not the RDF source input code simply translate these predicates into the Biolink Model slot values ('subject', 'predicate' and 'object'), in addition to looking for those Biolink slot values as well (in the RDF, if users opt to use them instead of OBAN).

@deepakunni3
Copy link
Member

The RdfSource reads triples and maps each predicate to the biolink vocabulary before adding the s,p,o to the graph. That means, http://purl.org/oban/association_has_subject is mapped to biolink:subject (thanks to the exact mappings that we have curated in the Biolink Model YAML). Similarly for http://purl.org/oban/association_has_predicate and http://purl.org/oban/association_has_object.

The purpose of http://purl.org/oban/association_has_subject, http://purl.org/oban/association_has_predicate, http://purl.org/oban/association_has_object being ignored is separate from this. That was just a quick fix (read as hack) to ignore the rare scenario where the RDF triple looks like so:

<https://monarchinitiative.org/MONARCH_cf73e4ecff5a6bfdb95de0d3acbdf186c6f8> oban:association_has_subject <http://purl.org/oban/association_has_subject>
<https://monarchinitiative.org/MONARCH_cf73e4ecff5a6bfdb95de0d3acbdf186c6f8> oban:association_has_predicate <http://purl.org/oban/association_has_predicate>
<https://monarchinitiative.org/MONARCH_cf73e4ecff5a6bfdb95de0d3acbdf186c6f8> oban:association_has_object <http://purl.org/oban/association_has_object>

@@ -273,7 +273,7 @@ def triple(self, s: URIRef, p: URIRef, o: URIRef) -> None:
# treating predicate as an edge
self.add_edge(s, o, p)

if len(self.edge_cache) >= self.CACHE_SIZE:
if self.edge_cache and len(self.edge_cache) <= self.CACHE_SIZE:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind this if clause is to:

  • test if edge_cache is greater than CACHE_SIZE. If it is greater than CACHE_SIZE then flush the cache by processing records in edge_cache.
  • whatever is left behind in edge_cache is then flushed in the end right before the RdfSource finishes with its parsing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepakunni3, I understand the intent of the first bullet point. That said, given how the edge_cache is populated, and the size of CACHE_SIZE, I wonder if this block of code will ever be called? I'm also now sure if the second bullet point is always properly done. Is there also a DRY issue here in both places. Sorry... been a few days since I looked at all of this, but when I did, the test behaviour seemed unusual.

Maybe after we diagnose the real sources of the unit test failures: are they reproducible in your hands? If not, then I should perhaps reinstall all my dependencies from scratch in case there is something stale in my testing environment. If so, then what has changed in the code base (or the critical semantic inputs, e.g. LinkML, BMT, Biolink Model?) to break the tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2nd bullet point is done in the parse method of RdfSource where after the entire file is read, any remaining records in node and edge cache are processed. It would be a major bug if the parse method is not getting to the records left behind in the cache.

Regarding the tests - All unit tests seem to work fine. Perhaps a fresh install (and virtual env) will help?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the tests - All unit tests seem to work fine. Perhaps a fresh install (and virtual env) will help?

@deepakunni3 (cc: @sierra-moxon) I cloned the master branch and built a fresh venv and pip install ., on my Mac OSX laptop, but I am still getting a small handful of RDF / OWL unit test errors.

============================================= short test summary info ==============
FAILED tests/unit/test_source/test_owl_source.py::test_read_owl1 - KeyError: ‘name’
FAILED tests/unit/test_source/test_owl_source.py::test_read_owl2 - KeyError: ‘name’
FAILED tests/unit/test_source/test_owl_source.py::test_read_owl3 - KeyError: ‘name’
FAILED tests/unit/test_source/test_rdf_source.py::test_read_nt5 - assert 0 == 14
FAILED tests/unit/test_source/test_rdf_source.py::test_read_nt6 - assert 0 == 14
================================5 failed, 35 passed, 1 warning in 49.94s ===========

Please kindly advise.

@RichardBruskiewich
Copy link
Collaborator Author

The RdfSource reads triples and maps each predicate to the biolink vocabulary before adding the s,p,o to the graph. That means, http://purl.org/oban/association_has_subject is mapped to biolink:subject (thanks to the exact mappings that we have curated in the Biolink Model YAML).,,,

BTW, @deepakunni3, when I single stepped into the code for dereify, for some reason, the Purl predicates are still found in the RDF input stream, hence, the edge records get rejected as invalid and missing the associated mappings (e.g. even if http://purl.org/oban/association_has_subject is there, it is not captured as a mapping to biolink:subject, at least, in the dereified() method).

@sierra-moxon
Copy link
Member

tests are failing for me as well, closing this PR in favor of one that fixes the tests locally. Then we can come back to this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants