-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 350 patch #351
Issue 350 patch #351
Conversation
…rsion (this eliminates the designated_type error, but still has some other unit test failures)
…isting size of the cache as smaller than the CACHE_SIZE?
@RichardBruskiewich I don't think I have enough knowledge of this project to review, sorry! |
Hi @balhoff, (cc: @deepakunni3) maybe I can just get you to comment on the 'reified nodes' issue: what are we expecting as Predicates as typical KGX RDF NT input triples? The KGX dereify() method ignores OBAN node properties like I'm simply wondering whether or not the RDF source input code simply translate these predicates into the Biolink Model slot values ('subject', 'predicate' and 'object'), in addition to looking for those Biolink slot values as well (in the RDF, if users opt to use them instead of OBAN). |
The RdfSource reads triples and maps each predicate to the biolink vocabulary before adding the s,p,o to the graph. That means, The purpose of
|
@@ -273,7 +273,7 @@ def triple(self, s: URIRef, p: URIRef, o: URIRef) -> None: | |||
# treating predicate as an edge | |||
self.add_edge(s, o, p) | |||
|
|||
if len(self.edge_cache) >= self.CACHE_SIZE: | |||
if self.edge_cache and len(self.edge_cache) <= self.CACHE_SIZE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind this if clause is to:
- test if edge_cache is greater than CACHE_SIZE. If it is greater than CACHE_SIZE then flush the cache by processing records in edge_cache.
- whatever is left behind in edge_cache is then flushed in the end right before the RdfSource finishes with its parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deepakunni3, I understand the intent of the first bullet point. That said, given how the edge_cache
is populated, and the size of CACHE_SIZE
, I wonder if this block of code will ever be called? I'm also now sure if the second bullet point is always properly done. Is there also a DRY issue here in both places. Sorry... been a few days since I looked at all of this, but when I did, the test behaviour seemed unusual.
Maybe after we diagnose the real sources of the unit test failures: are they reproducible in your hands? If not, then I should perhaps reinstall all my dependencies from scratch in case there is something stale in my testing environment. If so, then what has changed in the code base (or the critical semantic inputs, e.g. LinkML, BMT, Biolink Model?) to break the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 2nd bullet point is done in the parse
method of RdfSource where after the entire file is read, any remaining records in node and edge cache are processed. It would be a major bug if the parse method is not getting to the records left behind in the cache.
Regarding the tests - All unit tests seem to work fine. Perhaps a fresh install (and virtual env) will help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the tests - All unit tests seem to work fine. Perhaps a fresh install (and virtual env) will help?
@deepakunni3 (cc: @sierra-moxon) I cloned the master branch and built a fresh venv
and pip install .
, on my Mac OSX laptop, but I am still getting a small handful of RDF / OWL unit test errors.
============================================= short test summary info ==============
FAILED tests/unit/test_source/test_owl_source.py::test_read_owl1 - KeyError: ‘name’
FAILED tests/unit/test_source/test_owl_source.py::test_read_owl2 - KeyError: ‘name’
FAILED tests/unit/test_source/test_owl_source.py::test_read_owl3 - KeyError: ‘name’
FAILED tests/unit/test_source/test_rdf_source.py::test_read_nt5 - assert 0 == 14
FAILED tests/unit/test_source/test_rdf_source.py::test_read_nt6 - assert 0 == 14
================================5 failed, 35 passed, 1 warning in 49.94s ===========
Please kindly advise.
BTW, @deepakunni3, when I single stepped into the code for dereify, for some reason, the Purl predicates are still found in the RDF input stream, hence, the edge records get rejected as invalid and missing the associated mappings (e.g. even if |
tests are failing for me as well, closing this PR in favor of one that fixes the tests locally. Then we can come back to this one. |
Splitting out RDF/OWL test failure patch from PR #349 (see this other PR).