Issue 350 patch #351

RichardBruskiewich · 2021-10-27T14:47:00Z

Splitting out RDF/OWL test failure patch from PR #349 (see this other PR).

…rsion (this eliminates the designated_type error, but still has some other unit test failures)

…isting size of the cache as smaller than the CACHE_SIZE?

balhoff · 2021-10-27T15:49:28Z

@RichardBruskiewich I don't think I have enough knowledge of this project to review, sorry!

RichardBruskiewich · 2021-10-27T17:10:05Z

@RichardBruskiewich I don't think I have enough knowledge of this project to review, sorry!

Hi @balhoff, (cc: @deepakunni3) maybe I can just get you to comment on the 'reified nodes' issue: what are we expecting as Predicates as typical KGX RDF NT input triples?

The KGX dereify() method ignores OBAN node properties like http://purl.org/oban/association_has_subject, http://purl.org/oban/association_has_predicate and http://purl.org/oban/association_has_object, which are used in the sample OBAN NT data file (how universal are those properties as RDF globally?).

I'm simply wondering whether or not the RDF source input code simply translate these predicates into the Biolink Model slot values ('subject', 'predicate' and 'object'), in addition to looking for those Biolink slot values as well (in the RDF, if users opt to use them instead of OBAN).

deepakunni3 · 2021-11-03T16:23:50Z

The RdfSource reads triples and maps each predicate to the biolink vocabulary before adding the s,p,o to the graph. That means, http://purl.org/oban/association_has_subject is mapped to biolink:subject (thanks to the exact mappings that we have curated in the Biolink Model YAML). Similarly for http://purl.org/oban/association_has_predicate and http://purl.org/oban/association_has_object.

The purpose of http://purl.org/oban/association_has_subject, http://purl.org/oban/association_has_predicate, http://purl.org/oban/association_has_object being ignored is separate from this. That was just a quick fix (read as hack) to ignore the rare scenario where the RDF triple looks like so:

<https://monarchinitiative.org/MONARCH_cf73e4ecff5a6bfdb95de0d3acbdf186c6f8> oban:association_has_subject <http://purl.org/oban/association_has_subject>
<https://monarchinitiative.org/MONARCH_cf73e4ecff5a6bfdb95de0d3acbdf186c6f8> oban:association_has_predicate <http://purl.org/oban/association_has_predicate>
<https://monarchinitiative.org/MONARCH_cf73e4ecff5a6bfdb95de0d3acbdf186c6f8> oban:association_has_object <http://purl.org/oban/association_has_object>

deepakunni3 · 2021-11-03T16:34:11Z

kgx/source/rdf_source.py

@@ -273,7 +273,7 @@ def triple(self, s: URIRef, p: URIRef, o: URIRef) -> None:
            # treating predicate as an edge
            self.add_edge(s, o, p)

-        if len(self.edge_cache) >= self.CACHE_SIZE:
+        if self.edge_cache and len(self.edge_cache) <= self.CACHE_SIZE:


The idea behind this if clause is to:

test if edge_cache is greater than CACHE_SIZE. If it is greater than CACHE_SIZE then flush the cache by processing records in edge_cache.

whatever is left behind in edge_cache is then flushed in the end right before the RdfSource finishes with its parsing.

@deepakunni3, I understand the intent of the first bullet point. That said, given how the edge_cache is populated, and the size of CACHE_SIZE, I wonder if this block of code will ever be called? I'm also now sure if the second bullet point is always properly done. Is there also a DRY issue here in both places. Sorry... been a few days since I looked at all of this, but when I did, the test behaviour seemed unusual.

Maybe after we diagnose the real sources of the unit test failures: are they reproducible in your hands? If not, then I should perhaps reinstall all my dependencies from scratch in case there is something stale in my testing environment. If so, then what has changed in the code base (or the critical semantic inputs, e.g. LinkML, BMT, Biolink Model?) to break the tests?

The 2nd bullet point is done in the parse method of RdfSource where after the entire file is read, any remaining records in node and edge cache are processed. It would be a major bug if the parse method is not getting to the records left behind in the cache.

Regarding the tests - All unit tests seem to work fine. Perhaps a fresh install (and virtual env) will help?

Regarding the tests - All unit tests seem to work fine. Perhaps a fresh install (and virtual env) will help?

@deepakunni3 (cc: @sierra-moxon) I cloned the master branch and built a fresh venv and pip install ., on my Mac OSX laptop, but I am still getting a small handful of RDF / OWL unit test errors.

============================================= short test summary info ============== FAILED tests/unit/test_source/test_owl_source.py::test_read_owl1 - KeyError: ‘name’ FAILED tests/unit/test_source/test_owl_source.py::test_read_owl2 - KeyError: ‘name’ FAILED tests/unit/test_source/test_owl_source.py::test_read_owl3 - KeyError: ‘name’ FAILED tests/unit/test_source/test_rdf_source.py::test_read_nt5 - assert 0 == 14 FAILED tests/unit/test_source/test_rdf_source.py::test_read_nt6 - assert 0 == 14 ================================5 failed, 35 passed, 1 warning in 49.94s ===========

Please kindly advise.

RichardBruskiewich · 2021-11-03T17:51:16Z

The RdfSource reads triples and maps each predicate to the biolink vocabulary before adding the s,p,o to the graph. That means, http://purl.org/oban/association_has_subject is mapped to biolink:subject (thanks to the exact mappings that we have curated in the Biolink Model YAML).,,,

BTW, @deepakunni3, when I single stepped into the code for dereify, for some reason, the Purl predicates are still found in the RDF input stream, hence, the edge records get rejected as invalid and missing the associated mappings (e.g. even if http://purl.org/oban/association_has_subject is there, it is not captured as a mapping to biolink:subject, at least, in the dereified() method).

sierra-moxon · 2021-12-15T17:46:40Z

tests are failing for me as well, closing this PR in favor of one that fixes the tests locally. Then we can come back to this one.

Richard Bruskiewich added 2 commits October 26, 2021 13:52

Update requirements.txt of KGX to be more permissive of its linkml ve…

6cafd9c

…rsion (this eliminates the designated_type error, but still has some other unit test failures)

cache size test seems relationally inverted: should be testing the ex…

1a195d1

…isting size of the cache as smaller than the CACHE_SIZE?

RichardBruskiewich requested review from deepakunni3 and balhoff October 27, 2021 14:47

RichardBruskiewich mentioned this pull request Oct 27, 2021

Patch for issues 348 #349

Merged

RichardBruskiewich mentioned this pull request Oct 27, 2021

Switch from neo4jrestclient to neo4j for Neo4j 4.3 compatibility #343

Merged

deepakunni3 requested changes Nov 3, 2021

View reviewed changes

sierra-moxon closed this Dec 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 350 patch #351

Issue 350 patch #351

RichardBruskiewich commented Oct 27, 2021 •

edited

balhoff commented Oct 27, 2021

RichardBruskiewich commented Oct 27, 2021 •

edited

deepakunni3 commented Nov 3, 2021

deepakunni3 Nov 3, 2021 •

edited

RichardBruskiewich Nov 3, 2021

deepakunni3 Nov 3, 2021

RichardBruskiewich Nov 3, 2021 •

edited

RichardBruskiewich commented Nov 3, 2021

sierra-moxon commented Dec 15, 2021

Issue 350 patch #351

Issue 350 patch #351

Conversation

RichardBruskiewich commented Oct 27, 2021 • edited

balhoff commented Oct 27, 2021

RichardBruskiewich commented Oct 27, 2021 • edited

deepakunni3 commented Nov 3, 2021

deepakunni3 Nov 3, 2021 • edited

Choose a reason for hiding this comment

RichardBruskiewich Nov 3, 2021

Choose a reason for hiding this comment

deepakunni3 Nov 3, 2021

Choose a reason for hiding this comment

RichardBruskiewich Nov 3, 2021 • edited

Choose a reason for hiding this comment

RichardBruskiewich commented Nov 3, 2021

sierra-moxon commented Dec 15, 2021

RichardBruskiewich commented Oct 27, 2021 •

edited

RichardBruskiewich commented Oct 27, 2021 •

edited

deepakunni3 Nov 3, 2021 •

edited

RichardBruskiewich Nov 3, 2021 •

edited