Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 338 enhanced kgx transform logging #365

Merged
merged 62 commits into from
Jan 15, 2022
Merged
Show file tree
Hide file tree
Changes from 61 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
1fb6d39
Extend 'ValidationError' usage to "classical" KGX graph-summary
Nov 23, 2021
64960d1
Working to generalize error checking in graph-summary - initial commit
Nov 23, 2021
5c320af
Implemented new "ErrorDetecting" error logging class (first iteration…
Nov 26, 2021
99b46cf
Move general error reporting (super)classes to root KGX package pytho…
Nov 27, 2021
a898a6b
Start migrating Validator to have ErrorDetecting superclass (but need…
Nov 27, 2021
28c9ca7
Convert all static validation methods to instance validation methods;…
Nov 27, 2021
59cd5a5
Fixing and troubleshooting unit tests still using legacy model of err…
Nov 27, 2021
41c63a1
Finish cleaning up unit tests to reflect more recent Biolink Model re…
Nov 27, 2021
94f160b
report() renamed to 'get_errors' with original 'get_errors' renamed t…
Nov 27, 2021
ef220a8
Delete 'parse_error' replacing with 'log_error'; fixed python-dateuti…
Nov 27, 2021
3243ee7
Stray legacy self.errors List initialization removed
Nov 27, 2021
65f7d1a
Next iteration complete, removing redundancy in entities and error re…
Nov 27, 2021
06722d8
Implemented structured, relatively non-redundant, JSON error log outp…
Nov 27, 2021
128be58
Moved error detection code to its own module; fixed imports; some min…
Nov 28, 2021
e72215c
cleaned up inline comments
Nov 28, 2021
8b88af4
Introducing ErrorDetecting class into Transformer, and propagate Tran…
Nov 30, 2021
d0e887b
Embed some log_error reporting into Source.py
Nov 30, 2021
926dd3f
Replacing KGX Source code exception & logger errors/warnings plus som…
Dec 1, 2021
80288d3
ditto except for KGX Sink code
Dec 1, 2021
8ef1d8a
Simplify calls to log_error
Dec 1, 2021
84a254c
Extending 'source' tests to also test the Transformer 'ErrorDetectin…
Dec 2, 2021
a6fe9b5
Add ErrorDetecting tests added to summarize-graph and meta-knowledge-…
Dec 2, 2021
c175b6c
Adding error message testing in obograph unit test; renamed 'Error…
Dec 3, 2021
df97b7b
added a TSV test; renamed ErrorType.NO_EDGE_PREDICATE to MISSING_ED…
Dec 3, 2021
b69d0f2
Add a bit of README information about the new JSON KGX error reporting
Dec 3, 2021
e4e6f29
fix conflicts
sierra-moxon Jan 7, 2022
9cb60e7
fix conflicts
sierra-moxon Jan 7, 2022
68aa3fa
Remove unneeded @pytest.skip directives for non-Neo4j tests
Jan 13, 2022
f74702d
format tweak
Jan 13, 2022
7dcea65
Merge branch 'issue-338-categorization-of-KGX-transform-warnings-and-…
Jan 13, 2022
3f23231
Merge branch 'master' into issue-338-enhanced-kgx-transform-logging
sierra-moxon Jan 13, 2022
339f3b9
fixing tests
sierra-moxon Jan 13, 2022
4c733d5
fixing tests
sierra-moxon Jan 13, 2022
de413f9
fix neo4j param passing
sierra-moxon Jan 14, 2022
541056a
fixing tests
sierra-moxon Jan 14, 2022
541dfc4
fix merge conflict
sierra-moxon Jan 14, 2022
1762ecf
fix compile errors
sierra-moxon Jan 14, 2022
f4548a0
fixing infores tests
sierra-moxon Jan 14, 2022
c794cd7
fixing infores tests
sierra-moxon Jan 15, 2022
91d4aa3
fix test
sierra-moxon Jan 15, 2022
349d46a
test munging
sierra-moxon Jan 15, 2022
3c02c8b
code smells
sierra-moxon Jan 15, 2022
ea3949f
code smell
sierra-moxon Jan 15, 2022
0b63035
add test, remove print statements
sierra-moxon Jan 15, 2022
ae283a4
getting rid of stubbed code
sierra-moxon Jan 15, 2022
bbf6e9e
getting rid of stubbed code
sierra-moxon Jan 15, 2022
bf22645
fix tests and add tests to increase code coverage
sierra-moxon Jan 15, 2022
ca47e3a
add more tests for code coverage
sierra-moxon Jan 15, 2022
838ea7b
more tests for code coverage
sierra-moxon Jan 15, 2022
2af7ae2
more tests for code coverage
sierra-moxon Jan 15, 2022
26be2db
more coverage tests
sierra-moxon Jan 15, 2022
c5d361f
more coverage tests
sierra-moxon Jan 15, 2022
b746f8c
more coverage tests
sierra-moxon Jan 15, 2022
c4c8a63
more coverage tests
sierra-moxon Jan 15, 2022
621542f
code coverage test
sierra-moxon Jan 15, 2022
f7e5c82
fix wrapper test
sierra-moxon Jan 15, 2022
6dd8be8
more tests
sierra-moxon Jan 15, 2022
40944ee
another test
sierra-moxon Jan 15, 2022
30ff682
another test
sierra-moxon Jan 15, 2022
7726acf
enable test
sierra-moxon Jan 15, 2022
f1859db
more tests
sierra-moxon Jan 15, 2022
0732221
tests
sierra-moxon Jan 15, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 0 additions & 3 deletions .github/workflows/run_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,6 @@ name: Run tests

# Controls when the action will run.
on:
# Triggers the workflow on push or pull request events but only for the master branch
push:
branches: [ master ]
pull_request:
types: [opened, synchronize, reopened]

Expand Down
4 changes: 1 addition & 3 deletions .github/workflows/run_tox.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,8 @@ name: Run Tox

on:
# Triggers the workflow on push or pull request events but only for the master branch
push:
branches: [ master ]
pull_request:
types: [opened, synchronize, reopened]
types: [opened]

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:
Expand Down
80 changes: 79 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ KGX allows conversion to and from:
* Reasoner Standard API format
* OBOGraph JSON format


KGX will also provide validation, to ensure the KGs are conformant to the Biolink Model: making sure nodes are
categorized using Biolink classes, edges are labeled using valid Biolink relationship types, and valid properties are used.

Expand All @@ -32,6 +31,85 @@ The structure of this graph is expected to conform to the Biolink Model standard

In addition to the main code-base, KGX also provides a series of [command line operations](https://kgx.readthedocs.io/en/latest/examples.html#using-kgx-cli).

### Error Detection and Reporting

Non-redundant JSON-formatted structured error logging is now provided in KGX Transformer, Validator, GraphSummary and MetaKnowledgeGraph operations. See the various unit tests for the general design pattern (using the Validator as an example here):

```python
from kgx.validator import Validator
from kgx.transformer import Transformer

Validator.set_biolink_model("2.11.0")

# Validator assumes the currently set Biolink Release
validator = Validator()

transformer = Transformer(stream=True)

transformer.transform(
input_args = {
"filename": [
"graph_nodes.tsv",
"graph_edges.tsv",
],
"format": "tsv",
}
output_args={
"format": "null"
},
inspector=validator,
)

# Both the Validator and the Transformer can independently capture errors

# The Validator, from the overall semantics of the graph...
# Here, we just report severe Errors from the Validator (no Warnings)
validator.write_report(open("validation_errors.json", "w"), "Error")

# The Transformer, from the syntax of the input files...
# Here, we catch *all* Errors and Warnings (by not providing a filter)
transformer.write_report(open("input_errors.json", "w"))
```

The JSON error outputs will look something like this:

```json
{
"ERROR": {
"MISSING_EDGE_PROPERTY": {
"Required edge property 'id' is missing": [
"A:123->X:1",
"B:456->Y:2"
],
"Required edge property 'object' is missing": [
"A:123->X:1"
],
"Required edge property 'predicate' is missing": [
"A:123->X:1"
],
"Required edge property 'subject' is missing": [
"A:123->X:1",
"B:456->Y:2"
]
}
},
"WARNING": {
"DUPLICATE_NODE": {
"Node 'id' duplicated in input data": [
"MONDO:0010011",
"REACT:R-HSA-5635838"
]
}
}
}

```

This system reduces the significant redundancies of earlier line-oriented KGX logging text output files, in that graph entities with the same class of error are simply aggregated in lists of names/identifiers at the leaf level of the JSON structure.

The top level JSON tags originate from the `MessageLevel` class and the second level tags from the `ErrorType` class in the [error_detection](kgx/error_detection.py) module, while the third level messages are hard coded as `log_error` method messages in the code.

It is likely that additional error conditions within KGX can be efficiently captured and reported in the future using this general framework.

## Installation

Expand Down
3 changes: 3 additions & 0 deletions kgx/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
"""
KGX Package
"""
__version__ = "1.5.4"
19 changes: 9 additions & 10 deletions kgx/cli/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from sys import exit
from typing import List, Tuple, Optional
from typing import List, Tuple, Optional, Dict
import click

import kgx
Expand All @@ -17,7 +17,6 @@
summary_report_types,
get_report_format_types,
)
from kgx.validator import ValidationError, MessageLevel, ErrorType

log = get_logger()
config = get_config()
Expand Down Expand Up @@ -56,7 +55,7 @@ def cli():
"-r",
required=False,
type=str,
help=f"The summary report type. Must be one of {tuple(summary_report_types.keys())}",
help=f"The summary get_errors type. Must be one of {tuple(summary_report_types.keys())}",
default="kgx-map",
)
@click.option(
Expand Down Expand Up @@ -88,7 +87,7 @@ def cli():
"-l",
required=False,
type=click.Path(exists=False),
help='File within which to report graph data parsing errors (default: "stderr")',
help='File within which to get_errors graph data parsing errors (default: "stderr")',
)
def graph_summary_wrapper(
inputs: List[str],
Expand Down Expand Up @@ -118,9 +117,9 @@ def graph_summary_wrapper(
output: Optional[str]
Where to write the output (stdout, by default)
report_type: str
The summary report type
The summary get_errors type
report_format: Optional[str]
The summary report format file types: 'yaml' or 'json' (default is report_type specific)
The summary get_errors format file types: 'yaml' or 'json' (default is report_type specific)
stream: bool
Whether to parse input as a stream
graph_name: str
Expand Down Expand Up @@ -206,14 +205,14 @@ def validate_wrapper(
biolink_release: Optional[str]
SemVer version of Biolink Model Release used for validation (default: latest Biolink Model Toolkit version)
"""
errors: List[ValidationError] = []
errors = []
try:
errors: List[ValidationError] = validate(
errors = validate(
inputs, input_format, input_compression, output, stream, biolink_release
)
except Exception as ex:
ve = ValidationError("Graph", ErrorType.VALIDATION_SYSTEM_ERROR, str(ex), MessageLevel.ERROR)
errors.append(ve)
get_logger().error(str(ex))
exit(2)

if errors:
get_logger().error("kgx.validate() errors encountered... check the error log")
Expand Down
13 changes: 7 additions & 6 deletions kgx/cli/cli_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import os
from os.path import dirname, abspath

import sys
from sys import stdout
from multiprocessing import Pool
from typing import List, Tuple, Optional, Dict, Set, Any, Union
import yaml
Expand Down Expand Up @@ -163,7 +163,7 @@ def graph_summary(
with open(output, "w") as gsr:
inspector.save(gsr, file_format=report_format)
else:
inspector.save(sys.stdout, file_format=report_format)
inspector.save(stdout, file_format=report_format)

# ... Third, we directly return the graph statistics to the caller.
return inspector.get_graph_summary()
Expand All @@ -176,7 +176,7 @@ def validate(
output: Optional[str],
stream: bool,
biolink_release: Optional[str] = None,
) -> List:
) -> Dict:
"""
Run KGX validator on an input file to check for Biolink Model compliance.

Expand All @@ -194,10 +194,11 @@ def validate(
Whether to parse input as a stream.
biolink_release: Optional[str] = None
SemVer version of Biolink Model Release used for validation (default: latest Biolink Model Toolkit version)

Returns
-------
List
Returns a list of errors, if any
Dict
A dictionary of entities which have parse errors indexed by [message_level][error_type][message]

"""
# New design pattern enabling 'stream' processing of statistics on a small memory footprint
Expand Down Expand Up @@ -248,7 +249,7 @@ def validate(
if output:
validator.write_report(open(output, "w"))
else:
validator.write_report(sys.stdout)
validator.write_report(stdout)

# ... Third, we return directly any validation errors to the caller
return validator.get_errors()
Expand Down