Skip to content

Commit

Permalink
Merge pull request #365 from biolink/issue-338-enhanced-kgx-transform…
Browse files Browse the repository at this point in the history
…-logging

Issue 338 enhanced kgx transform logging
  • Loading branch information
sierra-moxon committed Jan 15, 2022
2 parents 2e20cc5 + 0732221 commit e25c99d
Show file tree
Hide file tree
Showing 67 changed files with 2,185 additions and 1,348 deletions.
3 changes: 0 additions & 3 deletions .github/workflows/run_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,6 @@ name: Run tests

# Controls when the action will run.
on:
# Triggers the workflow on push or pull request events but only for the master branch
push:
branches: [ master ]
pull_request:
types: [opened, synchronize, reopened]

Expand Down
4 changes: 1 addition & 3 deletions .github/workflows/run_tox.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,8 @@ name: Run Tox

on:
# Triggers the workflow on push or pull request events but only for the master branch
push:
branches: [ master ]
pull_request:
types: [opened, synchronize, reopened]
types: [opened]

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:
Expand Down
80 changes: 79 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ KGX allows conversion to and from:
* Reasoner Standard API format
* OBOGraph JSON format


KGX will also provide validation, to ensure the KGs are conformant to the Biolink Model: making sure nodes are
categorized using Biolink classes, edges are labeled using valid Biolink relationship types, and valid properties are used.

Expand All @@ -32,6 +31,85 @@ The structure of this graph is expected to conform to the Biolink Model standard

In addition to the main code-base, KGX also provides a series of [command line operations](https://kgx.readthedocs.io/en/latest/examples.html#using-kgx-cli).

### Error Detection and Reporting

Non-redundant JSON-formatted structured error logging is now provided in KGX Transformer, Validator, GraphSummary and MetaKnowledgeGraph operations. See the various unit tests for the general design pattern (using the Validator as an example here):

```python
from kgx.validator import Validator
from kgx.transformer import Transformer

Validator.set_biolink_model("2.11.0")

# Validator assumes the currently set Biolink Release
validator = Validator()

transformer = Transformer(stream=True)

transformer.transform(
input_args = {
"filename": [
"graph_nodes.tsv",
"graph_edges.tsv",
],
"format": "tsv",
}
output_args={
"format": "null"
},
inspector=validator,
)

# Both the Validator and the Transformer can independently capture errors

# The Validator, from the overall semantics of the graph...
# Here, we just report severe Errors from the Validator (no Warnings)
validator.write_report(open("validation_errors.json", "w"), "Error")

# The Transformer, from the syntax of the input files...
# Here, we catch *all* Errors and Warnings (by not providing a filter)
transformer.write_report(open("input_errors.json", "w"))
```

The JSON error outputs will look something like this:

```json
{
"ERROR": {
"MISSING_EDGE_PROPERTY": {
"Required edge property 'id' is missing": [
"A:123->X:1",
"B:456->Y:2"
],
"Required edge property 'object' is missing": [
"A:123->X:1"
],
"Required edge property 'predicate' is missing": [
"A:123->X:1"
],
"Required edge property 'subject' is missing": [
"A:123->X:1",
"B:456->Y:2"
]
}
},
"WARNING": {
"DUPLICATE_NODE": {
"Node 'id' duplicated in input data": [
"MONDO:0010011",
"REACT:R-HSA-5635838"
]
}
}
}

```

This system reduces the significant redundancies of earlier line-oriented KGX logging text output files, in that graph entities with the same class of error are simply aggregated in lists of names/identifiers at the leaf level of the JSON structure.

The top level JSON tags originate from the `MessageLevel` class and the second level tags from the `ErrorType` class in the [error_detection](kgx/error_detection.py) module, while the third level messages are hard coded as `log_error` method messages in the code.

It is likely that additional error conditions within KGX can be efficiently captured and reported in the future using this general framework.

## Installation

Expand Down
3 changes: 3 additions & 0 deletions kgx/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
"""
KGX Package
"""
__version__ = "1.5.4"
19 changes: 9 additions & 10 deletions kgx/cli/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from sys import exit
from typing import List, Tuple, Optional
from typing import List, Tuple, Optional, Dict
import click

import kgx
Expand All @@ -17,7 +17,6 @@
summary_report_types,
get_report_format_types,
)
from kgx.validator import ValidationError, MessageLevel, ErrorType

log = get_logger()
config = get_config()
Expand Down Expand Up @@ -56,7 +55,7 @@ def cli():
"-r",
required=False,
type=str,
help=f"The summary report type. Must be one of {tuple(summary_report_types.keys())}",
help=f"The summary get_errors type. Must be one of {tuple(summary_report_types.keys())}",
default="kgx-map",
)
@click.option(
Expand Down Expand Up @@ -88,7 +87,7 @@ def cli():
"-l",
required=False,
type=click.Path(exists=False),
help='File within which to report graph data parsing errors (default: "stderr")',
help='File within which to get_errors graph data parsing errors (default: "stderr")',
)
def graph_summary_wrapper(
inputs: List[str],
Expand Down Expand Up @@ -118,9 +117,9 @@ def graph_summary_wrapper(
output: Optional[str]
Where to write the output (stdout, by default)
report_type: str
The summary report type
The summary get_errors type
report_format: Optional[str]
The summary report format file types: 'yaml' or 'json' (default is report_type specific)
The summary get_errors format file types: 'yaml' or 'json' (default is report_type specific)
stream: bool
Whether to parse input as a stream
graph_name: str
Expand Down Expand Up @@ -206,14 +205,14 @@ def validate_wrapper(
biolink_release: Optional[str]
SemVer version of Biolink Model Release used for validation (default: latest Biolink Model Toolkit version)
"""
errors: List[ValidationError] = []
errors = []
try:
errors: List[ValidationError] = validate(
errors = validate(
inputs, input_format, input_compression, output, stream, biolink_release
)
except Exception as ex:
ve = ValidationError("Graph", ErrorType.VALIDATION_SYSTEM_ERROR, str(ex), MessageLevel.ERROR)
errors.append(ve)
get_logger().error(str(ex))
exit(2)

if errors:
get_logger().error("kgx.validate() errors encountered... check the error log")
Expand Down
13 changes: 7 additions & 6 deletions kgx/cli/cli_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import os
from os.path import dirname, abspath

import sys
from sys import stdout
from multiprocessing import Pool
from typing import List, Tuple, Optional, Dict, Set, Any, Union
import yaml
Expand Down Expand Up @@ -163,7 +163,7 @@ def graph_summary(
with open(output, "w") as gsr:
inspector.save(gsr, file_format=report_format)
else:
inspector.save(sys.stdout, file_format=report_format)
inspector.save(stdout, file_format=report_format)

# ... Third, we directly return the graph statistics to the caller.
return inspector.get_graph_summary()
Expand All @@ -176,7 +176,7 @@ def validate(
output: Optional[str],
stream: bool,
biolink_release: Optional[str] = None,
) -> List:
) -> Dict:
"""
Run KGX validator on an input file to check for Biolink Model compliance.
Expand All @@ -194,10 +194,11 @@ def validate(
Whether to parse input as a stream.
biolink_release: Optional[str] = None
SemVer version of Biolink Model Release used for validation (default: latest Biolink Model Toolkit version)
Returns
-------
List
Returns a list of errors, if any
Dict
A dictionary of entities which have parse errors indexed by [message_level][error_type][message]
"""
# New design pattern enabling 'stream' processing of statistics on a small memory footprint
Expand Down Expand Up @@ -248,7 +249,7 @@ def validate(
if output:
validator.write_report(open(output, "w"))
else:
validator.write_report(sys.stdout)
validator.write_report(stdout)

# ... Third, we return directly any validation errors to the caller
return validator.get_errors()
Expand Down

0 comments on commit e25c99d

Please sign in to comment.