Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run results reporting is a mess #2493

Closed
beckjake opened this issue May 27, 2020 · 9 comments · Fixed by #2943
Closed

run results reporting is a mess #2493

beckjake opened this issue May 27, 2020 · 9 comments · Fixed by #2943
Assignees
Labels
1.0.0 Issues related to the 1.0.0 release of dbt artifacts enhancement New feature or request

Comments

@beckjake
Copy link
Contributor

beckjake commented May 27, 2020

Describe the feature

Right now, run results are an ugly overloaded mess:

  • dbt run returns a string
    • it's the result of get_status method on SQL adapters, which can basically be anything the db/underlying db api want
    • on bigquery it's a carefully crafted string based on what we think the query did
  • dbt test returns an integer for the number of failed rows
  • other things just return 'OK', I think? Honestly who even knows though.
  • except on exceptions when the status is 'ERROR'
    • not to be confused with test errors, where fail is true but status is still the number of failed rows

Runs set error and status (but they could set fail and warn if they wanted, those attributes exist!)
Tests set error, fail, warn, and status.
status is defined as Union[None, bool, int, str].

There's no way we need all of these for everyone! The problem is complicated a bit by the RPC server which has its own set of responses, etc, but I believe this is solvable anyway.

I propose picking one unified key that supports a union of objects. If it's hard to give tests their own from a type system perspective, we can still make fail/warn/rows_failed available in run output, and have them be None in that case.

Here's a revised proposal, with some sample type specifications in python and sample outputs:

class RunStatus(StrEnum):
    Success = "success"
    Error = "error"
    Skipped = "skipped"


class TestStatus(StrEnum):
    Success = "success"
    Error = "error"
    Fail = "fail"
    Warn = "warn"


class FreshnessStatus(StrEnum):
    Pass = "pass"
    Warning = "warn"    
    Error = "error"
    Runtime = "runtime error"


@dataclass
class BaseResult(JsonSchemaMixin):
    status: str
    cmd: str  # the name of the task (`parsed_args.which`)
    timing: List[TimingInfo]
    thread_id: str
    execution_time: float
    message: Optional[str]


@dataclass
class NodeResult(BaseResult):
    status: RunStatus
    node: ParsedNode  # the node this result is from


@dataclass
class TestResult(NodeResult):
    status: TestStatus


# these two end up in the source-freshness API result and sources.json
@dataclass
class ErrorFreshnessResult(BaseResult):
    status: FreshnessStats = field(metadata={'restrict': [FreshnessStatus.Runtime]})


@dataclass
class FreshnessResult(BaseResult):
    status: FreshnessStats = field(metadata={'restrict': [FreshnessStatus.Pass, FreshnessStatus.Warning, FreshnessStatus.Error]})
    node: ParsedSourceDefinition
    snapshotted_at: datetime
    max_loaded_at: datetime
    age: float


ResultType = TypeVar('ResultType')


class ResultsCollection(Generic[ResultType]):
    results: List[ResultType]
    generated_at: datetime
    elapsed_time: float  # time spent in the task

# and the RPC result is pretty similar, with an extra field
class RPCResult(Generic[ResultType], ResultsCollection[ResultType]):
    logs: List[LogRecord]
    start: datetime
    end: datetime
    elapsed: float  # time spent in the whole API call

JSON examples

# a single entry in "results" for a successful table materialization during "dbt run" (run_results.json, API)
{
    "status": "success",
    "cmd": "run",
    "thread_id": "Thread-1",
    "execution_time": 0.24340200424194336,
    "timing": [...],
    "message": "SELECT 1",
    "node": ...,  # omitted
}

# a single entry in "results" for a node that failed during "dbt run" due to an error (run_results.json, API)
{
    "status": "error",
    "cmd": "run",
    "timing": [],
    "thread_id": "Thread-1",
    "execution_time": 0.0777738094329834,
    "message": "Compilation Error in model debug (models/debug.sql)\n  blah\n  \n  > in model debug (models/debug.sql)\n  > called by model debug (models/debug.sql)"
    "node": ...,  # omitted
}



# a single entry in "results" for a snapshot-freshness that resulted in a warning (API)
{
    "status": "warn",
    "cmd": "snapshot-freshness",
    "thread_id": "Thread-1",
    "execution_time": 0.019963979721069336,
    "timing": [...],
    "message": null,
    "node": ...,  # omitted
    "max_loaded_at": "2019-01-02T01:02:03+00:00",
    "snapshotted_at": "2020-07-13T20:09:14.857209+00:00",
    "age": 48280031.857209,
}



# a source freshness result (API)
{
    "results": [...],  # see above
    "generated_at": "2020-07-13T20:09:14.857209+00:00",
    "elapsed_time": 1.3687610626220703,
    "elapsed": 2.45623,
    "start": "2020-07-13T20:09:12.542073Z",
    "end": "2020-07-13T20:09:14.998303Z",
}


# a source freshness result(sources.json)
{
    "meta": {
        "generated_at": "2020-07-13T20:09:14.857209+00:00",
        "elapsed_time": 1.3687610626220703,
        "sources": {
            "source.<project_name>.<source_name>.<table_name>": {
                "max_loaded_at": "2019-01-02T01:02:03+00:00",
                "snapshotted_at": "2020-07-13T20:09:14.857209+00:00",
                "max_loaded_at_time_ago_in_s": 48280031.857209,
                "status": "warn",
                "criteria": {
                    "warn_after": {"count": 100, "period": "day"},
                    "error_after": {"count": 1000, "period": "day"},
                },
            }
        }
    }
}

It might be nice to include a node unique ID explicitly alongside everywhere this spec accepts node, with an eye to deprecating the "node" field. We can do that separately, as it'll require us writing out the manifest either a second time at run end (my preference) or moving the manifest-writing to the end of the run.

This will require some coordination with cloud, I assume it must use this value for displaying run results.

I don't think this issue is imminently urgent, but I would consider it a blocker for 1.0.

Describe alternatives you've considered

We can always not do this, it's just a cosmetic/display kind of issue... the information is there.

Who will this benefit?

developers, consumers of the output json

@beckjake beckjake added enhancement New feature or request 1.0.0 Issues related to the 1.0.0 release of dbt labels May 27, 2020
@cmcarthur
Copy link
Member

@beckjake is this intended to apply a schema to the current run results format, or is this a breaking change?

as a consumer of this data, testing for the existence of a key to detect the result type is a bit painful -- it requires lots of conditionals in the consumer to do the right thing with the data. I'd prefer to standardize the result schema a bit more on the core fields (success/error/warn, status message, etc) and extend it for types that have more data. IMO this would also be easier to extend with new fields in the future. what do you think about something like this?

class NodeResult(StrEnum):
    success = "success"
    error = "error"
    warn = "warn"

@dataclass
class NodeStatus(JsonSchemaMixin):
    result: NodeResult
    message: str # arbitrary string. kind of like "status", could just be OK or whatever the db returns

@dataclass
class RunStatus(NodeStatus):
    pass

@dataclass
class TestStatus(NodeStatus):
    rows_failed: int
    fail: bool  # implies warn
    warn: bool

@beckjake
Copy link
Contributor Author

@cmcarthur This is a breaking change. We already have a schema for the results output, it's just not a useful one and consumers are confused by it.

The problem here, which I don't believe the above structure solves, is that dbt does a poor job of differentiating between "There was an error in execution" (an exception raised) and "There was an error-severity test failure". Maybe we could add fail to the enum you suggested, and get rid of the fail/warn booleans?

@drewbanin drewbanin added this to the Marian Anderson milestone Jun 11, 2020
@beckjake
Copy link
Contributor Author

I updated and added some examples for a proposed spec. One thing we should keep in mind (and which we should maybe just fix/change permanently?) is that currently "docs generate" and "source freshness" treat their output artifacts as if they were run results internally. I'm not sure that's the best path going forward, they're really different things.

@cmcarthur
Copy link
Member

@beckjake these look totally reasonable to me.

the exception is: why do we need two different versions of source freshness results?

"a single entry in "results" for a snapshot-freshness that resulted in a warning (API)"
looks dramatically different than the contents of a single source in the sources.json

wherever possible I'd push to unify these specs. I know there's a practical reason for the top-level structure of sources.json, but it feels like individual nodes could be unified.

@beckjake
Copy link
Contributor Author

@cmcarthur I think that's fair. I would like to avoid writing node information to freshness results, would it be reasonable for the RPC server to basically return the sources.json-style output, but as a list instead of a dict? I guess we could also make sources.json a list so it's more like the API result.

@cmcarthur
Copy link
Member

@beckjake like this?

[
    {
        "max_loaded_at": "2019-01-02T01:02:03+00:00",
        "snapshotted_at": "2020-07-13T20:09:14.857209+00:00",
        "max_loaded_at_time_ago_in_s": 48280031.857209,
        "status": "warn",
        "criteria": {
            "warn_after": {"count": 100, "period": "day"},
            "error_after": {"count": 1000, "period": "day"},
        }
    }
]

that sounds perfect. it does make me wonder how we'll map this back to nodes though...

@beckjake
Copy link
Contributor Author

I think we can just add a unique_id field to the result, so:

    {
        "max_loaded_at": "2019-01-02T01:02:03+00:00",
        "snapshotted_at": "2020-07-13T20:09:14.857209+00:00",
        "max_loaded_at_time_ago_in_s": 48280031.857209,
        "status": "warn",
        "criteria": {
            "warn_after": {"count": 100, "period": "day"},
            "error_after": {"count": 1000, "period": "day"},
        },
        "unique_id": "source.<project_name>.<source_name>.<table_name>"
    }

@prratek
Copy link
Contributor

prratek commented Aug 20, 2020

Quick note about field naming - the pre-hook and post-hook fields in the node config in run_results.json seem to be the only ones with hyphens instead of underscores. @jtcohen6 pointed out on Slack that dbt already accepts pre_hook and post_hook as config aliases and it would be great to have the field names contain underscores instead for two reasons:

  1. Consistency with other field names
  2. Compatibility with data warehouse column name requirements: BigQuery at least doesn't permit hyphens in column names and not having to rename fields would make the process of loading run results into a table easier.

@jtcohen6
Copy link
Contributor

Additional results dict to support adapter-specific info like in #2747

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.0.0 Issues related to the 1.0.0 release of dbt artifacts enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants