Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use mashumaro for serializing logging events #4505

Merged
merged 2 commits into from
Jan 27, 2022
Merged

Conversation

gshank
Copy link
Contributor

@gshank gshank commented Dec 16, 2021

resolves #4504

Description

Use mashumaro to serialize json logging.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change

@cla-bot cla-bot bot added the cla:yes label Dec 16, 2021
@gshank
Copy link
Contributor Author

gshank commented Dec 16, 2021

Mashumaro required that the Event class be a dataclass. Changing the Event class to a dataclass with attributes with defaults means that the attributes of subclasses all have to have defaults too (because non-default attributes can only come before attributes with defaults). In some cases Event subclasses were being constructed without keyword names, which caused the passed in value to be used for the first attribute (i.e. log_version) which didn't work, so all constructors needed to use keyword arguments.

Mashumaro also expects to serialize everything in the class, so it didn't work to have various node objects in the event classes. I switched to putting the node_info construction code in the nodes themselves and passing the node_info the event classes instead of passing in the various node objects. I would have liked to make _event_status and node_info into a separate class, but mypy really didn't like that. The stubs.py file was removed.

I created a separate core/dbt/events/serialization.py file for handling the mashumaro serialization, since these classes didn't need some of the complications in core/dbt/dataclass_schema.py. It's also a good place to put other special serialization code if we need it.

@nathaniel-may
Copy link
Contributor

Integration tests with json serialization on are hanging in the interop tests. I suspect there's an exception being raised in threads again.

@nathaniel-may
Copy link
Contributor

I get a few failures too when I run DBT_LOG_FORMAT=json make integration locally. But it's definitely hanging sometime after the 73% mark.

@gshank gshank force-pushed the logging_serialization branch 7 times, most recently from 83a2f95 to 6da0dec Compare December 17, 2021 23:50
Copy link
Contributor

@nathaniel-may nathaniel-may left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of concerns with this approach:

  • mypy can no longer catch malformed events such as events that have no code field.
  • mypy can no longer catch when we fail to provide event attributes even when mypy runs everywhere
  • it looks like when an event takes a type that isn't serializable we are forced to convert it to a serializable type outside the event module which leaks a core abstraction of the event module back out to the rest of the code base

Since these warrant a larger discussion, I think we may way to make a new PR with the bug fixes you included here for the patch release, and consider the larger serialization change for 1.1.0.

CHANGELOG.md Outdated Show resolved Hide resolved
core/dbt/contracts/graph/parsed.py Outdated Show resolved Hide resolved
core/dbt/contracts/graph/parsed.py Outdated Show resolved Hide resolved
core/dbt/events/README.md Show resolved Hide resolved
core/dbt/events/base_types.py Outdated Show resolved Hide resolved
core/dbt/events/test_types.py Outdated Show resolved Hide resolved
core/dbt/events/types.py Outdated Show resolved Hide resolved
core/dbt/utils.py Show resolved Hide resolved
core/dbt/utils.py Show resolved Hide resolved
test/unit/test_events.py Show resolved Hide resolved
@gshank
Copy link
Contributor Author

gshank commented Dec 20, 2021

"it looks like when an event takes a type that isn't serializable we are forced to convert it to a serializable type outside the event module which leaks a core abstraction of the event module back out to the rest of the code base"

I'm not sure that I agree that it is the event module's job to enable serialization for outside classes. In any case, serializing classes that are not needed for the event message feels unnecessary, since it gets thrown away anyway. I think that we're better off providing only the information that is needed for the event/message.

@gshank
Copy link
Contributor Author

gshank commented Dec 20, 2021

"mypy can no longer catch when we fail to provide event attributes even when mypy runs everywhere"

Providing defaults for attributes of subclasses is something that the Python dataclasses module forces you to do. I think it's probably more important to consistently make the classes dataclasses than make a couple of methods abstractmethods, when executing those methods without an override will throw an error just like the abstract base class.

@nathaniel-may
Copy link
Contributor

The way I see it, the event module is tasked with translating dbt's internal representations into a structured output that is versioned as an API so that downstream applications can rely on it. Serialization and schema conformity are both major parts of creating this API to an enterprise quality. I want this code to be written in such a way that if you are not modifying the event module you do not need to know anything about how it was built or what it needs to continue to conform to the logging API version.

If using dataclasses forces us to weaken our development-time guarantees for this API, we should switch tactics and do something else.

Copy link
Contributor

@nathaniel-may nathaniel-may left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These recent commits address all 3 of my major concerns from my last review. Mypy is still able to be helpful in the ways it was planned to and there's a way to override serialization in serialization.py. The following remarks are much smaller issues likely with much simpler solutions. Thanks for all the hard work here!

core/dbt/adapters/factory.py Outdated Show resolved Hide resolved
core/dbt/events/base_types.py Show resolved Hide resolved
core/dbt/events/base_types.py Outdated Show resolved Hide resolved
core/dbt/events/base_types.py Show resolved Hide resolved
core/dbt/events/functions.py Show resolved Hide resolved
core/dbt/events/types.py Show resolved Hide resolved
core/dbt/utils.py Show resolved Hide resolved
test/unit/test_events.py Show resolved Hide resolved
test/unit/test_events.py Show resolved Hide resolved
core/dbt/events/serialization.py Outdated Show resolved Hide resolved
@gshank gshank force-pushed the logging_serialization branch 2 times, most recently from 24c770d to fdef893 Compare January 6, 2022 18:22
Copy link
Contributor

@nathaniel-may nathaniel-may left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some additional comments. There are still 5 unaddressed comments from my previous review.

core/dbt/events/serialization.py Outdated Show resolved Hide resolved
core/dbt/events/base_types.py Show resolved Hide resolved
@gshank gshank force-pushed the logging_serialization branch 4 times, most recently from 6572b66 to 775959f Compare January 12, 2022 22:58
@nathaniel-may
Copy link
Contributor

Just finished a quick sync with @gshank to go over the remaining comments. Here are the outcomes from that conversation:

  • The node_info field is going to move underneath data. This will require an API version increase.
  • The serialization strategy for datetime objects in serialization.py seems to be currently unused so it's going to be removed with a comment for now to avoid confusion between other datetime serializations. Cleaning up these across dbt-core is recorded in [CT-82] Standardize serialization of datetimes #4608
  • The "node_info" value is actually not being constructed here, it's just being passed through as a dictionary, so let's skip turning it into a nominal type here for now.
  • RunResult is a little tougher to serialize with mashumaro. Let's comment what makes that difficult and leave it as a dictionary value.

@gshank gshank force-pushed the logging_serialization branch 2 times, most recently from 9b3a17b to a3f91e5 Compare January 27, 2022 19:04
Copy link
Contributor

@nathaniel-may nathaniel-may left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's just that one value test for the log version that still needs to be updated, but this looks good to me!

in the event dictionary and leave it in the event data dictionary.
Bump log version to 2.
@gshank gshank merged commit efb890d into main Jan 27, 2022
@gshank gshank deleted the logging_serialization branch January 27, 2022 19:43
iknox-fa pushed a commit that referenced this pull request Feb 8, 2022
automatic commit by git-black, original commits:
  efb890d
iknox-fa pushed a commit that referenced this pull request Feb 8, 2022
automatic commit by git-black, original commits:
  3cafc9e
  efb890d
iknox-fa pushed a commit that referenced this pull request Feb 8, 2022
automatic commit by git-black, original commits:
  b2aea11
  efb890d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-19] Logging serialization: use mashumaro for consistency
2 participants