Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EBT] Validate event&context payloads in dev-mode #135061

Merged
merged 12 commits into from
Jul 4, 2022

Conversation

afharo
Copy link
Member

@afharo afharo commented Jun 23, 2022

Summary

This PR uses the defined schemas in the event types' and context providers' registration steps to validate the actual payloads sent.

The validation only applies when running Kibana in dev-mode to highlight any errors developers may incur when iterating over the structure of the event and failing to keep the schema up-to-date. In production, the validation is disabled because the users can't do anything to fix it once a potential bug is released.

Resolves #132250.

Checklist

Delete any items that are not applicable to this PR.

For maintainers

@afharo afharo added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Telemetry release_note:skip Skip the PR/issue when compiling release notes backport:skip This commit does not require backporting labels Jun 23, 2022
Comment on lines 128 to 136
// debug-logging before checking the opt-in status to help during development
if (this.initContext.isDev) {
// TODO: In the future we may need to validate the eventData based on the eventType's registered schema (only if isDev)
this.initContext.logger.debug(`Report event "${eventType}"`);
// If in the browser, log the event as such (to make it easier to debug in the browser's console).
// It needs the casting because the logger types expect debug to be a string
// If in the server, stringify to avoid printing [object Object]
this.initContext.logger.debug(
global.window ? (event as unknown as string) : JSON.stringify(event)
);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to the original issue: I took advantage of this PR to add a debug logger that will print the events. It will help developers to check how their events look like when they implement them :)

*/
type: TelemetryCounterType;
type: keyof typeof TelemetryCounterType;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list the potential values from the enum so users can set the values if they want to without importing the enum. This reduces bundle size because there is no need to import the @kbn/analytics-client runtime code only for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely better to not use an enum ihmo. but the proper way is to change the TelemetryCounterType implementation to no longer be an enum instead of using the keyof typeof trick

type TelemetryCounterType = 'enqueued' | ... | ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose this approach mostly for documentation purposes. The type format is harder to have documentation for each potential value, while enum seems more straightforward.

But happy to remove the enum. I bet it will also save some bundle size 😇

Copy link
Member Author

@afharo afharo Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 51f6fb6 (#135061)

@@ -39,6 +40,7 @@ import { TelemetryCounterType } from '../events';
import { ShippersRegistry } from './shippers_registry';
import { OptInConfigService } from './opt_in_config';
import { ContextService } from './context_service';
import { schemaToIoTs, validateSchema } from '../schema/validation';
Copy link
Member Author

@afharo afharo Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried importing this conditionally and via await import (inside the constructor, not a top-level import 😅) and it still loads it in the initial chunk of core... Could it be that the current Bazel-based packaging system doesn't understand how to break packages into chunks when the dynamic import occurs inside a package?

cc @elastic/kibana-operations

In any case, I think we can apply this improvement to reduce bundle size in production in a follow up PR, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inside the constructor

How can you await from within a non-async method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let potentiallyUndefinedAPI: typeof API | undefined;
...
import(path).then((api) => {
  potentiallyUndefinedAPI = api;
});

@afharo afharo marked this pull request as ready for review June 24, 2022 08:50
@afharo afharo requested a review from a team as a code owner June 24, 2022 08:50
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@afharo afharo added the v8.4.0 label Jun 24, 2022
@afharo
Copy link
Member Author

afharo commented Jun 27, 2022

@elasticmachine merge upstream

@afharo
Copy link
Member Author

afharo commented Jun 27, 2022

@elasticmachine merge upstream

Copy link
Contributor

@TinaHeiligers TinaHeiligers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we don't want to enable validation and logging in prod? I see value in having failed validation log messages in prod but understand there might be other consequences, such as end users not being able to do anything about "buggy" events.

}
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean we return true regardless of the outcome from validation? If that's the case, all we're getting from a failed validation is an error log message but we still send the event. Am I reading this right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind the return false; inside the catch in line 54 😇


import * as t from 'io-ts';
import type { RootSchema, SchemaArray, SchemaObject, SchemaValue } from '../types';
import { excess } from './excess';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how we're supposed to interpret excess here? Just glancing at the file name doesn't give me an idea of what to expect. Is excess extra stuff/extra validation/data overflow? I know I'm being pedantic and naming is hard I'm just trying to future-proof ourselves in three years when we can't remember what was supposed to be in this file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason I called it excess is because that was the name folks in the io-ts library referred to it in the original issue where I found the code I based my solution upon: gcanti/io-ts#322

Happy to rename it if you think it's worth it. What would that name be?


Apart from documentation, the schema is used to validate the payload during the dev-cycle.
This way it adds an extra layer of confidence over the data to be sent.
The validation, however, is disabled in production because users cannot do anything to fix the bug after it is released. Additionally, receiving _buggy_ events can be considered an additional insight into how our users use our products. For example: the buggy event can be caused by a user following an unexpected path in the UI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using validation for two different purposes depending on if we're in dev or prod mode. In dev, it helps us ensure we're sending payloads we expect but in prod it seems that we're using validation failures for something else.

Are we sure we want to use validation for two distinct purposes?

If debug logging isn't already enabled in a cluster and there are issues with that cluster, how are we going to know to enable debug logs for the analytics logger? Are there other symptoms that would make one enable debug logging?

Copy link
Member Author

@afharo afharo Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention with this paragraph was to explain why we only want to validate events in dev mode and never in production.

In production, it's OK to get malformed events because that highlights an unexpected bug in our logic or a corner case we didn't account for and we should allow the event through so we know about it. A good example is the file upload you fixed a couple of months ago: if we had EBT back then, we'd report the filename and import options. The developer tests didn't try hitting the import button without selecting a file, so the EBT validation in dev-mode would have accepted filename as a string. However, in production, getting filename as undefined or null, would have highlighted the issue to us via telemetry. If we validate and skip the events in production, we miss those insights.

I'll see if I can reword it to make it clearer because it's obvious it's not 😇

context as Record<string, unknown>
);
} catch (validationError) {
this.logger.error(validationError);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not going to see these log messages in prod mode. Is that what we want? From the readme, we say that receiving a "buggy" event lets us interpret what happened and we can take that further by having access to these logs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agreement in the design phase was not to run validations in production because:

  1. We don't want the additional noise in the logs
  2. We don't want to add the validation overhead to production environments
  3. Invalid events are also a form of telemetry (they tell us there's an edge case we haven't considered in our testing).
  4. We don't collect browser logs, so even if we log them, we'd only benefit from those buggy events generated in the server and on Cloud. I don't think it's worth the overhead.

What do you think?

*/
type: TelemetryCounterType;
type: keyof typeof TelemetryCounterType;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely better to not use an enum ihmo. but the proper way is to change the TelemetryCounterType implementation to no longer be an enum instead of using the keyof typeof trick

type TelemetryCounterType = 'enqueued' | ... | ...

@@ -39,6 +40,7 @@ import { TelemetryCounterType } from '../events';
import { ShippersRegistry } from './shippers_registry';
import { OptInConfigService } from './opt_in_config';
import { ContextService } from './context_service';
import { schemaToIoTs, validateSchema } from '../schema/validation';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inside the constructor

How can you await from within a non-async method?

Comment on lines 721 to 725
context$.next(undefined);
expect(logger.error).toHaveBeenCalledTimes(1);
expect((logger.error.mock.calls[0][0] as Error).message).toContain(
`Failed to validate payload coming from "Context Provider 'contextProviderA'"`
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ihmo this is bad testing isolation. The client tests should mock validateSchema and only asserts that the function is properly called and that the system behaves correctly depending on the validateSchema's response/behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move these tests to a new context_service.test.ts file. I think they make more sense there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +85 to +90
export const excess = <C extends t.HasProps>(
codec: C,
name: string = getExcessTypeName(codec)
): ExcessType<C> => {
const props: t.Props = getProps(codec);
return new ExcessType<C>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Are we sure there isn't a 3rd party lib for this?

If there's not, this should probably go to a kbn-io-ts-utils package of something (also I think other teams already have such excess implementation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Are we sure there isn't a 3rd party lib for this?

I couldn't find any 😢

If there's not, this should probably go to a kbn-io-ts-utils package of something (also I think other teams already have such excess implementation

We can't move it at the moment because that library imports @kbn/config-schema and we could leak joi to the browser 😞

@afharo
Copy link
Member Author

afharo commented Jun 28, 2022

@elasticmachine merge upstream

@afharo afharo force-pushed the ebt/validate-from-schema branch 2 times, most recently from e34085d to 16dcd7e Compare June 29, 2022 09:08
@afharo
Copy link
Member Author

afharo commented Jul 4, 2022

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

kibana-ci commented Jul 4, 2022

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
core 349 355 +6
telemetry 52 44 -8
total -2

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
core 136.3KB 136.3KB +20.0B
telemetry 13.8KB 13.8KB -4.0B
total +16.0B

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
core 298.8KB 331.1KB +32.3KB
telemetry 22.1KB 22.0KB -189.0B
total +32.1KB

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@afharo
Copy link
Member Author

afharo commented Jul 4, 2022

@TinaHeiligers I think I addressed your comments, so I'll go ahead and merge so this task doesn't fall to our next sprint as incomplete. Please, let me know if I missed anything and I'll fix it in a follow-up PR :)

@afharo afharo merged commit dd43c2d into elastic:main Jul 4, 2022
@afharo afharo deleted the ebt/validate-from-schema branch July 4, 2022 13:42
yakhinvadim pushed a commit to yakhinvadim/kibana that referenced this pull request Jul 5, 2022
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Telemetry release_note:skip Skip the PR/issue when compiling release notes Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[EBT] Schema-based event&context validation (only in dev mode)
6 participants