Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduces field capabilities event loop block times by scaling linearly using hashes #75718

Merged
merged 3 commits into from
Aug 24, 2020

Conversation

FrankHassanabad
Copy link
Contributor

@FrankHassanabad FrankHassanabad commented Aug 22, 2020

Summary

On the security_solution project from different customers we have been getting reports of scaling issues and excessive NodeJS event blocking times. After in-depth tracing through some of the index and field capabilities calls we identified two of the "hot paths" running through field_capabilities to where it is using double looped arrays rather than hashes. By switching these two hot spots out for hashes we are now able to reduce the event loop block times by an order of magnitude.

Before this PR you can see event loop block times as high as:

field_cap: 575.131ms

And after this PR you will see event loop block times drop by an order of magnitude to:

field_cap: 31.783ms

when you're calling into indexes as large as filebeat-*. This number can be higher if you're concatenating several large indexes together trying to get capabilities from each one all at once. We already only call getFieldCapabilities with one index at a time to spread out event block times.

The fix is to use a hash within two key areas within these two files:

src/plugins/data/server/index_patterns/fetcher/lib/field_capabilities/field_capabilities.ts
src/plugins/data/server/index_patterns/fetcher/lib/field_capabilities/field_caps_response.ts

This effect happens during the query of SourceQuery/IndexFields within security_solution but you should be able to trigger it with any application code who calls into those code paths with large index sizes such as filebeat-* anywhere in Kibana.

An explanation of how to see the block times for before and after

Add, console.time('field_cap'); and console.timeEnd('field_cap'); to where the synchronize code is for testing the optimizations of before and after.

For example around lines 45 with the original code:

  const esFieldCaps: FieldCapsResponse = await callFieldCapsApi(callCluster, indices);
  console.time('field_cap'); // <--- start timer
  const fieldsFromFieldCapsByName = keyBy(readFieldCapsResponse(esFieldCaps), 'name');

  const allFieldsUnsorted = Object.keys(fieldsFromFieldCapsByName)
    .filter((name) => !name.startsWith('_'))
    .concat(metaFields)
    .reduce(concatIfUniq, [] as string[])
    .map<FieldDescriptor>((name) =>
      defaults({}, fieldsFromFieldCapsByName[name], {
        name,
        type: 'string',
        searchable: false,
        aggregatable: false,
        readFromDocValues: false,
      })
    )
    .map(mergeOverrides);

  const sorted = sortBy(allFieldsUnsorted, 'name');
  console.timeEnd('field_cap'); // <--- outputs the end timer
  return sorted;

And around lines 45 with this pull request:

  const esFieldCaps: FieldCapsResponse = await callFieldCapsApi(callCluster, indices);
  console.time('field_cap'); // <--- start timer
  const fieldsFromFieldCapsByName = keyBy(readFieldCapsResponse(esFieldCaps), 'name');

  const allFieldsUnsorted = Object.keys(fieldsFromFieldCapsByName)
    .filter((name) => !name.startsWith('_'))
    .concat(metaFields)
    .reduce<{ names: string[]; hash: Record<string, string> }>(
      (agg, value) => {
        // This is intentionally using a "hash" and a "push" to be highly optimized with very large indexes
        if (agg.hash[value] != null) {
          return agg;
        } else {
          agg.hash[value] = value;
          agg.names.push(value);
          return agg;
        }
      },
      { names: [], hash: {} }
    )
    .names.map<FieldDescriptor>((name) =>
      defaults({}, fieldsFromFieldCapsByName[name], {
        name,
        type: 'string',
        searchable: false,
        aggregatable: false,
        readFromDocValues: false,
      })
    )
    .map(mergeOverrides);

  const sorted = sortBy(allFieldsUnsorted, 'name');
  console.timeEnd('field_cap'); // <--- outputs the end timer
  return sorted;

And then reload the security solutions application web page or generically anything that is going to call filebeat-* index or another large index or you could concatenate several indexes together as well to test out the performance. For security solutions we can just visit any page such as this one below which has a filebeat-* index:

http://localhost:5601/app/security/timelines/

Be sure to load it twice for testing as NodeJS will sometimes report better numbers the second time as it does optimizations after the first time it encounters some code paths.

You should begin to see numbers similar to this in the before:

field_cap: 575.131ms

This indicates that it is blocking the event loop for around half a second before this fix. If an application adds additional indexes on-top of filebeat, or if it tries to execute other code after this (which we do in security solutions) then the block times will climb even higher.

However, after this fix, the m^n are changed to use hashing so this only climb by some constant * n where n is your fields and for filebeat-* it will should very low around:

field_cap: 31.783ms

Checklist

Unit tests already present, so this shouldn't break anything 🤞 .

@FrankHassanabad FrankHassanabad requested a review from a team as a code owner August 22, 2020 06:29
@FrankHassanabad FrankHassanabad self-assigned this Aug 22, 2020
@FrankHassanabad FrankHassanabad changed the title Fixes field_capabilities to not block the NodeJS event loop by scaling linearly using hashes in two areas Fixes field_capabilities reduce NodeJS event block times by scaling linearly using hashes Aug 22, 2020
@FrankHassanabad FrankHassanabad changed the title Fixes field_capabilities reduce NodeJS event block times by scaling linearly using hashes Reduces field capabilities event loop block times by scaling linearly using hashes Aug 22, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/siem (Team:SIEM)

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Build metrics

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Copy link
Contributor

@mattkime mattkime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Niiiiiiice!

@FrankHassanabad FrankHassanabad merged commit 9dfcde2 into elastic:master Aug 24, 2020
@FrankHassanabad FrankHassanabad deleted the node-block-fix-caps branch August 24, 2020 18:12
FrankHassanabad added a commit to FrankHassanabad/kibana that referenced this pull request Aug 24, 2020
… using hashes (elastic#75718)

## Summary

On the `security_solution` project from different customers we have been getting reports of scaling issues and excessive NodeJS event blocking times. After in-depth tracing through some of the index and field capabilities calls we identified two of the "hot paths" running through `field_capabilities` to where it is using double looped arrays rather than hashes. By switching these two hot spots out for hashes we are now able to reduce the event loop block times by an order of magnitude. 

Before this PR you can see event loop block times as high as:

```ts
field_cap: 575.131ms
```

And after this PR you will see event loop block times drop by an order of magnitude to:

```ts
field_cap: 31.783ms
```

when you're calling into indexes as large as `filebeat-*`. This number can be higher if you're concatenating several large indexes together trying to get capabilities from each one all at once. We already only call `getFieldCapabilities` with one index at a time to spread out event block times.

The fix is to use a hash within two key areas within these two files:

```ts
src/plugins/data/server/index_patterns/fetcher/lib/field_capabilities/field_capabilities.ts
src/plugins/data/server/index_patterns/fetcher/lib/field_capabilities/field_caps_response.ts
```

This effect happens during the query of `SourceQuery`/`IndexFields` within `security_solution` but you should be able to trigger it with any application code who calls into those code paths with large index sizes such as `filebeat-*` anywhere in Kibana.

An explanation of how to see the block times for before and after
---

Add, `console.time('field_cap');` and `console.timeEnd('field_cap');` to where the synchronize code is for testing the optimizations of before and after.

For example around lines 45 with the original code:
```ts
  const esFieldCaps: FieldCapsResponse = await callFieldCapsApi(callCluster, indices);
  console.time('field_cap'); // <--- start timer
  const fieldsFromFieldCapsByName = keyBy(readFieldCapsResponse(esFieldCaps), 'name');

  const allFieldsUnsorted = Object.keys(fieldsFromFieldCapsByName)
    .filter((name) => !name.startsWith('_'))
    .concat(metaFields)
    .reduce(concatIfUniq, [] as string[])
    .map<FieldDescriptor>((name) =>
      defaults({}, fieldsFromFieldCapsByName[name], {
        name,
        type: 'string',
        searchable: false,
        aggregatable: false,
        readFromDocValues: false,
      })
    )
    .map(mergeOverrides);

  const sorted = sortBy(allFieldsUnsorted, 'name');
  console.timeEnd('field_cap'); // <--- outputs the end timer
  return sorted;

```

And around lines 45 with this pull request:

```ts
  const esFieldCaps: FieldCapsResponse = await callFieldCapsApi(callCluster, indices);
  console.time('field_cap'); // <--- start timer
  const fieldsFromFieldCapsByName = keyBy(readFieldCapsResponse(esFieldCaps), 'name');

  const allFieldsUnsorted = Object.keys(fieldsFromFieldCapsByName)
    .filter((name) => !name.startsWith('_'))
    .concat(metaFields)
    .reduce<{ names: string[]; hash: Record<string, string> }>(
      (agg, value) => {
        // This is intentionally using a "hash" and a "push" to be highly optimized with very large indexes
        if (agg.hash[value] != null) {
          return agg;
        } else {
          agg.hash[value] = value;
          agg.names.push(value);
          return agg;
        }
      },
      { names: [], hash: {} }
    )
    .names.map<FieldDescriptor>((name) =>
      defaults({}, fieldsFromFieldCapsByName[name], {
        name,
        type: 'string',
        searchable: false,
        aggregatable: false,
        readFromDocValues: false,
      })
    )
    .map(mergeOverrides);

  const sorted = sortBy(allFieldsUnsorted, 'name');
  console.timeEnd('field_cap'); // <--- outputs the end timer
  return sorted;

```

And then reload the security solutions application web page or generically anything that is going to call filebeat-* index or another large index or you could concatenate several indexes together as well to test out the performance. For security solutions we can just visit any page such as this one below which has a filebeat-* index:

```
http://localhost:5601/app/security/timelines/
```

Be sure to load it _twice_ for testing as NodeJS will sometimes report better numbers the second time as it does optimizations after the first time it encounters some code paths.

You should begin to see numbers similar to this in the before:

```ts
field_cap: 575.131ms
```

This indicates that it is blocking the event loop for around half a second before this fix. If an application adds additional indexes on-top of `filebeat`, or if it tries to execute other code after this (which we do in security solutions) then the block times will climb even higher.


However, after this fix, the m^n are changed to use hashing so this only climb by some constant * n where n is your fields and for filebeat-* it will should very low around:

```ts
field_cap: 31.783ms
```

### Checklist

Unit tests already present, so this shouldn't break anything 🤞 .
- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
FrankHassanabad added a commit to FrankHassanabad/kibana that referenced this pull request Aug 24, 2020
… using hashes (elastic#75718)

## Summary

On the `security_solution` project from different customers we have been getting reports of scaling issues and excessive NodeJS event blocking times. After in-depth tracing through some of the index and field capabilities calls we identified two of the "hot paths" running through `field_capabilities` to where it is using double looped arrays rather than hashes. By switching these two hot spots out for hashes we are now able to reduce the event loop block times by an order of magnitude. 

Before this PR you can see event loop block times as high as:

```ts
field_cap: 575.131ms
```

And after this PR you will see event loop block times drop by an order of magnitude to:

```ts
field_cap: 31.783ms
```

when you're calling into indexes as large as `filebeat-*`. This number can be higher if you're concatenating several large indexes together trying to get capabilities from each one all at once. We already only call `getFieldCapabilities` with one index at a time to spread out event block times.

The fix is to use a hash within two key areas within these two files:

```ts
src/plugins/data/server/index_patterns/fetcher/lib/field_capabilities/field_capabilities.ts
src/plugins/data/server/index_patterns/fetcher/lib/field_capabilities/field_caps_response.ts
```

This effect happens during the query of `SourceQuery`/`IndexFields` within `security_solution` but you should be able to trigger it with any application code who calls into those code paths with large index sizes such as `filebeat-*` anywhere in Kibana.

An explanation of how to see the block times for before and after
---

Add, `console.time('field_cap');` and `console.timeEnd('field_cap');` to where the synchronize code is for testing the optimizations of before and after.

For example around lines 45 with the original code:
```ts
  const esFieldCaps: FieldCapsResponse = await callFieldCapsApi(callCluster, indices);
  console.time('field_cap'); // <--- start timer
  const fieldsFromFieldCapsByName = keyBy(readFieldCapsResponse(esFieldCaps), 'name');

  const allFieldsUnsorted = Object.keys(fieldsFromFieldCapsByName)
    .filter((name) => !name.startsWith('_'))
    .concat(metaFields)
    .reduce(concatIfUniq, [] as string[])
    .map<FieldDescriptor>((name) =>
      defaults({}, fieldsFromFieldCapsByName[name], {
        name,
        type: 'string',
        searchable: false,
        aggregatable: false,
        readFromDocValues: false,
      })
    )
    .map(mergeOverrides);

  const sorted = sortBy(allFieldsUnsorted, 'name');
  console.timeEnd('field_cap'); // <--- outputs the end timer
  return sorted;

```

And around lines 45 with this pull request:

```ts
  const esFieldCaps: FieldCapsResponse = await callFieldCapsApi(callCluster, indices);
  console.time('field_cap'); // <--- start timer
  const fieldsFromFieldCapsByName = keyBy(readFieldCapsResponse(esFieldCaps), 'name');

  const allFieldsUnsorted = Object.keys(fieldsFromFieldCapsByName)
    .filter((name) => !name.startsWith('_'))
    .concat(metaFields)
    .reduce<{ names: string[]; hash: Record<string, string> }>(
      (agg, value) => {
        // This is intentionally using a "hash" and a "push" to be highly optimized with very large indexes
        if (agg.hash[value] != null) {
          return agg;
        } else {
          agg.hash[value] = value;
          agg.names.push(value);
          return agg;
        }
      },
      { names: [], hash: {} }
    )
    .names.map<FieldDescriptor>((name) =>
      defaults({}, fieldsFromFieldCapsByName[name], {
        name,
        type: 'string',
        searchable: false,
        aggregatable: false,
        readFromDocValues: false,
      })
    )
    .map(mergeOverrides);

  const sorted = sortBy(allFieldsUnsorted, 'name');
  console.timeEnd('field_cap'); // <--- outputs the end timer
  return sorted;

```

And then reload the security solutions application web page or generically anything that is going to call filebeat-* index or another large index or you could concatenate several indexes together as well to test out the performance. For security solutions we can just visit any page such as this one below which has a filebeat-* index:

```
http://localhost:5601/app/security/timelines/
```

Be sure to load it _twice_ for testing as NodeJS will sometimes report better numbers the second time as it does optimizations after the first time it encounters some code paths.

You should begin to see numbers similar to this in the before:

```ts
field_cap: 575.131ms
```

This indicates that it is blocking the event loop for around half a second before this fix. If an application adds additional indexes on-top of `filebeat`, or if it tries to execute other code after this (which we do in security solutions) then the block times will climb even higher.


However, after this fix, the m^n are changed to use hashing so this only climb by some constant * n where n is your fields and for filebeat-* it will should very low around:

```ts
field_cap: 31.783ms
```

### Checklist

Unit tests already present, so this shouldn't break anything 🤞 .
- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
FrankHassanabad added a commit that referenced this pull request Aug 24, 2020
… using hashes (#75718) (#75805)

## Summary

On the `security_solution` project from different customers we have been getting reports of scaling issues and excessive NodeJS event blocking times. After in-depth tracing through some of the index and field capabilities calls we identified two of the "hot paths" running through `field_capabilities` to where it is using double looped arrays rather than hashes. By switching these two hot spots out for hashes we are now able to reduce the event loop block times by an order of magnitude. 

Before this PR you can see event loop block times as high as:

```ts
field_cap: 575.131ms
```

And after this PR you will see event loop block times drop by an order of magnitude to:

```ts
field_cap: 31.783ms
```

when you're calling into indexes as large as `filebeat-*`. This number can be higher if you're concatenating several large indexes together trying to get capabilities from each one all at once. We already only call `getFieldCapabilities` with one index at a time to spread out event block times.

The fix is to use a hash within two key areas within these two files:

```ts
src/plugins/data/server/index_patterns/fetcher/lib/field_capabilities/field_capabilities.ts
src/plugins/data/server/index_patterns/fetcher/lib/field_capabilities/field_caps_response.ts
```

This effect happens during the query of `SourceQuery`/`IndexFields` within `security_solution` but you should be able to trigger it with any application code who calls into those code paths with large index sizes such as `filebeat-*` anywhere in Kibana.

An explanation of how to see the block times for before and after
---

Add, `console.time('field_cap');` and `console.timeEnd('field_cap');` to where the synchronize code is for testing the optimizations of before and after.

For example around lines 45 with the original code:
```ts
  const esFieldCaps: FieldCapsResponse = await callFieldCapsApi(callCluster, indices);
  console.time('field_cap'); // <--- start timer
  const fieldsFromFieldCapsByName = keyBy(readFieldCapsResponse(esFieldCaps), 'name');

  const allFieldsUnsorted = Object.keys(fieldsFromFieldCapsByName)
    .filter((name) => !name.startsWith('_'))
    .concat(metaFields)
    .reduce(concatIfUniq, [] as string[])
    .map<FieldDescriptor>((name) =>
      defaults({}, fieldsFromFieldCapsByName[name], {
        name,
        type: 'string',
        searchable: false,
        aggregatable: false,
        readFromDocValues: false,
      })
    )
    .map(mergeOverrides);

  const sorted = sortBy(allFieldsUnsorted, 'name');
  console.timeEnd('field_cap'); // <--- outputs the end timer
  return sorted;

```

And around lines 45 with this pull request:

```ts
  const esFieldCaps: FieldCapsResponse = await callFieldCapsApi(callCluster, indices);
  console.time('field_cap'); // <--- start timer
  const fieldsFromFieldCapsByName = keyBy(readFieldCapsResponse(esFieldCaps), 'name');

  const allFieldsUnsorted = Object.keys(fieldsFromFieldCapsByName)
    .filter((name) => !name.startsWith('_'))
    .concat(metaFields)
    .reduce<{ names: string[]; hash: Record<string, string> }>(
      (agg, value) => {
        // This is intentionally using a "hash" and a "push" to be highly optimized with very large indexes
        if (agg.hash[value] != null) {
          return agg;
        } else {
          agg.hash[value] = value;
          agg.names.push(value);
          return agg;
        }
      },
      { names: [], hash: {} }
    )
    .names.map<FieldDescriptor>((name) =>
      defaults({}, fieldsFromFieldCapsByName[name], {
        name,
        type: 'string',
        searchable: false,
        aggregatable: false,
        readFromDocValues: false,
      })
    )
    .map(mergeOverrides);

  const sorted = sortBy(allFieldsUnsorted, 'name');
  console.timeEnd('field_cap'); // <--- outputs the end timer
  return sorted;

```

And then reload the security solutions application web page or generically anything that is going to call filebeat-* index or another large index or you could concatenate several indexes together as well to test out the performance. For security solutions we can just visit any page such as this one below which has a filebeat-* index:

```
http://localhost:5601/app/security/timelines/
```

Be sure to load it _twice_ for testing as NodeJS will sometimes report better numbers the second time as it does optimizations after the first time it encounters some code paths.

You should begin to see numbers similar to this in the before:

```ts
field_cap: 575.131ms
```

This indicates that it is blocking the event loop for around half a second before this fix. If an application adds additional indexes on-top of `filebeat`, or if it tries to execute other code after this (which we do in security solutions) then the block times will climb even higher.


However, after this fix, the m^n are changed to use hashing so this only climb by some constant * n where n is your fields and for filebeat-* it will should very low around:

```ts
field_cap: 31.783ms
```

### Checklist

Unit tests already present, so this shouldn't break anything 🤞 .
- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
FrankHassanabad added a commit that referenced this pull request Aug 24, 2020
… using hashes (#75718) (#75804)

## Summary

On the `security_solution` project from different customers we have been getting reports of scaling issues and excessive NodeJS event blocking times. After in-depth tracing through some of the index and field capabilities calls we identified two of the "hot paths" running through `field_capabilities` to where it is using double looped arrays rather than hashes. By switching these two hot spots out for hashes we are now able to reduce the event loop block times by an order of magnitude. 

Before this PR you can see event loop block times as high as:

```ts
field_cap: 575.131ms
```

And after this PR you will see event loop block times drop by an order of magnitude to:

```ts
field_cap: 31.783ms
```

when you're calling into indexes as large as `filebeat-*`. This number can be higher if you're concatenating several large indexes together trying to get capabilities from each one all at once. We already only call `getFieldCapabilities` with one index at a time to spread out event block times.

The fix is to use a hash within two key areas within these two files:

```ts
src/plugins/data/server/index_patterns/fetcher/lib/field_capabilities/field_capabilities.ts
src/plugins/data/server/index_patterns/fetcher/lib/field_capabilities/field_caps_response.ts
```

This effect happens during the query of `SourceQuery`/`IndexFields` within `security_solution` but you should be able to trigger it with any application code who calls into those code paths with large index sizes such as `filebeat-*` anywhere in Kibana.

An explanation of how to see the block times for before and after
---

Add, `console.time('field_cap');` and `console.timeEnd('field_cap');` to where the synchronize code is for testing the optimizations of before and after.

For example around lines 45 with the original code:
```ts
  const esFieldCaps: FieldCapsResponse = await callFieldCapsApi(callCluster, indices);
  console.time('field_cap'); // <--- start timer
  const fieldsFromFieldCapsByName = keyBy(readFieldCapsResponse(esFieldCaps), 'name');

  const allFieldsUnsorted = Object.keys(fieldsFromFieldCapsByName)
    .filter((name) => !name.startsWith('_'))
    .concat(metaFields)
    .reduce(concatIfUniq, [] as string[])
    .map<FieldDescriptor>((name) =>
      defaults({}, fieldsFromFieldCapsByName[name], {
        name,
        type: 'string',
        searchable: false,
        aggregatable: false,
        readFromDocValues: false,
      })
    )
    .map(mergeOverrides);

  const sorted = sortBy(allFieldsUnsorted, 'name');
  console.timeEnd('field_cap'); // <--- outputs the end timer
  return sorted;

```

And around lines 45 with this pull request:

```ts
  const esFieldCaps: FieldCapsResponse = await callFieldCapsApi(callCluster, indices);
  console.time('field_cap'); // <--- start timer
  const fieldsFromFieldCapsByName = keyBy(readFieldCapsResponse(esFieldCaps), 'name');

  const allFieldsUnsorted = Object.keys(fieldsFromFieldCapsByName)
    .filter((name) => !name.startsWith('_'))
    .concat(metaFields)
    .reduce<{ names: string[]; hash: Record<string, string> }>(
      (agg, value) => {
        // This is intentionally using a "hash" and a "push" to be highly optimized with very large indexes
        if (agg.hash[value] != null) {
          return agg;
        } else {
          agg.hash[value] = value;
          agg.names.push(value);
          return agg;
        }
      },
      { names: [], hash: {} }
    )
    .names.map<FieldDescriptor>((name) =>
      defaults({}, fieldsFromFieldCapsByName[name], {
        name,
        type: 'string',
        searchable: false,
        aggregatable: false,
        readFromDocValues: false,
      })
    )
    .map(mergeOverrides);

  const sorted = sortBy(allFieldsUnsorted, 'name');
  console.timeEnd('field_cap'); // <--- outputs the end timer
  return sorted;

```

And then reload the security solutions application web page or generically anything that is going to call filebeat-* index or another large index or you could concatenate several indexes together as well to test out the performance. For security solutions we can just visit any page such as this one below which has a filebeat-* index:

```
http://localhost:5601/app/security/timelines/
```

Be sure to load it _twice_ for testing as NodeJS will sometimes report better numbers the second time as it does optimizations after the first time it encounters some code paths.

You should begin to see numbers similar to this in the before:

```ts
field_cap: 575.131ms
```

This indicates that it is blocking the event loop for around half a second before this fix. If an application adds additional indexes on-top of `filebeat`, or if it tries to execute other code after this (which we do in security solutions) then the block times will climb even higher.


However, after this fix, the m^n are changed to use hashing so this only climb by some constant * n where n is your fields and for filebeat-* it will should very low around:

```ts
field_cap: 31.783ms
```

### Checklist

Unit tests already present, so this shouldn't break anything 🤞 .
- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
parkiino pushed a commit to parkiino/kibana that referenced this pull request Sep 1, 2020
… using hashes (elastic#75718)

## Summary

On the `security_solution` project from different customers we have been getting reports of scaling issues and excessive NodeJS event blocking times. After in-depth tracing through some of the index and field capabilities calls we identified two of the "hot paths" running through `field_capabilities` to where it is using double looped arrays rather than hashes. By switching these two hot spots out for hashes we are now able to reduce the event loop block times by an order of magnitude. 

Before this PR you can see event loop block times as high as:

```ts
field_cap: 575.131ms
```

And after this PR you will see event loop block times drop by an order of magnitude to:

```ts
field_cap: 31.783ms
```

when you're calling into indexes as large as `filebeat-*`. This number can be higher if you're concatenating several large indexes together trying to get capabilities from each one all at once. We already only call `getFieldCapabilities` with one index at a time to spread out event block times.

The fix is to use a hash within two key areas within these two files:

```ts
src/plugins/data/server/index_patterns/fetcher/lib/field_capabilities/field_capabilities.ts
src/plugins/data/server/index_patterns/fetcher/lib/field_capabilities/field_caps_response.ts
```

This effect happens during the query of `SourceQuery`/`IndexFields` within `security_solution` but you should be able to trigger it with any application code who calls into those code paths with large index sizes such as `filebeat-*` anywhere in Kibana.

An explanation of how to see the block times for before and after
---

Add, `console.time('field_cap');` and `console.timeEnd('field_cap');` to where the synchronize code is for testing the optimizations of before and after.

For example around lines 45 with the original code:
```ts
  const esFieldCaps: FieldCapsResponse = await callFieldCapsApi(callCluster, indices);
  console.time('field_cap'); // <--- start timer
  const fieldsFromFieldCapsByName = keyBy(readFieldCapsResponse(esFieldCaps), 'name');

  const allFieldsUnsorted = Object.keys(fieldsFromFieldCapsByName)
    .filter((name) => !name.startsWith('_'))
    .concat(metaFields)
    .reduce(concatIfUniq, [] as string[])
    .map<FieldDescriptor>((name) =>
      defaults({}, fieldsFromFieldCapsByName[name], {
        name,
        type: 'string',
        searchable: false,
        aggregatable: false,
        readFromDocValues: false,
      })
    )
    .map(mergeOverrides);

  const sorted = sortBy(allFieldsUnsorted, 'name');
  console.timeEnd('field_cap'); // <--- outputs the end timer
  return sorted;

```

And around lines 45 with this pull request:

```ts
  const esFieldCaps: FieldCapsResponse = await callFieldCapsApi(callCluster, indices);
  console.time('field_cap'); // <--- start timer
  const fieldsFromFieldCapsByName = keyBy(readFieldCapsResponse(esFieldCaps), 'name');

  const allFieldsUnsorted = Object.keys(fieldsFromFieldCapsByName)
    .filter((name) => !name.startsWith('_'))
    .concat(metaFields)
    .reduce<{ names: string[]; hash: Record<string, string> }>(
      (agg, value) => {
        // This is intentionally using a "hash" and a "push" to be highly optimized with very large indexes
        if (agg.hash[value] != null) {
          return agg;
        } else {
          agg.hash[value] = value;
          agg.names.push(value);
          return agg;
        }
      },
      { names: [], hash: {} }
    )
    .names.map<FieldDescriptor>((name) =>
      defaults({}, fieldsFromFieldCapsByName[name], {
        name,
        type: 'string',
        searchable: false,
        aggregatable: false,
        readFromDocValues: false,
      })
    )
    .map(mergeOverrides);

  const sorted = sortBy(allFieldsUnsorted, 'name');
  console.timeEnd('field_cap'); // <--- outputs the end timer
  return sorted;

```

And then reload the security solutions application web page or generically anything that is going to call filebeat-* index or another large index or you could concatenate several indexes together as well to test out the performance. For security solutions we can just visit any page such as this one below which has a filebeat-* index:

```
http://localhost:5601/app/security/timelines/
```

Be sure to load it _twice_ for testing as NodeJS will sometimes report better numbers the second time as it does optimizations after the first time it encounters some code paths.

You should begin to see numbers similar to this in the before:

```ts
field_cap: 575.131ms
```

This indicates that it is blocking the event loop for around half a second before this fix. If an application adds additional indexes on-top of `filebeat`, or if it tries to execute other code after this (which we do in security solutions) then the block times will climb even higher.


However, after this fix, the m^n are changed to use hashing so this only climb by some constant * n where n is your fields and for filebeat-* it will should very low around:

```ts
field_cap: 31.783ms
```

### Checklist

Unit tests already present, so this shouldn't break anything 🤞 .
- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
@MindyRS MindyRS added the Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. label Sep 23, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_note:fix Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:SIEM v7.9.1 v7.10.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants