[ML] Switch from normal sampling to random sampler for Index data visualizer table #144646

qn895 · 2022-11-04T21:03:10Z

Summary

This PR switches the currently sampling method from normal sampling to random sampler for Index data visualizer table. It also lowers the threshold for when the sampling can be done in the interest of speed.

Checklist

Delete any items that are not applicable to this PR.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
Any UI touched in this PR is usable by keyboard only (learn more about keyboard accessibility)
Any UI touched in this PR does not create any new axe failures (run axe in browser: FF, Chrome)
If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
This renders correctly on smaller devices using a responsive layout. (You can test this in your browser)
This was checked for cross-browser compatibility

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk	Probability	Severity	Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space.	Low	High	Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks.	High	Low	Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled.	Medium	High	Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

This was checked for breaking API changes and was labeled appropriately

elasticmachine · 2022-11-04T21:08:27Z

Pinging @elastic/ml-ui (:ml)

…r-part-2

benwtrent

Are there any safety checks for small data?

benwtrent · 2022-11-07T12:50:42Z

...izer/public/application/index_data_visualizer/search_strategy/requests/get_document_stats.ts

-      newProbability === Infinity ||
-      numSampled / initialDefaultProbability < 1e7
-    ) {
+    if (numSampled === 0 || newProbability === Infinity || numSampled < 5) {


Why is 5 chosen here? It seems to me it should be much bigger.

Also, numSampled === 0 already implies numSampled < 5, so that equals 0 check is redundant.

benwtrent · 2022-11-07T12:56:39Z

...public/application/index_data_visualizer/search_strategy/requests/get_numeric_field_stats.ts

            isTopValuesSampled:
              field.cardinality >= SAMPLER_TOP_TERMS_THRESHOLD || samplerShardSize > 0,


This logic seems wrong now? I don't think you use SAMPLER_TOP_TERMS_THRESHOLD anylonger when determining if a thing is sampled or not.

…r-part-2

…nges and overallStats is refetched

walterra · 2022-11-10T11:37:56Z

For total documents we recalculate now an estimated number based on the sampling. Do we aim for doing the same for all other data too in the table so that for example the documents stats would also be estimated full numbers recalculated from the sampling probability?

…r-part-2

jughosta

Data Discovery changes LGTM 👍
Stats in Sidebars' field popover and Field Statistics table match.

peteharverson · 2022-11-14T17:57:14Z

For total documents we recalculate now an estimated number based on the sampling. Do we aim for doing the same for all other data too in the table so that for example the documents stats would also be estimated full numbers recalculated from the sampling probability?

Currently the counts in the expanded rows are not consistent:

The doc stats add up to the sampled doc count, whereas the top values add up to the total doc count. We should be consistent here - either using the total count or the sampled count in both places. The previous approached used the sampled total.

jgowdyelastic · 2022-11-14T18:18:02Z

...lication/common/components/stats_table/components/field_data_expanded_row/choropleth_map.tsx

@@ -97,13 +99,55 @@ interface Props {
 }

 export const ChoroplethMap: FC<Props> = ({ stats, suggestion }) => {
-  const { fieldName, isTopValuesSampled, topValues, topValuesSamplerShardSize } = stats!;
+  const {
+    services: { data },


Nit, could be

Suggested change

services: { data },

services: { data: { fieldFormats } },