Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Switch from normal sampling to random sampler for Index data visualizer table #144646

Merged
merged 22 commits into from Nov 16, 2022

Conversation

qn895
Copy link
Member

@qn895 qn895 commented Nov 4, 2022

Summary

This PR switches the currently sampling method from normal sampling to random sampler for Index data visualizer table. It also lowers the threshold for when the sampling can be done in the interest of speed.

Screen Shot 2022-11-04 at 16 01 31

Checklist

Delete any items that are not applicable to this PR.

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk Probability Severity Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space. Low High Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks. High Low Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled. Medium High Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

@qn895 qn895 self-assigned this Nov 4, 2022
@qn895 qn895 requested a review from benwtrent November 4, 2022 21:04
@qn895 qn895 added :ml Feature:File and Index Data Viz ML file and index data visualizer labels Nov 4, 2022
@qn895 qn895 marked this pull request as ready for review November 4, 2022 21:08
@qn895 qn895 requested a review from a team as a code owner November 4, 2022 21:08
@elasticmachine
Copy link
Contributor

Pinging @elastic/ml-ui (:ml)

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any safety checks for small data?

newProbability === Infinity ||
numSampled / initialDefaultProbability < 1e7
) {
if (numSampled === 0 || newProbability === Infinity || numSampled < 5) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is 5 chosen here? It seems to me it should be much bigger.

Also, numSampled === 0 already implies numSampled < 5, so that equals 0 check is redundant.

Comment on lines 153 to 154
isTopValuesSampled:
field.cardinality >= SAMPLER_TOP_TERMS_THRESHOLD || samplerShardSize > 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems wrong now? I don't think you use SAMPLER_TOP_TERMS_THRESHOLD anylonger when determining if a thing is sampled or not.

@walterra
Copy link
Contributor

For total documents we recalculate now an estimated number based on the sampling. Do we aim for doing the same for all other data too in the table so that for example the documents stats would also be estimated full numbers recalculated from the sampling probability?

@qn895 qn895 requested a review from a team as a code owner November 10, 2022 23:33
Copy link
Contributor

@jughosta jughosta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data Discovery changes LGTM 👍
Stats in Sidebars' field popover and Field Statistics table match.

@peteharverson
Copy link
Contributor

peteharverson commented Nov 14, 2022

For total documents we recalculate now an estimated number based on the sampling. Do we aim for doing the same for all other data too in the table so that for example the documents stats would also be estimated full numbers recalculated from the sampling probability?

Currently the counts in the expanded rows are not consistent:

image

The doc stats add up to the sampled doc count, whereas the top values add up to the total doc count. We should be consistent here - either using the total count or the sampled count in both places. The previous approached used the sampled total.

image

@@ -97,13 +99,55 @@ interface Props {
}

export const ChoroplethMap: FC<Props> = ({ stats, suggestion }) => {
const { fieldName, isTopValuesSampled, topValues, topValuesSamplerShardSize } = stats!;
const {
services: { data },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, could be

Suggested change
services: { data },
services: { data: { fieldFormats } },

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here 80872e0 (#144646)

const docsPercent =
valueCount !== undefined && sampleCount !== undefined
? roundToDecimalPlace((valueCount / sampleCount) * 100)
: 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's quite a lot of logic here, a comment explaining what it's doing would be useful.

Do we definitely want to be showing 0% if value valueCount or sampleCount are undefined? should we instead not show this item at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here 80872e0 (#144646)

const { stats } = config;
const {
services: { data },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit,

Suggested change
services: { data },
services: { data: { fieldFormats } },

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here 80872e0 (#144646)


// If field exists is docs but we don't have count stats then don't show
// Otherwise if field doesn't appear in docs at all, show 0%
const docsCount =
const valueCount =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just notices this is the same logic as before, could it be moved to a function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here 80872e0 (#144646)

@@ -36,8 +36,7 @@ interface Props {
onAddFilter?: (field: DataViewField | string, value: string, type: '+' | '-') => void;
}

function getPercentLabel(docCount: number, topValuesSampleSize: number): string {
const percent = (100 * docCount) / topValuesSampleSize;
function _getPercentLabel(percent: number): string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably doesn't need an underscore prefix. We sometimes do this to signify a function is private when inside another function or class, but it is clear that this is not exported so will be private to the module.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here 80872e0 (#144646)

const totalDocuments = stats.totalDocuments ?? 0;
const topValuesOtherCountPercent =
1 - (topValues ? topValues.reduce((acc, bucket) => acc + bucket.percent, 0) : 0);
const topValuesOtherCount = Math.floor(topValuesOtherCountPercent * (sampleCount || 0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although it doesn't matter logically here, for consistency it might be better to still use ?? because the point of the check is to look for undefined

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here 80872e0 (#144646)

export type FieldStatsEmbeddableSamplerOption =
typeof EMBEDDABLE_SAMPLER_OPTION[keyof typeof EMBEDDABLE_SAMPLER_OPTION];

export function isRandomSamplingOption(arg: SamplingOption): arg is RandomSamplingOption {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these guards might be better living next to the types in common/types/field_stats

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here 80872e0 (#144646)

}

export function buildAggregationWithSamplingOption(
aggs: any,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this any be replaced by a correct type?
maybe Record<string, estypes.AggregationsAggregationContainer>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here 80872e0 (#144646)

}

export function buildSamplerAggregation(
aggs: any,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with this any

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here 80872e0

* Wraps the supplied aggregations in a random sampler aggregation.
*/
export function buildRandomSamplerAggregation(
aggs: any,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with this any

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here 80872e0

@jgowdyelastic
Copy link
Member

jgowdyelastic commented Nov 14, 2022

It would be good to add a debounce or delay to the sampling percentage slider.
It goes a bit crazy when moving it.

635FA12B-7B4E-4B84-B0FB-641E5A072014-39619-0001669D073D610D

Also the default number in the tooltip is 100 when it should be 50
image

@qn895
Copy link
Member Author

qn895 commented Nov 14, 2022

Currently the counts in the expanded rows are not consistent:

@peteharverson Thanks for catching that. The numbers were adding up to the sampled # of docs for the numerical and boolean top stats, but not for string/keyword top terms. This is because the buckets' doc_count numbers returning from elasticsearch itself was not for the values sampled. I've added a fix for that by scaling the numbers down to match the # sampled docs (so bucket.doc_count * (# sampled docs/# total docs). After the fix:
Screen Shot 2022-11-14 at 17 41 04

It would be good to add a debounce or delay to the sampling percentage slider.

@jgowdyelastic Good point! I've added debouncing here 33c645f (#144646)

if (setSamplingProbability) {
setSamplingProbability(closestProbability / 100);
}
updateSamplingProbability(closestProbability / 100);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be more performant to put all of the logic in this onChange function inside the function that is wrapped in the debounce. There's no need to be calculating the closestProbability on every change if it's just going to be discarded.

Also there is a useful hook called useDebounce which might work well here.
You could put the e.currentTarget.value in a temporary state variable e.g. newProbability and then useDebounce could watch for changes in that variable.

id="xpack.dataVisualizer.dataGrid.field.topValues.calculatedFromSampleDescription"
defaultMessage="Calculated from sample of {topValuesSamplerShardSize} documents per shard"
id="xpack.dataVisualizer.dataGrid.field.topValues.calculatedFromSampleRecordsLabel"
defaultMessage="Calculated from {sampledDocumentsFormatted} sample {sampledDocuments, plural, one {record} other {records}}."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the file data viz expanded rows, the doc count here always seems to be 0:

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here 7d7ba42 (#144646)

{isTopValuesSampled ? (
<FormattedMessage
id="xpack.dataVisualizer.dataGrid.fieldExpandedRow.choroplethMapTopValues.calculatedFromSampleRecordsLabel"
defaultMessage="Calculated from {sampledDocumentsFormatted} sample {sampledDocuments, plural, one {record} other {records}}."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These counts are also 0 in the file data viz:

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here 7d7ba42 (#144646)

Copy link
Contributor

@peteharverson peteharverson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested latest changes and LGTM.
Switch of position of the random sampler controls cog icon to the right looks good - this is consistent now with the position of the settings control in Discover and prevents the control jumping around as the slider is moved and the chart is reloaded.

Copy link
Member

@jgowdyelastic jgowdyelastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qn895 qn895 enabled auto-merge (squash) November 15, 2022 17:07
@qn895 qn895 disabled auto-merge November 15, 2022 17:12
@qn895
Copy link
Member Author

qn895 commented Nov 15, 2022

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
dataVisualizer 318 309 -9

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
dataVisualizer 537.8KB 532.4KB -5.4KB
discover 414.0KB 414.1KB +61.0B
total -5.3KB
Unknown metric groups

ESLint disabled in files

id before after diff
osquery 1 2 +1

ESLint disabled line counts

id before after diff
dataVisualizer 44 45 +1
enterpriseSearch 19 21 +2
fleet 59 65 +6
osquery 108 113 +5
securitySolution 441 447 +6
total +20

References to deprecated APIs

id before after diff
dataVisualizer 23 25 +2

Total ESLint disabled count

id before after diff
dataVisualizer 44 45 +1
enterpriseSearch 20 22 +2
fleet 67 73 +6
osquery 109 115 +6
securitySolution 518 524 +6
total +21

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @qn895

@benwtrent benwtrent self-requested a review November 15, 2022 19:06
Comment on lines +135 to +136
const multiplier =
count > sampleCount ? get(aggregations, [...aggsPath, 'probability'], 1) : 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same concerns here as above.

Comment on lines +88 to +91
// Sampler agg will yield doc_count that's bigger than the actual # of sampled records
// because it uses the stored _doc_count if available
// https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-doc-count-field.html
// therefore we need to correct it by multiplying by the sampled probability
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is doing what you are commenting.

I think this is scaling the count back DOWN via the probability (random_sampler not only uses _doc_count, but scales sampled counting statistics by the inverted probability). In doing this, you ensure that doc count is near the sampledCount, but this has nothing to do with _doc_count.

const topValues = topValuesBuckets.map((bucket) => ({
...bucket,
doc_count: sampledCount
? Math.floor(bucket.doc_count * (sampledCount / realNumberOfDocuments))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have a comment here explaining sampledCount / realNumberOfDocuments that this was done to "scale back down" doc_count to be lower than the sampled number of documents.

The sum of the bucket values is indeed to handle the _doc_count weirdness.

@qn895 qn895 merged commit 22d0fa7 into elastic:main Nov 16, 2022
@qn895 qn895 deleted the ml-dv-random-sampler-part-2 branch November 16, 2022 14:36
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Nov 16, 2022
benakansara pushed a commit to benakansara/kibana that referenced this pull request Nov 17, 2022
…ualizer table (elastic#144646)

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:File and Index Data Viz ML file and index data visualizer :ml release_note:enhancement v8.6.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants