Skip to content

[AutoML] Reservoir sample dataset statistics #3778

@daholste

Description

@daholste

Currently dataset statistics within AutoML are calculated from the first 1,000 rows of a dataset. Instead, we should be calculating statistics from a random sample of 1,000 rows. (First 1,000 rows could be biased if they are sorted by label, any other column, time of collection, etc.) We can use reservoir sampling to obtain a random sample of a fixed size in a single pass over the dataset

Metadata

Metadata

Assignees

No one assigned

    Labels

    AutoML.NETAutomating various steps of the machine learning processP2Priority of the issue for triage purpose: Needs to be fixed at some point.enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions