-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[AutoML] Reservoir sample dataset statistics #3778
Copy link
Copy link
Open
Labels
AutoML.NETAutomating various steps of the machine learning processAutomating various steps of the machine learning processP2Priority of the issue for triage purpose: Needs to be fixed at some point.Priority of the issue for triage purpose: Needs to be fixed at some point.enhancementNew feature or requestNew feature or request
Metadata
Metadata
Assignees
Labels
AutoML.NETAutomating various steps of the machine learning processAutomating various steps of the machine learning processP2Priority of the issue for triage purpose: Needs to be fixed at some point.Priority of the issue for triage purpose: Needs to be fixed at some point.enhancementNew feature or requestNew feature or request
Type
Fields
Give feedbackNo fields configured for issues without a type.
Currently dataset statistics within AutoML are calculated from the first 1,000 rows of a dataset. Instead, we should be calculating statistics from a random sample of 1,000 rows. (First 1,000 rows could be biased if they are sorted by label, any other column, time of collection, etc.) We can use reservoir sampling to obtain a random sample of a fixed size in a single pass over the dataset