Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TimeSeriesImputer to use multiple imputation strategies #4442

Open
daholste opened this issue Nov 5, 2019 · 1 comment
Assignees
Labels

Comments

@daholste
Copy link
Contributor

@daholste daholste commented Nov 5, 2019

For imputed rows, this is a feature request for TimeSeriesImputer to be able to support multiple imputation strategies across different columns. For instance, numeric feature columns could be imputed by median, date feature columns by forward fill, and target column by median.

@justinormont

This comment has been minimized.

Copy link
Member

@justinormont justinormont commented Nov 5, 2019

We need an approximate median algorithm, as exact is not doable on generalized data for streaming.

For approximate medians:

  • Reservoir sampling -- reduce data size then use standard median algorithm
  • Approx algos on the full dataset -- E.g. Greenwald-Khanna algorithm as used in Spark's approxQuantile() function

Median is perfectly calculable when:

  • Fits in memory -- Normal median algos (e.g. Floyd–Rivest algorithm)
  • Low cardinality -- Can use radix/bucket sort counting
  • Beyond memory, high cardinality -- Generally not reasonable but can take log(n) passes of the dataset for binary search of the median

We may also want to check if median is actually needed, or if we can use only mean/forward-fill instead.

@gvashishtha gvashishtha self-assigned this Nov 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.