Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update feature imputation strategy #201

Closed
Tracked by #54
bfhealy opened this issue Dec 20, 2022 · 1 comment · Fixed by #207
Closed
Tracked by #54

Update feature imputation strategy #201

bfhealy opened this issue Dec 20, 2022 · 1 comment · Fixed by #207
Labels
enhancement New feature or request

Comments

@bfhealy
Copy link
Collaborator

bfhealy commented Dec 20, 2022

Missing features are current zero-imputed for training and mean-imputed for inference. Aside from their inconsistency, there are better (albeit more costly) ways to perform imputation. The current plan is to:

  • Exclude the AllWISE W3 and W4 magnitude errors, which are missing from >75% of the training sample
  • Impute a value of zero for missing mean_ztf_alert_braai
  • Impute the median for missing magnitude errors (mainly PS1)
  • Use regression (e.g. KNN imputation) to impute missing magnitudes
  • Use regression (potentially on a class-by-class basis) to impute missing Gaia EDR3 parallaxes
@bfhealy bfhealy added the enhancement New feature or request label Dec 20, 2022
@bfhealy bfhealy mentioned this issue Dec 20, 2022
48 tasks
@bfhealy
Copy link
Collaborator Author

bfhealy commented Jan 3, 2023

Using sklearn's KNNImputer (n=5) takes ~2-3 minutes to run on 100,000 sources for the following feature subset:

['Gaia_EDR3__phot_bp_mean_mag',
 'Gaia_EDR3__phot_rp_mean_mag',
 'Gaia_EDR3__parallax',
 'PS1_DR1__gMeanPSFMag',
 'PS1_DR1__rMeanPSFMag',
 'PS1_DR1__iMeanPSFMag',
 'PS1_DR1__zMeanPSFMag',
 'PS1_DR1__yMeanPSFMag',
 'AllWISE__w1mpro',
 'AllWISE__w2mpro',
 'AllWISE__w3mpro',
 'AllWISE__w4mpro']

This takes significantly longer than inference (~10s) for the same number of sources, but it remains important to change our imputation strategy from its current state.

@bfhealy bfhealy linked a pull request Jan 4, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant