Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new example notebook for active learning #910

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

joelnkn
Copy link

@joelnkn joelnkn commented Jun 7, 2024

Adding a notebook for active learning following the checklist in #772 (and #556).

@joelnkn joelnkn self-assigned this Jun 7, 2024
@kevingreenman kevingreenman added this to the v2.1.0 milestone Jun 7, 2024
@kevingreenman kevingreenman linked an issue Jun 7, 2024 that may be closed by this pull request
@kevingreenman
Copy link
Member

@joelnkn thanks for adding this! It looks like a great start to me at first glance. A couple notes:

  • You mentioned on Slack a concern about the error metrics getting worse with more active learning iterations. I agree that this would normally be concerning, but I think in this case it might be caused by the very small size of the dataset; with only 100 total points in the best case, the active learning progression is very noisy, and the model is apparently getting lucky with the training points it chooses in early iterations of the model that lead to better metrics. We could consider using a larger dataset for this notebook, but we'll have to think about what makes sense because we don't want to unnecessarily bloat the github repo with large files.
  • If we find a good dataset to use where the metrics decrease with more active learning iterations, it would be nice to make a plot at the end of the notebook to visualize this.
  • Right now I think random is a good choice for the priority function (aka acquisition function). Once we add the uncertainty functionality, I think people would also like to see an example of how using uncertainty-based sampling might improve results more efficiently than sampling randomly.

@kevingreenman
Copy link
Member

One other note: I suggest changing your batch_size variable name to something like al_batch_size to avoid confusion with the batch_size that's a hyperparameter in training the model without active learning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[TODO]: Add example notebooks to the docs
3 participants