Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core algorithms #87

Merged
merged 26 commits into from
Nov 26, 2022
Merged

Core algorithms #87

merged 26 commits into from
Nov 26, 2022

Conversation

atroyn
Copy link
Contributor

@atroyn atroyn commented Nov 25, 2022

This PR introduces three core Chroma algorithms.

The change is in two parts:

  1. The algorithms module which contains the definitions of each algorithm
  2. Driver utilities to run the algorithms, store the results, then query the results to create a sample of the unlabeled data.

There are 3 core algorithms:

  • Activation uncertainty, which determines (in a class-wise way) which unseen examples result in low network activations relative to the training data. This is a proxy for model uncertainty.
  • Boundary uncertainty, which determines which unseen examples might lie close to the decision boundary and so might be mis-classified.
  • Representative sampling from unsupervised clustering. This finds points which are representative of the unseen data, but not the training data. This is a proxy for distribution shift. This also finds overall (difficult) outliers.

There is also a random sampler, which should make up some proportion of the next dataset - 'every datum gets a chance'

All 3 algorithms get run over the entire unlabeled (inference) dataset, effectively scoring it by each. Then, for a specified number of total samples, we proportionally collect the best candidates from each.

Testing

Install the new requirements:
pip install -r requirements.txt
pip install -r requirements-dev.txt

There are two notebooks:

chroma_server/algorithms/core_algorithms_examples.ipynb shows how each algorithm works separately.
chroma_server/utils/sampling_examples.ipynb shows how the algorithms are run, and how to get the final result.

Future

In the long term, there is a lot to do - the following represent near-term things we should do soon, but don't block shipping.

#91 - Fix random sampling to be over images, not embeddings.
#92 - Add telemetry
#88 - Fix class-based representative sampling
#89 - Address redundant computations to make algos more efficient.

TODOS:

  • Wire to the API (@jeffchuber)
    [x] Create celery task for background procesing decided this is unimportant
    • Create API endpoint to trigger processing synchronously
    • Create API endpoint to fetch results when they're ready
  • Fix DuckDB. It doesn't have a dataframe based fetch so we should wrap it.

@atroyn atroyn marked this pull request as ready for review November 26, 2022 00:38
Copy link
Contributor

@jeffchuber jeffchuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great

chroma-server/chroma_server/api.py Outdated Show resolved Hide resolved
chroma-server/chroma_server/api.py Outdated Show resolved Hide resolved
chroma-server/chroma_server/api.py Outdated Show resolved Hide resolved
chroma-server/chroma_server/db/clickhouse.py Show resolved Hide resolved
chroma-server/chroma_server/utils/sampling.py Show resolved Hide resolved
@atroyn atroyn merged commit ea1d268 into main Nov 26, 2022
@atroyn atroyn deleted the anton/integrated-algorithms branch November 26, 2022 00:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants