Core algorithms #87

atroyn · 2022-11-25T19:33:02Z

This PR introduces three core Chroma algorithms.

The change is in two parts:

The algorithms module which contains the definitions of each algorithm
Driver utilities to run the algorithms, store the results, then query the results to create a sample of the unlabeled data.

There are 3 core algorithms:

Activation uncertainty, which determines (in a class-wise way) which unseen examples result in low network activations relative to the training data. This is a proxy for model uncertainty.
Boundary uncertainty, which determines which unseen examples might lie close to the decision boundary and so might be mis-classified.
Representative sampling from unsupervised clustering. This finds points which are representative of the unseen data, but not the training data. This is a proxy for distribution shift. This also finds overall (difficult) outliers.

There is also a random sampler, which should make up some proportion of the next dataset - 'every datum gets a chance'

All 3 algorithms get run over the entire unlabeled (inference) dataset, effectively scoring it by each. Then, for a specified number of total samples, we proportionally collect the best candidates from each.

Testing

Install the new requirements:
pip install -r requirements.txt
pip install -r requirements-dev.txt

There are two notebooks:

chroma_server/algorithms/core_algorithms_examples.ipynb shows how each algorithm works separately.
chroma_server/utils/sampling_examples.ipynb shows how the algorithms are run, and how to get the final result.

Future

In the long term, there is a lot to do - the following represent near-term things we should do soon, but don't block shipping.

#91 - Fix random sampling to be over images, not embeddings.
#92 - Add telemetry
#88 - Fix class-based representative sampling
#89 - Address redundant computations to make algos more efficient.

TODOS:

Wire to the API (@jeffchuber)
~~[x] Create celery task for background procesing~~ decided this is unimportant
- Create API endpoint to trigger processing synchronously
- Create API endpoint to fetch results when they're ready
Fix DuckDB. It doesn't have a dataframe based fetch so we should wrap it.

hook up to api

jeffchuber

great

chroma-server/chroma_server/api.py

chroma-server/chroma_server/db/clickhouse.py

chroma-server/chroma_server/utils/sampling.py

atroyn added 14 commits November 25, 2022 11:42

Clickhouse to return a pandas dataframe.

3bbb818

id indexing fixes in hnswlib

8f93e88

Core Chroma algorithms

401f830

Core algos and drivers

2f041cb

Write results to DB

9e70e6b

Retrieval and storage for sampling

f4727e0

Results by column

23538d8

Relative path

0755b79

Results by column

a20c798

Old file removed

b1854a0

Disabled class clusters for now

29bac20

Sampling and retrieval

fc1e9ff

Convenience notebook for loading parquet

691d16f

Fixed black

d4c6a59

atroyn force-pushed the anton/integrated-algorithms branch from bff8b76 to d4c6a59 Compare November 25, 2022 19:46

atroyn added 3 commits November 25, 2022 12:05

Correct where filter

2a73b41

Make tests pass

21e8c79

Results storage types for DDB compatibility

9f9314a

atroyn mentioned this pull request Nov 25, 2022

Fix class-based cluster representative sampling #88

Closed

atroyn requested review from jeffchuber and levand November 25, 2022 22:10

atroyn and others added 6 commits November 25, 2022 14:31

DuckDB compatibility

51cacc3

Make clickhouse and DDB friends

89d5564

hook up to api

aff5a1e

Random sampling only from the inference dataset

e64dc3b

remove commented out code

cd01427

Merge pull request #90 from chroma-core/jeff/hook-up

dc9a0ac

hook up to api

atroyn marked this pull request as ready for review November 26, 2022 00:38

atroyn added 2 commits November 25, 2022 16:39

Remove print

f874499

Removing dev config

d7234b4

jeffchuber approved these changes Nov 26, 2022

View reviewed changes

Responded to comments

8695eef

atroyn merged commit ea1d268 into main Nov 26, 2022

atroyn deleted the anton/integrated-algorithms branch November 26, 2022 00:59

levand mentioned this pull request Nov 29, 2022

Post-refactoring fixes for the FastAPI client+server API #93

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core algorithms #87

Core algorithms #87

atroyn commented Nov 25, 2022 •

edited

Loading

jeffchuber left a comment

Core algorithms #87

Core algorithms #87

Conversation

atroyn commented Nov 25, 2022 • edited Loading

Testing

Future

TODOS:

jeffchuber left a comment

Choose a reason for hiding this comment

atroyn commented Nov 25, 2022 •

edited

Loading