Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] URI Data Loader #1294

Merged
merged 20 commits into from
Nov 7, 2023
Merged

[ENH] URI Data Loader #1294

merged 20 commits into from
Nov 7, 2023

Conversation

atroyn
Copy link
Contributor

@atroyn atroyn commented Oct 26, 2023

Description of changes

This PR adds URIs and DataLoaders into Chroma.

  • DataLoader works like EmbeddingFunction, except it takes a URIs and outputs the specified datatype.
  • A DataLoader using pillow for image file loading.
  • Adds uris as a field on add, query, as well as an include field
  • Adds data as an include field

URIs specify a place where data can be loaded from, and can be used to load data for embedding, or as the result of retrieval.

This makes multimodal retrieval with data stored externally as files seamless and extensible.

This PR is stacked on #1293

Test

Integration tests pass.

A new unit test for data loaders: https://github.com/chroma-core/chroma/blob/c71c3efa15d2a9252db26470b9ff1cdb10a2b681/chromadb/test/data_loader/test_data_loader.py

Try the notebook: https://github.com/chroma-core/chroma/blob/c71c3efa15d2a9252db26470b9ff1cdb10a2b681/examples/multimodal/multimodal_retrieval.ipynb

Documentation

Documentation for this and #1293 chroma-core/docs#157

TODOs

  • Concurrent Loading
  • Tests
  • Wiring through FastAPI
  • Documentation

@github-actions
Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readbility, Modularity, Intuitiveness)

chromadb/api/types.py Outdated Show resolved Hide resolved
chromadb/api/segment.py Outdated Show resolved Hide resolved
@atroyn atroyn force-pushed the anton/uri-loader branch 3 times, most recently from 4bd1ab6 to 07480b6 Compare October 30, 2023 00:23
@atroyn atroyn mentioned this pull request Nov 6, 2023
5 tasks
@atroyn atroyn changed the title [DRAFT] URI Data Loader [ENH] URI Data Loader Nov 6, 2023
@atroyn atroyn marked this pull request as ready for review November 6, 2023 03:14
@HammadB
Copy link
Collaborator

HammadB commented Nov 6, 2023

Do we support the case where you want to add an image that you have preloaded but also give its uri so you can fetch it later? I think that will be a common usage pattern.

@atroyn
Copy link
Contributor Author

atroyn commented Nov 6, 2023

Do we support the case where you want to add an image that you have preloaded but also give its uri so you can fetch it later? I think that will be a common usage pattern.

Good catch. On the one hand, I can see that your image might be already loaded in memory and stored somewhere, so loading it again from the URI to embed it is redundant. We can avoid that by giving images priority over uris if both are present in terms of which we embed. Having that implicit ordering makes me a little uncomfortable.

Additionally, it might imply the semantics that we store your images if you provide us a URL to a file on add, and similarly that we update the contained image on update. I would like to avoid that confusion, so I think for now let's leave it this way and see if people run into sharp edges.

@atroyn atroyn force-pushed the anton/uri-loader branch 2 times, most recently from 0ca441c to f3b080f Compare November 7, 2023 06:18
atroyn added a commit that referenced this pull request Nov 7, 2023
## Description of changes

This PR introduces multi-modal embeddings into Chroma. 
- It adds the generic `EmbeddingFunction` which can take various data
types. Existing functions take the `Documents` type.
- Adds `Images` as a type (numpy NDArray taking ints or floats)
- Add `OpenCLIPEmbeddingFunction` which is an
`EmbeddingFunction[Union[Documents, Images]]`

## Test

Integration tests pass. 

A new test for multimodal embedding functions:
[chromadb/test/ef/test_multimodal_ef.py](https://github.com/chroma-core/chroma/blob/86a9e2620352ee0b2844bc3233f9e001cc4aa3d9/chromadb/test/ef/test_multimodal_ef.py)

## Documentation

See #1294

## TODOs
- [x] Tests
- [x] ~Wiring through FastAPI~ Nothing to wire through
- [x] Documentation
- [x] Telemetry
- [ ] ~JavaScript~
Base automatically changed from anton/multimodal-ef to main November 7, 2023 17:57
@atroyn atroyn mentioned this pull request Nov 7, 2023
5 tasks
@atroyn atroyn changed the base branch from main to anton/multimodal-ef November 7, 2023 18:03
@atroyn atroyn merged commit 8d0405e into anton/multimodal-ef Nov 7, 2023
95 checks passed
@atroyn atroyn deleted the anton/uri-loader branch November 7, 2023 18:12
atroyn added a commit that referenced this pull request Nov 7, 2023
## Description of changes

This PR introduces multi-modal embeddings into Chroma. 
- It adds the generic `EmbeddingFunction` which can take various data
types. Existing functions take the `Documents` type.
- Adds `Images` as a type (numpy NDArray taking ints or floats)
- Add `OpenCLIPEmbeddingFunction` which is an
`EmbeddingFunction[Union[Documents, Images]]`

## Test

Integration tests pass. 

A new test for multimodal embedding functions:
[chromadb/test/ef/test_multimodal_ef.py](https://github.com/chroma-core/chroma/blob/86a9e2620352ee0b2844bc3233f9e001cc4aa3d9/chromadb/test/ef/test_multimodal_ef.py)

## Documentation

See #1294

## TODOs
- [x] Tests
- [x] ~Wiring through FastAPI~ Nothing to wire through
- [x] Documentation
- [x] Telemetry
- [ ] JavaScript
baskaryan added a commit to langchain-ai/langchain that referenced this pull request Nov 10, 2023
Pending:
* chroma-core/chroma#1294
* chroma-core/chroma#1293

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Bagatur <baskaryan@gmail.com>
pprados pushed a commit to pprados/langchain that referenced this pull request Nov 20, 2023
Pending:
* chroma-core/chroma#1294
* chroma-core/chroma#1293

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Bagatur <baskaryan@gmail.com>
xieqihui pushed a commit to xieqihui/langchain that referenced this pull request Nov 21, 2023
Pending:
* chroma-core/chroma#1294
* chroma-core/chroma#1293

---------

Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants