Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in faiss for short document? #84

Closed
b8591340 opened this issue Jan 28, 2024 · 2 comments
Closed

Failure in faiss for short document? #84

b8591340 opened this issue Jan 28, 2024 · 2 comments

Comments

@b8591340
Copy link

if __name__ == "__main__":
    from ragatouille import RAGPretrainedModel
    from time import time

    RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
    RAG.index(
        collection=["This is a test."],
        document_ids=["test_document"],
        index_name=f"test_index_{time()}",
        split_documents=False,
    )

    results = RAG.search(query="What animation studio did Miyazaki found?", k=10)
    print(results)
[Jan 28, 19:18:40] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
.pyvenv/ragatouille/lib/python3.11/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(


[Jan 28, 19:18:41] #> Creating directory .ragatouille/colbert/indexes/test_index_1706465921.111713 


[Jan 28, 19:18:43] [0]           #> Encoding 1 passages..
  0%|                                                                                        | 0/1 [00:00<?, ?it/s].pyvenv/ragatouille/lib/python3.11/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
100%|████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.16s/it]
[Jan 28, 19:18:44] [0]           avg_doclen_est = 7.0    len(local_sample) = 1
[Jan 28, 19:18:44] [0]           Creating 32 partitions.
[Jan 28, 19:18:44] [0]           *Estimated* 7 embeddings.
[Jan 28, 19:18:44] [0]           #> Saving the indexing plan to .ragatouille/colbert/indexes/test_index_1706465921.111713/plan.json ..
Traceback (most recent call last):
  File "Documents/Exploring/Playgrounds/explore-colbert-ragatouille/main.py", line 51, in <module>
    RAG.index(
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py", line 183, in index
    return self.model.index(
           ^^^^^^^^^^^^^^^^^
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/ragatouille/models/colbert.py", line 349, in index
    self.indexer.index(
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/indexer.py", line 78, in index
    self.__launch(collection)
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/indexer.py", line 89, in __launch
    launcher.launch(self.config, collection, shared_lists, shared_queues, self.verbose)
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/infra/launcher.py", line 34, in launch
    return_val = run_process_without_mp(self.callee, new_config, *args)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/infra/launcher.py", line 103, in run_process_without_mp
    return_val = callee(config, *args)
                 ^^^^^^^^^^^^^^^^^^^^^
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/indexing/collection_indexer.py", line 33, in encode
    encoder.run(shared_lists)
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/indexing/collection_indexer.py", line 68, in run
    self.train(shared_lists) # Trains centroids from selected passages
    ^^^^^^^^^^^^^^^^^^^^^^^^
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/indexing/collection_indexer.py", line 232, in train
    centroids = self._train_kmeans(sample, shared_lists)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/indexing/collection_indexer.py", line 304, in _train_kmeans
    centroids = compute_faiss_kmeans(*args_)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/indexing/collection_indexer.py", line 507, in compute_faiss_kmeans
    kmeans.train(sample)
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/faiss/extra_wrappers.py", line 457, in train
    clus.train(x, self.index, weights)
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/faiss/class_wrappers.py", line 85, in replacement_train
    self.train_c(n, swig_ptr(x), index)
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/faiss/swigfaiss.py", line 2165, in train
    return _swigfaiss.Clustering_train(self, n, x, index, x_weights)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Error in void faiss::Clustering::train_encoded(faiss::idx_t, const uint8_t *, const faiss::Index *, faiss::Index &, const float *)
at /Users/runner/work/faiss-wheels/faiss-wheels/faiss/faiss/Clustering.cpp:281:
Error: 'nx >= k' failed: Number of training points (7) should be at least as large as number of clusters (32)

This is on macOS 14.3 (M1) with Python 3.11, latest ragatouille.

The Miyazaki example works.

@bclavie
Copy link
Collaborator

bclavie commented Jan 28, 2024

Hey, this is "normal" (as in, it makes sense given what ColBERT indexes are optimised for), but should be documented as it's obscure behaviour.

Basically, indexes are specifically optimised for performance. If you are going to create an index with fewer than 32 total document tokens, it'll error out as it hasn't been built for it. For very low document use-cases, encode() and search_encoded_docs() (example here) make more sense.

I'll close this issue and add a note to the to-do (soon to be on Github Project so it should be easier to see what's being worked on) to document this better!

@bclavie bclavie closed this as completed Jan 28, 2024
@b8591340
Copy link
Author

I see! Thanks Benjamin!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants