Failure in faiss for short document? #84

b8591340 · 2024-01-28T18:24:56Z

if __name__ == "__main__":
    from ragatouille import RAGPretrainedModel
    from time import time

    RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
    RAG.index(
        collection=["This is a test."],
        document_ids=["test_document"],
        index_name=f"test_index_{time()}",
        split_documents=False,
    )

    results = RAG.search(query="What animation studio did Miyazaki found?", k=10)
    print(results)

[Jan 28, 19:18:40] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
.pyvenv/ragatouille/lib/python3.11/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(


[Jan 28, 19:18:41] #> Creating directory .ragatouille/colbert/indexes/test_index_1706465921.111713 


[Jan 28, 19:18:43] [0]           #> Encoding 1 passages..
  0%|                                                                                        | 0/1 [00:00<?, ?it/s].pyvenv/ragatouille/lib/python3.11/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
100%|████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.16s/it]
[Jan 28, 19:18:44] [0]           avg_doclen_est = 7.0    len(local_sample) = 1
[Jan 28, 19:18:44] [0]           Creating 32 partitions.
[Jan 28, 19:18:44] [0]           *Estimated* 7 embeddings.
[Jan 28, 19:18:44] [0]           #> Saving the indexing plan to .ragatouille/colbert/indexes/test_index_1706465921.111713/plan.json ..

Traceback (most recent call last):
  File "Documents/Exploring/Playgrounds/explore-colbert-ragatouille/main.py", line 51, in <module>
    RAG.index(
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/ragatouille/RAGPretrainedModel.py", line 183, in index
    return self.model.index(
           ^^^^^^^^^^^^^^^^^
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/ragatouille/models/colbert.py", line 349, in index
    self.indexer.index(
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/indexer.py", line 78, in index
    self.__launch(collection)
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/indexer.py", line 89, in __launch
    launcher.launch(self.config, collection, shared_lists, shared_queues, self.verbose)
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/infra/launcher.py", line 34, in launch
    return_val = run_process_without_mp(self.callee, new_config, *args)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/infra/launcher.py", line 103, in run_process_without_mp
    return_val = callee(config, *args)
                 ^^^^^^^^^^^^^^^^^^^^^
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/indexing/collection_indexer.py", line 33, in encode
    encoder.run(shared_lists)
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/indexing/collection_indexer.py", line 68, in run
    self.train(shared_lists) # Trains centroids from selected passages
    ^^^^^^^^^^^^^^^^^^^^^^^^
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/indexing/collection_indexer.py", line 232, in train
    centroids = self._train_kmeans(sample, shared_lists)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/indexing/collection_indexer.py", line 304, in _train_kmeans
    centroids = compute_faiss_kmeans(*args_)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/colbert/indexing/collection_indexer.py", line 507, in compute_faiss_kmeans
    kmeans.train(sample)
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/faiss/extra_wrappers.py", line 457, in train
    clus.train(x, self.index, weights)
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/faiss/class_wrappers.py", line 85, in replacement_train
    self.train_c(n, swig_ptr(x), index)
  File ".pyvenv/ragatouille/lib/python3.11/site-packages/faiss/swigfaiss.py", line 2165, in train
    return _swigfaiss.Clustering_train(self, n, x, index, x_weights)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Error in void faiss::Clustering::train_encoded(faiss::idx_t, const uint8_t *, const faiss::Index *, faiss::Index &, const float *)
at /Users/runner/work/faiss-wheels/faiss-wheels/faiss/faiss/Clustering.cpp:281:
Error: 'nx >= k' failed: Number of training points (7) should be at least as large as number of clusters (32)

This is on macOS 14.3 (M1) with Python 3.11, latest ragatouille.

The Miyazaki example works.

The text was updated successfully, but these errors were encountered:

bclavie · 2024-01-28T18:30:14Z

Hey, this is "normal" (as in, it makes sense given what ColBERT indexes are optimised for), but should be documented as it's obscure behaviour.

Basically, indexes are specifically optimised for performance. If you are going to create an index with fewer than 32 total document tokens, it'll error out as it hasn't been built for it. For very low document use-cases, encode() and search_encoded_docs() (example here) make more sense.

I'll close this issue and add a note to the to-do (soon to be on Github Project so it should be easier to see what's being worked on) to document this better!

b8591340 · 2024-01-28T19:33:50Z

I see! Thanks Benjamin!

bclavie closed this as completed Jan 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in faiss for short document? #84

Failure in faiss for short document? #84

b8591340 commented Jan 28, 2024

bclavie commented Jan 28, 2024

b8591340 commented Jan 28, 2024

Failure in faiss for short document? #84

Failure in faiss for short document? #84

Comments

b8591340 commented Jan 28, 2024

bclavie commented Jan 28, 2024

b8591340 commented Jan 28, 2024