You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
if__name__=="__main__":
fromragatouilleimportRAGPretrainedModelfromtimeimporttimeRAG=RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
RAG.index(
collection=["This is a test."],
document_ids=["test_document"],
index_name=f"test_index_{time()}",
split_documents=False,
)
results=RAG.search(query="What animation studio did Miyazaki found?", k=10)
print(results)
[Jan 28, 19:18:40] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
.pyvenv/ragatouille/lib/python3.11/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.
warnings.warn(
[Jan 28, 19:18:41] #> Creating directory .ragatouille/colbert/indexes/test_index_1706465921.111713
[Jan 28, 19:18:43] [0] #> Encoding 1 passages..
0%| | 0/1 [00:00<?, ?it/s].pyvenv/ragatouille/lib/python3.11/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn(
100%|████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.16s/it]
[Jan 28, 19:18:44] [0] avg_doclen_est = 7.0 len(local_sample) = 1
[Jan 28, 19:18:44] [0] Creating 32 partitions.
[Jan 28, 19:18:44] [0] *Estimated* 7 embeddings.
[Jan 28, 19:18:44] [0] #> Saving the indexing plan to .ragatouille/colbert/indexes/test_index_1706465921.111713/plan.json ..
Hey, this is "normal" (as in, it makes sense given what ColBERT indexes are optimised for), but should be documented as it's obscure behaviour.
Basically, indexes are specifically optimised for performance. If you are going to create an index with fewer than 32 total document tokens, it'll error out as it hasn't been built for it. For very low document use-cases, encode() and search_encoded_docs() (example here) make more sense.
I'll close this issue and add a note to the to-do (soon to be on Github Project so it should be easier to see what's being worked on) to document this better!
This is on macOS 14.3 (M1) with Python 3.11, latest ragatouille.
The Miyazaki example works.
The text was updated successfully, but these errors were encountered: