Pytorch 2.1 on Runpod running Examples hangs with message #146

grahama1970 · 2024-02-19T15:24:54Z

Hi. I'm trying to run Ragatouille examples off a pytorch 2.1 template with 2 A6000 GPUs. On the basic_indexing_and_searching example, it hangs and produces the below output:
The examples do work without issues in a free Google collab....just not on WSL or on the Runpod, yet :)
runpod.io
template: pytorch 2.1
runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04
2 x RTX A6000
63 vCPU 117 GB RAM

Any help is appreciated

#> Starting...
#> Starting...
nranks = 2 num_gpus = 2 device=1
nranks = 2 num_gpus = 2 device=0
[Feb 19, 15:15:52] [1] #> Encoding 40 passages..
[Feb 19, 15:15:52] [0] #> Encoding 41 passages..
[Feb 19, 15:15:53] [0] avg_doclen_est = 129.80091857910156 len(local_sample) = 41
[Feb 19, 15:15:53] [1] avg_doclen_est = 129.80091857910156 len(local_sample) = 40
[Feb 19, 15:15:53] [0] Creating 1,024 partitions.
[Feb 19, 15:15:53] [0] Estimated 10,513 embeddings.
[Feb 19, 15:15:53] [0] #> Saving the indexing plan to .ragatouille/colbert/indexes/Miyazaki/plan.json ..
Clustering 9991 points in 128D to 1024 clusters, redo 1 times, 20 iterations
Preprocessing in 0.00 s
WARNING clustering 9991 points to 1024 centroids: please provide at least 39936 training points
Faiss assertion 'err == CUBLAS_STATUS_SUCCESS' failed in void faiss::gpu::runMatrixMult(faiss::gpu::Tensor<float, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<IndexType, 2, true>&, bool, float, float, cublasHandle_t, cudaStream_t) [with AT = float; BT = float; cublasHandle_t = cublasContext*; cudaStream_t = CUstream_st*] at /project/faiss/faiss/gpu/utils/MatrixMult-inl.cuh:265; details: cublas failed (13): (512, 128) x (1024, 128)' = (512, 1024) gemm params m 1024 n 512 k 128 trA T trB N lda 128 ldb 128 ldc 1024
Faiss assertion 'err == CUBLAS_STATUS_SUCCESS' failed in void faiss::gpu::runMatrixMult(faiss::gpu::Tensor<float, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<IndexType, 2, true>&, bool, float, float, cublasHandle_t, cudaStream_t) [with AT = float; BT = float; cublasHandle_t = cublasContext*; cudaStream_t = CUstream_st*] at /project/faiss/faiss/gpu/utils/MatrixMult-inl.cuh:265; details: cublas failed (13): (512, 128) x (1024, 128)' = (512, 1024) gemm params m 1024 n 512 k 128 trA T trB N lda 128 ldb 128 ldc 1024

bclavie · 2024-02-20T15:31:44Z

HEy,

Thank you for the report! cc @Anmol6 this is another faiss+CUDA issue. Development/fixing for this family of issue is currently a bit slow (I'm stretched very thin at the moment). I see that you've got a very low number of documents, you could bypass this stage by using encode() and using in-memory encodings, or trying out the experimental (but functional & soon™️ to be released) full vectors index approach in #137 by installing directly from that branch.

grahama1970 · 2024-02-20T17:01:32Z

On my end, I'm happy to test any patches/fixes, and send back a bug report :)
Also, re documents, I'm only using your Miyazaki example, to try to get things to work simply, first.

look forward to trying ragatouille out! Question: do the same issues occur on Colbertv2 without using Ragatouille? I haven't tried Colbert by itself.

grahama1970 · 2024-02-23T14:20:27Z

Question to all--is there a runpod template that will run Ragatouille without error?

bclavie · 2024-02-24T20:25:08Z

On my end, I'm happy to test any patches/fixes, and send back a bug report :) Also, re documents, I'm only using your Miyazaki example, to try to get things to work simply, first.

look forward to trying ragatouille out! Question: do the same issues occur on Colbertv2 without using Ragatouille? I haven't tried Colbert by itself.

Sorry for the late response! This issue seems to be happening in upstream ColBERT code (specifically, the PLAID index code) rather than RAGatouille, so I think it'd also occur when using the base code.

I'm not currently using Runpod, but a strange thing is that I did use it when training JaColBERT (with PyTorch 2.1 template) and there didn't seem to be any major faiss issues like the ones you're experiencing.

grahama1970 · 2024-02-26T12:09:10Z

No worries. I'm happy to try a different runpod template, and send you the logs. At this point, I'm unable to test Ragatouille on WSL 2 (windows) or on a runpod. I look forward to getting Ragatouille working!btw, it does work in a google Collab--which is not part of my normal flow/pipeline.

bclavie · 2024-03-18T19:51:57Z

This should be fixed in 0.0.8, as long as you are indexing fewer than ~100k documents (with the new tentative faiss replacement just using PyTorch to perform k-means).

bclavie closed this as completed Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytorch 2.1 on Runpod running Examples hangs with message #146

Pytorch 2.1 on Runpod running Examples hangs with message #146

grahama1970 commented Feb 19, 2024 •

edited

Loading

bclavie commented Feb 20, 2024

grahama1970 commented Feb 20, 2024 •

edited

Loading

grahama1970 commented Feb 23, 2024

bclavie commented Feb 24, 2024

grahama1970 commented Feb 26, 2024

bclavie commented Mar 18, 2024

Pytorch 2.1 on Runpod running Examples hangs with message #146

Pytorch 2.1 on Runpod running Examples hangs with message #146

Comments

grahama1970 commented Feb 19, 2024 • edited Loading

bclavie commented Feb 20, 2024

grahama1970 commented Feb 20, 2024 • edited Loading

grahama1970 commented Feb 23, 2024

bclavie commented Feb 24, 2024

grahama1970 commented Feb 26, 2024

bclavie commented Mar 18, 2024

grahama1970 commented Feb 19, 2024 •

edited

Loading

grahama1970 commented Feb 20, 2024 •

edited

Loading