Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pytorch 2.1 on Runpod running Examples hangs with message #146

Closed
grahama1970 opened this issue Feb 19, 2024 · 6 comments
Closed

Pytorch 2.1 on Runpod running Examples hangs with message #146

grahama1970 opened this issue Feb 19, 2024 · 6 comments

Comments

@grahama1970
Copy link

grahama1970 commented Feb 19, 2024

Hi. I'm trying to run Ragatouille examples off a pytorch 2.1 template with 2 A6000 GPUs. On the basic_indexing_and_searching example, it hangs and produces the below output:
The examples do work without issues in a free Google collab....just not on WSL or on the Runpod, yet :)
runpod.io
template: pytorch 2.1
runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04
2 x RTX A6000
63 vCPU 117 GB RAM
image

Any help is appreciated

#> Starting...
#> Starting...
nranks = 2 num_gpus = 2 device=1
nranks = 2 num_gpus = 2 device=0
[Feb 19, 15:15:52] [1] #> Encoding 40 passages..
[Feb 19, 15:15:52] [0] #> Encoding 41 passages..
[Feb 19, 15:15:53] [0] avg_doclen_est = 129.80091857910156 len(local_sample) = 41
[Feb 19, 15:15:53] [1] avg_doclen_est = 129.80091857910156 len(local_sample) = 40
[Feb 19, 15:15:53] [0] Creating 1,024 partitions.
[Feb 19, 15:15:53] [0] Estimated 10,513 embeddings.
[Feb 19, 15:15:53] [0] #> Saving the indexing plan to .ragatouille/colbert/indexes/Miyazaki/plan.json ..
Clustering 9991 points in 128D to 1024 clusters, redo 1 times, 20 iterations
Preprocessing in 0.00 s
WARNING clustering 9991 points to 1024 centroids: please provide at least 39936 training points
Faiss assertion 'err == CUBLAS_STATUS_SUCCESS' failed in void faiss::gpu::runMatrixMult(faiss::gpu::Tensor<float, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<IndexType, 2, true>&, bool, float, float, cublasHandle_t, cudaStream_t) [with AT = float; BT = float; cublasHandle_t = cublasContext*; cudaStream_t = CUstream_st*] at /project/faiss/faiss/gpu/utils/MatrixMult-inl.cuh:265; details: cublas failed (13): (512, 128) x (1024, 128)' = (512, 1024) gemm params m 1024 n 512 k 128 trA T trB N lda 128 ldb 128 ldc 1024
Faiss assertion 'err == CUBLAS_STATUS_SUCCESS' failed in void faiss::gpu::runMatrixMult(faiss::gpu::Tensor<float, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<IndexType, 2, true>&, bool, float, float, cublasHandle_t, cudaStream_t) [with AT = float; BT = float; cublasHandle_t = cublasContext*; cudaStream_t = CUstream_st*] at /project/faiss/faiss/gpu/utils/MatrixMult-inl.cuh:265; details: cublas failed (13): (512, 128) x (1024, 128)' = (512, 1024) gemm params m 1024 n 512 k 128 trA T trB N lda 128 ldb 128 ldc 1024

@bclavie
Copy link
Owner

bclavie commented Feb 20, 2024

HEy,

Thank you for the report! cc @Anmol6 this is another faiss+CUDA issue. Development/fixing for this family of issue is currently a bit slow (I'm stretched very thin at the moment). I see that you've got a very low number of documents, you could bypass this stage by using encode() and using in-memory encodings, or trying out the experimental (but functional & soon™️ to be released) full vectors index approach in #137 by installing directly from that branch.

@grahama1970
Copy link
Author

grahama1970 commented Feb 20, 2024

On my end, I'm happy to test any patches/fixes, and send back a bug report :)
Also, re documents, I'm only using your Miyazaki example, to try to get things to work simply, first.

look forward to trying ragatouille out! Question: do the same issues occur on Colbertv2 without using Ragatouille? I haven't tried Colbert by itself.

@grahama1970
Copy link
Author

Question to all--is there a runpod template that will run Ragatouille without error?

@bclavie
Copy link
Owner

bclavie commented Feb 24, 2024

On my end, I'm happy to test any patches/fixes, and send back a bug report :) Also, re documents, I'm only using your Miyazaki example, to try to get things to work simply, first.

look forward to trying ragatouille out! Question: do the same issues occur on Colbertv2 without using Ragatouille? I haven't tried Colbert by itself.

Sorry for the late response! This issue seems to be happening in upstream ColBERT code (specifically, the PLAID index code) rather than RAGatouille, so I think it'd also occur when using the base code.

I'm not currently using Runpod, but a strange thing is that I did use it when training JaColBERT (with PyTorch 2.1 template) and there didn't seem to be any major faiss issues like the ones you're experiencing.

@grahama1970
Copy link
Author

No worries. I'm happy to try a different runpod template, and send you the logs. At this point, I'm unable to test Ragatouille on WSL 2 (windows) or on a runpod. I look forward to getting Ragatouille working!btw, it does work in a google Collab--which is not part of my normal flow/pipeline.

@bclavie
Copy link
Owner

bclavie commented Mar 18, 2024

This should be fixed in 0.0.8, as long as you are indexing fewer than ~100k documents (with the new tentative faiss replacement just using PyTorch to perform k-means).

@bclavie bclavie closed this as completed Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants