-
-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pytorch 2.1 on Runpod running Examples hangs with message #146
Comments
HEy, Thank you for the report! cc @Anmol6 this is another faiss+CUDA issue. Development/fixing for this family of issue is currently a bit slow (I'm stretched very thin at the moment). I see that you've got a very low number of documents, you could bypass this stage by using |
On my end, I'm happy to test any patches/fixes, and send back a bug report :) look forward to trying ragatouille out! Question: do the same issues occur on Colbertv2 without using Ragatouille? I haven't tried Colbert by itself. |
Question to all--is there a runpod template that will run Ragatouille without error? |
Sorry for the late response! This issue seems to be happening in upstream ColBERT code (specifically, the PLAID index code) rather than RAGatouille, so I think it'd also occur when using the base code. I'm not currently using Runpod, but a strange thing is that I did use it when training JaColBERT (with PyTorch 2.1 template) and there didn't seem to be any major |
No worries. I'm happy to try a different runpod template, and send you the logs. At this point, I'm unable to test Ragatouille on WSL 2 (windows) or on a runpod. I look forward to getting Ragatouille working!btw, it does work in a google Collab--which is not part of my normal flow/pipeline. |
This should be fixed in 0.0.8, as long as you are indexing fewer than ~100k documents (with the new tentative faiss replacement just using PyTorch to perform k-means). |
Hi. I'm trying to run Ragatouille examples off a pytorch 2.1 template with 2 A6000 GPUs. On the basic_indexing_and_searching example, it hangs and produces the below output:
![image](https://private-user-images.githubusercontent.com/152222635/306010458-bf3c9349-44cd-4bd8-acfd-48f0a54f82ac.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAzMTY4NDksIm5iZiI6MTcyMDMxNjU0OSwicGF0aCI6Ii8xNTIyMjI2MzUvMzA2MDEwNDU4LWJmM2M5MzQ5LTQ0Y2QtNGJkOC1hY2ZkLTQ4ZjBhNTRmODJhYy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzA3JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcwN1QwMTQyMjlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1hNjRkY2RhZDE1YTE4MDc3ODg5Y2Y5M2Y0MDExODExN2ZhNmQyMDQ2YzQ5ZWE3NjExOWM1Y2QwMDczNzJjNDhhJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.3a2YSisbc7eBaYPzX41tjty6ZRUBVf_s84KtL5utM5A)
The examples do work without issues in a free Google collab....just not on WSL or on the Runpod, yet :)
runpod.io
template: pytorch 2.1
runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04
2 x RTX A6000
63 vCPU 117 GB RAM
Any help is appreciated
#> Starting...
#> Starting...
nranks = 2 num_gpus = 2 device=1
nranks = 2 num_gpus = 2 device=0
[Feb 19, 15:15:52] [1] #> Encoding 40 passages..
[Feb 19, 15:15:52] [0] #> Encoding 41 passages..
[Feb 19, 15:15:53] [0] avg_doclen_est = 129.80091857910156 len(local_sample) = 41
[Feb 19, 15:15:53] [1] avg_doclen_est = 129.80091857910156 len(local_sample) = 40
[Feb 19, 15:15:53] [0] Creating 1,024 partitions.
[Feb 19, 15:15:53] [0] Estimated 10,513 embeddings.
[Feb 19, 15:15:53] [0] #> Saving the indexing plan to .ragatouille/colbert/indexes/Miyazaki/plan.json ..
Clustering 9991 points in 128D to 1024 clusters, redo 1 times, 20 iterations
Preprocessing in 0.00 s
WARNING clustering 9991 points to 1024 centroids: please provide at least 39936 training points
Faiss assertion 'err == CUBLAS_STATUS_SUCCESS' failed in void faiss::gpu::runMatrixMult(faiss::gpu::Tensor<float, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<IndexType, 2, true>&, bool, float, float, cublasHandle_t, cudaStream_t) [with AT = float; BT = float; cublasHandle_t = cublasContext*; cudaStream_t = CUstream_st*] at /project/faiss/faiss/gpu/utils/MatrixMult-inl.cuh:265; details: cublas failed (13): (512, 128) x (1024, 128)' = (512, 1024) gemm params m 1024 n 512 k 128 trA T trB N lda 128 ldb 128 ldc 1024
Faiss assertion 'err == CUBLAS_STATUS_SUCCESS' failed in void faiss::gpu::runMatrixMult(faiss::gpu::Tensor<float, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<IndexType, 2, true>&, bool, float, float, cublasHandle_t, cudaStream_t) [with AT = float; BT = float; cublasHandle_t = cublasContext*; cudaStream_t = CUstream_st*] at /project/faiss/faiss/gpu/utils/MatrixMult-inl.cuh:265; details: cublas failed (13): (512, 128) x (1024, 128)' = (512, 1024) gemm params m 1024 n 512 k 128 trA T trB N lda 128 ldb 128 ldc 1024
The text was updated successfully, but these errors were encountered: