Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infinite loop when clustering #463

Closed
q423462798 opened this issue May 29, 2018 · 12 comments
Closed

infinite loop when clustering #463

q423462798 opened this issue May 29, 2018 · 12 comments
Assignees

Comments

@q423462798
Copy link

q423462798 commented May 29, 2018

hi, when we are running Clustering::train to train PQ slice, sometimes we will get in a infinite loop, as the attachment shown.
stucked

we think that it may because our faiss version is too old (the commit 5ca0521), and we found that you have lots of commits, we noticed that some commits may be related to this bug, but we are not sure because some of the commits do not describe what bugs do the commit fix(such as the commit 250a3d3.)

So, would you mind helping me figure out which bugs may cause the problem?
Sincerely thanks!

@mdouze
Copy link
Contributor

mdouze commented May 29, 2018

Hi
This could happen if the vectors have a very bad distribution (eg. all the same). Could you check this on the relevant sub-vector of your data, ie. the slice 18 out of 32?

@q423462798
Copy link
Author

q423462798 commented May 30, 2018

thanks your reply!
But the data is so large (16384*16), how can we check it?the data is shown as the file
this is the x_in for function void Clustering::train (idx_t nx, const float *x_in, Index & index)
vector.txt

and , after applying gdb to the core file generated by gcore on the stucked program, we think that faiss might be stucked in the function knn_L2sqr_blas, the following is the stack info of the stuck program
#0 0x00007f9877cebdc7 in sched_yield () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f9873a75595 in inner_thread () from /opt/OpenBLAS/lib/libopenblas.so.0
#2 0x00007f9873ba5769 in exec_blas () from /opt/OpenBLAS/lib/libopenblas.so.0
#3 0x00007f9873a75c33 in gemm_driver.constprop.0 () from /opt/OpenBLAS/lib/libopenblas.so.0
#4 0x00007f9873a75cd5 in sgemm_thread_tn () from /opt/OpenBLAS/lib/libopenblas.so.0
#5 0x00007f98739a5fdf in sgemm_ () from /opt/OpenBLAS/lib/libopenblas.so.0
#6 0x00007f9878048c0e in knn_L2sqr_blasfaiss::NopDistanceCorrection (corr=..., res=0x7f982ed36880, ny=256, nx=16384, d=16, y=0x7f982402f640, x=0x7f982424bdd0)
at /gpu_search_core/deps/faiss/utils.cpp:869
#7 faiss::knn_L2sqr (x=0x7f982424bdd0, y=0x7f982402f640, d=16, nx=16384, ny=256, res=res@entry=0x7f982ed36880) at /gpu_search_core/deps/faiss/utils.cpp:944
#8 0x00007f9878051181 in faiss::IndexFlat::search (this=, n=, x=, k=, distances=, labels=0x7f982434bde0)
at /gpu_search_core/deps/faiss/IndexFlat.cpp:54
#9 0x00007f987804be0c in faiss::Clustering::train (this=this@entry=0x7f982ed36ac0, nx=nx@entry=16384, x_in=x_in@entry=0x7f982424bdd0, index=...) at /gpu_search_core/deps/faiss/Clustering.cpp:159
#10 0x00007f98780196fd in faiss::ProductQuantizer::train (this=0x7f982ed36c30, n=16384, x=0x7f9744ffd010) at /gpu_search_core/deps/faiss/ProductQuantizer.cpp:287
#11 0x00007f9878080c2a in faiss::gpu::GpuIndexIVFPQ::trainResidualQuantizer_(long, float const*) () from /libs/libgpu_search_core.so.0
#12 0x00007f9878080ee4 in faiss::gpu::GpuIndexIVFPQ::train(long, float const*) () from //libs/libgpu_search_core.so.0
#13 0x00007f9877ff9339 in gsc_train_gpu_index () from //libs/libgpu_search_core.so.0
#14 0x000000000098e26f in _cgo_17e8fee69989_Cfunc_gsc_train_gpu_index (v=0xc421a8d910) at cgo-gcc-prolog:333
#15 0x000000000045a320 in runtime.asmcgocall () at /usr/local/go/src/runtime/asm_amd64.s:624
#16 0x000000c420578180 in ?? ()
#17 0x00007f982ed36db0 in ?? ()
#18 0x0000000000455942 in runtime.(*mcache).nextFree.func1 () at /usr/local/go/src/runtime/malloc.go:557
#19 0x0000000000458b49 in runtime.systemstack () at /usr/local/go/src/runtime/asm_amd64.s:344
#20 0x0000000000431a80 in ?? () at /usr/local/go/src/runtime/proc.go:1070
#21 0x000000c420031900 in ?? ()
#22 0x0000000000000000 in ?? ()

finally, by the way, does this bug #150 and the "#pragma omp parallel" in Clustering::train be related to our question?
beacause we found that our version still have bug 150 and you have removed the "#pragma omp parallel" in Clustering::train

@mdouze
Copy link
Contributor

mdouze commented Jun 4, 2018

Hi
I cannot load vector.txt. Could you dump it in a machine-readable text or binary format?

@q423462798
Copy link
Author

q423462798 commented Jun 5, 2018

HI~Thanks for your attention~!
We reproduce the bug in another train, this time the program stucked on Iteration 15 of the process on training 9th PQ, which is shown as following log:
image

And we get more data as following files:
1、the float[n * dsub]* xslice for void ProductQuantizer::train (int n, const float * x) in clus.train (n, xslice, index);
dsub:16, kusb:256 n:16384 m:8 M:32
binary files:
xslice-dsub-16-ksub-256-n-16384-m-9-M-32.tar.gz

2、const float *x in void Clustering::train (idx_t nx, const float *x_in, Index & index)
binary files:
x-clus-train-262144.tar.gz

3、idx_t[nx]* assign and float[nx] * dis in void Clustering::train (idx_t nx, const float *x_in, Index & index)
nx=16384
binary files:
dis and assign.tar.gz

4、std::vector cur_centroids in void Clustering::train (idx_t nx, const float *x_in, Index & index)
k=256, d=16
binary files:
cur_centroids-k-256-d-16.tar.gz

5、obj that stores the err in void Clustering::train (idx_t nx, const float *x_in, Index & index)
binary files:
obj-niter-25.tar.gz

6、the args x、y and local var x_norms, y_norms in static void knn_L2sqr_blas (const float * x, const float * y, size_t d, size_t nx, size_t ny, float_maxheap_array_t * res, const DistanceCorrection &corr)

binary files:
args_for_knn_L2sqr_blas.tar.gz

and the thread stack info got by gdb is shown as following:
#0 0x00007fd517d50dc7 in sched_yield () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fd513ada595 in inner_thread () from /opt/OpenBLAS/lib/libopenblas.so.0
#2 0x00007fd513c0a769 in exec_blas () from /opt/OpenBLAS/lib/libopenblas.so.0
#3 0x00007fd513adac33 in gemm_driver.constprop.0 () from /opt/OpenBLAS/lib/libopenblas.so.0
#4 0x00007fd513adacd5 in sgemm_thread_tn () from /opt/OpenBLAS/lib/libopenblas.so.0
#5 0x00007fd513a0afdf in sgemm_ () from /opt/OpenBLAS/lib/libopenblas.so.0
#6 0x00007fd5180a823e in knn_L2sqr_blasfaiss::NopDistanceCorrection (corr=..., res=0x7fd4e6ffc800, ny=256, nx=16384, d=16, y=0x7fd4b39e53a0, x=0x7fd4b3b90060)
at /root/gpu_search_core/deps/faiss/utils.cpp:869
#7 faiss::knn_L2sqr (x=0x7fd4b3b90060, y=0x7fd4b39e53a0, d=16, nx=16384, ny=256, res=res@entry=0x7fd4e6ffc800) at /root/gpu_search_core/deps/faiss/utils.cpp:944
#8 0x00007fd5180b0b51 in faiss::IndexFlat::search (this=, n=, x=, k=, distances=, labels=0x7fd4b3a42a50)
at /root/gpu_search_core/deps/faiss/IndexFlat.cpp:54
#9 0x00007fd5180ab4a7 in faiss::Clustering::train (this=this@entry=0x7fd4e6ffca40, nx=nx@entry=16384, x_in=x_in@entry=0x7fd4b3b90060, index=...) at /root/gpu_search_core/deps/faiss/Clustering.cpp:164
#10 0x00007fd518078c5d in faiss::ProductQuantizer::train (this=0x7fd4e6ffcc10, n=16384, x=0x7fd39e7d0010) at /root/gpu_search_core/deps/faiss/ProductQuantizer.cpp:287
#11 0x00007fd5180e5067 in faiss::gpu::GpuIndexIVFPQ::trainResidualQuantizer_ (this=0x7fd4b37018a0, n=16384, x=0xc42e8cc000) at /root/gpu_search_core/deps/faiss/gpu/GpuIndexIVFPQ.cu:283
#12 0x00007fd5180e53f7 in faiss::gpu::GpuIndexIVFPQ::train (this=0x7fd4b37018a0, n=194811, x=0xc42e8cc000) at /root/gpu_search_core/deps/faiss/gpu/GpuIndexIVFPQ.cu:314
#13 0x00007fd518057f00 in gsc_train_gpu_index (index=0x7fd4b37018a0, n=194811, x=0xc42e8cc000) at /root/gpu_search_core/src/gpu_search_core.cpp:143
#14 0x000000000098e4ef in _cgo_17e8fee69989_Cfunc_gsc_train_gpu_index (v=0xc420daf888) at cgo-gcc-prolog:333
#15 0x000000000045a320 in runtime.asmcgocall () at /usr/local/go/src/runtime/asm_amd64.s:624
#16 0x000000c420394480 in ?? ()
#17 0x00007fd4e6ffcdb0 in ?? ()
#18 0x0000000000455942 in runtime.(*mcache).nextFree.func1 () at /usr/local/go/src/runtime/malloc.go:557
#19 0x0000000000458b49 in runtime.systemstack () at /usr/local/go/src/runtime/asm_amd64.s:344
#20 0x0000000000431a80 in ?? () at /usr/local/go/src/runtime/proc.go:1070
#21 0x000000c42002cc00 in ?? ()
#22 0x0000000000000000 in ?? ()

@mdouze
Copy link
Contributor

mdouze commented Jun 5, 2018

First, please stop posting screenshots. I will look into the datafiles.

@q423462798
Copy link
Author

ok,i' m sorry for that. I just think that the datafiles do not tell where the program stucked..so...

@mdouze
Copy link
Contributor

mdouze commented Jun 5, 2018

I loaded x-clus-train-262144 and xslice-dsub-16-ksub-256-n-16384-m-9-M-32, both cluster without problems.
Could you post a minimal reproduction program?

@mdouze
Copy link
Contributor

mdouze commented Jun 5, 2018

For ref, what I do:

import faiss

#x = np.fromfile('/tmp/xslice-dsub-16-ksub-256-n-16384-m-9-M-32', dtype='float32')
x = np.fromfile('/tmp/x-clus-train-262144', dtype='float32')

xslice = x.reshape(-1, 16)

xslice.shape

print "distinct vectors:", len(set(x.tostring() for x in xslice))

kmeans = faiss.Kmeans(16, 256, verbose=True)

kmeans.train(xslice)

@q423462798
Copy link
Author

errrrrr..In fact..we still can't reproduce the problem stably..it sometimes happen when we call GpuIndexIVFPQ::train continuous, i.e.cluster 20 or more IVFPQ.
It seems that the program is stucked in process of sgemm_ , so we are trying to reproduce the phenomenon of this process

@mdouze
Copy link
Contributor

mdouze commented Jun 6, 2018

This sounds like a memory corruption, probably on your side.

@q423462798
Copy link
Author

q423462798 commented Jun 6, 2018

oh..that sounds possible..
when the problem was occurred, our memory usage no longer changed (5.3G/8G), while the cpu load kept in high usage(400%/4 cores)
are there any way to check memory corruption? e.g. some tools?

@mdouze
Copy link
Contributor

mdouze commented Jun 12, 2018

Closing issue. Feel free to re-open if the bug can be tracked down to Faiss.

@mdouze mdouze closed this as completed Jun 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants