FAISS Store: allow multiple write calls and fix potential memory leak in update_embeddings #422

lalitpagaria · 2020-09-22T21:34:38Z

Allow multiple write calls to existing FAISS index.
Fixing issue when update_embeddings always create new FAISS index instead of clearing existing one. New index creation may not free existing used memory and cause memory leak.

tholor · 2020-09-23T06:54:40Z

Looking good. Thanks for adding this @lalitpagaria !

Let me know when it's ready for review

lalitpagaria · 2020-09-23T07:26:44Z

@tholor technically PR is ready but there is one short coming, please read below -

Call write multiple times for Docs without embedding -> There will no logical issue
Call write multiple times for Docs with embeddings -> As FAISS store currently convert embedding (L2 to IP) to allow Inner Product search specially for HNSWx index, for this we are doing conversion and it cause issue.

Currently we follow this link to enable L2 metrics to allow IP search. It method required phi to be computed before and used this to add another dimension to the embedding. So if we call write multiple times, each time phi values will be different hence wrong computation of extra dimension to support IP search.

Thats why in my PR #385 , I abstracted out this functionality to separate class FaissIndexStore. So it will up to user to choose which metrics to use. As FAISS support few IP indexes as well like IVF, IndexFlatIP etc.

Alternative is to expose _get_phi and _get_hnsw_vectors function via utils class and caller (retriever in our case) can use it to pre compute and add extra dimension before writing them to the FAISS store.

BTW update_embeddings do not suffer from this issue as it get's all embeddings together.

Please let me know what do you think.

lalitpagaria · 2020-09-23T12:55:44Z

@tholor It seems I had wrong assumption I tested this scenario on notebook and verified that multiple write works perfectly. Refer this notebook https://colab.research.google.com/drive/11e3zdFP6kg2xhJ8LvplVt08B8Nf2iEGn?usp=sharing

So yes my existing PR will ready for review and it not have issue which I raised in my previous comment.

In case if you not able to open notebook try this -

!pip install faiss-gpu

import numpy as np
import faiss


# see http://ulrichpaquet.com/Papers/SpeedUp.pdf theorem 5

def get_phi(xb): 
    return (xb ** 2).sum(1).max()

def augment_xb(xb, phi=None): 
    norms = (xb ** 2).sum(1)
    if phi is None: 
        phi = norms.max()
    extracol = np.sqrt(phi - norms)
    return np.hstack((xb, extracol.reshape(-1, 1)))

def augment_xq(xq): 
    extracol = np.zeros(len(xq), dtype='float32')
    return np.hstack((xq, extracol.reshape(-1, 1)))


nq = 100
nb = 1000
d = 32

# Search vectors
xq = faiss.randn((nq, d))

# Embeddings
xb1 = faiss.randn((nq, d))
xb2 = faiss.randn((nq, d))
# concatenate for single write
xb = np.concatenate((xb1, xb2), axis=0)


# Secenario_1: Call add two times
# reference IP search via IP Index
k = 10
index = faiss.IndexFlatIP(d)
# Append two times
index.add(xb1)
index.add(xb2)
Dref, Iref = index.search(xq, k)


# reference IP search via L2 Index
k = 10
index = faiss.IndexFlatL2(d + 1)

# Append two times
index.add(augment_xb(xb1))
index.add(augment_xb(xb2))
D, I = index.search(augment_xq(xq), k)

# Check if after result
print("Secenario_1:", np.all(I == Iref))




# Secenario_2: Call add two times for FlatL2 and single times for FlatIP
# reference IP search via IP Index
k = 10
index = faiss.IndexFlatIP(d)
# Append one time concated array
index.add(xb)
Dref, Iref = index.search(xq, k)


# reference IP search via L2 Index
k = 10
index = faiss.IndexFlatL2(d + 1)

# Append two times
index.add(augment_xb(xb1))
index.add(augment_xb(xb2))
D, I = index.search(augment_xq(xq), k)

# Check if after result
print("Secenario_2:", np.all(I == Iref))



# Secenario_2: Call add one time for FlatL2 and two times for FlatIP
# reference IP search via IP Index
k = 10
index = faiss.IndexFlatIP(d)
# Append two times
index.add(xb1)
index.add(xb2)
Dref, Iref = index.search(xq, k)


# reference IP search via L2 Index
k = 10
index = faiss.IndexFlatL2(d + 1)

# Append one time concated array
index.add(augment_xb(xb))
D, I = index.search(augment_xq(xq), k)

# Check if after result
print("Secenario_3:", np.all(I == Iref))




# Secenario_2: Call add one time for FlatL2 and FlatIP
# reference IP search via IP Index
k = 10
index = faiss.IndexFlatIP(d)
# Append one time concated array
index.add(xb)
Dref, Iref = index.search(xq, k)


# reference IP search via L2 Index
k = 10
index = faiss.IndexFlatL2(d + 1)

# Append one time concated array
index.add(augment_xb(xb))
D, I = index.search(augment_xq(xq), k)

# Check if after result
print("Secenario_4:", np.all(I == Iref))

tholor · 2020-09-24T08:18:28Z

Thanks. The code example is very helpful. I will review it in the next days. I want to investigate the transformation via Phi a bit deeper as a bug here will be very hard to trace later and can impact performance a lot.

lalitpagaria · 2020-09-29T13:01:11Z

@tholor Hope you may find time to review this PR.

I just want to add one more point that this PR also have fix for potential memory leak issue in update_embeddings function, if called multiple times. As update_embeddings function, always create new faiss_index, hence old memory will not released if variable faiss_index has been overwritten. Refer facebookresearch/faiss#872 and facebookresearch/faiss#257

tholor · 2020-09-30T08:49:30Z

I had a look at your test script. Very helpful scenarios you have in there!

However, I found one major issue:
xq, xb1 and xb2 are all exactly the same arrays in this test. This is because you init them with faiss.randn() which has a default seed argument.

When changing the init to:

# Search vectors
xq = faiss.randn((nq, d), seed=11)

# Embeddings
xb1 = faiss.randn((nq, d), seed=12)
xb2 = faiss.randn((nq, d), seed=13)
# concatenate for single write
xb = np.concatenate((xb1, xb2), axis=0)
print(f"Same array: {np.all(xb1 == xb2)}")

The output is

Same array: False
Secenario_1: False
Secenario_2: False
Secenario_3: True
Secenario_4: True

So it's only working in the scenarios with one write to the L2 index.
I can dig deeper here into the phi augmentation and try to find a workaround.

lalitpagaria · 2020-10-01T10:37:18Z

@tholor Not sure why test failed which is unrelated to PR. Locally no test failed on my system. I just rebased with latest master. Is it possible to re-run the build?

- Fixing issue when update_embeddings always create new FAISS index instead of clearing existing one. New index creation may not free existing used memory and cause memory leak.

lalitpagaria · 2020-10-02T07:21:10Z

@tholor Not sure if this is valid scenario? Pardon my less knowledge about embedding.

I assume a model (same model) will produce same embedding for a text, if no changes in any parameters except called on different time. Not sure if model will update it's seed with the time. Also I presume any change in model's seed will require update_embeddings operation again for all the documents.

Here I have question, which is relevant to use case I am trying to solving: Continues fine training.
Is it require to update embedding for all documents on every fine tuning?

Base Model  ------> Updated Model -----------> User  --------> Feedback -------> Fine tuning -------> Updated Model ------>

tholor · 2020-10-02T07:33:53Z

I assume a model (same model) will produce same embedding for a text, if no changes in any parameters except called on different time. Not sure if model will update it's seed with the time. Also I presume any change in model's seed will require update_embeddings operation again for all the documents.

One model will always produce the same embedding for one text. There is no randomness (and therefore no seed) in these models at inference time. Not sure why you are asking this, but if it's related to the seed in the above test scenarios, this is something different: if we don't vary the seed there we basically simulate indexing the same batch of documents twice. However, we usually have different docs in the two batches (xb1 and xb2) and therefore different embeddings.

Is it require to update embedding for all documents on every fine tuning?

Yes, every time we have a new model (e.g. after fine-tuning), we'll need to update all embeddings.

@tholor Not sure why test failed which is unrelated to PR. Locally no test failed on my system. I just rebased with latest master. Is it possible to re-run the build?

Yes, you can now re-run github actions. There is a button on the top right to do this when you are on the page for the failed test

lalitpagaria · 2020-10-03T00:02:19Z

Thanks for explanation. I created this notebook to test end to end pipeline to verify phi impact see if this help.

tholor · 2020-10-05T10:00:13Z

After some further investigation, I think we can actually get rid of the phi normalization trick.
The HMSW Index in FAISS now also supports the inner product metric directly. While I had problems with initializing the index via the factory the direct init seems to work well (facebookresearch/faiss#1434).
When comparing both approaches, it seemed that L2+ Phi normalization is a bit faster (0.009 vs. 0.006 sec/query), but accuracy was actually a bit worse on some random dummy vectors (~ 0.85 vs 0.54) .

Therefore, I think we can merge this PR now and switch from Phi normalization to the native faiss implementation in a separate PR.

lalitpagaria mentioned this pull request Sep 22, 2020

Improvements in FaissDocumentStore: Abstracting out FAISS index related customisation to another class #385

Closed

lalitpagaria changed the title ~~FAISS Store: allow multiple write calls and fix potential memory leak in update_embeddings~~ WIP FAISS Store: allow multiple write calls and fix potential memory leak in update_embeddings Sep 23, 2020

tholor self-assigned this Sep 23, 2020

lalitpagaria changed the title ~~WIP FAISS Store: allow multiple write calls and fix potential memory leak in update_embeddings~~ FAISS Store: allow multiple write calls and fix potential memory leak in update_embeddings Sep 23, 2020

- Allow multiple write calls to existing FAISS index.

0f8391a

- Fixing issue when update_embeddings always create new FAISS index instead of clearing existing one. New index creation may not free existing used memory and cause memory leak.

Merge branch 'master' into allow_multiple_writes_in_faiss_store

df5769c

tholor approved these changes Oct 5, 2020

View reviewed changes

tholor merged commit 465ccbc into deepset-ai:master Oct 5, 2020

lalitpagaria deleted the allow_multiple_writes_in_faiss_store branch October 5, 2020 16:31

This was referenced Oct 6, 2020

Remove phi normalization from FAISS & support more index types #467

Merged

Enable multiple calls of FAISSDocumentStore.write_documents() #421

Closed

lalitpagaria mentioned this pull request Aug 12, 2021

Add cosine similarity as similarity metric for FAISS document store #1337

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAISS Store: allow multiple write calls and fix potential memory leak in update_embeddings #422

FAISS Store: allow multiple write calls and fix potential memory leak in update_embeddings #422

lalitpagaria commented Sep 22, 2020

tholor commented Sep 23, 2020

lalitpagaria commented Sep 23, 2020

lalitpagaria commented Sep 23, 2020 •

edited

Loading

tholor commented Sep 24, 2020 •

edited

Loading

lalitpagaria commented Sep 29, 2020

tholor commented Sep 30, 2020

lalitpagaria commented Oct 1, 2020

lalitpagaria commented Oct 2, 2020 •

edited

Loading

tholor commented Oct 2, 2020 •

edited

Loading

lalitpagaria commented Oct 3, 2020

tholor commented Oct 5, 2020

FAISS Store: allow multiple write calls and fix potential memory leak in update_embeddings #422

FAISS Store: allow multiple write calls and fix potential memory leak in update_embeddings #422

Conversation

lalitpagaria commented Sep 22, 2020

tholor commented Sep 23, 2020

lalitpagaria commented Sep 23, 2020

lalitpagaria commented Sep 23, 2020 • edited Loading

tholor commented Sep 24, 2020 • edited Loading

lalitpagaria commented Sep 29, 2020

tholor commented Sep 30, 2020

lalitpagaria commented Oct 1, 2020

lalitpagaria commented Oct 2, 2020 • edited Loading

tholor commented Oct 2, 2020 • edited Loading

lalitpagaria commented Oct 3, 2020

tholor commented Oct 5, 2020

lalitpagaria commented Sep 23, 2020 •

edited

Loading

tholor commented Sep 24, 2020 •

edited

Loading

lalitpagaria commented Oct 2, 2020 •

edited

Loading

tholor commented Oct 2, 2020 •

edited

Loading