# Converting a Pyserini FAISS index to a Fast-Forward index

We'll use [this](https://github.com/castorini/pyserini/blob/9db25847829a656d1c9eacb267bf745f7522dd14/pyserini/prebuilt_index_info.py#L3482) index.

First, download and extract the files:


In [None]:
!wget https://rgw.cs.uwaterloo.ca/pyserini/indexes/faiss/faiss-flat.beir-v1.0.0-fiqa.contriever.20230124.tar.gz
!tar xf faiss-flat.beir-v1.0.0-fiqa.contriever.20230124.tar.gz

Since Pyserini indexes are for dense retrieval, you'll need the [FAISS library](https://github.com/facebookresearch/faiss) to load them.


In [None]:
!pip install faiss-cpu

We can then reconstruct all vectors:

In [None]:
import faiss

index = faiss.read_index("faiss-flat.beir-v1.0.0-fiqa.contriever.20230124/index")
with open("faiss-flat.beir-v1.0.0-fiqa.contriever.20230124/docid") as fp:
    docids = list(fp.read().splitlines())
vectors = index.reconstruct_n(0, len(docids))

Now we have two arrays; one contains all document representations and the other contains the corresponding IDs. We can use those to create a Fast-Forward index:


In [None]:
!pip install fast-forward-indexes==0.2.0

In [None]:
from pathlib import Path
from fast_forward import OnDiskIndex

OnDiskIndex(Path("beir-v1.0.0-fiqa.contriever_ff.h5"), 768).add(vectors, doc_ids=docids)

# Using the index

The index we created is for the [Contriever](https://github.com/facebookresearch/contriever) encoder. The model is available [here](https://huggingface.co/facebook/contriever).

Since the model is based on a Transformer encoder, we can subclass `fast_forward.encoder.TransformerEncoder` to implement a Fast-Forward query encoder. The code is simply copied from the readme found at the link above.


In [None]:
from fast_forward.encoder import TransformerEncoder
import torch


class ContrieverEncoder(TransformerEncoder):
    def __call__(self, texts):
        inputs = self.tokenizer(
            texts, padding=True, truncation=True, return_tensors="pt"
        )
        with torch.no_grad():
            outputs = self.model(**inputs)

        def mean_pooling(token_embeddings, mask):
            token_embeddings = token_embeddings.masked_fill(
                ~mask[..., None].bool(), 0.0
            )
            sentence_embeddings = (
                token_embeddings.sum(dim=1) / mask.sum(dim=1)[..., None]
            )
            return sentence_embeddings

        return mean_pooling(outputs[0], inputs["attention_mask"])

Now we can load the index we just created and attach a query encoder:


In [None]:
from fast_forward import OnDiskIndex, Mode
from pathlib import Path

ff_index = OnDiskIndex.load(
    Path("beir-v1.0.0-fiqa.contriever_ff.h5"),
    ContrieverEncoder("facebook/contriever"),
    Mode.MAXP,
).to_memory()

This index can be used, for example, in a PyTerrier pipeline.


In [None]:
!wget https://rgw.cs.uwaterloo.ca/pyserini/indexes/faiss/faiss-flat.beir-v1.0.0-fiqa.contriever.20230124.tar.gz -P /home/anistor/anistor-Neural-ranking-models/bge/dense_indexes/


In [None]:
!tar xf /home/anistor/anistor-Neural-ranking-models/bge/dense_indexes/faiss-flat.beir-v1.0.0-fiqa.contriever.20230124.tar.gz -C  /home/anistor/anistor-Neural-ranking-models/bge/dense_indexes/


In [None]:
!rm /home/anistor/anistor-Neural-ranking-models/bge/dense_indexes/faiss-flat.beir-v1.0.0-fiqa.contriever.20230124.tar.gz
