# Indexing

In practical cases, datasets are consist of thousands or millions of rows. Looping through the whole corpus to find the best answer to a query is way too slow. In this tutorial, we'll introduce how to use indexing to solve this issue.

## Step 0: Setup

Install the dependencies in the environment.

In [None]:
%pip install -U FlagEmbedding datasets faiss-cpu

## Step 1: Load Dataset

Comparing to the 10 sentences corpus in the quick-start, now we are more serious. 

First, let's download the FiQA dataset from Hugging Face.

In [14]:
from datasets import load_dataset

queries = load_dataset("BeIR/fiqa", "queries", trust_remote_code=True)['queries']
corpus = load_dataset("BeIR/fiqa", "corpus", trust_remote_code=True)['corpus']

FiQA is a dataset mainly in financial related topics, with 6.65k rows of queries and 57.6k rows of corpus.

The following print lines give a brief idea what do the items in queries and corpus look like.

In [15]:
print(queries[0])
print(corpus[0])

{'_id': '0', 'title': '', 'text': 'What is considered a business expense on a business trip?'}
{'_id': '3', 'title': '', 'text': "I'm not saying I don't like the idea of on-the-job training too, but you can't expect the company to do that. Training workers is not their job - they're building software. Perhaps educational systems in the U.S. (or their students) should worry a little about getting marketable skills in exchange for their massive investment in education, rather than getting out with thousands in student debt and then complaining that they aren't qualified to do anything."}


## Step 2: Text Embedding

Here, for the sake of speed, we just embed the first 500 docs in the corpus.

In [16]:
from FlagEmbedding import FlagModel

# get the BGE embedding model
model = FlagModel('BAAI/bge-base-en-v1.5',
                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                  use_fp16=True)

# get the embedding of the corpus
corpus_embeddings = model.encode(corpus[:500]['text'])

print("shape of the corpus embeddings:", corpus_embeddings.shape)
print("data type of the embeddings: ", corpus_embeddings.dtype)

Inference Embeddings: 100%|██████████| 2/2 [00:34<00:00, 17.13s/it]

shape of the corpus embeddings: (500, 768)
data type of the embeddings:  float32





Faiss only accepts float32 inputs.

If the corpus_embeddings has dtype different from float32, uncomment and run the following cell.

In [17]:
# corpus_embeddings = corpus_embeddings.astype(np.float32)

## Step 3: Indexing

In this step, we build an index and add the embedding vectors to it.

In [18]:
import faiss

# get the length of our embedding vectors, vectors by bge-base-en-v1.5 have length 768
dim = corpus_embeddings.shape[-1]

# create the faiss index and store the corpus embeddings into the vector space
index = faiss.index_factory(dim, 'IDMap,Flat', faiss.METRIC_INNER_PRODUCT)
index.train(corpus_embeddings)
index.add_with_ids(corpus_embeddings, corpus[:500]["_id"])

print(f"total number of vectors: {index.ntotal}")

total number of vectors: 500


### Step 3.5 (Optional): Saving Faiss index

Once you have your index with the embedding vectors, you can save it locally for future usage.

In [19]:
# change the path to where you want to save the index
path = "./index.bin"
faiss.write_index(index, path)

If you already have stored index in your local directory, you can load it by:

In [20]:
# index = faiss.read_index("./index.bin")

## Step 4: Find answers to the query

First, we select a query and get its embedding:

In [21]:
query_id = queries[912]['_id']
query = queries[912]['text']
print(f"id: {query_id}\nquery: {query}")
q_embedding = model.encode(query).reshape([1, -1])

id: 1927
query: How does a Non US citizen gain SEC Accredited Investor Status?


Then, use the Faiss index to do a knn search in the vector space:

In [22]:
dists, ids = index.search(q_embedding, k=5)

The result of the k best match ids and their distances to the queries:

In [23]:
print(dists)
print(ids)

[[0.75860226 0.6303128  0.6199247  0.6176082  0.6170898 ]]
[[  63  417 4772 2809 1001]]


Here, the system tells us that the best document in the corpus is the one with id: 63, which we already help you found it, the 4th item in our corpus.

In [24]:
print(f"id:\t{corpus[4]['_id']}\ntext:\t\'{corpus[4]['text']}\'")

id:	63
text:	'Here are the SEC requirements: The federal securities laws define the term accredited investor in   Rule 501 of Regulation D as: a bank, insurance company, registered investment company, business development company, or small business investment company; an employee benefit plan, within the meaning of the Employee Retirement Income Security Act, if a bank, insurance company, or   registered investment adviser makes the investment decisions, or if   the plan has total assets in excess of $5 million; a charitable organization, corporation, or partnership with assets exceeding $5 million; a director, executive officer, or general partner of the company selling the securities; a business in which all the equity owners are accredited investors; a natural person who has individual net worth, or joint net worth with the person’s spouse, that exceeds $1 million at the time of the   purchase, excluding the value of the primary residence of such person; a natural person with income

According to the ground truth of FiQA queries, for they query with id 1927, the corresponding corpus id is 18850. 

Congrats! You have successfully use Faiss for indexing! This will be the basic for many future information retrieval tasks.