# Create the embeddings for the query examples

## Prerequisits
- [Ollama](https://ollama.com/search?c=embedding) with the embedding models: `mxbai-embed-large`, `nomic-embed-text`, `all-minilm`
- A folder with the queries (`./../../data/idsm_queries`)

## Import the required modules

In [None]:
from tqdm import tqdm
import glob
from langchain_ollama import OllamaEmbeddings
from langchain_community.docstore import InMemoryDocstore
import faiss
from langchain_community.vectorstores import FAISS
import logging
from pathlib import Path
import os

logging.getLogger("httpx").propagate = False
logging.getLogger("httpx").setLevel("CRITICAL") 

## Prepare the embedding variables

We use the Ollama Embeddings with one of the following models `mxbai-embed-large`, `nomic-embed-text`, `all-minilm`

In [None]:
embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
)

We use the FAISS vector storage

In [None]:
vectorstore = FAISS(
    embedding_function=embeddings,
    docstore= InMemoryDocstore(),
    index= faiss.IndexFlatL2(len(embeddings.embed_query("hello world"))),
    index_to_docstore_id={}
)

We initialise the query directory and the saving/loading path. 

Note that all the embeddings are available at [this MyBox URL](https://mybox.inria.fr/d/24d9423c67d64f8284fa/) you can download them to avoid waiting for the embedding process to be done. The password is: `Kc8(-8aE`

In [None]:
query_directory = str(Path(os.getcwd()).parent.parent / 'data' / 'idsm_queries')
saving_path = Path(os.getcwd()).parent.parent / 'data' / 'faiss_embeddings' / 'idsm' / "query_v1_nomic_faiss_index"
documents = []

Prepare the documents to be injested

In [None]:
for filename in glob.glob(query_directory+'/*.rq'):
    with open(file=filename,mode="r") as f:
        documents.append(f.read())

In [None]:
len(documents)

## Injest the documents 

In [None]:
db = None
with tqdm(total=len(documents), desc="Ingesting documents") as pbar:
    for d in documents:
        if db:
            db.add_texts([d])
        else:
            db = FAISS.from_texts([d], embedding=embeddings)
        pbar.update(1)  

## Save the embeddings locally

In [None]:
db.save_local(saving_path)

## Load the embeeding

In [None]:
db = FAISS.load_local(saving_path,embeddings=embeddings,allow_dangerous_deserialization=True)

In [None]:
db.index.ntotal

## Example of query selection

In [None]:
queries = [
    "What protein targets does donepezil (CHEBI_53289) inhibit with an IC50 less than 10 µM?",
    "What protein targets does (CHEBI_124758) inhibit with an PF5 less than 10 µM?",
    "protein targets donepezil (CHEBI_53289) inhibit with IC50",
    "protein targets donepezil (CHEBI_53289) IC50",
    "protein donepezil (CHEBI_53289) IC50",
    "donepezil (CHEBI_53289) IC50",
    "donepezil 53289 IC50"
    ]

query = queries[0]

# Retrieve the most similar text
retrieved_documents = db.similarity_search(query,k=5)

# show the retrieved document's content
for doc in retrieved_documents:
    print(f"{doc.page_content}\n\n-----------------------------------------\n")