In [3]:
import json
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
from transformers import GPT2LMHeadModel, GPT2Tokenizer

The following code is for model preparation. I chose the `all-MiniLM-L6-v2` model because of its balance between computational efficiency and embedding quality. For response generation, I needed a larger model, so I chose GPT-2, since I could run it locally and it can (somewhat) handle the task. Due to hardware limitations, I could not use models larger than GPT-2.

In [4]:
with open("preprocessed_data/preprocessed_results_with_metadata.json", "r") as f:
    data = json.load(f)

text_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2")
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token

The next code block prepares the text embeddings and vectorizes them using the FAISS (Facebook AI Similarity Search) index. The 384 dimentionality was chosen because it matches the output size of the `all-MiniLM-L6-v2` model, which is a 384 dimensional dense vector space. It can be adjusted based on what sentence transformer model is being used.

In [5]:
index = faiss.IndexFlatL2(384)
embeddings = []
metadata = []

for doc in data:
    text_embedding = text_model.encode(doc["text"], convert_to_tensor=True)
    embeddings.append(text_embedding.detach().numpy())
    metadata.append(doc)

index.add(np.vstack(embeddings))

The following code handles the search and response generation. The `search` function is meant to return the top K most relevant documents based on vector representation similarity in the FAISS index. The response generation is mostly standard, except for `max_length`, which could be arbitrary, however in this case, it's set to 250 for error handling. Since the requirements wants only 1 response containing all the necessary information,`num_return_sequences` is set to 1. The rest is for printing the response.

In [13]:
def search(query, top_k=5):
    embedding = text_model.encode(query, convert_to_tensor=True).detach().numpy()
    distances, indices = index.search(embedding.reshape(1, -1), top_k)
    results = [metadata[i] for i in indices[0]]

    return results

def response_gen(prompt):
    inputs = gpt2_tokenizer(prompt, return_tensors="pt", padding=True, truncation=True)
    outputs = gpt2_model.generate(
        inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_length=250,
        num_return_sequences=1,
        pad_token_id=gpt2_tokenizer.eos_token_id
    )
    return gpt2_tokenizer.decode(outputs[0], skip_special_tokens=True)

def querying(query):
    results = search(query)
    top_docs = "\n".join([f"{i['title']} - {i['link']}" for i in results])

    prompt = f"Based on the following documents, answer the query:\n{top_docs}\n\nQuery: {query}\n\nAnswer: "
    response = response_gen(prompt)

    return {
        "response": response,
        "top_documents": [{
            "title": i["title"],
            "link": i["link"],
            "snippet": i["text"][:200]
        } for i in results]
    }


def print_response(query):
    result = querying(query)
    print("Response:", result["response"])
    print("Top Documents:")
    for doc in result["top_documents"]:
        print(f"- {doc['title']} ({doc['link']})\n  Snippet: {doc['snippet']}\n")

The model can be queried as follows:

In [12]:
query = "What are the benefits of serological assays in COVID-19?"
print_response(query)

Response: Based on the following documents, answer the query:
Detection technologies and recent developments in the diagnosis of COVID-19 infection - https://pubmed.ncbi.nlm.nih.gov/33394144/
Effectiveness of COVID-19 diagnosis and management tools: A review - https://pubmed.ncbi.nlm.nih.gov/33008761/
Integrated control of COVID-19 in resource-poor countries - https://pubmed.ncbi.nlm.nih.gov/32916249/
Emerging COVID-19 variants and their impact on SARS-CoV-2 diagnosis, therapeutics and vaccines - https://pubmed.ncbi.nlm.nih.gov/35132910/
Advancements in detection of SARS-CoV-2 infection for confronting COVID-19 pandemics - https://pubmed.ncbi.nlm.nih.gov/34497366/

Query: What are the benefits of serological assays in COVID-19?

Answer:  The benefits of serological assays are not limited to the detection of COVID-19 infection
Top Documents:
- Detection technologies and recent developments in the diagnosis of COVID-19 infection (https://pubmed.ncbi.nlm.nih.gov/33394144/)
  Snippet: Appl

In [10]:
query = "COVID-19 Origins"
print_response(query)

Response: Based on the following documents, answer the query:
Emerging COVID-19 variants and their impact on SARS-CoV-2 diagnosis, therapeutics and vaccines - https://pubmed.ncbi.nlm.nih.gov/35132910/
Rapid SARS-CoV-2 antigen detection assay in comparison with real-time RT-PCR assay for laboratory diagnosis of COVID-19 in Thailand - https://pubmed.ncbi.nlm.nih.gov/33187528/
Detection technologies and recent developments in the diagnosis of COVID-19 infection - https://pubmed.ncbi.nlm.nih.gov/33394144/
Advances in Technology to Address COVID-19 - https://pubmed.ncbi.nlm.nih.gov/33215941/
COVID-19 diagnosis â€”A review of current methods - https://pubmed.ncbi.nlm.nih.gov/33126180/

Query: COVID-19 Origins

Answer: Â COVID-19 is a new type of influenza virus that is not currently recognized as a human influenza virus
Top Documents:
- Emerging COVID-19 variants and their impact on SARS-CoV-2 diagnosis, therapeutics and vaccines (https://pubmed.ncbi.nlm.nih.gov/35132910/)
  Snippet: ANNALSO