#Select a text dataset with at least 5k relatively short documents (no more than 200-300 words) from which you want to extract information - it can be product reviews (there is a database of Kindle reviews for example), movie reviews, short news, movie plots, etc.


##Load the scientific papers dataset

In [None]:
!pip install datasets transformers

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m588.1 kB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [None]:
import torch
!nvidia-smi

Sat Dec 14 23:47:45 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.empty_cache()

In [None]:
torch.cuda.get_device_name(0)

'Tesla T4'

In [None]:
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("scientific_papers", "pubmed")

Downloading data:   0%|          | 0.00/3.62G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/880M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/119924 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/6633 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6658 [00:00<?, ? examples/s]

In [None]:
# Check dataset structure
print(dataset)

# Access train/validation/test sets
print(dataset['train'][0])  # View the first example in the training set

# Access specific fields like article text and abstracts
article = dataset['train'][0]['article']
abstract = dataset['train'][0]['abstract']

print(f"Article:\n{article[:500]}...\n")
print(f"Abstract:\n{abstract}")

DatasetDict({
    train: Dataset({
        features: ['article', 'abstract', 'section_names'],
        num_rows: 119924
    })
    validation: Dataset({
        features: ['article', 'abstract', 'section_names'],
        num_rows: 6633
    })
    test: Dataset({
        features: ['article', 'abstract', 'section_names'],
        num_rows: 6658
    })
})
{'article': "a recent systematic analysis showed that in 2011 , 314 ( 296 - 331 ) million children younger than 5 years were mildly , moderately or severely stunted and 258 ( 240 - 274 ) million were mildly , moderately or severely underweight in the developing countries .\nin iran a study among 752 high school girls in sistan and baluchestan showed prevalence of 16.2% , 8.6% and 1.5% , for underweight , overweight and obesity , respectively .\nthe prevalence of malnutrition among elementary school aged children in tehran varied from 6% to 16% .\nanthropometric study of elementary school students in shiraz revealed that 16% of them suff

In [None]:
# Select specific fields
dataset = load_dataset("scientific_papers", "pubmed", split="train")

# Convert to Pandas DataFrame if needed
import pandas as pd
df = pd.DataFrame(dataset)
df = df.dropna(subset=["article", "abstract"])  # Remove rows with missing data
df.head()

Unnamed: 0,article,abstract,section_names
0,a recent systematic analysis showed that in 20...,background : the present study was carried ou...,INTRODUCTION\nMATERIALS AND METHODS\nParticipa...
1,it occurs in more than 50% of patients and may...,backgroundanemia in patients with cancer who ...,Introduction\nPatients and methods\nStudy desi...
2,"tardive dystonia ( td ) , a rarer side effect ...",tardive dystonia ( td ) is a serious side eff...,INTRODUCTION\nCASE REPORT\nDISCUSSION\nDeclara...
3,"lepidoptera include agricultural pests that , ...",many lepidopteran insects are agricultural pe...,1. Introduction\n2. Insect Immunity\n3. Signal...
4,syncope is caused by transient diffuse cerebra...,we present an unusual case of recurrent cough...,Introduction\nCase report\nDiscussion\nConflic...


##Pre-Process the data

In [None]:
import numpy as np
import spacy   # another tokenizer, lemmatizer (has --> be)
nlp = spacy.load('en_core_web_sm')
nlp.disable_pipes('parser', 'ner')

['parser', 'ner']

In [None]:
def nlp_processing(doc): # from indexing.ipyns
    tokens = nlp(doc)

    #print(type(tokens))
    # eliminates stop words  and non alpha num and converts all to lower case
    terms = [token.lemma_.lower() for token in tokens if token.is_alpha and (not token.is_stop)]

    return terms

In [None]:
# Pre-process articles and abstracts
#df["article"] = df["article"].apply(nlp_processing)
#df["abstract"] = df["abstract"].apply(nlp_processing)

In [None]:
# filter articles related to heart disease
def filter_heart_disease(dataframe):
    keywords = ["heart disease", "cardiac", "cardiovascular", "myocardial", "heart attack"]
    heart_disease_dataframe = dataframe[dataframe["article"].str.contains('|'.join(keywords), case=False, na=False) |
                      dataframe["abstract"].str.contains('|'.join(keywords), case=False, na=False)]

    return heart_disease_dataframe

In [None]:
# Filter the DataFrame
heart_disease_df = filter_heart_disease(df)

print(f"Number of articles about heart disease: {len(heart_disease_df)}")
print(heart_disease_df.head())

Number of articles about heart disease: 25687
                                              article  \
8   lipid apheresis provides a safe and effective ...   
9   agenesis of the inferior vena cava ( ivc ) as ...   
16  evans , using a mouse mutant for the lim homeo...   
18  in past years , numerous studies have describe...   
20  der p 1 was isolated from house dust mite feca...   

                                             abstract  \
8    lipid apheresis is used to treat patients wit...   
9    background : agenesis of the inferior vena ca...   
16   cardiac progenitor cells are multipotent stem...   
18   purpose : to investigate to what degree the p...   
20   the house dust mite dermatophagoides pteronys...   

                                        section_names  
8   1. Introduction\n2. Methods\n3. Results\n4. Di...  
9   Background:\nCase:\nConclusion:\nBackground\nC...  
16  Islet1 positive cells during heart development...  
18  Introduction\nMethods\nDesign\nStudy pop


#Embed all text documents using a sentence transformer (same as in Project 1 if it worked well) and make sure you can get from the text to the embedding and viceversa fast.

In [None]:
from sentence_transformers import SentenceTransformer
# Load https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
embedding_model = SentenceTransformer("sentence-transformers/multi-qa-mpnet-base-dot-v1", device="cuda")

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/8.71k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
embedding_model.get_max_seq_length()

512

In [None]:
embedding_model.get_sentence_embedding_dimension()

768

In [None]:
embedding_model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

In [None]:
# Generate embeddings for abstracts (optimized for dense retrieval)
abstracts_to_embed = heart_disease_df["abstract"].tolist()
abstract_embeddings = embedding_model.encode(abstracts_to_embed, show_progress_bar=True, batch_size=64)

Batches:   0%|          | 0/402 [00:00<?, ?it/s]

In [None]:
# Normalize embeddings for cosine similarity
abstract_embeddings = np.array(abstract_embeddings)
normalized_embeddings = abstract_embeddings / np.linalg.norm(abstract_embeddings, axis=1, keepdims=True)

In [None]:
# Save the text-to-embedding and embedding-to-text mapping
# Makes sure able to go from text to embedding and vicecersa fast

# Add embeddings back to DataFrame
heart_disease_df["abstract_embedding"] = list(normalized_embeddings)

text_to_embedding = {row['abstract']: row['abstract_embedding'] for _, row in heart_disease_df.iterrows()}
embedding_to_text = {tuple(row['abstract_embedding']): row['abstract'] for _, row in heart_disease_df.iterrows()}

# Verify the mapping
sample_text = abstracts_to_embed[0]
sample_embedding = text_to_embedding[sample_text]
retrieved_text = embedding_to_text[tuple(sample_embedding)]
print(f"Original Text:\n{sample_text[:500]}...\n")
print(f"Retrieved Text:\n{retrieved_text[:500]}...\n")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  heart_disease_df["abstract_embedding"] = list(normalized_embeddings)


Original Text:
 lipid apheresis is used to treat patients with severe hyperlipidemia by reducing low - density lipoprotein cholesterol ( ldl - c ) . 
 this study examines the effect of apheresis on the lipid panel and cardiac event rates before and after apheresis . 
 an electronic health record screen of ambulatory patients identified 11 active patients undergoing lipid apheresis with 10/11 carrying a diagnosis of fh . 
 baseline demographics , pre- and postapheresis lipid levels , highest recorded ldl - c , ...

Retrieved Text:
 lipid apheresis is used to treat patients with severe hyperlipidemia by reducing low - density lipoprotein cholesterol ( ldl - c ) . 
 this study examines the effect of apheresis on the lipid panel and cardiac event rates before and after apheresis . 
 an electronic health record screen of ambulatory patients identified 11 active patients undergoing lipid apheresis with 10/11 carrying a diagnosis of fh . 
 baseline demographics , pre- and postapheresis lipid 


#Choose three questions - make them harder to be answered just by finding a relevant document. For example: What type of movies that x actor plays in? This requires a summary of the movies in which an actor played, not just a list of movies. Embed the queries, and find the top 5 closest documents using the cosine distance.


##Questions

###(1)What are the most common risk factors for heart disease?

###(2)How do different types of heart disease, such as coronary artery disease and arrhythmias, affect the heart's function?

###(3)How do emerging treatments like gene therapy and personalized medicine address the underlying causes of specific heart diseases?

In [None]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0.post1


In [None]:
import faiss
# Improved Retrieval with Metadata
dimension = normalized_embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)
index.add(normalized_embeddings)

# Map FAISS IDs to text and metadata
id_to_text = {i: heart_disease_df.iloc[i]["abstract"] for i in range(len(heart_disease_df))}
id_to_metadata = {i: {"article": heart_disease_df.iloc[i]["article"], "index": i}
                  for i in range(len(heart_disease_df))}

In [None]:
# Define a function to extract relevant sentences from documents
def extract_relevant_sentences(docs, question):
    keywords = question.split()  # Use question words as basic keywords
    relevant_sentences = []
    for doc in docs:
        sentences = doc["text"].split(". ")
        for sentence in sentences:
            if any(keyword.lower() in sentence.lower() for keyword in keywords):
                relevant_sentences.append(sentence.strip())
    return " ".join(relevant_sentences)

In [None]:
# Document Retrieval Function with Similarity Scores
def retrieve_documents(query, top_k=5):
    # Generate embedding for the input query using the embedding model
    query_embedding = embedding_model.encode(query, normalize_embeddings=True).reshape(1, -1)

    # Perform similarity search in the FAISS index to find the top_k closest embeddings
    distances, indices = index.search(query_embedding, top_k)

    # Create a list of results with the text, metadata, and similarity score for each retrieved document
    results = [
        {"text": id_to_text[i], "metadata": id_to_metadata[i], "score": distances[0][j]}
        for j, i in enumerate(indices[0])
    ]
    return results  # Return the list of retrieved documents with metadata and similarity scores

In [None]:
# Generate answers based on the retrieved documents
def generate_answer(question, top_k=5):
    # Retrieve the top_k most relevant documents
    retrieved_docs = retrieve_documents(question, top_k=top_k)

    # Extract relevant sentences from the retrieved documents to build the context
    context = extract_relevant_sentences(retrieved_docs, question)

    # Construct the refined prompt
    prompt = f"""
    You are an expert in cardiovascular medicine.
    Based on the following context, answer the question concisely and accurately in 100 words or less.
    If you don't know the answer, just say that you don't know, don't try to make up an answer.

    Context: {context}

    Question: {question}

    Answer:
    """

    # Generate an answer using the QA pipeline
    raw_answer = qa_pipeline(prompt, max_new_tokens=200, truncation=True)[0]["generated_text"]


    return raw_answer, retrieved_docs

In [None]:
# Generate answers based on the retrieved documents
def generate_answer2(question, top_k=5):
    # Retrieve the top_k most relevant documents
    retrieved_docs = retrieve_documents(question, top_k=top_k)

    # Extract relevant sentences from the retrieved documents to build the context
    context = extract_relevant_sentences(retrieved_docs, question)

    # Construct the refined prompt
    prompt = f"""
    You are an expert in cardiovascular medicine.
    Based on the following context, answer the question concisely and accurately in 50 words or less.
    If you don't know the answer, just say that you don't know, don't try to make up an answer.

    Context: {context}

    Question: {question}

    Answer:
    """

    # Generate an answer using the QA pipeline
    raw_answer = qa_pipeline(prompt, max_new_tokens=200, truncation=True)[0]["generated_text"]


    return raw_answer, retrieved_docs

In [None]:
# Generate answers based on the retrieved documents
def generate_answer3(question, top_k=5):
    # Retrieve the top_k most relevant documents
    retrieved_docs = retrieve_documents(question, top_k=top_k)

    # Extract relevant sentences from the retrieved documents to build the context
    context = extract_relevant_sentences(retrieved_docs, question)

    # Construct the refined prompt
    prompt = f"""
    You are an expert in cardiovascular medicine.
    Based on the provided context, answer the question concisely and accurately in fewer than 80 words.
    If the context does not contain enough information, say \'I don't know\' rather than making up an answer..

    Context: {context}

    Question: {question}

    Answer:
    """

    # Generate an answer using the QA pipeline
    raw_answer = qa_pipeline(prompt, max_new_tokens=200, truncation=True)[0]["generated_text"]


    return raw_answer, retrieved_docs

In [None]:
questions = [
    "What are the most common risk factors for heart disease?",
    "How do different types of heart disease, such as coronary artery disease and arrhythmias, affect the heart's function?",
    "How do emerging treatments like gene therapy and personalized medicine address the underlying causes of specific heart diseases?"
]

In [None]:
from transformers import pipeline

# Initialize QA model (FLAN-T5) and summarizer
qa_pipeline = pipeline("text2text-generation", model="google/flan-t5-base")
#summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [None]:
# (First prompt) Loop through each question to generate answers and display the results
for question in questions:
    # Generate the answer and retrieve metadata for the top documents
    answer, retrieved_docs = generate_answer(question)

    # Summarize the answer for conciseness
    #final_answer = summarizer(answer, max_length=100, min_length=30, do_sample=False)[0]["summary_text"]

    # Print the question and the final answer
    print(f"Q: {question}\nA: {answer}\n")

    # Print the top retrieved documents with their similarity scores
    print("Top Retrieved Documents with Similarity Scores:")
    for doc in retrieved_docs:
        print(f"- Index: {doc['metadata']['index']}, Similarity Score: {doc['score']:.4f}")
        print(f"  Document Excerpt: {doc['text'][:200]}...\n")

# (Second prompt)Loop through each question to generate answers and display the results
for question in questions:
    # Generate the answer and retrieve metadata for the top documents
    answer, retrieved_docs = generate_answer2(question)

    # Summarize the answer for conciseness
    #final_answer = summarizer(answer, max_length=100, min_length=30, do_sample=False)[0]["summary_text"]

    # Print the question and the final answer
    print(f"Q: {question}\nA: {answer}\n")

    # Print the top retrieved documents with their similarity scores
    print("Top Retrieved Documents with Similarity Scores:")
    for doc in retrieved_docs:
        print(f"- Index: {doc['metadata']['index']}, Similarity Score: {doc['score']:.4f}")
        print(f"  Document Excerpt: {doc['text'][:200]}...\n")

# (Third prompt) Loop through each question to generate answers and display the results
for question in questions:
    # Generate the answer and retrieve metadata for the top documents
    answer, retrieved_docs = generate_answer3(question)

    # Summarize the answer for conciseness
    #final_answer = summarizer(answer, max_length=100, min_length=30, do_sample=False)[0]["summary_text"]

    # Print the question and the final answer
    print(f"Q: {question}\nA: {answer}\n")

    # Print the top retrieved documents with their similarity scores
    print("Top Retrieved Documents with Similarity Scores:")
    for doc in retrieved_docs:
        print(f"- Index: {doc['metadata']['index']}, Similarity Score: {doc['score']:.4f}")
        print(f"  Document Excerpt: {doc['text'][:200]}...\n")

Q: What are the most common risk factors for heart disease?
A: cardiovascular disease

Top Retrieved Documents with Similarity Scores:
- Index: 191, Similarity Score: 0.6770
  Document Excerpt:  cardiovascular diseases ( cvds ) are the leading cause of mortality worldwide . coronary heart disease ( chd ) 
 is the main cause of mortality in heart patients following stroke , rheumatic heart di...

- Index: 951, Similarity Score: 0.6472
  Document Excerpt:  objectivesto correlate cardiovascular risk factors ( e.g. , hypertension , obesity , hypercholesterolemia , hypertriglyceridemia , hyperglycemia , sedentariness ) in childhood and adolescence with th...

- Index: 17675, Similarity Score: 0.6441
  Document Excerpt:  cardiovascular diseases ( cvds ) causes of worldwide preventable morbidity and mortality .   
 cvds are a leading cause of mortality and morbidity in developing countries , and rates are expected to ...

- Index: 4864, Similarity Score: 0.6438
  Document Excerpt:  cholestero


#Choose a language model that fits in the GPU of a colab notebook (look at LLama or Mistral 7B parameters). Form a prompt that includes the original query and the text of the top 5 closest documents (if the context length of the model is shorter - then just use top 3) - look in the Medium article for prompt examples. Store the answer generated for each query. Try different prompts, try repeating the same prompt and see what happens.

#Initialize a LLama Model

In [None]:
from huggingface_hub import login

# Replace "YOUR_TOKEN_HERE" with your actual token
login("INSERT TOKEN HERE")

# Initialize the LLaMA model
qa_pipeline = pipeline("text-generation", model="meta-llama/Llama-3.2-1B")

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


#Generate Answers from LLama

In [None]:
# (First prompt) Loop through each question to generate answers and display the results
for question in questions:
    # Generate the answer and retrieve metadata for the top documents
    answer, retrieved_docs = generate_answer(question)

    # Summarize the answer for conciseness
    #final_answer = summarizer(answer, max_length=100, min_length=30, do_sample=False)[0]["summary_text"]

    # Print the question and the final answer
    print(f"Q: {question}\nA: {answer}\n")

    # Print the top retrieved documents with their similarity scores
    print("Top Retrieved Documents with Similarity Scores:")
    for doc in retrieved_docs:
        print(f"- Index: {doc['metadata']['index']}, Similarity Score: {doc['score']:.4f}")
        print(f"  Document Excerpt: {doc['text'][:200]}...\n")

# (Second prompt) Loop through each question to generate answers and display the results
for question in questions:
    # Generate the answer and retrieve metadata for the top documents
    answer, retrieved_docs = generate_answer2(question)

    # Summarize the answer for conciseness
    #final_answer = summarizer(answer, max_length=100, min_length=30, do_sample=False)[0]["summary_text"]

    # Print the question and the final answer
    print(f"Q: {question}\nA: {answer}\n")

    # Print the top retrieved documents with their similarity scores
    print("Top Retrieved Documents with Similarity Scores:")
    for doc in retrieved_docs:
        print(f"- Index: {doc['metadata']['index']}, Similarity Score: {doc['score']:.4f}")
        print(f"  Document Excerpt: {doc['text'][:200]}...\n")

# (Third prompt) Loop through each question to generate answers and display the results
for question in questions:
    # Generate the answer and retrieve metadata for the top documents
    answer, retrieved_docs = generate_answer3(question)

    # Summarize the answer for conciseness
    #final_answer = summarizer(answer, max_length=100, min_length=30, do_sample=False)[0]["summary_text"]

    # Print the question and the final answer
    print(f"Q: {question}\nA: {answer}\n")

    # Print the top retrieved documents with their similarity scores
    print("Top Retrieved Documents with Similarity Scores:")
    for doc in retrieved_docs:
        print(f"- Index: {doc['metadata']['index']}, Similarity Score: {doc['score']:.4f}")
        print(f"  Document Excerpt: {doc['text'][:200]}...\n")

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Q: What are the most common risk factors for heart disease?
A: 
    You are an expert in cardiovascular medicine.
    Based on the following context, answer the question concisely and accurately in 100 words or less.
    If you don't know the answer, just say that you don't know, don't try to make up an answer.

    Context: cardiovascular diseases ( cvds ) are the leading cause of mortality worldwide coronary heart disease ( chd ) 
 is the main cause of mortality in heart patients following stroke , rheumatic heart disease and myocardial infarctions infectious diseases , human immunodeficiency , tuberculosis , malaria , high blood pressure or hypertension , obesity and overweight , and nutritional disorders including smoking , excessive alcohol consumption , high salt and sugar intake , as well as other factors are responsible for cvds and chds in young as well as elderly individuals the focus of the present review are recent epidemiological aspects of cvd and chd as well as the usefu

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Q: How do different types of heart disease, such as coronary artery disease and arrhythmias, affect the heart's function?
A: 
    You are an expert in cardiovascular medicine.
    Based on the following context, answer the question concisely and accurately in 100 words or less.
    If you don't know the answer, just say that you don't know, don't try to make up an answer.

    Context: cardiovascular complications are known to be the main determinants of reduced life expectancy and decreased quality of life in acromegaly patients our study aimed to provide insight into the cardiovascular changes that occur in acromegaly patients and to investigate the correlative risk factors a total of 108 patients definitively diagnosed with acromegaly and 108 controls matched for age and gender were recruited into study and control groups , respectively standard echocardiography was performed on all of the participants , and data were collected and analyzed all acromegaly patients presented with str

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Q: How do emerging treatments like gene therapy and personalized medicine address the underlying causes of specific heart diseases?
A: 
    You are an expert in cardiovascular medicine.
    Based on the following context, answer the question concisely and accurately in 100 words or less.
    If you don't know the answer, just say that you don't know, don't try to make up an answer.

    Context: genetics plays an important role in the pathophysiology of cardiovascular diseases , and is increasingly being integrated into clinical practice since 2008 , both capacity and cost - efficiency of mutation screening of dna have been increased magnificently due to the technological advancement obtained by next - generation sequencing hence , the discovery rate of genetic defects in cardiovascular genetics has grown rapidly and the financial threshold for gene diagnostics has been lowered , making large - scale dna sequencing broadly accessible in this review , 
 the genetic variants , mutations 

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Q: What are the most common risk factors for heart disease?
A: 
    You are an expert in cardiovascular medicine.
    Based on the following context, answer the question concisely and accurately in 50 words or less.
    If you don't know the answer, just say that you don't know, don't try to make up an answer.

    Context: cardiovascular diseases ( cvds ) are the leading cause of mortality worldwide coronary heart disease ( chd ) 
 is the main cause of mortality in heart patients following stroke , rheumatic heart disease and myocardial infarctions infectious diseases , human immunodeficiency , tuberculosis , malaria , high blood pressure or hypertension , obesity and overweight , and nutritional disorders including smoking , excessive alcohol consumption , high salt and sugar intake , as well as other factors are responsible for cvds and chds in young as well as elderly individuals the focus of the present review are recent epidemiological aspects of cvd and chd as well as the useful

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Q: How do different types of heart disease, such as coronary artery disease and arrhythmias, affect the heart's function?
A: 
    You are an expert in cardiovascular medicine.
    Based on the following context, answer the question concisely and accurately in 50 words or less.
    If you don't know the answer, just say that you don't know, don't try to make up an answer.

    Context: cardiovascular complications are known to be the main determinants of reduced life expectancy and decreased quality of life in acromegaly patients our study aimed to provide insight into the cardiovascular changes that occur in acromegaly patients and to investigate the correlative risk factors a total of 108 patients definitively diagnosed with acromegaly and 108 controls matched for age and gender were recruited into study and control groups , respectively standard echocardiography was performed on all of the participants , and data were collected and analyzed all acromegaly patients presented with stru

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Q: How do emerging treatments like gene therapy and personalized medicine address the underlying causes of specific heart diseases?
A: 
    You are an expert in cardiovascular medicine.
    Based on the following context, answer the question concisely and accurately in 50 words or less.
    If you don't know the answer, just say that you don't know, don't try to make up an answer.

    Context: genetics plays an important role in the pathophysiology of cardiovascular diseases , and is increasingly being integrated into clinical practice since 2008 , both capacity and cost - efficiency of mutation screening of dna have been increased magnificently due to the technological advancement obtained by next - generation sequencing hence , the discovery rate of genetic defects in cardiovascular genetics has grown rapidly and the financial threshold for gene diagnostics has been lowered , making large - scale dna sequencing broadly accessible in this review , 
 the genetic variants , mutations a

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Q: What are the most common risk factors for heart disease?
A: 
    You are an expert in cardiovascular medicine.
    Based on the provided context, answer the question concisely and accurately in fewer than 80 words.
    If the context does not contain enough information, say 'I don't know' rather than making up an answer..

    Context: cardiovascular diseases ( cvds ) are the leading cause of mortality worldwide coronary heart disease ( chd ) 
 is the main cause of mortality in heart patients following stroke , rheumatic heart disease and myocardial infarctions infectious diseases , human immunodeficiency , tuberculosis , malaria , high blood pressure or hypertension , obesity and overweight , and nutritional disorders including smoking , excessive alcohol consumption , high salt and sugar intake , as well as other factors are responsible for cvds and chds in young as well as elderly individuals the focus of the present review are recent epidemiological aspects of cvd and chd as wel

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Q: How do different types of heart disease, such as coronary artery disease and arrhythmias, affect the heart's function?
A: 
    You are an expert in cardiovascular medicine.
    Based on the provided context, answer the question concisely and accurately in fewer than 80 words.
    If the context does not contain enough information, say 'I don't know' rather than making up an answer..

    Context: cardiovascular complications are known to be the main determinants of reduced life expectancy and decreased quality of life in acromegaly patients our study aimed to provide insight into the cardiovascular changes that occur in acromegaly patients and to investigate the correlative risk factors a total of 108 patients definitively diagnosed with acromegaly and 108 controls matched for age and gender were recruited into study and control groups , respectively standard echocardiography was performed on all of the participants , and data were collected and analyzed all acromegaly patients pres

#Add to Answers.md the description of your dataset, what you did, the experiments you tried and their results. A list of references for your algorithm/code, etc. Show your queries, top k documents, the prompts, and the best/worst answers you got. Please be thorough with your writing. Comment on the results you got and what is the improvement compared to just using top k documents - simple retrieval.