<a href="https://colab.research.google.com/github/drfarooqgenai-lab/synthetic-patient-data-generator/blob/main/Copy_of_MEDFAQ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install sentence-transformers faiss-cpu transformers torch pandas numpy rouge-score textstat streamlit



In [None]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
from transformers import pipeline
from rouge_score import rouge_scorer
import textstat

In [None]:
def load_dataset(file_path='/content/medquad_CANCER.csv'):
    try:
        df = pd.read_csv(file_path)
        # Normalize column names to lowercase for easier handling
        df.columns = df.columns.str.lower()

        # Check if the required columns exist after normalization
        if 'question' not in df.columns or 'answer' not in df.columns:
             raise ValueError("Dataset must contain 'QUESTION' and 'ANSWER' or 'question' and 'answer' columns.")

        df = df[['question', 'answer']].dropna()
        print(f"Loaded {len(df)} Q&A pairs from {file_path}")
        return df
    except FileNotFoundError:
        print("File not found. Using sample dataset.")
        sample_data = {
            'question': [
                "What is (are) Prostate Cancer?",
                "What are the side effects of chemotherapy?",
                "How is breast cancer diagnosed?",
                "What is radiation therapy for cancer?",
                "Can cancer be prevented?"
            ],
            'answer': [
                "The body is made up of many types of cells. Normally, cells grow, divide, and produce more cells as needed to keep the body healthy and functioning properly. Sometimes, however, the process goes wrong -- cells become abnormal and form more cells in an uncontrolled way. These extra cells form a mass of tissue, called a tumor. Prostate cancer occurs when such tumors form in the prostate gland.",
                "Chemotherapy can cause nausea, hair loss, fatigue, and increased risk of infections. These vary by person and treatment type.",
                "Breast cancer is diagnosed through mammograms, ultrasounds, biopsies, and sometimes MRI scans to confirm the presence of cancerous cells.",
                "Radiation therapy uses high-energy rays to kill cancer cells and shrink tumors, often used after surgery to target remaining cancer cells.",
                "While not all cancers can be prevented, avoiding tobacco, maintaining a healthy diet, exercising, and regular screenings can reduce risk."
            ]
        }
        df = pd.DataFrame(sample_data)
        print("Using sample data with 5 Q&A pairs.")
        return df
    except ValueError as e:
        print(f"Error loading dataset: {e}")
        # Fallback to sample data if file exists but columns are wrong
        print("Using sample dataset due to incorrect columns or file format.")
        sample_data = {
            'question': [
                "What is (are) Prostate Cancer?",
                "What are the side effects of chemotherapy?",
                "How is breast cancer diagnosed?",
                "What is radiation therapy for cancer?",
                "Can cancer be prevented?"
            ],
            'answer': [
                "The body is made up of many types of cells. Normally, cells grow, divide, and produce more cells as needed to keep the body healthy and functioning properly. Sometimes, however, the process goes wrong -- cells become abnormal and form more cells in an uncontrolled way. These extra cells form a mass of tissue, called a tumor. Prostate cancer occurs when such tumors form in the prostate gland.",
                "Chemotherapy can cause nausea, hair loss, fatigue, and increased risk of infections. These vary by person and treatment type.",
                "Breast cancer is diagnosed through mammograms, ultrasounds, biopsies, and sometimes MRI scans to confirm the presence of cancerous cells.",
                "Radiation therapy uses high-energy rays to kill cancer cells and shrink tumors, often used after surgery to target remaining cancer cells.",
                "While not all cancers can be prevented, avoiding tobacco, maintaining a healthy diet, exercising, and regular screenings can reduce risk."
            ]
        }
        df = pd.DataFrame(sample_data)
        print("Using sample data with 5 Q&A pairs.")
        return df

# Load the dataset
df = load_dataset()

Loaded 554 Q&A pairs from /content/medquad_CANCER.csv


In [None]:
def build_knowledge_base(df, model_name='all-MiniLM-L6-v2', top_k=3):
    embedder = SentenceTransformer(model_name)
    documents = [f"Question: {row['question']}\nAnswer: {row['answer']}" for _, row in df.iterrows()]
    embeddings = embedder.encode(documents, convert_to_tensor=False, show_progress_bar=True)
    embeddings = np.array(embeddings).astype('float32')

    dimension = embeddings.shape[1]
    index = faiss.IndexFlatIP(dimension)
    faiss.normalize_L2(embeddings)
    index.add(embeddings)

    metadata = df.to_dict('records')
    print(f"Built knowledge base with {len(documents)} documents.")
    return embedder, index, documents, metadata, top_k

# Build the knowledge base
embedder, index, documents, metadata, top_k = build_knowledge_base(df)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/18 [00:00<?, ?it/s]

Built knowledge base with 554 documents.


In [None]:
def retrieve_documents(query, embedder, index, documents, metadata, top_k):
    query_embedding = embedder.encode([query], convert_to_tensor=False)
    query_embedding = np.array(query_embedding).astype('float32')
    faiss.normalize_L2(query_embedding)

    distances, indices = index.search(query_embedding, top_k)
    retrieved_docs = [documents[i] for i in indices[0]]
    retrieved_metadata = [metadata[i] for i in indices[0]]
    avg_similarity = np.mean(distances[0])

    return retrieved_docs, retrieved_metadata, avg_similarity

In [None]:
def generate_answer(query, retrieved_docs, llm_model='distilgpt2'):
    generator = pipeline('text-generation', model=llm_model, tokenizer=llm_model,
                        max_length=1024, max_new_tokens=100,  # Fix: handle long input, limit output
                        do_sample=True, temperature=0.7, pad_token_id=50256)

    # Truncate context to avoid token limit
    context = "\n".join(retrieved_docs)[:1500]  # ~500 tokens
    prompt = f"""You are a hospital assistant explaining cancer questions to patients in simple language.
Query: {query}
Relevant Information: {context}
Answer in a clear, patient-friendly way using only the provided information:"""

    response = generator(prompt, return_full_text=False)[0]['generated_text'].strip()
    return response

In [None]:
def rag_query(query, embedder, index, documents, metadata, top_k):
    retrieved_docs, retrieved_metadata, similarity = retrieve_documents(query, embedder, index, documents, metadata, top_k)
    answer = generate_answer(query, retrieved_docs)
    return answer, retrieved_metadata, similarity

In [None]:
# Test query
query = "What is prostate cancer?"
answer, retrieved, similarity = rag_query(query, embedder, index, documents, metadata, top_k)
print(f"\nTest Query: {query}")
print(f"Generated Answer: {answer}")
print(f"Retrieval Similarity: {similarity:.3f}")
print("Top Retrieved Q&A:")
print(f"Q: {retrieved[0]['question']}\nA: {retrieved[0]['answer'][:200]}...")

Device set to use cpu



Test Query: What is prostate cancer?
Generated Answer: Question: What is prostate cancer?
Answer: The prostate is a benign, small gland. It is located in the urethra, and in front of the rectum. It is located in the urethra, and in front of the rectum. The prostate is a gland that surrounds the male urethra and helps produce semen, the fluid that carries sperm.  Early prostate cancer usually does not cause pain, and most affected men exhibit no noticeable symptoms. Men are often diagnosed
Retrieval Similarity: 0.750
Top Retrieved Q&A:
Q: What is (are) Prostate Cancer ?
A: The prostate is a male sex gland, about the size of a large walnut. It is located below the bladder and in front of the rectum. The prostate's main function is to make fluid for semen, a white substan...


In [None]:
def evaluate_rag(df, embedder, index, documents, metadata, top_k):
    test_queries = [
        "What is prostate cancer?",
        "What are the side effects of chemotherapy?",
        "How is breast cancer diagnosed?",
        "What is radiation therapy for cancer?",
        "Can cancer be prevented?"
    ]
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
    results = []

    for query in test_queries:
        answer, retrieved, similarity = rag_query(query, embedder, index, documents, metadata, top_k)
        ground_truth = None
        for _, row in df.iterrows():
            if query.lower() in row['question'].lower():
                ground_truth = row['answer']
                break
        if not ground_truth:
            ground_truth = retrieved[0]['answer']

        rouge_scores = scorer.score(ground_truth, answer)
        readability = textstat.flesch_kincaid_grade(answer)

        results.append({
            'query': query,
            'answer': answer,
            'ground_truth': ground_truth,
            'similarity': similarity,
            'rouge1_f1': rouge_scores['rouge1'].fmeasure,
            'rougeL_f1': rouge_scores['rougeL'].fmeasure,
            'readability_grade': readability
        })
        print(f"\nQuery: {query}")
        print(f"Answer: {answer}")
        print(f"Ground Truth: {ground_truth[:200]}...")
        print(f"Retrieval Similarity: {similarity:.3f}")
        print(f"ROUGE-1 F1: {rouge_scores['rouge1'].fmeasure:.3f}")
        print(f"Readability (Flesch-Kincaid Grade): {readability:.1f}")

    return results

In [None]:
# Run evaluation
print("\nRunning Evaluation...")
evaluation_results = evaluate_rag(df, embedder, index, documents, metadata, top_k)


Running Evaluation...


Device set to use cpu



Query: What is prostate cancer?
Answer: Question: What is prostate cancer?
Answer: The prostate is very sensitive to stress and pain, the most common cause of prostate cancer. The prostate is known to produce an excess amount of fluid during intercourse, and is usually under control and/or with the help of the gland. The prostate contains a hormone called testosterone, called testosterone, called testosterone, which is used to control its production of the hormone.
Question: What is (are) prostate cancer?
Answer: The prostate is
Ground Truth: The prostate is a male sex gland, about the size of a large walnut. It is located below the bladder and in front of the rectum. The prostate's main function is to make fluid for semen, a white substan...
Retrieval Similarity: 0.750
ROUGE-1 F1: 0.413
Readability (Flesch-Kincaid Grade): 8.2


Device set to use cpu



Query: What are the side effects of chemotherapy?
Answer: Cervical Tissue
Answer: Cervical Tissue
Answer: Cervical Tissue
Answer: Cervical Tissue
Answer: Cervical Tissue
Answer: Cervical Tissue
Answer: Cervical Tissue
Answer: Cervical Tissue
Answer: Cervical Tissue
Cervical Tissue
Answer: Cervical Tissue
Answer: Cervical Tissue
Answer: Cervical Tissue
Ground Truth: Summary : Normally, your cells grow and die in a controlled way. Cancer cells keep forming without control. Chemotherapy is drug therapy that can kill these cells or stop them from multiplying. Howeve...
Retrieval Similarity: 0.612
ROUGE-1 F1: 0.000
Readability (Flesch-Kincaid Grade): 26.6


Device set to use cpu



Query: How is breast cancer diagnosed?
Answer: This is the first time you have access to information that can be used to determine the most effective treatment.
Ground Truth: Key Points
                    - Breast cancer is a disease in which malignant (cancer) cells form in the tissues of the breast.    - Sometimes breast cancer occurs in women who are pregnant or have j...
Retrieval Similarity: 0.716
ROUGE-1 F1: 0.043
Readability (Flesch-Kincaid Grade): 9.3


Device set to use cpu



Query: What is radiation therapy for cancer?
Answer: Question: Are there any cancer treatments for the breast or breast?
Answer: There are those.
Question: How many cancer treatments are being tested in clinical trials in the United States?
Answer: There are approximately three dozen.
Answer: There are over ten.
Question: Who is the subject of a study and how much radiation is being tested in the United States?
Answer: There are about three hundred.
Question: If the results are accurate, will there be a
Ground Truth: Radiation therapy uses high-energy x-rays or other types of radiation to kill cancer cells and shrink tumors. This therapy often follows a lumpectomy, and is sometimes used after mastectomy. During ra...
Retrieval Similarity: 0.658
ROUGE-1 F1: 0.211
Readability (Flesch-Kincaid Grade): 6.3


Device set to use cpu



Query: Can cancer be prevented?
Answer: If you have a high risk of developing breast cancer, you should not use the information you have provided to increase your risk of getting cancer.
If you have a low risk of developing cancer, you should not use the information you have provided to increase your risk of getting cancer. If you have a high risk of developing cancer, you should not use the information you have provided to increase your risk of getting cancer. If you have a high risk of developing cancer, you should not use the information
Ground Truth: What Is Cancer Prevention? Cancer prevention is action taken to lower the chance of getting cancer. By preventing cancer, the number of new cases of cancer in a group or population is lowered. Hopeful...
Retrieval Similarity: 0.567
ROUGE-1 F1: 0.092
Readability (Flesch-Kincaid Grade): 10.7


In [None]:
faiss.write_index(index, '/content/cancer_faq_index.faiss')
print("FAISS index saved as '/content/cancer_faq_index.faiss'")

FAISS index saved as '/content/cancer_faq_index.faiss'
