# **Week 4: Retrieval-Augmented Generation (RAG)**

- **Topics:** RAG architecture and concepts, vector search concepts, similarity and distance metrics, indexing strategies, using vectors in document retrieval and LLMs.
- **Hands-on:** Building a basic RAG pipeline with pre-trained models, implementing a vector search mechanism with a document corpus.

## **Designing the RAG Pipeline for the Medical AI Assistant**

### **Objective**
The goal is to design a scalable and efficient RAG pipeline for the VA's Medical AI Assistant. This pipeline will:
1. Retrieve relevant clinical and medical knowledge (e.g., TIU notes, ICD codes, guidelines).
2. Generate accurate, contextually relevant, and actionable responses.
3. Support both experimentation (open-source) and production (Azure services) environments.

---

### **Core Components of the RAG Pipeline**

1. **Document Corpus**  
   - A centralized repository of all relevant medical knowledge, including:
     - TIU notes (real-world clinical context).
     - ICD codes (structured disease classifications).
     - Evidence-based guidelines (e.g., VA/DoD clinical practice guidelines).
     - General medical encyclopedias (e.g., MedlinePlus).

2. **Embedding Model**
   - Converts text into dense vector representations for similarity-based retrieval.
   - **Azure Option**: Azure OpenAI embeddings (`text-embedding-ada-002`).
   - **Open-Source Option**: SentenceTransformers (e.g., `all-MiniLM-L6-v2`).

3. **Vector Store**
   - A scalable repository to store and search embeddings.
   - **Azure Option**: Azure AI Search (with vector search enabled).
   - **Open-Source Option**: FAISS or Milvus for local/experimental setups.

4. **Metadata and Indexing**
   - Enables filtering and relevance ranking during retrieval.
   - Key metadata fields:
     - **Document Type**: TIU note, ICD code, guideline, etc.
     - **Specialty**: Cardiology, gastroenterology, mental health.
     - **Source**: EHR, MedlinePlus, PubMed, etc.
     - **Date**: Timeliness of the document.
   - **Azure Option**: Use Azure AI Search’s metadata and indexing features.
   - **Open-Source Option**: Store metadata in a database (e.g., MongoDB or PostgreSQL) alongside embeddings.

5. **Retrieval Mechanism**
   - Fetches top-k documents relevant to the query.
   - Supports similarity-based searches and metadata filtering.
   - **Azure Option**: Azure AI Search with semantic ranking.
   - **Open-Source Option**: FAISS/Milvus queries combined with metadata filtering logic.

6. **Context-Aware Generation**
   - Combines retrieved documents with the query to generate a final response.
   - **Azure Option**: Azure OpenAI Service (e.g., GPT-4).
   - **Open-Source Option**: OpenAI API or Hugging Face transformers.

---

### **Step-by-Step Design Framework**

#### **Step 1: Data Collection and Preparation**
- **Sources**:
  - Clinical data: TIU notes, radiology reports, lab results (anonymized).
  - Structured data: ICD-10, SNOMED CT, CPT codes.
  - General knowledge: MedlinePlus, VA-specific guidelines.
  - Behavioral health: DSM-5 criteria, PTSD scales.

- **Processing**:
  - Clean and preprocess text data to remove noise.
  - Assign metadata tags for filtering and relevance ranking.

#### **Step 2: Embedding Strategy**
- **Azure Approach**:
  - Use Azure OpenAI embeddings for consistency with other Azure services.
- **Open-Source Approach**:
  - Use domain-specific models like BioBERT or SentenceTransformers for medical data.

#### **Step 3: Vector Store Design**
- **Indexes**:
  - Organize data into logical partitions or indexes:
    - Clinical Notes Index: TIU notes, radiology, lab reports.
    - Terminology Index: ICD, SNOMED CT, CPT.
    - General Knowledge Index: MedlinePlus, guidelines, FAQs.

- **Metadata Schema**:
  - Example:
    - Document Type: "TIU Note", "ICD Code".
    - Specialty: "Cardiology", "Gastroenterology".
    - Date: "2023-10-01".

- **Storage Options**:
  - **Azure**: Use Azure AI Search with vector and metadata capabilities.
  - **Open-Source**: Use FAISS or Weaviate with metadata stored separately.

#### **Step 4: Retrieval Logic**
- Use similarity search (based on embeddings) combined with metadata filtering.
- **Query Examples**:
  - "What are the causes of chest pain?"
    - Retrieve relevant TIU notes, ICD codes, and guidelines tagged with "cardiology".
  - "What guidelines exist for PTSD management?"
    - Retrieve VA/DoD guidelines and DSM-5 criteria tagged with "behavioral health".

#### **Step 5: Generation Logic**
- Combine retrieved documents with the query to construct a prompt.
- Generate a response using an LLM (e.g., GPT-4 or Claude).
- Prompt Example:
  ```
  Query: "What are the causes of chest pain?"
  Retrieved Context: [Document 1 content, Document 2 content]
  Response: Summarize the retrieved context and provide a medically accurate explanation.
  ```

#### **Step 6: System Validation**
- Test the pipeline with real-world queries:
  - Validate retrieval accuracy (Are the top-k results relevant?).
  - Assess generation quality (Are the responses clear and actionable?).
- Iteratively refine embeddings, metadata schema, and retrieval logic.

---

### **Key Considerations**

1. **Scalability**:
   - **Azure**: Seamless scaling with Azure AI Search and Azure OpenAI.
   - **Open-Source**: Use FAISS for smaller-scale experiments, with a transition to Milvus or Weaviate for larger datasets.

2. **Security and Privacy**:
   - Ensure all clinical data (e.g., TIU notes) is anonymized.
   - Use secure storage options, especially for sensitive VA data.

3. **Future Expansion**:
   - Plan for adding new data sources (e.g., clinical trials, research papers).
   - Ensure that the system supports updates to medical terminologies (e.g., new ICD revisions).



## **Prioritizing Data Sources for the Vector Store**

To ensure the Medical AI Assistant is both relevant and effective, we need to focus on high-impact data sources. Below is a prioritized list of data sources based on **clinical relevance**, **retrieval utility**, and **ease of integration** into the vector store.

---

### **Tier 1: High-Priority Data Sources**
These data sources should be integrated first as they directly impact the Assistant's ability to address medical queries.

1. **TIU Notes (Text Integration Utility Notes)**  
   - **Why**:  
     TIU notes are critical for understanding real-world clinical scenarios, providing the assistant with patient-specific contexts (after anonymization).  
   - **Examples**:  
     - SOAP notes (Subjective, Objective, Assessment, Plan).
     - Admission, discharge, and progress notes.
   - **Challenges**:  
     - Requires extensive preprocessing to anonymize patient data while preserving clinical relevance.

2. **ICD Codes (International Classification of Diseases)**  
   - **Why**:  
     Structured data that links symptoms and diagnoses, making it easier to map patient complaints to clinical terms.  
   - **Examples**:  
     - ICD-10 Codes: "I20" (Angina Pectoris), "K21" (GERD).
   - **Challenges**:  
     - Keeping up-to-date with revisions (e.g., ICD-11).

3. **VA/DoD Clinical Guidelines**  
   - **Why**:  
     Ensures responses align with evidence-based practices specific to the VA system.  
   - **Examples**:  
     - PTSD guidelines, diabetes management, hypertension treatment.
   - **Challenges**:  
     - Formatting variability and ensuring semantic consistency.

---

### **Tier 2: Medium-Priority Data Sources**
Once Tier 1 data sources are integrated, these enhance the Assistant’s knowledge and ability to handle broader queries.

4. **MedlinePlus Consumer Health Information**  
   - **Why**:  
     Provides plain-language explanations of conditions, treatments, and medications, helpful for patient-facing responses.  
   - **Examples**:  
     - Descriptions of GERD, treatments for hypertension.
   - **Challenges**:  
     - Balancing clinical depth with consumer-level simplicity.

5. **SNOMED CT (Systematized Nomenclature of Medicine)**  
   - **Why**:  
     Adds depth to the Assistant's knowledge of medical terms and relationships between concepts (e.g., "angina" vs. "unstable angina").  
   - **Examples**:  
     - Hierarchies: "Chest Pain" → "Cardiac Chest Pain" → "Angina".
   - **Challenges**:  
     - Large vocabulary requires efficient indexing and filtering.

6. **Drug Databases**  
   - **Why**:  
     Medication information is essential for addressing drug interactions, dosages, and treatment options.  
   - **Examples**:  
     - DailyMed (FDA-approved labels), Lexicomp.
   - **Challenges**:  
     - Licensing requirements for proprietary databases (e.g., Lexicomp).

---

### **Tier 3: Supplemental Data Sources**
These are useful for more advanced capabilities and specialized queries but are not essential for the initial vector store.

7. **DSM-5 Criteria (Mental Health)**  
   - **Why**:  
     Supports mental health queries, particularly for PTSD, depression, and anxiety—conditions prevalent in the VA population.  
   - **Examples**:  
     - PTSD diagnostic criteria, generalized anxiety disorder symptoms.
   - **Challenges**:  
     - Requires structured integration to avoid overwhelming the retrieval system.

8. **PubMed Abstracts (Biomedical Literature)**  
   - **Why**:  
     Offers cutting-edge research insights, useful for clinician-facing queries.  
   - **Examples**:  
     - Recent studies on new treatments for atrial fibrillation.
   - **Challenges**:  
     - Query relevance may require additional filtering based on publication date, journal reputation.

9. **Radiology and Lab Reports**  
   - **Why**:  
     Adds context for diagnostic queries involving imaging or laboratory findings.  
   - **Examples**:  
     - Common findings: "Ground-glass opacity" (COVID-19), elevated troponin levels (MI).
   - **Challenges**:  
     - Requires sophisticated preprocessing to extract key terms.

---

### **Proposed Integration Timeline**
1. **Phase 1 (Essential)**:
   - TIU Notes (preprocessed and anonymized).
   - ICD Codes.
   - VA/DoD Clinical Guidelines.

2. **Phase 2 (Enhanced)**:
   - MedlinePlus.
   - SNOMED CT.
   - Drug Databases.

3. **Phase 3 (Advanced)**:
   - DSM-5 Criteria.
   - PubMed Abstracts.
   - Radiology and Lab Reports.

---

### **Design Considerations for Integration**

#### **1. Metadata Schema**
- **Fields**:
  - **Type**: TIU Note, ICD Code, Guideline, etc.
  - **Specialty**: Cardiology, Gastroenterology, Mental Health.
  - **Date**: Timestamp of document creation.
  - **Source**: EHR, MedlinePlus, PubMed, etc.
  - **Relevance**: Weighted score for prioritization.

#### **2. Storage and Partitioning**
- **Index Design**:
  - Create separate indexes for structured (ICD codes) vs. unstructured (TIU notes) data.
  - Use metadata for cross-index filtering (e.g., "retrieve only Cardiology TIU notes").
- **Scaling**:
  - Plan for incremental updates, especially for ICD revisions or new guidelines.

#### **3. Embedding Strategy**
- Generate embeddings for all data sources using:
  - **Azure**: OpenAI embeddings for standardization across services.
  - **Open-Source**: Domain-specific models (e.g., ClinicalBERT).

#### **4. Retrieval Performance**
- Optimize for:
  - **Precision**: Retrieve only the most relevant results.
  - **Latency**: Minimize query processing time for real-time use.

---

### **Next Steps**
1. **Finalize Tier 1 Sources**:
   - Confirm the scope of TIU notes, ICD codes, and VA guidelines for initial integration.
2. **Design Metadata Schema**:
   - Outline specific fields and values for indexing and filtering.
3. **Embed and Index Sample Data**:
   - Test embedding and indexing with a subset of documents to validate design assumptions.



### **Openly Available Public Datasets for Medical AI Assistant**

Here’s a curated list of publicly available datasets suitable for building a Medical AI Assistant, along with their descriptions, focus areas, and links for access.

| **Dataset Name**               | **Description**                                                                                             | **Focus Area**                                  | **Link**                                                                 |
|--------------------------------|-------------------------------------------------------------------------------------------------------------|------------------------------------------------|-------------------------------------------------------------------------|
| **MIMIC-III**                  | A large database containing de-identified health data from critical care patients.                          | Clinical text, ICU data                        | [MIMIC-III Dataset](https://physionet.org/content/mimiciii/1.4/)        |
| **MIMIC-IV**                   | The successor to MIMIC-III with more recent data, enhanced granularity, and more structured formats.        | Clinical text, ICU data                        | [MIMIC-IV Dataset](https://physionet.org/content/mimiciv/2.2/)          |
| **eICU Collaborative Research**| A multi-center critical care database with de-identified patient data from ICUs across the United States.  | ICU data, clinical outcomes                   | [eICU Dataset](https://physionet.org/content/eicu-crd/2.0/)             |
| **BioASQ Dataset**             | Biomedical semantic indexing and question answering dataset for natural language processing.               | Biomedical text, QA systems                   | [BioASQ Dataset](http://bioasq.org/)                                    |
| **PubMed Central Open Access** | A large repository of free full-text biomedical and life sciences journal articles.                        | Biomedical research, text embeddings           | [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/)    |
| **MedlinePlus**                | Consumer-friendly health information about diseases, conditions, and wellness topics.                      | Patient education, health information          | [MedlinePlus](https://medlineplus.gov/)                                 |
| **SNOMED CT**                  | A comprehensive, multilingual clinical healthcare terminology standard.                                     | Clinical terminology, medical codes           | [SNOMED CT](https://www.nlm.nih.gov/healthit/snomedct/index.html)       |
| **ICD-10 Dataset**             | International Classification of Diseases, Tenth Revision, used for coding diagnoses and procedures.        | Clinical terminology, disease classification   | [ICD-10](https://www.who.int/standards/classifications/classification-of-diseases) |
| **COVID-19 Open Research Dataset (CORD-19)** | A dataset of scholarly articles about COVID-19 for text mining and natural language processing research. | Pandemic-related biomedical research           | [CORD-19 Dataset](https://www.semanticscholar.org/cord19)               |
| **Unified Medical Language System (UMLS)** | A collection of biomedical vocabularies integrated into a single framework for interoperability.      | Medical vocabularies, terminology             | [UMLS](https://www.nlm.nih.gov/research/umls/index.html)                |
| **Disease Ontology**           | A standardized ontology for human disease terms, their definitions, and their relationships.               | Disease classification, ontologies            | [Disease Ontology](http://www.disease-ontology.org/)                    |
| **Open-i Medical Image Dataset** | A collection of de-identified medical images with corresponding metadata and reports.                     | Medical imaging, radiology reports             | [Open-i Dataset](https://openi.nlm.nih.gov/)                            |
| **CheXpert**                   | A large dataset of chest X-rays labeled for various pathologies.                                           | Radiology, chest diseases                      | [CheXpert Dataset](https://stanfordmlgroup.github.io/competitions/chexpert/) |
| **PhysioNet**                  | Open access to complex physiological signals such as ECGs and EEGs.                                         | Physiological data, signal analysis           | [PhysioNet](https://physionet.org/)                                     |
| **OHDSI OMOP**                 | Observational health data from a global collaboration focusing on standardized medical research databases.  | Clinical observations, research data           | [OHDSI Dataset](https://www.ohdsi.org/data-standardization/the-common-data-model/) |
| **RxNorm**                     | A normalized naming system for generic and branded drugs and their relationships.                          | Drug information, pharmacology                 | [RxNorm](https://www.nlm.nih.gov/research/umls/rxnorm/index.html)       |
| **DrugBank Open Data**         | A comprehensive database of drug and drug target information.                                               | Pharmacology, drug-target interactions         | [DrugBank Open Data](https://www.drugbank.com/releases/latest)          |
| **ClinicalTrials.gov**         | A database of privately and publicly funded clinical studies conducted around the world.                   | Clinical trials, research studies              | [ClinicalTrials.gov](https://clinicaltrials.gov/)                       |

---

### **How This Table Can Be Used**
1. **Building the Vector Store**:
   - Choose datasets relevant to your use case (e.g., TIU notes from MIMIC-III, disease definitions from Disease Ontology).
   - Use embeddings to represent the dataset text and store them for retrieval.

2. **Expanding the Knowledge Base**:
   - Incorporate data from medical vocabularies (e.g., UMLS, SNOMED CT) to enhance terminology understanding.
   - Include patient-facing resources (e.g., MedlinePlus) for consumer-level education.

3. **Supporting Specific Applications**:
   - Use imaging datasets (e.g., CheXpert) to complement textual data.
   - Leverage pharmacological datasets (e.g., DrugBank, RxNorm) for drug-related queries.



## **Focus: Metadata Schema Design**

Metadata is the backbone of effective document retrieval in a vector store. A well-designed schema ensures that documents are organized, searchable, and retrievable based on precise filters. For the Medical AI Assistant, we need a robust schema that supports diverse data types like TIU notes, ICD codes, and clinical guidelines.

---

### **1. Metadata Schema Design**

#### **Core Metadata Fields**
| **Field Name**      | **Description**                                                                 | **Example Values**                                  |
|----------------------|---------------------------------------------------------------------------------|----------------------------------------------------|
| **Document Type**    | The category of the document.                                                   | "TIU Note", "ICD Code", "Guideline"               |
| **Specialty**        | The medical specialty relevant to the document.                                | "Cardiology", "Gastroenterology", "Mental Health" |
| **Source**           | The origin of the document or data.                                             | "EHR", "MedlinePlus", "VA Guidelines"             |
| **Title**            | A brief title or summary for the document.                                     | "Managing GERD in Veterans"                       |
| **Keywords**         | Key terms associated with the document.                                        | "GERD", "angina", "chest pain"                    |
| **Date**             | The date the document was created or last updated.                             | "2024-11-15"                                       |
| **Relevance Score**  | A weight or rank indicating the document's importance for retrieval.            | "0.85", "0.92"                                    |
| **Context Summary**  | A short summary or abstract of the document's content.                         | "Chest pain caused by GERD, angina, and anxiety." |
| **Clinical Tags**    | Specific tags for clinical features, diagnoses, or symptoms.                   | "Chest Pain", "MI", "Pulmonary Embolism"          |

---

#### **2. Metadata Examples for Tier 1 Sources**

- **TIU Note Example**:
  ```json
  {
      "Document Type": "TIU Note",
      "Specialty": "Cardiology",
      "Source": "EHR",
      "Title": "Progress Note for Chest Pain",
      "Keywords": ["chest pain", "angina", "ECG"],
      "Date": "2024-10-10",
      "Relevance Score": "0.88",
      "Context Summary": "Patient presented with chest pain; ruled out myocardial infarction.",
      "Clinical Tags": ["Chest Pain", "Angina", "ECG"]
  }
  ```

- **ICD Code Example**:
  ```json
  {
      "Document Type": "ICD Code",
      "Specialty": "General Medicine",
      "Source": "ICD-10",
      "Title": "I20 - Angina Pectoris",
      "Keywords": ["angina", "ischemic heart disease"],
      "Date": "2024-01-01",
      "Relevance Score": "0.95",
      "Context Summary": "ICD-10 code for angina, includes classifications like unstable angina.",
      "Clinical Tags": ["Ischemic Heart Disease", "Angina"]
  }
  ```

- **VA Guideline Example**:
  ```json
  {
      "Document Type": "Guideline",
      "Specialty": "Mental Health",
      "Source": "VA/DoD",
      "Title": "PTSD Treatment Guidelines",
      "Keywords": ["PTSD", "mental health", "CBT"],
      "Date": "2023-06-15",
      "Relevance Score": "0.90",
      "Context Summary": "Best practices for diagnosing and managing PTSD in veterans.",
      "Clinical Tags": ["PTSD", "Mental Health", "CBT"]
  }
  ```

---

### **3. Implementation Plan**

#### **Step 1: Define Schema**
- Finalize metadata fields and default values for missing information.

#### **Step 2: Map Schema to Data Sources**
- TIU Notes:
  - Extract fields like **Date**, **Keywords**, and **Clinical Tags** from structured sections.
  - Use NLP to summarize and tag.
- ICD Codes:
  - Predefined schema with codes, descriptions, and associated conditions.
- Guidelines:
  - Summarize key sections and tag with **Keywords** and **Clinical Tags**.

#### **Step 3: Store Metadata**
- **Azure Option**:
  - Use **Azure Cosmos DB** or **Azure Blob Storage** with metadata as JSON objects.
- **Open-Source Option**:
  - Store metadata in MongoDB or a relational database (e.g., PostgreSQL).

#### **Step 4: Integrate Metadata with Vector Store**
- Ensure metadata is associated with embeddings in the vector store.
- Enable filtering and sorting based on metadata fields during retrieval.

---

### **Next Steps**

1. Expand this with **embedding strategies** for each data type.
2. Move on to **vector store indexing and retrieval design**.
3. Discuss **scalability and updating the vector store**.



## **Standardizing on a Single Embedding Model vs. Using Multiple Models**

Whether to use a single embedding model or multiple specialized models depends on your **use case**, **data diversity**, and **performance requirements**. Below is an analysis of the **pros and cons** of each approach to help you decide.

---

### **Option 1: Standardizing on a Single Embedding Model**

#### **Description**:
Use one embedding model for all data types, such as **Azure OpenAI’s `text-embedding-ada-002`** or an open-source alternative like **SentenceTransformers**.

#### **Pros**:
1. **Simplicity**:
   - Easier to manage and maintain embeddings in the pipeline.
   - Uniform embedding dimensionality simplifies vector store schema and search.

2. **Consistency**:
   - Avoids discrepancies when comparing embeddings generated by different models.
   - Ensures that similarity scores are computed on vectors from the same feature space.

3. **Efficiency**:
   - Reduces operational complexity by eliminating the need to switch between models.
   - Faster development and scaling, especially when embedding multiple data sources.

4. **Scalability**:
   - Optimized for cloud-native solutions (e.g., Azure AI Search), enabling seamless integration and scaling.

5. **Cost-Effective**:
   - Training, hosting, or calling a single model is often more cost-efficient than managing multiple specialized models.

#### **Cons**:
1. **Generalization**:
   - A single model may not perform optimally on domain-specific tasks (e.g., clinical text vs. scientific abstracts).
   - Loss of precision when embedding structured data (e.g., ICD codes).

2. **Compromise in Quality**:
   - The model may struggle to encode niche medical terms or specialized data with the same quality as a domain-specific model.

3. **Limited Fine-Tuning**:
   - Fine-tuning a single model for all data types might lead to overfitting for one type and underperformance for others.

---

### **Option 2: Using Multiple Specialized Models**

#### **Description**:
Employ different embedding models optimized for specific data types, such as ClinicalBERT for clinical notes or SciBERT for biomedical research.

#### **Pros**:
1. **Domain-Specific Accuracy**:
   - Specialized models excel in their respective areas (e.g., ClinicalBERT captures medical nuances better than general-purpose models).

2. **Task Optimization**:
   - Embeddings are tailored to the unique requirements of each data type (e.g., ICD codes benefit from lightweight sentence transformers, while TIU notes require context-aware embeddings).

3. **Flexibility**:
   - Allows experimenting with various models for different tasks, optimizing performance dynamically.

4. **State-of-the-Art Results**:
   - Leveraging the latest advancements in specific domains ensures higher relevance in retrieval tasks.

#### **Cons**:
1. **Complexity**:
   - Requires maintaining multiple embedding pipelines, increasing operational overhead.
   - Managing different dimensionalities for embeddings adds complexity to the vector store design.

2. **Inconsistent Feature Spaces**:
   - Combining embeddings from different models may lead to inconsistencies in similarity scoring and ranking.

3. **Higher Costs**:
   - Running multiple models simultaneously can lead to higher compute and storage costs, especially in production.

4. **Integration Challenges**:
   - Integrating and optimizing workflows for multiple models requires additional engineering effort.

---

### **Comparison Table**

| **Aspect**               | **Single Model**                          | **Multiple Models**                       |
|--------------------------|------------------------------------------|------------------------------------------|
| **Ease of Maintenance**   | High                                     | Low                                      |
| **Consistency**           | High                                     | Medium (varies by model feature space)   |
| **Embedding Quality**     | Generalized                              | Specialized                              |
| **Scalability**           | Easier to scale                          | Complex due to varied workflows          |
| **Cost**                  | Lower                                    | Higher                                   |
| **Performance**           | Generalized performance                  | Optimized for each data type             |
| **Flexibility**           | Lower (one-size-fits-all)                | High (task-specific optimization)        |

---

### **Recommendation for the Medical AI Assistant**

Given your work with **Azure services** and the **diverse nature of medical data**, here’s a practical approach:

#### **1. Start with a Single Model (Standardized Approach)**:
   - Use **Azure OpenAI’s `text-embedding-ada-002`** for initial integration.
   - Pros:
     - Simplifies pipeline development.
     - Ensures compatibility across all data types.
     - Aligns with existing Azure infrastructure.

#### **2. Evaluate Performance on Key Tasks**:
   - Test retrieval quality across data types (e.g., TIU notes, ICD codes, guidelines).
   - Identify gaps in performance or areas where embeddings are insufficient.

#### **3. Introduce Specialized Models as Needed**:
   - For clinical notes, consider using **ClinicalBERT**.
   - For PubMed abstracts, try **SciBERT**.
   - Maintain these models for high-priority or low-recall tasks only.

#### **4. Long-Term Scalability**:
   - Use **Azure AI Search** for scaling the vector store.
   - Integrate specialized models incrementally if retrieval performance requires domain-specific improvements.

---

### **Next Steps**

1. **Designing a single-model embedding pipeline**  
2. **Drafting a hybrid pipeline to test both approaches** (We can do this separately as a part of advanced examples)  
3. **Evaluating data quality and embedding strategies with example queries**

## **Designing a Single-Model Embedding Pipeline**

The single-model embedding pipeline focuses on using a **standardized embedding model** to process all types of data in a Retrieval-Augmented Generation (RAG) pipeline. This simplifies operations while ensuring compatibility and scalability.

---

### **Pipeline Components**

#### **1. Data Ingestion**
- **Purpose**: Collect and preprocess data from various sources (e.g., TIU notes, ICD codes, MedlinePlus).
- **Key Steps**:
  - Connect to data sources (e.g., EHR systems, APIs, local files).
  - Normalize the data (e.g., remove duplicates, format for readability).
  - Ensure sensitive data is de-identified (e.g., for TIU notes).

#### **2. Preprocessing**
- **Purpose**: Prepare data for embedding by cleaning, tokenizing, and splitting text.
- **Key Steps**:
  - **Text Cleaning**: Remove unnecessary punctuation, standardize casing, and handle special characters.
  - **Chunking**:
    - Divide long documents (e.g., TIU notes) into smaller, manageable chunks.
    - Overlap chunks slightly to retain context across boundaries.
  - **Metadata Enrichment**:
    - Add fields like `source`, `specialty`, `tags`, and `date` to enrich retrieval capabilities.

#### **3. Embedding Generation**
- **Purpose**: Convert text into dense numerical vectors using a single embedding model.
- **Model Choice**:
  - **Azure Option**: `text-embedding-ada-002` (recommended for uniformity and scalability).
  - **Implementation**:
    ```python
    import openai

    def generate_embedding(text):
        response = openai.Embedding.create(
            input=text,
            engine="text-embedding-ada-002"
        )
        return response["data"][0]["embedding"]
    ```
- **Batch Processing**:
  - Embed documents in batches to improve throughput.
  - Store embeddings alongside metadata in a scalable storage solution.

#### **4. Vector Store**
- **Purpose**: Store embeddings and metadata for efficient similarity search.
- **Options**:
  - **Azure AI Search**: Fully managed search service.
  - **Open-Source**: FAISS, Milvus, or Pinecone for local or hybrid setups.

#### **5. Retrieval**
- **Purpose**: Fetch the most relevant documents based on user queries.
- **Process**:
  - Convert the user query into an embedding.
  - Perform similarity search in the vector store to retrieve the top results.
  - Return results with associated metadata for contextual understanding.

#### **6. Augmentation**
- **Purpose**: Combine retrieved data with the user query for enhanced model responses.
- **Process**:
  - Concatenate retrieved documents with the query.
  - Pass the combined text to the LLM for a final answer.

---

### **Pipeline Flow**

1. **Input**:
   - Data sources: TIU notes, ICD codes, guidelines.
   - Query: "What are the causes of chest pain?"

2. **Output**:
   - Top relevant documents (e.g., "TIU note mentioning angina and GERD").
   - LLM-enhanced response: "Common causes of chest pain include angina, GERD, and anxiety. Here are recommended diagnostic steps..."

---

### **Infrastructure and Containers**

**Containers** like **Docker** or **Kubernetes** can help build a robust infrastructure for this pipeline. Here's how:

#### **Advantages of Containers**:
1. **Consistency**:
   - Containers ensure that the pipeline runs consistently across different environments (e.g., development, staging, production).

2. **Scalability**:
   - Container orchestration (e.g., Kubernetes) allows scaling individual pipeline components based on demand (e.g., increase embedding throughput during peak ingestion periods).

3. **Isolation**:
   - Each component (e.g., preprocessing, embedding, vector search) can run in isolated containers to prevent dependency conflicts.

4. **Portability**:
   - Containers make it easy to deploy the pipeline across cloud providers (e.g., Azure, AWS) or hybrid setups.

5. **Integration with CI/CD**:
   - Automated deployment pipelines can be established to update and maintain the pipeline seamlessly.

---

### **Pipeline Design with Containers**

#### **Key Components and Their Containers**:
| **Component**           | **Description**                                       | **Container Functionality**                          |
|--------------------------|-------------------------------------------------------|-----------------------------------------------------|
| **Preprocessing**        | Cleans, chunks, and enriches metadata.                | Runs NLP preprocessing libraries (e.g., spaCy).     |
| **Embedding Generation** | Converts text to embeddings using a single model.     | Hosts the embedding model (e.g., via Azure OpenAI). |
| **Vector Store**         | Stores embeddings and metadata for retrieval.         | Hosts FAISS, Milvus, or integrates with Azure AI Search.      |
| **Query Handler**        | Handles user queries and performs similarity search.  | Performs search and formats results.               |
| **LLM Interface**        | Combines retrieved documents with user queries.       | Sends augmented input to the LLM API.              |

#### **Example Workflow**:
1. **Preprocessing Container**:
   - Reads TIU notes from EHR, cleans text, and extracts metadata.
2. **Embedding Container**:
   - Generates embeddings for processed text.
3. **Vector Store Container**:
   - Stores embeddings and metadata for later retrieval.
4. **Query Handler Container**:
   - Accepts user queries, retrieves relevant documents, and passes results to the LLM.
5. **LLM Container**:
   - Uses Azure OpenAI API to generate the final response.

---

### **Next Steps**
1. **Define Preprocessing Workflow**:
   - Standardize text cleaning and chunking logic.
2. **Detail Embedding Storage**:
   - Decide on storage backends (e.g., Azure AI Search vs. FAISS).
3. **Plan Containerization**:
   - Identify container dependencies and orchestrator requirements.



### **Simplified RAG Pipeline Design Using OpenAI Embedding Model and FAISS**

For a small-scale implementation, we’ll leverage the **OpenAI embedding model (`text-embedding-ada-002`)** and the open-source FAISS library for vector storage. This setup is lightweight, efficient, and ideal for prototyping or smaller deployments.

---

### **Pipeline Overview**

1. **Data Ingestion**:
   - Input: TIU notes, ICD codes, MedlinePlus text, and other relevant medical data.
   - Processed into standardized text chunks.

2. **Preprocessing**:
   - Clean, tokenize, and chunk text into manageable segments.
   - Add metadata for filtering and retrieval.

3. **Embedding Generation**:
   - Use OpenAI’s `text-embedding-ada-002` to convert text chunks into vector embeddings.

4. **Vector Store**:
   - Store embeddings and metadata in a FAISS index for similarity-based retrieval.

5. **Retrieval**:
   - Convert user queries into embeddings.
   - Use FAISS to retrieve the top `k` relevant chunks.

6. **Augmentation**:
   - Combine retrieved data with the user query.
   - Pass the combined input to an LLM for response generation.

---

### **Advantages of OpenAI + FAISS Setup**

1. **Ease of Use**:
   - OpenAI’s embedding API is simple to integrate and performs well across diverse data types.
   - FAISS provides a fast, reliable solution for vector-based retrieval.

2. **Cost Efficiency**:
   - No need for specialized infrastructure or managed services.

3. **Customizability**:
   - FAISS allows for on-premise deployment, enabling full control over the pipeline.

4. **Scalability**:
   - Suitable for datasets with millions of vectors, with indexing options like IVF and PQ for large-scale operations.

---

### **Pipeline Design Steps**

#### **Step 1: Preprocessing**
- **Objective**: Standardize and prepare data for embedding.
- **Workflow**:
  1. Clean raw text (e.g., remove special characters, normalize casing).
  2. Chunk long text into manageable pieces (e.g., 512 tokens).
  3. Add metadata fields (e.g., `source`, `category`, `tags`).

```python
def preprocess_text(text, max_length=512, overlap=50):
    """
    Clean and chunk text into manageable segments.
    """
    # Basic cleaning (extend as needed)
    text = text.lower().replace("\n", " ").strip()
    words = text.split()

    # Chunking logic
    for i in range(0, len(words), max_length - overlap):
        yield " ".join(words[i:i + max_length])
```

---

#### **Step 2: Embedding and Indexing**
- **Objective**: Generate embeddings and store them in a FAISS index.
- **Workflow**:
  - Generate embeddings for each text chunk using OpenAI’s API.
  - Store the embeddings and associated metadata in a FAISS index.

```python
import openai
import faiss
import numpy as np

# Initialize FAISS index
dimension = 1536  # Dimensionality of `text-embedding-ada-002`
index = faiss.IndexFlatL2(dimension)

# Generate embeddings and store in FAISS
def generate_and_store_embeddings(text_chunks):
    embeddings = []
    metadata = []

    for chunk in text_chunks:
        response = openai.Embedding.create(
            input=chunk,
            engine="text-embedding-ada-002"
        )
        embedding = response['data'][0]['embedding']
        embeddings.append(embedding)
        metadata.append(chunk)  # Add chunk metadata

    # Convert embeddings to numpy array and add to FAISS
    embeddings_np = np.array(embeddings, dtype="float32")
    index.add(embeddings_np)

    return metadata
```

---

#### **Step 3: Query and Retrieval**
- **Objective**: Retrieve the top `k` relevant chunks for a user query.
- **Workflow**:
  - Embed the user query.
  - Perform similarity search in the FAISS index.
  - Return the top matches and their metadata.

```python
def retrieve_similar_chunks(query, top_k=5):
    # Generate embedding for the query
    query_embedding = openai.Embedding.create(
        input=query,
        engine="text-embedding-ada-002"
    )['data'][0]['embedding']

    # Convert query to numpy array and search FAISS
    query_np = np.array([query_embedding], dtype="float32")
    distances, indices = index.search(query_np, top_k)

    # Retrieve corresponding chunks
    results = [(metadata[i], distances[0][idx]) for idx, i in enumerate(indices[0])]
    return results
```

---

#### **Step 4: Augmentation and Response Generation**
- **Objective**: Use retrieved chunks to enhance the user query before passing it to an LLM.
- **Workflow**:
  - Combine retrieved text chunks with the user query.
  - Generate a final response using an LLM.

```python
def generate_response(query, top_k=5):
    # Retrieve similar chunks
    similar_chunks = retrieve_similar_chunks(query, top_k=top_k)

    # Combine query with retrieved context
    augmented_query = query + "\n\n" + "\n".join([chunk[0] for chunk in similar_chunks])

    # Use LLM to generate response
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=augmented_query,
        max_tokens=200
    )
    return response['choices'][0]['text']
```

---

### **Next Steps**

1. **Prepare Sample Data**:
   - Use a small corpus of TIU notes, ICD codes, and MedlinePlus data for testing.
   
2. **Run Pipeline End-to-End**:
   - Test the ingestion, embedding, and retrieval processes.

3. **Containerize the Pipeline**:
   - Package preprocessing, embedding, and retrieval steps in separate Docker containers.

4. **Evaluate Performance**:
   - Measure retrieval accuracy and response quality with sample queries.

Would you like to proceed with implementing the code, or should we expand on containerization and orchestration for this pipeline?

In [4]:
#pip install sentence-transformers --break-system-packages

In [3]:
# pip install faiss-cpu --break-system-packages

In [None]:
# ======================================
# Initialization and Environment Setup
# ======================================

import os
import requests
import numpy as np
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from anthropic import Anthropic
from sentence_transformers import SentenceTransformer
import faiss
import openai

# Load environment variables from .env file
load_dotenv()

# Helper function to load environment variables
def get_env_var(var: str):
    value = os.getenv(var)
    if value is None:
        raise ValueError(f"{var} not found in environment variables. Make sure it is set in your .env file.")
    return value

# Load API keys
langchain_api_key = get_env_var("LANGCHAIN_API_KEY")
langchain_tracing_v2 = get_env_var("LANGCHAIN_TRACING_V2")
openai_api_key = get_env_var("OPENAI_API_KEY")
anthropic_api_key = get_env_var("ANTHROPIC_API_KEY")
grok_api_key = get_env_var("GROK_API_KEY")

# Set OpenAI API key
openai.api_key = openai_api_key

# ======================================
# Model Setup
# ======================================

# OpenAI GPT models
gpt4o_chat = ChatOpenAI(model="gpt-4o", temperature=0, openai_api_key=openai_api_key)
gpt35_chat = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0, openai_api_key=openai_api_key)

# Anthropic Claude models
claude = Anthropic(api_key=anthropic_api_key)
claude_chat = ChatAnthropic(model="claude-3-5-sonnet-20240620", temperature=0, anthropic_api_key=anthropic_api_key)

# Sentence Transformer model for embeddings
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# ======================================
# Grok API Integration (from existing code)
# ======================================

def query_grok(prompt: str, model="grok-beta", stream=False, temperature=0):
    """
    Query the Grok API with a user-provided prompt and return cleaned content.
    """
    # Define the Grok API endpoint and headers
    url = "https://api.x.ai/v1/chat/completions"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {grok_api_key}"
    }

    # Define the payload
    payload = {
        "messages": [
            {"role": "system", "content": "You are Grok, a chatbot inspired by the Hitchhikers Guide to the Galaxy."},
            {"role": "user", "content": prompt}
        ],
        "model": model,
        "stream": stream,
        "temperature": temperature
    }

    try:
        # Send the request to the Grok API
        response = requests.post(url, headers=headers, json=payload)
        response.raise_for_status()  # Raise an error if the request fails
        
        # Parse and return the relevant content
        response_json = response.json()
        choices = response_json.get("choices", [])
        if choices and "content" in choices[0].get("message", {}):
            return choices[0]["message"]["content"]  # Extract only the assistant's content
        else:
            return "No content returned"

    except requests.exceptions.RequestException as e:
        print("Error querying Grok API:", e)
        return "Error querying Grok API."


# ======================================
# RAG Pipeline Components
# ======================================

# Step 1: Index the Medical Corpus
def create_faiss_index(documents):
    """
    Create a FAISS index from the given corpus of documents.
    """
    embeddings = embedding_model.encode(documents)
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(np.array(embeddings))
    return index, embeddings

# Step 2: Query the Index
def query_faiss_index(index, query, k=2):
    """
    Retrieve the top-k most relevant documents for a given query.
    """
    query_embedding = embedding_model.encode([query])
    distances, indices = index.search(np.array(query_embedding), k)
    return distances, indices

# Step 3: Generate a Response
def generate_response_with_context(context, query):
    """
    Use OpenAI's GPT-4 model to generate a response based on context.
    """
    prompt = f"""
    You are a Medical AI Assistant. Using the following medical information, answer the query provided by the clinician.

    Medical Information:
    {context}

    Clinician's Query:
    {query}

    Provide a concise response that addresses the query.
    """
    response = openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt,
        max_tokens=150
    )
    return response["choices"][0]["text"].strip()


# ======================================
# Unified Response Comparison with Formatting and Step-specific Saving
# ======================================

def format_response(response: str) -> str:
    """
    Standardize the formatting of the model's response for consistent readability.
    """
    # Split the response into sections based on categories (if applicable)
    sections = response.split("\n")
    formatted_response = []
    for section in sections:
        # Add formatting for sections that appear as headers
        if section.strip().endswith(":"):
            formatted_response.append(f"\n### {section.strip()}")  # Add Markdown-style headers
        else:
            formatted_response.append(section.strip())  # Keep other lines as is
    return "\n".join(formatted_response)

def save_comparison_to_markdown(prompt: str, results: dict, filename: str):
    """
    Save the formatted output of compare_responses to a Markdown file.

    Parameters:
    - prompt: The prompt used for the comparison.
    - results: A dictionary containing model responses.
    - filename: The filename for the Markdown file.
    """
    try:
        with open(filename, "w", encoding="utf-8") as f:
            # Write the prompt
            f.write(f"# Prompt:\n\n{prompt}\n\n")
            f.write("=" * 80 + "\n\n")
            
            # Write each model's response
            for model, response in results.items():
                f.write(f"## {model} Response\n\n")
                f.write(f"{response}\n\n")
                f.write("-" * 80 + "\n\n")
        
        print(f"Saved comparison results to {filename}")
    except Exception as e:
        print(f"Error saving comparison to {filename}: {e}")

def compare_responses(prompt: str, step_name="initial", include_claude=True, include_gpt4=True, include_gpt35=True, include_grok=True, save_dir="outputs"):
    """
    Compare responses from different models for the same prompt, format the output, and save to step-specific Markdown files.

    Parameters:
    - prompt: The input prompt for comparison.
    - step_name: A unique identifier for the step (e.g., "initial", "step1").
    - include_claude: Include Claude model in the comparison.
    - include_gpt4: Include GPT-4 model in the comparison.
    - include_gpt35: Include GPT-3.5 model in the comparison.
    - include_grok: Include Grok model in the comparison.
    - save_dir: Directory to save the Markdown file.
    """
    results = {}

    # Collect responses from all included models
    if include_claude:
        claude_response = claude_chat.invoke(prompt)
        results["Claude-3.5-Sonnet"] = format_response(claude_response.content)

    if include_gpt4:
        gpt4_response = gpt4o_chat.invoke(prompt)
        results["GPT-4o"] = format_response(gpt4_response.content)

    if include_gpt35:
        gpt35_response = gpt35_chat.invoke(prompt)
        results["GPT-3.5"] = format_response(gpt35_response.content)

    if include_grok:
        grok_response = query_grok(prompt)
        results["Grok-Beta"] = format_response(grok_response)

    # Ensure output directory exists
    os.makedirs(save_dir, exist_ok=True)

    # Generate the step-specific filename
    output_filename = os.path.join(save_dir, f"{step_name}_comparison_responses.md")
    save_comparison_to_markdown(prompt, results, output_filename)

    # Display the formatted results in the console
    print(f"\nPrompt: {prompt}\n")
    print("=" * 80)
    for model, response in results.items():
        print(f"\n{model}:\n")
        print(response)
        print("-" * 80)

# ======================================
# Medical AI Use-Case: RAG Implementation
# ======================================

# Example medical corpus
medical_documents = [
    "TIU Note: Patient reported chest pain during exertion, described as pressure-like, radiating to the left arm. EKG shows ST-segment elevation, suggesting myocardial infarction. Immediate transfer to cardiology recommended.",
    "TIU Note: Veteran presents with sharp pleuritic chest pain worsening with deep breaths. Imaging confirmed pulmonary embolism. Initiated anticoagulation therapy.",
    "TIU Note: Veteran with a history of GERD reports chest discomfort after meals. Symptoms relieved by antacids. Endoscopy scheduled to rule out esophageal abnormalities.",
    "TIU Note: Anxiety-related chest tightness reported. Symptoms associated with episodes of hyperventilation during stressful situations. Referred to behavioral health for evaluation.",
    "TIU Note: Veteran complaining of acute chest pain and shortness of breath. Differential includes acute coronary syndrome, pneumonia, or musculoskeletal etiology. Labs and chest X-ray pending for further evaluation."
]


# Clinician's query
clinician_query = "What are the cardiovascular causes of chest pain?"

# Create FAISS index
faiss_index, _ = create_faiss_index(medical_documents)

# Query the index
distances, indices = query_faiss_index(faiss_index, clinician_query, k=2)

# Retrieve top documents
retrieved_docs = [medical_documents[idx] for idx in indices[0]]

# Combine context for generation
context = " ".join(retrieved_docs)

# Generate response
response = generate_response_with_context(context, clinician_query)

# Display output
print("Retrieved Documents:")
for i, doc in enumerate(retrieved_docs, 1):
    print(f"{i}. {doc}")

print("\nGenerated Response:")
print(response)
