# Health Data

## Health Data - Introduction

This lecture introduces the key players in healthcare and describes the life cycle of healthcare data. Here's a recap and some additional context to better understand each component.

### Major Players in Healthcare
1. **Patients**: At the heart of the healthcare system, patients interact with all other entities. They receive care, medications, and coverage, driving much of the data generation.
  
2. **Providers**: These are healthcare entities, including doctors, hospitals, clinics, and pharmacies. They deliver care to patients, generate electronic health records (EHRs), and manage patient interactions.

3. **Payers**: Payers are insurance companies or private companies that cover the costs of healthcare services. They receive medical and pharmacy claims data and play a crucial role in reimbursement and managing healthcare costs.

4. **Pharmaceutical Companies**: They develop new treatments and drugs. These companies conduct clinical trials, a critical process that generates clinical trial data for evaluating the safety and effectiveness of new drugs.

5. **Government Agencies**: Agencies like the FDA (Food and Drug Administration), CDC (Centers for Disease Control and Prevention), and CMS (Centers for Medicare & Medicaid Services) regulate healthcare, ensure compliance, and oversee public health policies. They interact with other players to enforce standards and approve new drugs and treatments.

6. **Researchers**: They are involved across various organizations, from healthcare providers to pharmaceutical companies and payers. Researchers analyze healthcare data, conduct studies, and contribute to medical advancements. Their work often results in new publications, clinical guidelines, and further discoveries.

### Life Cycle of Healthcare Data
Healthcare data emerges from various interactions within the system. Here's a simplified flow of how data is generated and used:

1. **Patient-Provider Interaction**: When patients seek care from providers (e.g., hospitals or clinics), this interaction generates electronic health records (EHRs). These records capture patient information, diagnoses, treatment plans, and outcomes.

2. **Medical Claims**: After providing services, hospitals and providers submit medical claims to payers (insurance companies) to get reimbursed for the care provided. This generates another form of data—medical claims data, which includes billing codes and payment details.

3. **Pharmacy Claims**: When patients receive prescriptions, pharmacies process these medications and submit pharmacy claims to insurance companies. This transaction generates pharmacy claims data.

4. **Clinical Trials**: Pharmaceutical companies conduct clinical trials to test new drugs or treatments. They collect data on the participants, outcomes, and safety of the drugs. This data is crucial for the drug approval process and is submitted to regulatory agencies like the FDA.

5. **Medical Research and Publications**: Researchers across the healthcare ecosystem analyze the data generated in hospitals, insurance companies, clinical trials, and pharmacies. Their findings are often published as medical literature, contributing to guidelines and best practices in the medical field.

### Types of Healthcare Data
Different types of data are created and analyzed within this healthcare system:

1. **Electronic Health Record (EHR) Data**: This is patient-level data recorded by providers during healthcare encounters. It includes demographics, medical history, diagnoses, medications, and treatment plans.

2. **Medical Claims Data**: Claims data is generated when providers submit claims to insurers for reimbursement. It typically includes details about procedures, diagnoses, and billing codes.

3. **Clinical Notes**: These are unstructured text data from doctors and nurses that are part of EHRs. They include observations, medical assessments, and treatment plans that are often key for understanding the patient's condition.

4. **Medical Literature**: This includes published research papers, clinical guidelines, and reviews. These documents form the foundation of evidence-based practice in medicine.

5. **Continuous Signals**: Data from monitoring devices, especially in critical care settings like the ICU. These signals, such as heart rate or blood pressure readings, are continuously recorded to monitor patient status.

6. **Medical Imaging**: Imaging data includes X-rays, MRI scans, CT scans, and other visual data used for diagnosing and monitoring conditions.

7. **Medical Ontologies**: These are structured knowledge graphs that represent medical concepts and their relationships. Ontologies help standardize terminology and assist in data integration across systems.

8. **Clinical Trials Data**: This data includes information about the design, conduct, and results of clinical trials. It is essential for evaluating new drugs and treatments.

9. **Drug Discovery Data**: This includes molecular data and drug databases that support pharmaceutical research and development. It helps researchers identify potential drug candidates and understand their mechanisms.

### Conclusion
Understanding the key players in healthcare and the types of data they generate is crucial for leveraging healthcare data effectively. Each type of data serves a specific purpose and collectively supports patient care, research, and innovation in medicine. As we move forward, analyzing these data types in detail will deepen our understanding of their roles and applications.

## Health Data: EHR, Notes

### Electronic Health Record (EHR) Data Overview

**Adoption and Types of EHRs**:
- The adoption of EHR systems among U.S. hospitals has risen significantly, approaching 100%. By 2016, over 90% of hospitals had adopted EHRs, with a split between basic EHR systems (red curve) and certified EHRs (green curve), which include key functionalities required by certification processes.
- EHR systems now serve as a digital alternative to paper charts, capturing comprehensive patient health data across multiple encounters.

**What Information is Captured in EHR Data?**
- **Demographic Information**: Basic patient details such as age, gender, and contact information.
- **Medications**: Records of medications prescribed and administered.
- **Doctor’s Notes**: Clinical notes containing observations, diagnoses, and treatment plans.
- **Medical Codes**: Diagnosis codes (e.g., ICD codes), procedure codes (e.g., CPT codes), and medication codes.
- **Medical Imaging**: X-rays, MRIs, CT scans, and other diagnostic images.
- **Monitoring Data**: Continuous data from devices monitoring vitals like ECG, EEG, etc.

**Structure of EHR Data**:
- EHR data is often **longitudinal**, meaning it captures multiple encounters (e.g., v1, v2, v3) over time, creating a timeline of the patient's health history.
  - **Point Events**: Outpatient visits (e.g., routine checkups, minor consultations) happening on a specific date (t1, t3).
  - **Interval Events**: Inpatient visits, spanning from admission (t2a) to discharge (t2b), capturing a more complex range of data including monitoring and imaging.

**EHR Hierarchical Structure**:
- **Top Level**: Patient
- **Next Level**: Multiple visits (v1, v2, ..., vt)
- **Next Level**: Each visit contains symptoms, diagnoses, medications, procedures, and tests.
  
This hierarchical structure reflects the complexity of EHR data, capturing patient history in detail across various healthcare interactions.

### EHR Data and Deep Learning
- Deep learning can convert raw EHR data into **embeddings**—vector representations of patients, visits, or medical codes.
- These embeddings can be used for various tasks, including:
  - **Predictive Modeling**: Forecasting patient outcomes or disease progression.
  - **Clustering**: Grouping similar patients or conditions for analysis.
  - **Data Augmentation**: Enhancing datasets for better model training.

### Pros and Cons of EHR Data

#### Pros:
- **Rich Data**: EHR data includes both structured (e.g., codes, dates) and unstructured data (e.g., clinical notes).
- **Longitudinal Nature**: Captures patient history over time, offering a temporal view of health evolution.
  
#### Cons:
- **Complexity**: EHRs include a mix of data modalities—structured codes, numerical values, time-series data, medical images, and text (e.g., clinical notes), making it complex to process.
- **Siloed Data**: Data is often isolated within individual hospital systems, meaning a patient's complete medical history may be split across multiple hospitals and inaccessible from a single EHR system.
- **Sensitivity**: EHR data is highly sensitive due to privacy concerns, limiting access for research and requiring careful handling.

### Clinical Notes in EHR Data
- **Clinical Notes**: Free-text documentation by healthcare providers (e.g., doctors, nurses) containing detailed patient information. Notes vary based on the stage of care and provider type, including admission notes, progress notes, discharge summaries, and more.
  - **SOAP Structure** (for progress notes):
    - **Subjective**: Patient's own description of symptoms.
    - **Objective**: Doctor’s observations and test results.
    - **Assessment**: Diagnosis made by the provider.
    - **Plan**: Treatment plan, including prescribed medications or procedures.

### Properties of Clinical Notes

#### Pros:
- **Detail-Oriented**: Captures in-depth information about a patient’s condition and treatment.
- **Universal**: Used across all clinical encounters, making it a common data source in EHR systems.
- **Flexible**: Can document various aspects of a patient’s health, from symptoms to lab results, in a free-text format.

#### Cons:
- **Unstructured Data**: Difficult for automated systems to process due to its narrative format.
- **Noisy Data**: Variability in quality due to time constraints on providers, use of acronyms, and possible typos.
- **Sensitive Information**: Free-text data can contain sensitive details, such as personally identifiable information (PII), increasing privacy risks.

### Conclusion
EHR data is essential for capturing comprehensive patient information across healthcare encounters, but its complexity, sensitivity, and siloed nature present challenges. Despite these challenges, deep learning techniques can help transform raw EHR data into actionable insights, enabling predictive models, clustering, and more.

## Health Data: Claims, Signals

### Claims Data (Administrative Data) Overview

Claims data, also known as administrative data, is electronic transaction data generated during patient-provider interactions. These transactions are essential for healthcare providers to receive payments from insurance companies. Claims serve as the proof required by insurance companies to process payments.

### Types of Claims Data

Claims data can be categorized based on the type of insurance:
1. **Medicare Claims Data**:
   - **Part A**: Covers inpatient hospital visits, nursing facilities, and hospice care.
   - **Part B**: Covers outpatient services, physician services, and ambulance visits.
   - **Part C (Medicare Advantage)**: A specialized capitation plan that supplements Medicare coverage.
   - **Part D**: Covers prescription drugs.
   
2. **Commercial or Private Insurance Claims Data**:
   - Similar to Medicare claims, these include:
     - **Inpatient Claims**: Covers hospital stays, nursing home visits, and hospice care.
     - **Outpatient Claims**: Covers ambulatory visits, outpatient centers, and facility fees.
     - **Professional Claims**: Covers fees for doctors and other healthcare professionals.
     - **Pharmacy Claims**: Covers prescription drugs.

### Information Captured in Claims Data
Claims data contains various structured information, including:
- **Member Information**: Details about the patient (name, insurance ID, etc.).
- **Date and Place of Service**: When and where the service was provided.
- **Diagnosis Codes**: ICD codes representing the patient’s diagnosis.
- **Procedure Codes**: HCPCS/CPT codes representing the medical procedures performed.
- **Drug Information**: Documented using NDC (National Drug Code).
- **Financial Information**: Details of charges, payments, and expenditures.

### Properties of Claims Data

#### Pros:
- **Large Volume**: Claims data is extensive, with a large amount of data available for analysis.
- **Holistic View**: Provides a comprehensive view of all patient interactions with healthcare providers during the covered period.
- **Medication Compliance**: Helps track whether patients fill their prescriptions, which can indicate medication adherence.

#### Cons:
- **Coding Errors**: Since claims data is primarily for billing purposes, coding inaccuracies can occur.
- **Not Clinical Documentation**: The data may not reflect the full clinical picture of the patient, as its primary purpose is to facilitate payment, not clinical care.
- **Time Lags**: Claims data can have delays of weeks or even months before it is fully processed, limiting its use for real-time decision-making.
- **Temporal Limitations**: If patients change insurance plans, tracking them over long periods becomes challenging. Claims data only covers the period the patient was with a specific insurer, resulting in potential gaps in the longitudinal view of a patient’s healthcare history.

### Comparison: Claims Data vs. EHR Data

- **Scope of Data**: Claims data captures all transactions between patients and providers, offering a broad view of patient interactions. In contrast, EHR data captures only the care provided by specific providers and is often siloed within individual hospitals.
  
- **Medication Data**:
  - **Claims Data**: Shows that a prescription was filled, indicating that the patient likely received the medication, but it doesn’t confirm if the medication was taken.
  - **EHR Data**: Records prescriptions but doesn’t show if they were filled or taken.

- **Data Richness**:
  - **Claims Data**: Contains structured codes (e.g., diagnosis, procedure, and drug codes), offering limited detail.
  - **EHR Data**: Richer in content, including clinical notes, lab results, vital signs, problem lists, social history, and medical images, providing a more comprehensive view of the patient's health.

### Continuous Signals Data Overview

Continuous signals data is generated from real-time monitoring devices, particularly in settings like intensive care units (ICUs). These devices continuously monitor a patient’s vital signs and physiological data, generating large volumes of real-time information.

### Types of Continuous Signals Data
Common types of continuous monitoring include:
- **ECG (Electrocardiography)**: Tracks heart activity.
- **Oxygen Saturation**: Monitors blood oxygen levels.
- **Heart Rate**: Tracks the patient’s pulse rate.
- **Blood Pressure**: Measures the patient’s blood pressure over time.
- **EEG (Electroencephalography)**: Tracks brain wave activity.

### Properties of Continuous Signals Data

#### Pros:
- **Detailed Data**: Continuous monitoring produces highly detailed data, often sampled at high frequencies (e.g., 200 Hz), capturing fine-grained information about a patient’s physiological state.
- **Objective Measures**: The data is directly captured from sensors, minimizing subjective human interpretation and providing a more objective measure of the patient's condition.

#### Cons:
- **Noisy Data**: Sensors can produce noisy data due to issues like improper placement, interference, or patient movement.
- **Large Volume**: Continuous monitoring generates vast amounts of data, especially over long periods and across many patients, requiring significant storage and processing capabilities.
  
### Conclusion
Both claims data and continuous signals data play crucial roles in healthcare, each offering unique advantages and challenges. Claims data provides a broad view of patient-provider interactions, valuable for understanding healthcare utilization and medication adherence, while continuous signals offer real-time, detailed physiological data crucial for immediate clinical decision-making. However, both types of data have limitations, such as potential inaccuracies, noise, and time lags, making it essential to use them carefully in research and clinical applications.

## Health Data: Images, Literature, Drugs

### Types of Healthcare Data and Their Properties

#### Imaging Data

**Types of Imaging Data:**
1. **X-ray**
2. **Computed Tomography (CT)**
3. **Magnetic Resonance Imaging (MRI)**
4. **Positron Emission Tomography (PET-CT)**

**Example Sizes:**
- **Full-body PET-CT**: ~8 GB
- **CT Cardiac**: ~36 GB
- **fMRI**: Up to ~300 GB

**Properties:**
- **Objective Measures**: Imaging data is captured directly from sensors, without human interpretation in the raw data.
- **Standardized Data**: Uses specific standards (e.g., DICOM) which aid in the generalizability of models trained on such data.
- **High Resolution**: Images are detailed and large, capturing intricate physiological details.

**Limitations:**
- **Insufficient Labels**: Annotating and labeling imaging data is labor-intensive, leading to potential gaps in high-quality labels, especially with large datasets.
- **High Dimensionality**: The large size and high resolution of images create challenges in storage and analysis.

#### Medical Literature

**Sources:**
- **PubMed**: A comprehensive search engine for medical publications.
- **Guideline Central**: Provides clinical guidelines and treatment protocols.

**Properties:**
- **High Quality**: Medical literature is well-written and peer-reviewed, providing reliable information.
- **Comprehensive**: Covers a wide range of medical conditions and treatments.

**Limitations:**
- **Difficult to Parse**: Documents are written for human experts and can be challenging to process with natural language processing algorithms.
- **Not Machine-Friendly**: Limited structure and design for human consumption make it harder for machines to utilize effectively.

#### Medical Ontologies

**Popular Ontologies:**
- **CPT (Current Procedural Terminology)**: For procedures.
- **RxNORM**: For drugs.
- **SNOMED CT**: General clinical terms.
- **MeSH (Medical Subject Headings)**: For literature.
- **ATC (Anatomical Therapeutic Chemical Classification System)**: For drugs.

**Properties:**
- **Machine-Readable**: Ontologies are structured as directed acyclic graphs, making them easy for machines to parse and integrate with other data sources.
- **Integration**: Facilitates integration with EHR and claims data.

**Limitations:**
- **Limited Coverage**: Constructing and updating ontologies is labor-intensive, leading to potential gaps in coverage.
- **Outdated Information**: Ontologies can become outdated due to the slow process of updating and maintaining them.

#### Clinical Trial Data

**Sources:**
- **ClinicalTrials.gov**: Contains protocols, eligibility criteria, and designs.
- **FDA Adverse Event Database**: Monitors safety reports.
- **Clinical Trial Management Systems (CTMS)**: Manages and reports trial results.

**Properties:**
- **Valuable**: Essential for drug approval and understanding treatment efficacy.
- **Heterogeneous**: Includes structured and unstructured information, often longitudinal.

**Limitations:**
- **Integration Challenges**: Difficulty in matching unstructured protocols with structured patient records.
- **Data Integration**: The diversity of data sources and quality issues complicates data integration.

#### Drug-Related Data

**Sources:**
- **DrugBank**: Information about existing drugs.
- **ChEMBL**: Bioactivity database for drug discovery.
- **ZINC**: Chemical compound database for virtual screening.
- **QM9**: Quantum chemistry benchmark data.

**Properties:**
- **Standardized Formats**: Data formats are generally standardized and often freely available online.
- **Accessibility**: Many datasets are publicly accessible, supporting research and development.

**Limitations:**
- **Lack of 3D Structures**: Many datasets provide 2D representations rather than 3D structures.
- **Limited Novel Data**: Up-to-date chemical data is often proprietary and not available publicly.

### Summary
Each type of healthcare data serves unique purposes and comes with its own set of advantages and challenges. Imaging data provides detailed, high-resolution information but can be difficult to analyze due to its size and dimensionality. Medical literature offers comprehensive insights but is often challenging to parse for computational use. Medical ontologies facilitate machine-readability and integration but may suffer from limited coverage and outdated information. Clinical trial data is crucial for drug development but presents integration and data matching challenges. Finally, drug-related data supports research and discovery with standardized formats, though it often lacks up-to-date information and 3D structural data.

# Health Data Standards

## Health Data Standard: Intro & ICD

### Health Data Standards

Health data standards ensure consistency and interoperability in healthcare data, which is crucial for accurate diagnosis, treatment, and billing. Here's an overview of several important health data standards:

#### 1. **LOINC Code**
- **Definition**: Logical Observation Identifiers Names and Codes.
- **Use**: Primarily for lab tests and clinical observations.
- **Purpose**: Provides a universal code for identifying laboratory and clinical observations.

#### 2. **ICD Code**
- **Definition**: International Classification of Diseases.
- **Use**: For categorizing and coding diseases and conditions.
- **Current Version**: ICD-10 (previously ICD-9).
  
**ICD-9:**
- **Structure**: 3-5 digits.
  - **First 3 Digits**: Category (e.g., 250 for diabetes).
  - **Next Digit**: Subcategory (e.g., 250.0 for diabetes without complication).
  - **Fifth Digit**: Sub-classification (e.g., 250.01 for Type 1 diabetes without complication).
  - **Supplementary Codes**: Letters like V or E for additional classifications (e.g., V85 for BMI).

**ICD-10:**
- **Structure**: Up to 7 alphanumeric characters.
  - **First 3 Characters**: Disease category (e.g., E10 for Type 1 diabetes).
  - **Fourth Character**: Etiology (cause of disease).
  - **Fifth Character**: Body part.
  - **Sixth Character**: Severity or additional details.
  - **Seventh Character**: Extension (for increased specificity).

**Mapping Between ICD-9 and ICD-10:**
- **General Approach**: Often one-to-many due to increased specificity in ICD-10.
  - Example: ICD-9 code 733.6 (Tietze syndrome) maps to ICD-10 code M94.0.
  - Complex cases like 733.82 (disorder of bone and cartilage) have extensive mappings in ICD-10.

#### 3. **CPT Code**
- **Definition**: Current Procedural Terminology.
- **Use**: To code medical procedures and services.
- **Purpose**: Standardizes the documentation of medical, surgical, and diagnostic services.

#### 4. **NDC Code**
- **Definition**: National Drug Code.
- **Use**: For identifying medications.
- **Purpose**: Provides a unique identifier for each drug and its formulation.

#### 5. **SNOMED CT**
- **Definition**: Systematized Nomenclature of Medicine Clinical Terms.
- **Use**: Comprehensive clinical terminology for health records.
- **Purpose**: Offers a detailed and structured vocabulary for clinical healthcare.

#### 6. **UMLS**
- **Definition**: Unified Medical Language System.
- **Use**: Integrates various health terminologies and knowledge sources.
- **Purpose**: Provides tools for understanding and processing different medical terminologies and ontologies.

### Importance of Health Data Standards

- **Insurance Claims**: Health data standards are crucial for processing insurance claims. They ensure that medical services are correctly coded and billed, which affects reimbursement and financial transactions.
- **Data Integration**: Standards facilitate the integration of data across different systems and platforms, improving interoperability and data sharing in healthcare settings.

### Quiz Questions on ICD Codes

1. **Which of the following are not ICD-9 codes?**
   - **501**: Valid (ICD-9 code).
   - **U80.1**: Invalid (ICD-9 codes start with numbers or letters E/V, but not U).
   - **A02.3**: Valid (ICD-9 code).
   - **V70**: Valid (Supplementary ICD-9 code).
   - **E82.0**: Valid (Supplementary ICD-9 code).
   - **5A0.01**: Invalid (ICD-9 codes cannot have letters after the first character).

   **Answers**: U80.1 and 5A0.01.

2. **ICD Codes for Influenza:**
   - **ICD-9**: 487.1
   - **ICD-10**: J11.1

3. **ICD-10 Code for Coronavirus:**
   - **Answer**: B34.2

### Summary

Health data standards, including LOINC, ICD, CPT, NDC, SNOMED CT, and UMLS, are essential for maintaining consistency, interoperability, and accuracy in healthcare data. They facilitate everything from diagnosis and treatment to insurance billing and data integration.

## Health Data Standards: CPT, LOINC, NDC

### Overview of Health Data Standards

#### CPT Code (Current Procedural Terminology)

- **Definition**: CPT stands for Current Procedural Terminology.
- **Purpose**: Describes medical, surgical, and diagnostic services. It is used primarily for billing and reimbursement purposes.
- **Maintained by**: American Medical Association (AMA).

**Categories of CPT Codes**:
1. **Category I**: 
   - **Format**: 5 digits.
   - **Use**: Covers a broad range of procedures and services.
   - **Examples**: Evaluation and management codes (e.g., 99201-99499).

2. **Category II**: 
   - **Format**: 4 digits followed by an 'F'.
   - **Use**: Quality metrics and performance measures.
   - **Examples**: Composite measures (e.g., 0001F for blood pressure measurement).

3. **Category III**: 
   - **Format**: 4 digits followed by a 'T'.
   - **Use**: Experimental and emerging technologies.
   - **Examples**: New procedures or technologies not yet widely accepted.

**Example Quiz**:
- **Find the CPT code for a detailed office visit**:
  - **Answer**: Codes 99201-99205.

#### LOINC Code (Logical Observation Identifiers Names and Codes)

- **Definition**: LOINC is used to identify lab tests and clinical observations.
- **Created by**: Regenstrief Institute.
- **Purpose**: Standardizes the way lab results and clinical observations are recorded.

**LOINC Code Structure**:
- **Component Name**: What is being measured (e.g., Alpha 1 globulin).
- **Property**: The type of measurement (e.g., MCnc for mass concentration).
- **Time Aspect**: When the measurement is taken (e.g., Pt for point in time).
- **Sample Type**: The type of sample (e.g., serum and plasma).
- **Scale**: Measurement scale (e.g., quantitative, ordinal, nominal).
- **Method**: Technique used (e.g., Electrophoresis).

**Example Quiz**:
- **Find the LOINC code for creatinine**:
  - **Answer**: 2160-0.

#### NDC Code (National Drug Code)

- **Definition**: NDC is used to identify medications.
- **Maintained by**: Food and Drug Administration (FDA).
- **Purpose**: Tracks drugs throughout the supply chain, including manufacturing, distribution, and insurance.

**NDC Code Structure**:
1. **Labeler Code**: The company that produces the drug (4 or 5 digits).
2. **Product Code**: The specific drug product (e.g., 3105 for Prozac Capsule 20 mg).
3. **Package Code**: The packaging (e.g., 02 for a package of 100 pills).

**Example Quiz**:
- **Find the NDC code for metformin hydrochloride 500 mg**:
  - **Answer**: 0093-7214-01.

### Summary

Health data standards like CPT, LOINC, and NDC codes play crucial roles in the healthcare industry. CPT codes ensure accurate billing and reimbursement for medical services, LOINC codes standardize lab and clinical test reporting, and NDC codes track and identify medications throughout the supply chain. Understanding and correctly using these standards is essential for effective healthcare management and administration.

## Health Data Standards: SNOMED

### SNOMED Overview

#### What is SNOMED?

- **Definition**: SNOMED stands for Systematized Nomenclature of Medicine.
- **Purpose**: It is a comprehensive medical ontology designed to capture health information and support effective clinical recording to improve patient care.
- **Maintained by**: International Health Terminology Standards Development Organisation (IHTSDO), a non-profit organization based in Denmark.

#### SNOMED Life Cycle

1. **International Management**:
   - IHTSDO oversees international development, release, distribution, maintenance, and education for SNOMED.

2. **National Release Centers**:
   - Each member country may have its own release center and national reference set.
   - National editions may be updated at different times.

3. **Vendor Implementations**:
   - Vendors may implement subsets of the SNOMED reference set depending on regional or specific needs.

4. **Users**:
   - Clinicians, researchers, and data analysts use SNOMED for clinical documentation, semantic interoperability, clinical support, and data analysis.

#### SNOMED Structure

1. **Components**:
   - **Concepts**: Fundamental units, each with a unique machine-readable identifier.
   - **Descriptions**: Human-readable forms of concepts. There are two types:
     - **Fully Specified Names (FSN)**: Precise explanations of the concept.
     - **Synonyms**: Alternative terms or phrases used to describe the same concept.

   - **Relationships**:
     - **Is-a Relationship**: Indicates generalization from a specific concept to a more general one.
     - **Attribute Relationships**: Define attributes or properties of concepts.

**Example**:
   - **Concept ID**: 22298006
   - **Description**: Myocardial infarction disorder (FSN)
   - **Synonyms**: Heart attack, myocardial infarct, MI

2. **Hierarchy**:
   - **Is-a Relationship**: Establishes a hierarchical structure from specific to general concepts.
     - Example: "Cellulitis of the hand" and "Cellulitis of the foot" are subtypes of "Cellulitis".
     - "Cellulitis of the foot" also has a relationship with "Disorder of foot".

   - **Attributes**:
     - Example: "Abscess of heart" has a morphology relationship with "Abscess".

#### SNOMED Structure Summary

- **Hierarchy**: Concepts are organized hierarchically, ranging from specific findings to broader categories.
- **Relationships**:
  - **Is-a**: Hierarchical relationships.
  - **Attributes**: Describe additional properties of concepts.
- **Unique Identifier**: Each concept has a unique ID and multiple descriptions and relationships.

#### Quizzes

1. **Find the SNOMED code for chronic gouty arthritis disorder**:
   - **Answer**: 68451005

2. **What is the resulting structure of all IS-A relationships in SNOMED?**:
   - **Answer**: Directed graph without cycles (also known as a Directed Acyclic Graph or DAG).

## Health Data Standards: UMLS

### Universal Medical Language System (UMLS) Overview

#### What is UMLS?

- **Definition**: UMLS stands for Universal Medical Language System.
- **Maintained by**: US National Library of Medicine.
- **Purpose**: UMLS is a comprehensive thesaurus and ontology of biomedical concepts. It integrates multiple existing standards to provide a unified system for accessing diverse biomedical information.

#### Components of UMLS

1. **Metathesaurus**:
   - **Description**: A large, multilingual database containing over 1 million biomedical concepts from more than 100 sources.
   - **Organization**: Concepts in the Metathesaurus are organized hierarchically into four levels:
     - **Atoms**: The most granular level, each representing a unique occurrence of a term in a source vocabulary. Identified by a unique Atom Unique Identifier (AUI), starting with "A" followed by seven digits.
     - **Strings**: Each unique term or concept name maps to a String Unique Identifier (SUI), starting with "S" followed by seven digits.
     - **Lexicon Unique Identifier (LUI)**: Links lexical variants of terms detected by lexical variant generator programs. Identified by a unique LUI starting with "L" followed by seven digits.
     - **Concept Unique Identifier (CUI)**: Represents concepts that may have multiple names from different vocabularies. Identified by a unique CUI starting with "C" followed by seven digits.

   **Example**:
   - **Concept**: "Headache"
     - **Atoms**: Various AUI for occurrences of the term "headache" in different vocabularies.
     - **String Unique Identifiers**: "Headaches", "Cephalodynia" (all mapped to the same concept).
     - **Lexicon Unique Identifiers**: Variants like "cephalalgia" and "headache" mapped to a single CUI.
     - **Concept Unique Identifier**: C0018681 for "Headache".

2. **Semantic Network**:
   - **Description**: Provides a structured organization of concepts (CUIs) into hierarchies and relationships. It adds structure to the terms in the Metathesaurus, which lacks an overall hierarchy.
   - **Components**:
     - **Semantic Types**: 135 broad categories used to classify concepts.
     - **Semantic Relationships**: 54 types of relationships between concepts, such as "is a" relationships, which define the hierarchy.

   **Example**:
   - **Hierarchy**: "Headache" may be categorized under "Disorder" and "Symptom".
   - **Relationships**: "Headache" might have relationships to related conditions or symptoms, providing context and connections.

3. **Specialist Lexicon**:
   - **Description**: Contains over 300,000 entries for common and biomedical terms in English. Includes information on part of speech, syntax, and morphology.
   - **Purpose**: Used with natural language processing (NLP) tools to process and analyze biomedical text.
   - **Popular Tools**:
     - **MMTX**: A tool for mapping concepts to UMLS identifiers.
     - **MetaMap**: A tool for identifying UMLS concepts and relationships within biomedical text.

### Summary

- **UMLS** integrates various biomedical terminologies and ontologies into a unified system.
- **Metathesaurus**: Organizes concepts with unique identifiers at multiple levels (AUI, SUI, LUI, CUI).
- **Semantic Network**: Provides hierarchical and relational structure to concepts.
- **Specialist Lexicon**: Contains detailed lexical information and is used with NLP tools for text processing.

#### Quizzes

1. **What is the unique identifier for the concept "Headache" in UMLS?**
   - **Answer**: C0018681

2. **What are the three components of UMLS?**
   - **Answer**: Metathesaurus, Semantic Network, Specialist Lexicon

3. **What is the purpose of the Specialist Lexicon in UMLS?**
   - **Answer**: To provide lexical information for common and biomedical terms and to be used with NLP tools for processing biomedical text.