# Unstructured Data

## Introduction to unstructured data

**1. Unstructured Data Overview:**
   - **Types:**
     - **Text:** Rich and complex, often requiring advanced processing techniques.
     - **Images:** Includes medical imaging (e.g., X-rays, MRIs), which requires image processing techniques.
     - **Signals:** Includes physiological signals (e.g., ECG, EEG), which need signal processing techniques.
   - **Contrast with Structured Data:**
     - Structured data is organized in rectangular tables (e.g., databases).

**2. Focus on Clinical Text:**
   - **Characteristics:**
     - Contains detailed medical information (e.g., patient history, symptoms, treatments).
     - Rich in semantic information but often unstructured and complex.
   - **Utility as a Source of Features:**
     - Provides detailed insights that may not be captured in structured data.
     - Can be used to derive features for predictive models, such as identifying symptoms or treatments mentioned in patient notes.

**3. Challenges in Processing Clinical Text:**
   - **Complexity and Variability:**
     - Diverse formats and terminology.
     - Use of medical jargon, abbreviations, and non-standard language.
   - **Contextual Understanding:**
     - Understanding the context and meaning behind the text can be difficult.
   - **Data Volume:**
     - Large volumes of text data can be challenging to process and analyze.

**4. Methods for Extracting Features:**
   - **Without Understanding Content:**
     - **Natural Language Processing (NLP):** Techniques like tokenization, named entity recognition (NER), and part-of-speech tagging can be used to extract features.
     - **Text Mining:** Methods to identify patterns and relevant information from large text corpora.
     - **Term Frequency-Inverse Document Frequency (TF-IDF):** Measures the importance of words in a document relative to a corpus.
     - **Bag of Words (BoW) and N-Grams:** Techniques for representing text data by frequency of words or sequences of words.
   - **Pre-trained Models:** Use of models like BERT or GPT to extract features based on pre-trained embeddings.

**5. Anonymization and De-identification:**
   - **Importance:**
     - Necessary for protecting patient privacy and complying with regulations (e.g., HIPAA).
   - **Techniques:**
     - **Anonymization:** Removing or obscuring personally identifiable information (PII) from the text.
     - **De-identification:** Techniques to make the text less identifiable, such as replacing names and other identifiers with pseudonyms or generalized terms.

This summary provides a foundation for understanding the key aspects of processing clinical text, including its potential, challenges, methods for feature extraction, and privacy considerations.

# Clinical Text

## What is clinical text?

### Understanding Clinical Text

**Definition:**
- **Clinical Text:**
  - Written by clinicians or healthcare providers.
  - Describes patient details, including their pathologies, personal, social, and medical histories, as well as findings from interviews and procedures.
  - Target audience includes other medical professionals rather than the general public.

**Characteristics:**
- **Different from Ordinary Natural Language:**
  - **Ordinary Text:** Found in books, articles, and scientific publications, written in grammatical English.
  - **Clinical Text:** Often not in full sentences, may use medical jargon, abbreviations, and shorthand.
  
- **Purpose and Usage:**
  - Not intended for publication or general audience.
  - Aims to document clinical encounters and provide information relevant to other medical professionals.
  
- **Example:**
  - Progress notes documenting patient visits may include:
    - Brief descriptions of patient conditions.
    - Observations made during consultations.
    - Medical procedures and outcomes.

**Processing Clinical Text:**
- **Special Techniques Required:**
  - Requires different processing approaches compared to standard natural language due to its specific structure and content.
  - Some researchers use **Natural Language Processing (NLP)** techniques, despite the challenges posed by the specialized nature of clinical text.

**Issues with NLP in Clinical Text:**
- **Challenges:**
  - Clinical text may lack grammatical completeness.
  - Use of non-standard language and terminology.
  - Requires adaptation of NLP tools to handle the unique aspects of clinical documentation.

## The value of clinical text

**1. Complementing Billing Codes:**
   - **Inaccuracy and Bias in Codes:**
     - Billing codes may be incomplete or influenced by billing practices.
   - **Enhancing Accuracy:**
     - Clinical text provides a detailed description of patient encounters, which can improve the accuracy of patient representation beyond what codes alone can offer.
   - **Example:**
     - **Urinary Incontinence After Prostate Surgery:** Clinical text identified five times more cases than diagnosis codes, highlighting its potential for capturing more nuanced information.

**2. Disease and Condition Representation:**
   - **Improved Identification:**
     - Some conditions are more accurately captured using clinical text than codes. 
     - **Examples:**
       - **Alzheimer's Disease, Multiple Sclerosis, Parkinson's Disease:** Identified more frequently through clinical notes compared to diagnosis codes.
       - **Atrial Fibrillation:** Better identified using diagnosis codes and prescribed medications.

**3. Applications of Clinical Text:**
   - **Bio Surveillance:**
     - Used to detect and monitor disease outbreaks by analyzing patterns in text.
   - **Improving Disease Terminology:**
     - Refines and expands the vocabulary used to refer to diseases.
   - **Clinical Decision Support:**
     - Helps in reminders and alerts for necessary procedures and treatments.
     - **Example:** 
       - **Spleen Removal:** If not marked in the problem list, a patient may miss receiving an anti-pneumococcal vaccination. Analyzing clinical text can catch such cases.

**4. Adding Codes and Enhancing Records:**
   - **Code Addition:**
     - Clinical text helps in adding accurate codes to medical records for better querying and reporting.

**5. Facilitating Clinical Research:**
   - **Study Cohorts:**
     - Aids in constructing cohorts for research studies.
   - **Knowledge Discovery:**
     - Enables automated discovery of new knowledge by mining text for patterns.
     - **Example:**
       - Discovery of a potential treatment relationship between Renal Disease and fish oil through text analysis.

**Conclusion:**
Clinical text is a valuable healthcare data type due to its ability to complement billing codes, improve disease representation, support various applications in healthcare, and facilitate clinical research.

## What makes clinical text difficult to handle?

**1. Structural and Linguistic Complexity:**
   - **Ungrammatical Text:** Clinical text often lacks standard sentence structure.
   - **Misspellings and Abbreviations:** Frequent misspellings, acronyms, and abbreviations (which may have multiple interpretations) add complexity.
   - **Telegraphic Phrases:** Short, terse phrases are common and can be difficult to interpret.
   - **Varied Quality:** 
     - **Radiology Reports:** Dictated and composed deliberately for clarity.
     - **Progress Notes:** Often written for documentation, leading to variable quality.
   - **Pasted Content:** Long sets of lab results or vital signs may be copied from other parts of the medical record.
   - **Institution-Specific Patterns:** Templates used in EMR systems can result in idiosyncratic and institution-specific documentation styles.

**2. Negation:**
   - **Importance in Clinical Analysis:** 
     - Negation indicates the absence of a condition, which is crucial in clinical context (e.g., "no pneumonia" means the patient does *not* have pneumonia).
   - **Prevalence:** Roughly 40% of clinical text content involves negation, making it essential to detect and account for in analysis.

**3. Context Detection:**
   - **Temporal and Subject Context:** 
     - Clinical text may refer to past conditions or family history (e.g., "family history of heart attack" does not indicate that the patient had a heart attack).
   - **Contextual Challenges:**
     - **Patient vs. Others:** Does the condition refer to the patient or someone else (like a family member)?
     - **Time Frame:** Does the condition refer to the present or the past?
   - **Positive Mentions:** Analysts typically aim to identify **present positive mentions**, which indicate the presence of a condition in the patient at the time of writing. Negations and references to the past or others should not be counted in these mentions.

**4. Security, Privacy, and Data Access:**
   - **Privacy Concerns:** Misunderstanding and confusion around data privacy, anonymization, and de-identification have increased the difficulty of gaining access to clinical text.
   - **Increased Burden:** Fear surrounding security and privacy has led to more stringent requirements for accessing clinical text data, which can hinder research.

This summary highlights the major challenges in processing clinical text, including its structural variability, the need for handling negation and context, and issues related to security and privacy.

## Privacy and de-identification

### Handling Protected Health Information (PHI) in Text

**1. The Importance of Protecting PHI:**
   - **Risks of Leaking PHI:**
     - Can harm patients, damage the healthcare institution’s reputation, and result in large fines.
   - **Key Terms:**
     - **Anonymization:** Data cannot be linked to a specific person. However, complete anonymization is difficult to guarantee, as future data sources may enable re-linking.
     - **De-identification:** The removal of specific identifiers that constitute PHI. This approach focuses on removing or masking identifiable information.

**2. Methods for De-identification:**

   **A. Safe Harbor Method:**
   - **Overview:**
     - Requires the removal of 18 specific identifiers from the data. These include:
       - Name
       - Social Security number
       - Telephone number
       - Email address
       - Full face photographic images
       - Other identifying information (such as unique codes or characteristics).
     - The identifiers removed apply not only to the patient but also to their relatives, employers, and household members.
   - **Challenges:**
     - Depends on accurately identifying all PHI in the text. If an identifier is not recognized, it cannot be removed.

   **B. Statistical Method:**
   - **Definition:**
     - A statistician certifies that the statistical risk of re-identifying individuals is very small.
   - **Re-Identification Risk:**
     - The US law suggests using the "zip code rule," where a three-digit zip code can be revealed only if the area contains more than 20,000 people.
     - The re-identification risk should ideally be less than 1 in 20,000.
   - **Applications:**
     - Commonly used in large insurance claims databases.
   - **More Information:** Detailed on the US Department of Health and Human Services (HHS) website.

   **C. Hiding in Plain Sight:**
   - **Method Overview:**
     - Acknowledges that both manual and automated de-identification processes may miss some PHI identifiers.
   - **Approach:**
     - Replaces identifiable information with realistic, synthetic surrogate data that looks plausible but is not real.
   - **Benefit:**
     - Even if some PHI items are missed, they blend in with the synthetic information, making it difficult to distinguish the real PHI from the surrogates.

**Conclusion:**
- Protecting PHI in clinical text requires careful consideration, with multiple methods available for de-identification. Each method has its own strengths and challenges, and the choice of method depends on the context and requirements of the data.

## A primer on Natural Language Processing

### Natural Language Processing (NLP) for Clinical Text

**Overview:**
- NLP is a branch of artificial intelligence focused on enabling computers to process human language.
- Though NLP is not always required for processing clinical text, understanding NLP provides a useful contrast to non-NLP methods.

**Steps in an NLP System:**

1. **Tokenization:**
   - Purpose: Identifying individual words and determining where sentences start and end.
   
2. **Parsing:**
   - Purpose: Determining the grammatical structure of each sentence.
   - Tags each word with its **part of speech** (e.g., noun, verb).

3. **Named Entity Recognition (NER):**
   - Identifies specific words or phrases of interest.
   - Assigns appropriate **category labels** (e.g., diseases, medical procedures).
   
4. **Section Detection:**
   - Purpose: Identifies boundaries between different sections of the document.
   - For clinical text, marking sections like **past medical history** is crucial. 
   - This step helps distinguish between a patient's history of a disease and their current condition.

**Adapting NLP for Clinical Text:**
- **Negation and Context Detection:** Clinical text requires specific adaptations to handle the unique challenges posed by negations (e.g., "no pneumonia") and context (e.g., family history vs. current patient condition).
  
In summary, while NLP tools can be used for clinical text, they require modifications to address the complexities of medical language, including the detection of negation and context.

## Practical approach to processing clinical text

**1. Goal:**  
- The goal is not to understand the full contents of the document but to identify **features** for downstream analysis that answer clinical research questions.

**2. Dealing with PHI:**
- **Two main approaches**:
  1. **Remove PHI**: Locate and remove protected health information (PHI) from the text.
  2. **Keep medical terms**: Filter out non-relevant terms by extracting only medical terms using a **knowledge graph**. This automatically avoids PHI since it's not in the medical dictionary.

**3. Knowledge Graphs for Processing:**
- Knowledge graphs help with:
  - **Identifying ambiguous terms**: For example, "MG" could mean **magnesium** (a drug) or **myasthenia gravis** (a disease). These ambiguous terms can be flagged and removed.
  - **Expanding terms**: By analyzing word patterns (like **Hearst patterns**), you can find similar terms to expand your dictionary (e.g., expanding 14 terms for "discomfort" to 31 terms).

**4. Handling Negation and Context:**
- Tools like **NegEx** and **Contex** detect negation (e.g., "no pneumonia") and context (e.g., "family history of heart attack").
- **Section headings** in clinical notes can also help distinguish sections, such as "medical/surgical history," to improve context detection. Section headings can be identified using formatting conventions like the first colon on a line.

**5. Preprocessing Clinical Text:**
- **Preprocess** the entire set of clinical documents once, storing the processed results for future use.
- The preprocessing step aims to detect **indexed, positive, present mentions** of medical terms (e.g., diseases, drugs, procedures) using a **knowledge graph**:
  - **Indexed**: Records where and which strings were mentioned.
  - **Positive**: Negations are removed.
  - **Present**: Historical mentions (e.g., family history) are excluded.

**6. Answering Clinical Research Questions:**
- Use the **knowledge graph** to identify relevant terms and their synonyms for your query.
- Address **ambiguities** (e.g., the word "stroke" might refer to either cerebrovascular stroke or heatstroke). Contextual clues like **age** and **season** can help resolve this.
  
**7. Constructing Patient-Feature Matrices:**
- **Temporal information** is crucial in distinguishing cases (e.g., if a drug is mentioned **before** or **after** an outcome).
- Count the **present, positive mentions** for a patient.
- Aggregate these features and use **temporal filters** to construct a **patient-feature matrix** for statistical analysis.

**8. Applications of Processed Clinical Text:**
- This method helps in constructing study cohorts and answering clinical research questions while efficiently handling clinical text without needing full document comprehension.

This approach ensures effective feature extraction, while addressing key challenges in processing clinical text like **PHI**, **ambiguous terms**, **negation**, and **context**.

# Images

## Overview and goals of medical imaging

Medical images are essential because they provide detailed insights into the anatomy and physiology of the human body, often without requiring invasive procedures. Here's why they are important:

### 1. **Diagnosis:**
   - **Detecting Diseases**: Medical images, such as X-rays or CT scans, can reveal lesions in the lungs due to tuberculosis or other diseases.
   - **Internal Bleeding**: They can show signs of internal bleeding or trauma.
   - **Identifying Foreign Bodies**: Images can locate foreign objects like bullets or surgical tools accidentally left in the body after procedures.
   - **Tumor Detection**: Early detection of tumors in various parts of the body is crucial for timely treatment.

### 2. **Disease Staging and Treatment Response:**
   - **Cancer Monitoring**: Medical imaging is essential for staging cancers by assessing the size and spread of tumors. This information helps determine appropriate treatments.
   - **Treatment Effectiveness**: Images are used to monitor how well a treatment is working, such as checking whether a tumor has shrunk or grown over time.

### 3. **Guiding Surgical Interventions:**
   - **Pre-surgical Planning**: Surgeons use medical images to locate tumors or foreign objects and to understand the surrounding anatomy.
   - **Assessing Proximity**: Understanding what vital structures (such as blood vessels or organs) are near the target area is critical for safe and successful surgery.
   - **Internal Bleeding Detection**: Images can also guide surgeons to areas of internal bleeding, helping them plan interventions more effectively.

In summary, medical images are vital for accurate diagnosis, staging diseases, monitoring treatment progress, and guiding surgical procedures, making them indispensable in modern medicine.

## Why are images important?

Medical images are vital for capturing detailed anatomical and physiological data, which serve several key purposes: 

1. **Diagnosis**: X-rays, CT scans, and MRIs can reveal conditions like infections, internal bleeding, tumors, or foreign objects in the body.
2. **Disease Staging and Treatment Response**: Especially for cancers, medical imaging helps track tumor size and spread, guiding appropriate treatment plans and assessing therapy effectiveness.
3. **Guiding Surgical Interventions**: Images assist surgeons in locating tumors or foreign bodies and identifying structures to avoid, as well as detecting internal bleeding.

However, for this course, it’s crucial to understand that these images are large and need interpretation—either by humans or machine learning models—to convert the low-level visual data into useful features for prediction and analysis tasks.

## What are images?

Medical images are created by converting physical energy—such as visible light, X-rays, ultrasonic sound waves, or magnetic resonance—into data, typically in the form of a two-dimensional rectangular array that measures signal intensity. They can also be three-dimensional, representing either time (as in a video) or space (volumetric images like CT scans). Medical images take up significant storage and are often managed in dedicated archiving systems separate from electronic medical records (EMRs).

### **Categories of Medical Images**
1. **Imaging Modality** (based on the type of physical energy):
   - **Visible Light**: Used in photographs of the retina, skin, and microscopic views of cells/tissues. Endoscope videos also fall under this category.
   - **X-rays**: High-frequency electromagnetic radiation is absorbed differently by various tissues, bones, and fluids. X-rays can be two-dimensional (e.g., chest X-ray) or three-dimensional as in CT (Computed Tomography) scans, which reconstruct body volumes from multiple projections. A key disadvantage of X-rays and CT scans is the use of ionizing radiation, which can cause tissue damage.
   - **Ultrasound**: Uses high-frequency sound waves to visualize tissues and moving structures like the heart. It has the advantage of being non-invasive and harmless to tissues, making it popular for monitoring pregnancies.
   - **Magnetic Resonance (MR)**: Measures the magnetic properties of atomic nuclei under electromagnetic pulses. This technique can reveal tissue density or locate specific chemical tracers. Like ultrasound, magnetic resonance imaging (MRI) does not cause tissue damage.

2. **Structural vs. Functional**:
   - **Structural Imaging**: Provides a static view of the spatial arrangement of anatomical structures, such as bones or organs.
   - **Functional Imaging**: Captures dynamic processes, such as the beating of the heart, blood flow, or physiological activities (e.g., functional MRI, which shows brain activity).

These images are crucial for diagnosis, treatment planning, and guiding interventions, but their large size requires significant computational resources for analysis, whether by human experts or machine learning algorithms.

## A typical image management process

The life cycle of radiology images, which are the most commonly accessible for data mining, includes several key stages:

1. **Image Acquisition**: 
   - A camera, scanner, or sensor captures physical energy (e.g., X-rays) and converts it into a digital signal to produce the medical image.

2. **Image Storage and Management**:
   - The images are transferred from the acquisition device to a storage system. This system organizes the images and stores metadata, including the patient identifier, time and location of acquisition, and details about the equipment used.
   - The **DICOM** (Digital Imaging and Communications in Medicine) standard is widely used to manage image storage and transmission. It supports all imaging modalities and ensures standardization of image formats.
   - Images are typically stored in **PACS** (Picture Archiving and Communication Systems), which handle large image files and allow for compression to reduce storage space.

3. **Processing and Interpretation**:
   - Traditionally, radiologists or specialists interpret the content of medical images and produce a narrative report detailing the clinically relevant features.
   - This report is stored in the **EMR** (Electronic Medical Record).
   - Increasingly, machine learning, particularly **Deep Learning**, is being applied to automate routine image analysis tasks, identify important features, and even generate expert-level textual descriptions of the images.

Deep learning techniques have shown remarkable success in automatically identifying features from raw image data, helping radiologists and pathologists process images more efficiently.

# Signals

## Overview of biomedical signals

Biomedical signals are generated by equipment that measures physiological parameters and converts them into electrical signals, usually in the form of a voltage. This voltage is then digitized for computer analysis, enabling the processing and interpretation of the data. Common examples of these signals include:

- **Heart Rate**: Measures the number of heartbeats per minute.
- **Oxygen Saturation**: Reflects the percentage of oxygen in the blood.
- **Electrocardiogram (ECG or EKG)**: Records the electrical activity of the heart over time, providing insight into heart rhythms and function.

These signals are essential for monitoring and diagnosing various medical conditions in real-time.

## Why are signals important?

Signals in healthcare, such as those captured by wearables, offer a unique opportunity for continuous, real-time monitoring of physiological states. This data can provide insights that are often unavailable from other health records, helping to create novel digital markers for health or disease predictions. These advancements, driven by improvements in sensor technology and the increasing affordability of devices, are shaping how health monitoring and prediction are integrated into both clinical practice and personal health management.

## What are signals?

Signals are sequences of measured values taken at regular intervals, determined by the sensor's sampling frequency. They provide crucial real-time data in contexts where continuous monitoring is vital, such as in intensive care units or with wearable devices like heart rate trackers.

The main research goals for using signals are akin to those for images: automatic feature detection and interpretation. For instance, some EKG machines and smartwatches can automatically analyze heart rhythm features and offer preliminary interpretations, such as detecting atrial fibrillation. The field of signal processing, while significant, is not the focus of this course, but it's worth noting that advancements in this area aim to improve automatic analysis and interpretation of physiological data.

## What are the major issues with using signals?

Using signals, especially from wearable devices, presents several major issues:

1. **Proprietary Algorithms**: Many commercial devices use proprietary algorithms for feature detection and interpretation. The inner workings of these algorithms are often not transparent, making it difficult to assess their accuracy and reliability.

2. **Accuracy and Reliability**: Raw measurements from devices, such as acceleration in different directions, need to be converted into higher-level features. This conversion can be imprecise. For instance, distinguishing between different activities like sleep, sitting, standing, walking, and running can vary in reliability. Similarly, sleep trackers might not always accurately measure sleep duration or stages, leading to potential inaccuracies.

3. **Variability in Measurements**: There can be significant variability between different sensors or even within the same device. For example, step counts from two different sensors on the same person can differ by up to 30%, which can lead to misleading health assessments.

4. **Potential for Misinterpretation**: Inaccurate readings can lead to false reassurances or unnecessary panic. For example, a device might suggest a health issue that turns out to be a false positive or may not detect a real health problem.

5. **Quantified Self Movement**: This movement promotes self-monitoring for health and productivity improvements. While it has led to valuable innovations and apps, it's not yet clear how much continuous signal collection and analysis actually benefit health. The true value of these devices is still being researched.

6. **Privacy Concerns**: Collecting and sharing personal data through wearables raises privacy issues. While data collection is often voluntary, there is a risk that individuals may feel pressured to share their data, leading to potential privacy invasions.

These issues highlight the need for careful consideration when using and interpreting data from wearable devices and other signal-based technologies.