# Creating features to analyze

## Turning clinical data into something you can analyze

To define the unit of analysis, we typically refer to the basic entity on which measurements are taken. In your case, the unit of analysis could be an individual patient. Each row in the patient feature matrix would represent a single patient, while each column represents specific clinical features or measurements taken from that patient, such as demographic details, lab results, diagnoses, and treatments.

The key considerations for constructing this matrix include:

1. **Feature Selection**: Deciding which clinical variables to include—such as comorbidities, vital signs, medications, lab values, or outcomes—is critical. Feature selection often depends on the clinical question. If there are too many features, dimensionality reduction methods, like principal component analysis (PCA) or feature selection algorithms, can help reduce complexity.

2. **Handling Missing Data**: Clinical data frequently has missing values. Strategies such as imputation, where missing values are estimated using statistical methods (e.g., mean, median, or predictive modeling), or more sophisticated methods like multiple imputation, are often used to address this issue.

3. **Normalization and Scaling**: Since clinical data may come in different units (e.g., weight in kg, glucose levels in mg/dL), standardization or normalization ensures that all features are on a comparable scale, reducing bias in models that depend on distance metrics (like k-NN or SVMs).

4. **Impact of Feature Transformation on Analysis**: The way data is transformed (e.g., through log transformations, binning, or one-hot encoding) can directly affect the outcome of the analysis. Certain transformations may emphasize different patterns or associations that answer clinical questions more accurately.

In terms of using curated biomedical knowledge, **Knowledge Graphs** and **ontologies** can be essential tools for feature extraction. They can help:

- **Map clinical data to standard concepts**: These graphs provide a structured way to categorize conditions, treatments, and outcomes. For example, different clinical codes (ICD-10, SNOMED) can be linked to broader concepts in a Knowledge Graph.
- **Support feature generation**: They can help derive new features by linking symptoms or diagnoses to higher-level disease categories or treatment pathways, enabling the use of clinically meaningful groupings.


## Defining the unit of analysis

#### **1. Unit of Observation and Analysis**
- **Patient (default unit)**: Most clinical studies focus on the patient as the unit of analysis, meaning each row in the data matrix represents one patient. 
- **Other units**: Depending on the clinical question, different units might be more appropriate. For example:
   - **Drug-Disease Pairs**: For research on off-label medication use, rows could represent pairs of drugs and diseases, with features capturing relationships such as:
     - Frequency of drug and disease co-occurrence.
     - Temporal proximity of drug mention to diagnosis.
   - **Events**: For temporal analyses, each row could represent a specific clinical event (e.g., hospital admission, medication prescription).

---

#### **2. Feature Selection**
- **Types of features**:
   - **Demographics**: Age, sex, ethnicity.
   - **Clinical measurements**: Lab values, vital signs.
   - **Medications**: Drug names, dosages, duration.
   - **Diagnoses**: Disease codes (ICD-10, SNOMED), comorbidities.
   - **Outcomes**: Mortality, hospitalization, adverse events.
- **Dimensionality reduction**:
   - **Too many features** can overwhelm analysis and models. Use:
     - **Principal Component Analysis (PCA)**: Reduces feature space while retaining variability.
     - **Feature selection algorithms**: Use techniques like Lasso (L1 regularization) or tree-based feature selection (e.g., Random Forest) to retain only the most important features.

---

#### **3. Handling Missing Data**
- **Nature of missing data**: Often occurs due to incomplete records, data entry errors, or varying follow-up times.
- **Imputation strategies**:
   - **Mean/median imputation**: Replace missing values with average values.
   - **Predictive imputation**: Use models to predict missing values based on other features.
   - **Multiple imputation**: Create several datasets with imputed values and average results from each to account for uncertainty.

---

#### **4. Data Transformation and Normalization**
- **Normalization**: Ensures all features have the same scale (important for models that depend on distance metrics). Techniques include:
   - **Min-Max scaling**: Scales data between 0 and 1.
   - **Z-score normalization**: Centers data around a mean of 0 with a standard deviation of 1.
- **Transformation methods**:
   - **Log transformation**: Compresses large differences between high values, useful for skewed data.
   - **Binning**: Group continuous variables into discrete categories (e.g., age groups).
   - **One-hot encoding**: Converts categorical variables into binary columns (e.g., yes/no for comorbidities).
- **Impact of transformation**: Data transformations can influence the analysis outcome. Some models are more sensitive to certain transformations, so consistency and interpretation must be considered.

---

#### **5. Role of Curated Biomedical Knowledge (Knowledge Graphs & Ontologies)**
- **Knowledge Graphs**: Structured frameworks that connect clinical terms, helping to organize and classify features.
   - **ICD-10** and **SNOMED** codes can be mapped to higher-level disease categories, improving interpretability.
- **Ontology-based feature generation**:
   - **Grouping concepts**: Conditions, symptoms, and treatments can be grouped under broader categories for simplified analysis (e.g., hypertension and hyperlipidemia under cardiovascular diseases).
   - **Hierarchical relationships**: Use ontologies to connect related conditions, symptoms, and treatments.

---

### **Important Notes for Exam**
- **Unit of observation**: Most studies use the patient, but alternative units (e.g., drug-disease pairs) should be considered based on the research question.
- **Feature selection**: Choose clinically relevant features; reduce dimensionality when necessary.
- **Missing data**: Common in clinical datasets; handle using imputation techniques.
- **Data transformation**: Apply normalization and encoding based on the type of analysis.
- **Knowledge Graphs**: Essential for mapping clinical data to standard concepts and extracting meaningful features.

## Using features and the presence of features

#### **1. Feature Inclusion**
- **Start with all features**: Begin by including all available features, as modern machine learning models can automatically reduce feature sets by identifying and removing those that do not contribute to model accuracy.
- **Handling constraints**: If computing resources are limited, consider reducing the number of features based on computational needs.
- **Sensitive features**: In some cases, sensitive information like HIV status may need to be removed to protect patient privacy.

#### **2. Implicit Features & Metadata**
- **Metadata**: Data that refers to other data can be very informative. For example:
   - **Lab test orders vs. results**: The number of times a test (e.g., glucose test) is ordered can hint at a patient’s condition (e.g., diabetes) even if the results are unavailable.
   - **Indicator variables**: Use metadata like the occurrence of a test as an indicator of patient conditions (e.g., diabetic status based on frequent glucose-related tests).
   - **Example**: A patient having 80% of their lab tests related to glucose measurement might suggest they are diabetic. This feature is derived from the **count of lab test orders** rather than the test results.
- **Feature engineering**: Many features are crafted using **prior knowledge** of the disease and its indicators. For example:
   - Understanding that diabetes is related to glucose levels helps in crafting features related to glucose test frequency.
   - Features can also be **learned computationally** through advanced machine learning methods.

#### **3. Feature Engineering in Practice**
- **Modern applications**: In risk prediction models, the median number of features used was 27, with a median sample size of 12,000 (2017 study). This highlights the possibility that real-world datasets may use fewer features than often expected in big data discussions.
- **Creating advanced feature sets**: You can move beyond standard models by incorporating subtle metadata and implicit indicators to improve the predictive power of your analysis-ready datasets.


## How to create features from structured sources

#### **1. Structured Data in Healthcare**
- **Common Data Sources**: Healthcare data is often drawn from structured sources such as databases and tables.
- **Main Steps in Working with Structured Data**:
  - **Accessing Data**: Use SQL (Structured Query Language) to query databases and load results into programming languages like Python for analysis.
  - **Standardizing Features**: Ensure consistency in how features (e.g., lab results, diagnoses) are represented, using uniform formats and units.
  - **Handling Too Many Features**: If the number of features becomes overwhelming, reduce dimensionality through feature selection techniques.
  - **Dealing with Missing Data**: Address missing values using imputation or other methods.
  - **Constructing New Features**: Create new features using metadata or derived information (e.g., test order counts).

#### **2. Database Operations**
- **Joins**: Data from different tables can be linked using a "join" operation in SQL, aligning the data based on a unique identifier (e.g., patient ID).

#### **3. Data Reshaping**
- **Reshaping**: Data may need to be transformed or reshaped into a usable format for analysis, particularly when combining data from multiple sources.

## Standardizing features

#### **1. Importance of Standardization**
- **Standardizing (or Normalizing)**: Transforms features so they share a common numerical range, reducing the effect of extreme values and ensuring consistent scale across features.
- **Why it matters**: When features have different scales, analysis methods that rely on distance metrics (e.g., k-NN, clustering) may produce biased or incorrect results.

#### **2. Common Standardization Techniques**
- **Min-Max Scaling**: Rescales each feature to a specific range, typically **0 to 1**. Formula:
  $$
  X_{\text{scaled}} = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}
  $$
- **Z-score Standardization**: Adjusts data so that each feature has an **arithmetic mean of 0** and a **standard deviation of 1**. Formula:
  $$
  Z = \frac{X - \mu}{\sigma}
  $$
  - Where $\mu$ is the mean and $\sigma$ is the standard deviation of the feature.

#### **3. Benefits of Standardization**
- **Facilitates analysis**: Helps models treat each feature equally, especially when working with algorithms that assume features are on a comparable scale.
- **Prevents bias**: By transforming the data, it avoids situations where large numbers in certain features disproportionately influence the results.


## Dealing with too many features

#### **1. Reasons for Not Using All Features**
- **Useless Features**: Features that don’t provide valuable information, such as the number of times a record was accessed.
- **Missingness**: Features that are missing for most patients can be problematic and may need removal.
- **Sparsity**: Features with many missing values for a given patient may be unhelpful.
- **Redundancy**: Highly correlated features (e.g., Feature1 and Feature2) can cause issues in analyses that don’t handle correlations well.
- **Speed of Analysis**: A large number of features can slow down analysis and may be constrained by computational resources.
- **Privacy**: More features can increase the risk of violating patient privacy, making it easier to identify individuals.

#### **2. Feature Removal Criteria**
- **Constant Features**: Features with the same value for all patients or those that are nearly constant over time should be removed.
- **Low Prevalence Features**: Features missing for most patients are challenging to infer and may be candidates for removal.
- **Domain Knowledge Aggregation**: Combine features using domain knowledge. For example:
  - **Drug Classification**: Aggregate specific drug names into broader categories like NSAIDs.
  - **Disease Aggregation**: Group diseases into categories to simplify analysis and improve cross-site comparisons.

#### **3. Methods for Feature Reduction**
- **Knowledge Graphs**: Use domain knowledge graphs to aggregate and simplify features, making cross-site comparisons easier. Accurate representation of medical knowledge is crucial for effective aggregation.
- **Mathematical Techniques**:
  - **Principal Component Analysis (PCA)**: A method that uses linear algebra to combine features and detect patterns, independent of specific medical knowledge. This can be useful but may lead to less interpretable derived features.
  
#### **4. Considerations for Feature Reduction**
- **Relevance to Research Question**: Assess if detailed distinctions (e.g., drug brand names) are necessary, or if broader categories (e.g., drug class) will suffice.
- **Model Flexibility**: Models like regression may benefit more from feature aggregation than models with higher flexibility (e.g., gradient boosting).
- **Computational Efficiency**: Reducing the number of features can improve computational efficiency and is beneficial regardless of the model used.

# Missing values

## The origins of missing values

#### **1. Understanding Missing Data**
- **Definition**: In a patient feature matrix, missing data can imply:
  - **Genuine Absence**: The value should have been recorded but was not.
  - **Artifact of Tabular Representation**: The absence of data may result from the structure of the matrix (e.g., disease codes columns being empty for patients without those diseases).
  - **Deliberate Non-Collection**: The value might not be relevant for all patients, indicating that missing data is intentional (e.g., a specific test not administered to all patients).

#### **2. Implications of Missing Data**
- **Data Reporting**:
  - **Special Markers**: Systems often use special values (e.g., NA, Null) to denote missing data. Numerical values outside the expected range (e.g., -999, 0) may also be used, but these need proper documentation.
  - **Interpretation**: Ensure that markers are clearly documented to distinguish between actual missing data and recorded zero values.

- **Impact on Analysis**:
  - **Bias from Removal**: Removing records with missing values can lead to biased results. For example, homeless diabetic patients might have more missing lab values but are also at higher risk for other conditions. Excluding these patients could under-represent important groups and skew results.
  - **Imputation**: Replacing missing values (imputation) needs to be handled carefully to avoid introducing biases or inaccuracies.

#### **3. Strategies for Handling Missing Data**
- **Documentation**: Clearly document how missing data is represented and handled in your dataset to avoid misinterpretation.
- **Consider Impact**: Evaluate how missing data might affect your analysis, particularly regarding biases and under-representation of certain groups.
- **Imputation Techniques**:
  - Use statistical methods to estimate missing values, but ensure these methods are appropriate for your data and research question.
  - Consider domain knowledge to guide imputation strategies and avoid misleading conclusions.

## Dealing with missing values

#### **1. Imputation Overview**
- **Definition**: Imputation involves predicting and filling in missing values based on other information in the dataset, rather than simply removing records with missing data.

#### **2. Common Imputation Methods**

- **Column-Mean Imputation**:
  - **Description**: Replaces missing values with the mean (average) of the known values in the same column.
  - **Limitations**: Assumes that values in other rows of the same column provide relevant information about the missing value. This may not always be valid, especially in medical datasets where individual variations are significant.

- **Using Other Columns**:
  - **Description**: Improves imputation by considering values in other columns of the same patient. For instance, using known age and weight to estimate height.
  - **Advantage**: Accounts for correlations among different features of the same patient, providing a more informed estimate.

- **k-Nearest Neighbors (k-NN) Imputation**:
  - **Description**: Fills in missing values by finding patients similar to the current patient based on other features and using those patients' values to impute the missing one.
  - **Advantage**: Utilizes the similarity between patients to make more accurate predictions.

- **Multiple Imputation**:
  - **Description**: Repeatedly performs imputation to create several different versions of the dataset, each with different imputed values. Analyzing these multiple datasets helps estimate the variance or imprecision of the imputed values.
  - **Advantage**: Provides a measure of uncertainty around the imputed values, improving the robustness of the analysis.

#### **3. Considerations for Imputation**
- **Data Characteristics**: Choose an imputation method based on the nature of the data and the relationships among features.
- **Impact on Analysis**: Be aware of how imputation affects the results and interpret findings with an understanding of the imputation process.

## Summary recommendations for missing values

#### **1. General Guidance**
- **Imputation**:
  - **When to Consider**: If a variable has only a few missing values, imputation is often appropriate.
  - **Why**: Imputation can provide more information and maintain the integrity of the dataset.

- **Removal**:
  - **When to Consider**: If a variable has mostly missing values, removing it might be preferable.
  - **Why**: Imputing values for most patients in such cases can be complex and may not provide useful information.

- **Middle Ground**:
  - **Unclear Cases**: For variables with a moderate amount of missing data, there is no one-size-fits-all solution. Consulting with experts is advisable.
  - **Indicator Variables**: Some suggest adding indicator variables to denote imputed values, but this can significantly increase the feature space and is subject to debate.

#### **2. Practical Considerations**
- **Imputation Challenges**:
  - **Complexity**: Imputation methods can be complex and make assumptions about the data that may not always hold true.
  - **Sparse Data**: High levels of missing data make imputation more challenging and less reliable.

- **Alternative Methods**:
  - **Analysis Methods**: Some methods, like classification trees, can handle missing data inherently, reducing the need for imputation.
  - **Practicality**: The benefits of imputation in terms of accuracy improvements should be weighed against the computational costs and complexity.

#### **3. Key Questions to Consider**
- **Importance of Feature**: Assess how crucial the feature with missing data is to your analysis. Can the research question be answered without it?
- **Imputation vs. Removal**: Determine if imputation adds value or if the feature can be dropped without significantly impacting the results.


Question 1
A colleague comes to you with a dataset for diabetic patients. The dataset contains patient’s latest blood sugar levels, the doses of their diabetes medications, and the number of diabetes-related visits the patient has had in the past year. She wants to use this data to predict patient survival, but she is concerned because 50% of the patients have no recorded value for blood sugar level. Another colleague recommends column mean imputation. 

Do you agree with that recommendation for this dataset?

- No, column mean imputation may not be appropriate for blood sugar levels, as individual variations in blood sugar levels are significant and may not be well represented by the mean value.

# Creating new features

## Constructing new features

#### **1. Feature Engineering Overview**
- **Definition**: Feature engineering involves creating new features from the original data by applying transformations or combining existing features. This can enhance model performance by making features more informative or robust.

#### **2. Benefits of Feature Engineering**
- **Improved Performance**: Well-engineered features can lead to better performance of models compared to using raw features.
- **Simplicity vs. Complexity**: Simple models with well-engineered features can sometimes outperform complex models with many raw features, particularly in clinical data.

#### **3. Example of Feature Engineering**
- **Body Mass Index (BMI)**: Calculate BMI from height and weight using a formula. This derived feature can be more useful than height and weight alone in some analyses.
- **Binary Variables**: Instead of using raw counts, convert them into binary indicators (e.g., whether a count is greater than one). Binary variables are often more robust to outliers and can simplify analysis.

#### **4. Practical Considerations**
- **Distinguishing Counts from Indicators**: Decide if the count of occurrences (e.g., number of codes for a procedure) or a binary presence/absence indicator is more meaningful. For instance, does seeing two codes indicate disease severity or reflect a billing practice?
- **Simplicity and Robustness**: Simple feature engineering, such as converting counts to binary variables, can often be more robust and easier to interpret.
ve features?



## Examples of engineered features

#### **1. Clinical Scoring Systems**
- **Purpose**: Simple formulas that combine various values from electronic medical records (EMRs) to estimate the severity of a patient's condition.
- **Example**: Body Mass Index (BMI) - a straightforward scoring system that estimates whether a person is over- or underweight.

#### **2. Comorbidity Indices**
- **Purpose**: Quantify the overall burden of multiple diseases to account for the overall illness of a patient.
- **Examples**:
  - **Charlson Comorbidity Index**: Estimates the risk of mortality by considering various comorbid conditions.
  - **Elixhauser Comorbidity Index**: Used to assess the impact of comorbidities on hospital outcomes.

#### **3. Proxy Features**
- **Socioeconomic Status**: Can be inferred from a patient’s zip code and the number of EMR records, scaled by overall health measures.
- **Smoking Status**: Inferred from text analysis of keywords (e.g., "cigarette") in clinical notes.

#### **4. Specific Features from Clinical Knowledge**
- **Growth Rate of Wounds**: Defined as the change in wound size per unit time. This feature was created to predict the risk of a wound not healing within three months, significantly improving prediction accuracy.


## When to consider engineered features

#### **1. Identifying Gaps in Measurement**
- **Determine Important Unmeasured Features**: Consider what features might be relevant but are not directly measured. Construct proxies from existing features if necessary.
- **Example**: Wound healing rate as a proxy feature to estimate the risk of non-healing.

#### **2. Utilizing Pre-Validated Clinical Scoring Systems**
- **Advantages**: Use established scoring systems that have been validated for clinical problems, as they are likely to be effective and reliable.

#### **3. Approaches to Feature Construction**
- **Counts, Differences, and Ratios**: Consider creating features based on counts, differences between measurements, changes over time, and ratios.
- **Clinical Knowledge and Creativity**: Apply clinical insight and creativity to construct meaningful features. Let the downstream analysis help determine which features are most useful.

#### **4. Cost-Benefit Analysis**
- **Balance Effort and Benefit**: Evaluate the trade-off between the benefits of new features and the effort required to create them.

#### **5. Deep Learning Methods**
- **Raw Data Processing**: Deep learning methods can learn features directly from raw data without domain knowledge.
- **Data Requirements**: These methods require large datasets and are particularly effective for processing signals and images. The benefits for electronic medical records are more modest compared to their impact on signals and images.



## Main points about creating analysis ready datasets

#### **1. Transforming Structured Data**
- **Data Sources**: Structured data in database tables can be converted into an analysis-ready dataset, also known as the patient feature matrix.
- **Tools**: Use widely available programming tools to perform this transformation.

#### **2. Reducing the Number of Features**
- **Aggregation**: Reduce features by aggregating them using domain knowledge or mathematical techniques.
  - **Domain Knowledge**: Group or summarize features based on clinical insights.
  - **Mathematical Techniques**: Use methods like Principal Component Analysis (PCA) to handle high-dimensional data.

#### **3. Handling Missing Data**
- **Options**: Address missing data through removal or imputation.
  - **Removal**: Delete records or features with excessive missing values.
  - **Imputation**: Apply techniques with varying levels of sophistication to estimate missing values.

#### **4. Feature Engineering**
- **Creating New Features**: Construct additional features from the original data to enhance analysis.
  - **Techniques**: Utilize counts, differences, changes over time, and ratios to generate meaningful features.

#### **5. Enhancing Effectiveness**
- **Domain Knowledge**: Improve your ability to create and analyze datasets by learning about relevant biology and medicine related to your research question.

# Knowledge Graphs

## Structured knowledge graphs

#### **1. What is a Knowledge Graph?**
- **Definition**: A knowledge graph, also known as an ontology, is a structured representation of entities within a domain and the relationships among them.
- **Purpose**: It provides a comprehensive and digital form of domain knowledge that can be searched and processed by computers.
- **Construction**: Building a knowledge graph is labor-intensive but essential for representing complex relationships and concepts.

#### **2. Using Knowledge Graphs in Clinical Data**
- **Prior Knowledge Integration**: Knowledge graphs help integrate prior knowledge into data analysis. For instance:
  - **Example 1**: Using counts of glucose-related lab tests to infer diabetes status instead of relying on actual test results.
  - **Example 2**: Grouping drug brand names into broader categories like NSAIDs for better analysis.

#### **3. Understanding Domain Concepts**
- **Contextual Knowledge**: Knowledge graphs provide information about diseases, tests, and treatments. For example:
  - **Diabetes**: The knowledge graph links diabetes to glucose metabolism disorders.
  - **Drug Categorization**: Identifies that Aleve contains ibuprofen and belongs to the NSAID category.

#### **4. Addressing Synonyms and Variability**
- **Standardization**: Knowledge graphs help address variability in clinical data by representing synonyms and different expressions of the same concept.
- **Clarity**: They provide a unified view of related terms and entities, making it easier to standardize and interpret clinical data.

## So what exactly is in a knowledge graph

**Definition and Purpose:**
- **Knowledge Graph**: A digital representation of entities within a domain and their interrelationships. Often used interchangeably with the term "ontology."
- **Purpose**: Provides structured, comprehensive information about entities and their relationships, enhancing the ability to query and analyze clinical data.

**Components of Knowledge Graphs:**
1. **Entities**:
   - **Examples**: Symptoms, diseases, medications, treatments, body parts.
   - **Function**: Represent specific elements within the medical domain.

2. **Synonyms and Equivalent Names**:
   - **Purpose**: Ensures consistent referencing of entities.
   - **Examples**: "Heart attack" and "acute myocardial infarction"; brand names and generic names of medications.

3. **Relations Between Entities**:
   - **Common Relation**: "Is a kind of" or "kind of."
   - **Examples**: 
     - Diarrhea is a kind of gastrointestinal disease.
     - Lipitor is a kind of lipid-lowering drug.
   - **Inheritance**: Entities inherit properties from their parent entities (e.g., if Lipitor is a lipid-lowering drug, it shares properties of all lipid-lowering drugs).

4. **Links to Other Knowledge Graphs**:
   - **Function**: Connects different knowledge graphs to provide a unified reference framework across diverse data sources.
   - **Example**: Metformin is linked to its role in treating diabetes.

**Applications in Clinical Data Analysis**:
- **Enhanced Querying**: Facilitates finding patients with specific conditions or treatments by identifying various terms with the same meaning.
- **Example**: Identifying all patients with diabetes or those taking allergy medications by understanding synonyms and relationships.

**Benefits**:
- **Consistency**: Helps standardize terms and definitions across different datasets.
- **Comprehensiveness**: Provides a broad, structured view of medical knowledge, supporting better data analysis and integration.

**Challenges**:
- **Creation**: Building a knowledge graph is labor-intensive and requires expert knowledge.
- **Complexity**: Integrating and maintaining links between different knowledge graphs can be complex.

This summary outlines the key elements and uses of knowledge graphs in clinical data analysis, highlighting their importance in structuring and querying medical data.

## What are important knowledge graphs

**1. International Classification of Diseases (ICD):**
   - **Purpose**: Originally created to record causes of death, now used for diagnosing and classifying diseases.
   - **Versions**: 
     - **ICD-10**: Current version in the USA.
     - **ICD-9**: Older version, still relevant for historical data.
   - **Example Codes**:
     - **ICD-9**: 250.00 (Uncomplicated diabetes mellitus type 2).
     - **ICD-10**: E11.9 (Type 2 diabetes mellitus without complications).

**2. Current Procedural Terminology (CPT):**
   - **Purpose**: Categorizes medical procedures and services for billing purposes.
   - **Entities**: Procedures, services, and modifiers.
   - **Relationship**: Subclass of or kind of.

**3. RxNorm:**
   - **Purpose**: Provides naming systems for both generic and branded drugs.
   - **Features**: Represents synonyms for drugs.
   - **Tool**: RXNav, used to navigate RxNorm.

**4. Anatomic Therapeutic Chemical (ATC) Classification System:**
   - **Purpose**: Categorizes drugs based on their active ingredients, organ systems they act upon, therapeutic activity, and pharmacological properties.
   - **Maintainer**: World Health Organization.

**5. Logical Observation Identifiers Names and Codes (LOINC):**
   - **Purpose**: Provides identifiers for medical laboratory observations and measurements.
   - **Maintainer**: Regenstrief Institute.

**6. Unified Medical Language System (UMLS) Metathesaurus:**
   - **Purpose**: A comprehensive resource combining over 140 knowledge graphs, including ICD, CPT, RxNorm, and more.
   - **Features**: Contains declarations of relationships between various knowledge graphs.
   - **Components**: Includes the Semantic Network and Lexical Tools.
   - **Resource**: Tutorials and materials are available from the US National Library of Medicine.

**Additional Resource:**
- **BioPortal**: Hosted by the National Center for Biomedical Ontology at Stanford, provides access to various biomedical knowledge graphs.

**Applications:**
- **Creating Analysis Ready Datasets**: Use knowledge graphs to ensure consistency, identify synonyms, and integrate various data sources.
- **Clinical Data Mining**: Essential for querying and analyzing medical data accurately.

This summary provides an overview of key knowledge graphs used in medical data analysis, their purposes, and their applications in creating analysis-ready datasets.

## How to choose which knowledge graph to use

**1. Key Questions for Evaluation:**
   - **Entities and Classification**: 
     - What entities are included in the knowledge graph?
     - What is the basis for classifying these entities?
     - Understand the meaning and implications of the relationships in the graph (e.g., "is a kind of").

   - **Naming and Synonyms**:
     - What terminology is used to name the entities?
     - Are there synonyms, alternative names, or different spellings included?
     - Assess how well the graph handles variations in terminology.

   - **Mappings and Connectivity**:
     - Is the knowledge graph mapped to other knowledge graphs?
     - Evaluate how well-connected the graph is with other resources.
     - Connectivity can enhance the integration and utility of the knowledge graph.

**2. Practical Approaches for Assessment:**
   - **Term Coverage**:
     - Analyze the number of terms from the knowledge graph mentioned in the textual data from electronic medical records (EMRs).
     - Determine the coverage and relevance of the knowledge graph for your dataset.

   - **Handling Large Graphs**:
     - For large knowledge graphs with millions of terms, use term occurrence counts to filter and select relevant terms.
     - Helps in managing and refining the scope of the knowledge graph to suit your needs.

   - **Updates and Currency**:
     - Knowledge graphs are periodically updated (e.g., UMLS updates quarterly).
     - Ensure the knowledge graph you use is current and reflects the latest medical knowledge and classifications.

**3. Summary:**
   - **Knowledge Graphs**: Large, curated collections of medical entities, their synonyms, and relationships.
   - **Utility**: Essential for creating features and processing clinical texts in medical data mining.
   - **Recommendation**: If not already familiar with UMLS or other major knowledge graphs, explore these resources for comprehensive and accurate medical knowledge.

This framework will help you assess the suitability and utility of different knowledge graphs for clinical data mining, ensuring that the chosen resource aligns with your data and research objectives.