Skip to content

Code Sets

Tiffany J. Callahan edited this page Jun 5, 2019 · 11 revisions

Converting Phenotype Definitions to Computable Code Sets

An eMERGE phenotype usually includes a word document or a PDF that describes the clinical logic utilized when identifying patients that a specific disease. In most cases, the author of the phenotype provides a list of clinical codes (i.e. ICD9, SNOMED, RxNorm) and some logic or rules that are used to identify specific sets of patients. In the absence of a code, a string or descriptive phrase is usually provided (i.e. "adderall", "history of appendicitis"). In these situations, the string must then be mapped onto a code in a clinical terminology. While this may initially seem like a trivial task, the lack of specificity requires several important decisions be made.

For example, see the table below for two simple ways that one could try and map "adderall" to a concept_code in one of the vocabularies recognized by the OMOP common data model.

Input String Query Approach Results Example Results
adderall Wildcard match:

WHEN lower(c.concept_name) LIKE "%adderall%" THEN "adderall"
352
  • SNOMED:15846811000001101 - Adderall XR 30mg capsules (Imported (United States))
  • NDC:54868367402 - Amphetamine aspartate 2.5 MG / Amphetamine Sulfate 2.5 MG / Dextroamphetamine saccharate 2.5 MG / Dextroamphetamine Sulfate 2.5 MG Oral Tablet [Adderall]
adderall Exact match:

WHEN lower(c.concept_name) LIKE "adderall" THEN "adderall"
4
  • SPL:2ea20fcc-3a59-4b9b-9b90-9da0d2bf219d - Adderall
  • RxNorm:84815 - Adderall

To choose the best approach when mapping input source strings, we must understand what all of the different approaches are and how/why they differ. This page will document our results when exploring these different representations. The different methods for mapping strings to codes are shown in Table 1. The phenotypes that contain input strings, organized by clinical domain, are shown below.

Code Set Queries
The queries used to generate the code sets described below can be found here.

Code Set Data
Data for all code sets by phenotype can be found here. This directory contains a sub-directory for each phenotype and within it are separate spreadsheets for each of the 24 code sets described above. A code set is only generated for those phenotypes and clinical domains included in the table below that have at least 1 valid occurrence in either database.


Code Set Information


Mapping Source Strings to Concept Codes

Table 1. Methods for deriving code sets from input source strings.

Approach Description
APPROACH 1 Näive wildcard matching input strings to OMOP CDM concept_codes
Wildcard Match DEF: Wildcard match input strings to concept_names in the OMOP CONCEPT table.

  • EXAMPLE: the input string "%adderall%" matches concept_names like: "Adderall"; "Adderall Pill"; "Adderall Oral Product"; "Amphetamine 3.75 MG/ML / Dextroamphetamine 3.75 MG/ML [Adderall]".
Wildcard Match + Children DEF: Wildcard match input strings to concept_names in the OMOP CONCEPT table. The immediate children of the matched concept_names are then retrieved from the OMOP CONCEPT_ANCESTOR table.
Wildcard Match + All Descendants DEF: Wildcard match input strings to concept_names in the OMOP CONCEPT table. All descendants of the matched synonym_concept_names are then retrieved from the OMOP CONCEPT_ANCESTOR table.
Concept Synonym Wildcard Match (CSWM) DEF: Wildcard match input strings to synonym_concept_names in the OMOP CONCEPT_SYNONYM table and then retrieve wildcard matches between the input string and all synonyms to concept_names in the OMOP CONCEPT table.

  • EXAMPLE: the input string "%adderall%" matches concept_names like: "Adderall 10mg tablets (Imported (United States))" and "Amphetamine 2.5 MG / Dextroamphetamine 2.5 MG [Adderall XR]".
Concept Synonym Wildcard Match + Children (CSWM_Child) DEF: Wildcard match input strings to synonym_concept_names in the OMOP CONCEPT_SYNONYM table and then retrieve wildcard matches between the input string and all synonyms to concept_names in the OMOP CONCEPT table. The immediate children of the matched synonym_concept_names are then retrieved from the OMOP CONCEPT_ANCESTOR table.

  • EXAMPLE: the input string "%adderall%" matches concept_names like: "Dextroamphetamine 10 MG Oral Tablet [Adderall]" and "Amphetamine / Dextroamphetamine Oral Tablet [Adderall]"
Concept Synonym Wildcard Match + All Descendants (CSWM_Desc) DEF: Wildcard match input strings to synonym_concept_names in the OMOP CONCEPT_SYNONYM table and then retrieve wildcard matches between the input string and all synonyms to concept_names in the OMOP CONCEPT table. All descendants of the matched synonym_concept_names are then retrieved from the OMOP CONCEPT_ANCESTOR table.

  • EXAMPLE: the input string "%adderall%" matches concept_names like: "Amphetamine 2.5 MG / Dextroamphetamine 2.5 MG Extended Release Oral Capsule [Adderall XR] Box of 100"
APPROACH 2 Exact matching input strings to OMOP CDM concept_codes
Exact Match DEF: Exact match input strings to concept_names in the OMOP CONCEPT table.

  • EXAMPLE the input string "adderall" matches concept_names like: "adderall"; "Adderall"; and "ADDERALL".
Exact Match + Children DEF: Exact match input strings to concept_names in the OMOP CONCEPT table. The immediate children of the matched concept_names are then retrieved from the OMOP CONCEPT_ANCESTOR table.
Exact Match + All Descendants DEF: Exact match input strings to concept_names in the OMOP CONCEPT table. All descendants of the matched synonym_concept_names are then retrieved from the OMOP CONCEPT_ANCESTOR table.
Concept Synonym Exact Match (CSEM) DEF: Exact match input strings to synonym_concept_names in the OMOP CONCEPT_SYNONYM table and then retrieve exact matches between the input string and all synonyms to concept_names in the OMOP CONCEPT table.

  • EXAMPLE: the input string "adderall" matches concept_names like: "adderall"; "Adderall"; and "ADDERALL".
Concept Synonym Exact Match + Children (CSEM_Child) DEF: Exact match input strings to synonym_concept_names in the OMOP CONCEPT_SYNONYM table and then retrieve exact matches between the input string and all synonyms to concept_names in the OMOP CONCEPT table. The immediate children of the matched synonym_concept_names are then retrieved from the OMOP CONCEPT_ANCESTOR table.

  • EXAMPLE: the input string "adderall" matches concept_names like: "adderall"; "Adderall"; and "ADDERALL".
Concept Synonym Exact Match + All Descendants (CSEM_Desc) DEF: Exact match input strings to synonym_concept_names in the OMOP CONCEPT_SYNONYM table and then retrieve exact matches between the input string and all synonyms to concept_names in the OMOP CONCEPT table. All descendants of the matched synonym_concept_names are then retrieved from the OMOP CONCEPT_ANCESTOR table.

  • EXAMPLE: the input string "adderall" matches concept_names like:  "adderall"; "Adderall"; and "ADDERALL".

Mapping Verification Task


As you might expect, some of the approaches described in Table 1, yield false positive positives that we'd like to not carry forward to the code set generation task. To account for this, we have designed a verification task, which was completed by 2 Pharmacist and 2 third-year medical students.The documents below were given to each of the students. An excerpt of the instructions provided in these documents is also included below.

  • Description of drug string to source code mapping task (here)
  • Description of condition, measurement, observation, and procedure strings to source code mapping task (here)

Verification Task
The goal of this task is to use your expertise to verify the diagnostic accuracy of phenotype-specific clinical code sets derived from several different query-based approaches. For this task, you will be asked to review code sets generated from definitions for 7 phenotypes. Note that you are only being asked to review the code sets generated when a source string is provided.

  • All of the phenotypes will also include additional code sets provided by the authors, but those do not require review.
  • The code type (i.e., source string or clinical code) is provided in the phenotype definition.

As you work through each phenotype, you should consider each source string and the mapped source codes and determine whether the resulting code could be included in the phenotype definition because the code is a close match to the source string and/or because the code makes sense clinically. Ideally, resulting codes should be both a good match to the phenotype definition AND make sense clinically. There are different conclusions that can be made when verifying a mapping between a source string and a source code:

  • “Exactly Matches Definition String + Clinically Relevant” - This conclusion is reached when the source code is a good match to the source string and when the source code is both relevant and a currently used treatment for the phenotype.
    -“Exactly Matches Definition String + NOT Clinically Relevant” - This conclusion is used when the source code is a good match to the source string, but the clinical concept represented by the source code lacks clinical relevance (e.g. discontinued drug, weird delivery route, deprecated lab test). If this conclusion is selected, a brief explanation should be provided as a comment.
  • “Partially (and Appropriately) Matches Definition String + Clinically Relevant” - This conclusion is used when the source code is a partial and appropriate match to the source string (e.g. source code represents a combo drug or a chemical compound that is part of the source string) and the source code is both relevant and a currently used treatment for the phenotype. If this conclusion is selected, a brief explanation should be provided as a comment.
  • “Partially (and Appropriately) Matches Definition String + NOT Clinically Relevant” - This conclusion is used when the source code is a partial and appropriate match to the source string (e.g. source code represents a combination drug or a chemical compound that is part of the source string), but the concept represented by the source code lacks clinical relevance (e.g. discontinued drug, weird delivery route, deprecated lab test). If this conclusion is selected, a brief explanation should be provided as a comment.
  • “Does NOT Match Definition + NOT Clinically Relevant” - This conclusion is used when the source code is neither an exact nor a partial (and appropriate) match to the source string and is not clinically relevant. If this conclusion is selected, a brief explanation should be provided as a comment.
  • “Control Group Criteria” - These choices are meant to be used when you are verifying the mappings for control patients. These new options were added because the prior categories did not easily account for control patients in the sense that we can't possible know all the criteria that are used when selecting control patients, like we can with cases. Thus, there are scenarios where it would be very difficult to judge the clinical relevancy of a mapping.
    • “Control Group Criteria - Exactly Matches Definition String”
    • “Control Group Criteria - Partially (and Appropriately) Matches Definition String”
    • “Control Group Criteria - Does NOT Match Definition”

Consider the examples below:
Phenotype: Peanut Allergy (Observations)
Input String: “peanut”
Source Code: SNOMED_91935009: Allergy to peanuts

  • This source code is a good match because its label “allergy to peanuts”, contains the input string.
  • This makes sense clinically as a realistic observation source code that would be assigned to a patient who had a diagnosed peanut allergy.
    Conclusion: “Exactly Matches Definition String + Clinically Relevant”

Phenotype: Peanut Allergy (Conditions)
Input String: “peanut”
Source Code: SNOMED_17212311000001107: Lifestyle gluten free peanut butter cookies (Ultrapharm Ltd)

  • This source code is a good match because its label “allergy to peanuts”, contains the input string.
  • This might NOT make sense clinically as a realistic diagnosis or condition code to use as evidence that a patient has been diagnosed with a peanut allergy.
    Conclusion: “Exactly Matches Definition String + NOT Clinically Relevant”

Phenotype: Systemic Lupus Erythematosus (Measurements)
Input String: “antinuclear ab”
Source Code: LOINC_11090-8: Smith extractable nuclear Ab [Units/volume] in Serum

  • This source code is a good partial (and appropriate) match to the phenotype definition because smith extractable nuclear Ab is an autoantibody to the smith antigen (PMID:19121207).
  • This code makes sense clinically as this lab test is specifically used to detect systemic lupus erythematosus (PMID: 5932578).
    Conclusion: “Partially (and Appropriately) Matches Definition String + Clinically Relevant”

Phenotype: Hypothyroidism (Measurements)
Input String: “free t4”
Source Code: LOINC_3022-1: Thyroxine free index in Serum or Plasma

  • This source code is a good match because its label “Thyroxine free index in Serum or Plasma” contains the input string.
  • This might NOT make sense clinically as this LOINC code has been deprecated and a different lab, LOINC_32215-6: Thyroxine (T4) free index in Serum or Plasma by calculation, is now preferred.
    Conclusion: “Partially (and Appropriately) Matches Definition String + NOT Clinically Relevant”

Phenotype: Crohn’s Disease (Conditions)
Input String: “SLE”
Source Code: CIEL_126366: Sleep-Related Bruxism

  • This source code would NOT be a good match because “Sleep-Related Bruxism” has no well-known connection to “SLE” and was only included because it contained the string “SLE”.
  • This would NOT make sense clinically as a diagnosis or condition code assigned to a Crohn’s disease patient, especially if it were being used as a way to rule out a secondary diagnosis of “SLE” or systemic lupus erythematosus.
    Conclusion: “Does NOT Match Definition + NOT Clinically Relevant”

Please keep the distinction between these different kinds of mappings in mind as you read the rest of this document.

Spreadsheet A screenshot of the spreadsheet you will verify is shown in the Figure below (click on the image to enlarge).

Figure 3. Verification spreadsheet example.

The spreadsheet includes the following columns:

  • source_string: A string containing a medication or drug name (e.g. amphetamine, wellbutrin)
  • source_code: A string containing a code to a drug from one of the clinical vocabularies included in the OMOP common data model.
  • source_name: The label or name that corresponds to the source_code listed in the prior concept.
  • source_vocabulary: The vocabulary that the source_code belongs to.
  • standard_code: A string containing a code to a drug from one of the standard clinical vocabularies included in the OMOP common data model.
  • standard_name: The label or name that corresponds to the standard_code listed in the prior concept.
  • standard_vocabulary: The vocabulary that the source_code belongs to: - Conditions: SNOMED CT
    - Drugs: RxNorm
    - Observations: CPT4, HCPCS, SNOMED, LOINC, APC, DRG, CDT, PPI, SPL, MedRA, MMI, MDC
    - Procedures: CPT4, HCPCS, ICD9Proc, CDT, ICD10PCS, OPCS4, SNOMED
    - Measurements: LOINC
  • CHCO_count: The count of standard_code occurrences in the Children’s Hospital of Colorado electronic health record.
  • MIMICIII_count: The count of standard_code occurrences in the OMOP-MIMIC III electronic health record.
  • Student Evaluator: This column contains a drop-down menu of the verification options described and is to be used by the student.
  • Student Comment: If any option other than “Exactly Matches Definition String + Clinically Relevant” is selected, please provide a provide a brief explanation. The “Other” option is also meant to be used when the student is unsure of their answer or if assistance from the Domain Expert is needed.
  • Domain Expert: This column contains a drop-down menu of the verification options described above and is to be used by the domain expert when assisting the student.
  • Domain Expert Comment: This column is used when the Domain Expert’s advice or opinion was needed by the student. All responses made by the Domain Expert should include a brief explanation.
  • Complete: The student should ONLY check this column once the verification of that row in the spreadsheet is complete.

Resources
Each phenotype definition can be accessed by clicking on the hyper-linked phenotype name in the first column of the Student 1 Verification and Student 2 Verification tables as well through the project wiki.

Verification Instructions

  1. Obtain your student verification number from the project’s GitHub repository.
  2. Work through each of the hyper-linked spreadsheets listed in your table.
    1. Start by clicking on the phenotype name, this will lead you to documentation on the phenotype and the diagnostic logic or clinical decisions that were made when building the phenotype definition. Keep this in mind when verifying each of the source string-code mappings.
    2. Use the drop-down menu in the “Student Evaluator” column to indicate your response. Examples uses of each drop-down option are provided in the Verification Task section.
      1. If you are unsure and need advice from the Domain Expert, choose “Other” and provide a brief description in the “Student Comment” column. Then, navigate to the project’s GitHub repository and create a new comment, assigned to the Domain Expert. A special training will be provided prior to beginning verification.
      2. The Domain expert will respond and with their response in GitHub as well as in the spreadsheet via the “Domain Expert” and “Domain Expert Comment” columns.
    3. Check the box in the “Complete” column once you have finished your review.
    4. When a spreadsheet has been completed:
      1. Please type in the date of completion in the “Complete Date” column.
      2. Add any important comments in the “Comments” column of the table.
      3. Navigate to the project’s GitHub repository and create a new comment, assigned to callahan_tiff. I will then verify the spreadsheet and close the GitHub issue.
  3. If you experience any technical difficulties, find errors, or have questions, please contact Tiffany Callahan by clicking here.

Return to Menu