# UMLS and Belgian French SnomedCT to concept table

## Introduction

This notebook take concepts from UMLS and the Belgian SnomedCT extension to generate a French oriented concept table to be used in biomedical entity linking tasks on French medical corpus.

It outputs:
- A csv file that can be used with [MedCAT](https://medcat.readthedocs.io).
- A [BRAT Normalization](https://brat.nlplab.org/normalization.html) DB file.
- A csv file grouped by CUI with terms separated by SEP tokens
- A dataset of synonym pairs designed to be used to pretrain [sapBERT](https://aclanthology.org/2021.acl-short.72/).

All outputs include **french** and **english** terms from UMLS + **french** terms from the Belgian SnomedCT + ATC and Belgian drug names. The sapBERT output also include more terms and other latin languages (**Spanish**, **Portuguese** and **Italian**)

This notebook owns a lot to this repository ( https://github.com/umcu/dutch-medical-concepts ) from the team of the UMC Utrecht to generate concepts tables for the Dutch language.

## Requirements

The python dependency are pandas, beautifulsoup4 and tqdm, and can be installed with `pip install pandas beautifulsoup4 tqdm`.

The notebook also requires as inputs:
- [UMLS](https://www.nlm.nih.gov/research/umls/index.html) (MRCONSO.RRF and MRSTY.RRF), either from full subset or a custom subset generated using MetamorphoSys
- Belgian extension of the [Snomed CT ontology](https://mlds.ihtsdotools.org).
- SAM (Source Authentique des Médicaments) [full export (Samv2 v5)](https://www.vas.ehealth.fgov.be/websamcivics/samcivics/download/samv2.html) for Belgian drug names

## Imports

In [None]:
import pandas as pd
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm
import re
import itertools
import random

## Paths

### Inputs

In [None]:
# UMLS
mrconso_path = "input_files/UMLS/2022AB/META/MRCONSO.RRF"
mrsty_path = "input_files/UMLS/2022AB/META/MRSTY.RRF"

# Snomed CT
snomedct_be_fr_path = "input_files/SnomedCT_ManagedServiceBE_PRODUCTION_BE1000172_20221115T120000Z/Snapshot/Terminology/sct2_Description_Snapshot-fr_BE1000172_20221115.txt"

# SAM
sam_amp_file = "input_files/sam/AMP-1674307274162.xml"

## Languages to include for SAP pairs. Include English and latin languages.
sap_lang = [
    "ENG",
    "SPA",
    "FRE",
    "POR",
    "ITA",
]

### Outputs

In [None]:
medcat_csv_path = "output_files/custom_umls_fr_en_medcat.csv"
brat_output_path = "output_files/custom_umls_fr_en_brat.txt"
grouped_output_path = "output_files/custom_umls_fr_en_grouped.txt"
sap_output_path = "output_files/custom_umls_sap.txt"

# UMLS

## Define the files structure (for reference) and columns names

In [None]:
mrconso_structure = {
    "CUI": "Unique identifier for concept",
    "LAT": "Language of term",
    "TS": "Term status",
    "LUI": "Unique identifier for term",
    "STT": "String type",
    "SUI": "Unique identifier for string",
    "ISPREF": "Atom status - preferred (Y) or not (N) for this string within this concept",
    "AUI": "Unique identifier for atom - variable length field, 8 or 9 characters",
    "SAUI": "Source asserted atom identifier [optional]",
    "SCUI": "Source asserted concept identifier [optional]",
    "SDUI": "Source asserted descriptor identifier [optional]",
    "SAB": "Abbreviated source name (SAB). Maximum field length is 20 alphanumeric characters.",
    "TTY": "Abbreviation for term type in source vocabulary",
    "CODE": '''Most useful source asserted identifier (if the source vocabulary has more than one identifier), 
    or a Metathesaurus-generated source entry identifier (if the source vocabulary has none)''',
    "STR": "String",
    "SRL": "Source restriction level",
    "SUPPRESS": '''Suppressible flag. 
    Values = O (obsolete content), 
    E (Non-obsolete content marked suppressible by an editor), 
    Y (Non-obsolete content deemed suppressible during inversion), 
    or N (None of the above)''',
    "CVF": "Content View Flag. Bit field used to flag rows included in Content View.",
}
mrconso_columns_names = [key.lower() for key in mrconso_structure.keys()]

In [None]:
mrsty_structure = {
"CUI": "Unique identifier of concept",
"TUI": "Unique identifier of Semantic Type",
"STN": "Semantic Type tree number",
"STY": "Semantic Type. The valid values are defined in the Semantic Network.",
"ATUI": "Unique identifier for attribute",
"CVF": "Content View Flag. Bit field used to flag rows included in Content View.",
}
mrsty_columns_names = [key.lower() for key in mrsty_structure.keys()]

## Load from files

In [None]:
umls_full_df = pd.read_csv(mrconso_path, 
                      sep="|", 
                      header=None, 
                      names=mrconso_columns_names, 
                      index_col=False, 
                      dtype=str)

In [None]:
tui_df = pd.read_csv(mrsty_path, 
                     sep="|", 
                     header=None, 
                     names=mrsty_columns_names, 
                     index_col=False)

## Filter

In [None]:
# Selected tty list
tty_selection = ['PT', # Designated preferred name
                 'LLT', # entry_term
                 'MH', # preferred
                 'SY', # synonym
                ]

# Selected sab list
selected_sab = ['MSHFRE',
                'SNOMEDCT_US',
                'DRUGBANK',
                'RXNORM',
                'MTH',
                'ICPC2ICD10ENG',
                'ICD10',
                'HPO',
                'MDRFRE',
                'ICPCFRE',
                'WHOFRE',
                'ATC',
                'MTHMSTFRE',
               ]

# Filter by tty
umls_df = umls_full_df.loc[umls_full_df.tty.isin(tty_selection)]

# Filter by sab
filtered_umls_df = umls_df.loc[umls_df.sab.isin(selected_sab)]

# Separate French concepts
#french_umls_df = umls_df[umls_df.lat == "FRE"]

## Define a cui to tui mapping (types for each cui)

In [None]:
#cui_tui_mapping = mrsty_df.groupby('cui')['tui'].apply(list).to_dict()

In [None]:
# Define tuis to remove (based on https://github.com/umcu/dutch-medical-concepts)
tuis_to_remove = [
    
    # Concepts & Ideas
    'T078', # Idea or Concept
    'T089', # Regulation or Law

    # Living beings
    'T011', # Amphibian
    'T008', # Animal
    'T012', # Bird
    'T013', # Fish
    'T015', # Mammal
    'T001', # Organism
    'T001', # Plant
    'T014', # Reptile
    'T010', # Vertebrate
    
    # Objects
    'T168', # Food
    
    # Organizations
    'T093', # Healthcare Related Organization
    
    # Geographic areas
    'T083', #Geographic Aera
]

In [None]:
# Get types labels
types_df = pd.read_csv("umls_types.csv")

# Structure types labels in a dictionary
type_dict = {}
for row in types_df.iloc:
    type_dict[row["tui"]] = row["label"]

# SnomedCT Belgian extension

## Load from files

In [None]:
# Load Belgian Snomed CT file
snomedct_be_fr_df = pd.read_csv(snomedct_be_fr_path, sep="\t")

# Keep only active concepts
snomedct_be_fr_df = snomedct_be_fr_df[snomedct_be_fr_df["active"] == 1]

# Rename tty and scui columns
snomedct_be_fr_df.rename({'typeId': 'tty', 'conceptId': 'scui'}, inplace=True, axis=1)

# convert to string
snomedct_be_fr_df['id'] = snomedct_be_fr_df['id'].astype('string')
snomedct_be_fr_df['scui'] = snomedct_be_fr_df['scui'].astype('string')
snomedct_be_fr_df['tty'] = snomedct_be_fr_df['tty'].astype('string')

# Map to MedCAT's P (Preferred term) & A values
snomedct_be_fr_df.tty.replace({'900000000000003001': 'P',
                    '900000000000013009': 'A'}, inplace=True)

## Generate Snomed CT (scui) to UMLS (cui) mapping

In [None]:
# Get Snomed CT to UMLS mapping
snomed_us = umls_df[umls_df['sab'] == "SNOMEDCT_US"]
raw_snomed_umls_mapping = snomed_us.groupby('scui')['cui'].apply(list).to_dict()

In [None]:
# Remove SnomedCT concepts mapping to multiple UMLS CUI
snomed_umls_mapping = {}
for key, value in raw_snomed_umls_mapping.items():
    value = set(value)
    # Keep only unambiguous cuis
    if len(value) == 1:
        snomed_umls_mapping[key] = value.pop()

In [None]:
# Map Belgian SnomedCT to UMLS
umls_mapped_snomed_be_fr = []
for row in tqdm(snomedct_be_fr_df.iloc, total=len(snomedct_be_fr_df), desc="Mapping SnomedCT Be to UMLS"):
    if row.scui in snomed_umls_mapping.keys():
        cui = snomed_umls_mapping[row.scui]
        umls_mapped_snomed_be_fr.append([cui, row.term, 'SNOMEDCT_BE_FR', row.tty])
umls_mapped_snomed_be_fr = pd.DataFrame(umls_mapped_snomed_be_fr, columns = ['cui', 'str', 'sab', 'tty'])

# Drug names

## Extract ATC and DRUGBANK

In [None]:
# Get drugs from UMLS
umls_drugs_df = umls_full_df[umls_full_df.sab.isin(['ATC', 'DRUGBANK'])]
atc_drugs_df = umls_full_df[umls_full_df.sab.isin(['ATC'])]

## Extract data from SAM 
<font color="red">Warning: loading SAM use around 25Go of ram, can take a few minutes</font>

In [None]:
# Parse SAM
with open(sam_amp_file, 'r') as f:
    sam_bf = BeautifulSoup(f, "xml")

In [None]:
# Find drugs
sam_drug_list = sam_bf.find("ns4:ExportActualMedicines").findAll("ns4:Amp", recursive=False)

In [None]:
# Create a dictionary to store names and atc codes
sam_atc_dict = {
    "name": [],
    "atc_code": []
}

# Define regex rule to remove doses
drug_regex = re.compile("^([^\d]*)")

# Extract name and ATC code for each drug
for drug in tqdm(sam_drug_list, desc="Extracting names with ATC codes"):
    data_list = drug.findAll("ns4:Data", recursive=False)
    ampp_list = drug.findAll("ns4:Ampp", recursive=False)
    
    code = None
    for ampp in ampp_list:
        atc = ampp.find("ns4:Atc")
        if atc:
            if not code:
                code = atc['code']
                
            elif code != atc['code']:
                #print("Code not equal:", atc['code'], code)
                code = None
                break
    
    if code:
        name_list = []
        for data in data_list:
            if data.Name:
                cur_name = drug_regex.search(data.Name.find("ns2:Fr").contents[0]).group(0).split("(")[0].rstrip()
                if not name_list or cur_name not in name_list:
                    name_list.append(cur_name)

        for name in name_list:
            sam_atc_dict['name'].append(name)
            sam_atc_dict['atc_code'].append(code)

# Structure drug's names with ATC codes in a dataframe
sam_atc_df = pd.DataFrame(sam_atc_dict).drop_duplicates()

## Create ATC to UMLS mapping

In [None]:
# Store UMLS CUI as string in the mapping and remove duplicates if any
raw_atc_umls_mapping = atc_drugs_df.groupby('code')['cui'].apply(list).to_dict()

atc_umls_mapping = {}
for key, value in raw_atc_umls_mapping.items():
    value = set(value)
    # Keep only unambiguous cuis
    if len(value) == 1:
        atc_umls_mapping[key] = value.pop().upper()

## Map SAM to UMLS

In [None]:
# Map SAM AMP to UMLS
umls_mapped_sam = []
for row in tqdm(sam_atc_df.iloc, total=len(sam_atc_df), desc="Mapping SAM to UMLS"):
    if row.atc_code.upper() in atc_umls_mapping.keys():
        cui = atc_umls_mapping[row.atc_code]
        umls_mapped_sam.append([cui, row['name'], 'SAM', 'A'])
umls_mapped_sam = pd.DataFrame(umls_mapped_sam, columns = ['cui', 'str', 'sab', 'tty'])

## Merge ATC, Drugbank and SAM

In [None]:
all_drugs_df = pd.concat([umls_drugs_df[['cui', 'str', 'sab', 'tty']], umls_mapped_sam])

# Drop duplicates
all_drugs_df = all_drugs_df.drop_duplicates()

# MedCAT output

## Merge french UMLS terms with french Belgian SnomedCT terms and drugs

In [None]:
# Merge umls with mapped Belgian SnomedCT
medcat_df = pd.concat([filtered_umls_df[["cui", "str", "sab", "tty"]], umls_mapped_snomed_be_fr, all_drugs_df])

# Replace tty for MedCAT
medcat_df.tty.replace({'PT': 'P',
                       'LLT': 'A',
                       'MH': 'A',
                       'SY': 'A',
                      }, inplace=True)

# Convert names to lowercase
medcat_df['str'] = medcat_df['str'].apply(lambda name: " ".join([(part.lower() if part.istitle() else part) for part in str(name).split(' ')]))
medcat_df['str'] = medcat_df['str'].apply(lambda name: "-".join([(part.lower() if part.istitle() else part) for part in str(name).split('-')]))

# Add tui
medcat_df = medcat_df.merge(tui_df, how='left', on='cui')[["cui", "str", "tui", "sab", "tty"]]

# Remove unwanted tuis
medcat_df = medcat_df[~medcat_df.tui.isin(tuis_to_remove)]

# Drop duplicates
medcat_df = medcat_df.drop_duplicates()

# Merging tuis
medcat_df = medcat_df.groupby(['cui', 'str', 'tty', 'sab'])['tui'].apply('|'.join).reset_index()

# Rename columns and sort by cui and name_status
medcat_df.rename(columns={'str': 'name', 'tty': 'name_status', 'sab': 'ontologies', 'tui': 'type_ids'}, inplace=True)

# Merging ontologies
medcat_df = medcat_df.groupby(['cui', 'name', 'type_ids'], as_index=False).agg({'ontologies' : lambda onto: '|'.join(list(set(onto))), 'name_status' : '|'.join}).copy()

# Cleaning tty after merging
medcat_df.name_status = medcat_df.name_status.apply(lambda name_status: 'P' if 'P' in name_status else 'A' )

# Clean indexes
medcat_df.sort_values(by=['cui', 'name_status'], ascending=[True, False], inplace=True)
medcat_df.reset_index(drop=True,inplace=True)

In [None]:
medcat_df.to_csv(medcat_csv_path, index=None)

# BRAT Normalization output

In [None]:
# Groups names by cui
brat_df = medcat_df.groupby(["cui", "type_ids"])["name"].apply(list).reset_index()

In [None]:
# Iterate over concepts. Generate one line per concept with all names and types labels
lines = []
for row in tqdm(brat_df.iloc, total=len(brat_df)):
    # Get cui for current line
    cur_line = row.cui
    
    # Add names with BRAT format
    for name in row['name']:
        cur_line += "\tname:Name:" + name
        
    # Split types in a list
    type_ids = row.type_ids.split("|")
    
    # Add types labels to the current line
    for tui in type_ids:
        cur_line += "\tattr:Type:" + tui + "|" + type_dict[tui]
        
    # Add current line to list of all lines
    lines.append(cur_line)

In [None]:
# Save BRAT normalization formated ontology to file
with open(brat_output_path, "w") as f:
    for line in lines:
        f.write(line + "\n")

# CUI grouped CSV

In [None]:
grouped_df = brat_df[["cui", "name"]].copy()
grouped_df.name = grouped_df.name.apply("</s>".join)

In [None]:
grouped_df.to_csv(grouped_output_path, index=None)

# sapBERT pairs output

## Get english terms refering to same cui as current dataset

In [None]:
# Using the process from sapBERT ( https://github.com/cambridgeltl/sapbert ) to generate pairs in the form
# label_id || entity_name_1 || entity_name_2

In [None]:
multi_lang_df = umls_df[umls_df["lat"].isin(sap_lang)]

In [None]:
# Get all UMLS terms for cui in custom ontology
english_custom_df = multi_lang_df[multi_lang_df['cui'].isin(medcat_df['cui'].tolist())]
english_custom_df = english_custom_df.rename(columns={'str': 'name'})

In [None]:
# Merge custom english terms with custom ontology
sap_df = pd.concat([medcat_df, english_custom_df])[['cui', 'name']]
sap_df = sap_df.drop_duplicates()

## Generate positive pairs

In [None]:
def gen_pairs(input_list):
    return list(itertools.combinations(input_list, r=2))

In [None]:
# Group names as a list for each cui
cui_names = sap_df.groupby(['cui'])['name'].apply(list).reset_index()

In [None]:
df = cui_names['name'].apply(gen_pairs).reset_index()

In [None]:
pos_pairs = []

len_pairs = []

for row in tqdm(cui_names.iloc, total=len(cui_names), desc="Generating positive pairs"):
    name_list = row['name']
    cui = row['cui']
    pairs = gen_pairs(name_list)
    len_pairs.append(len(pairs))
    if len(pairs)>50: # if >50 pairs, then trim to 50 pairs
        pairs = random.sample(pairs, 50)
    for cur_pair in pairs:
        try:
            line = cui + "||" + cur_pair[0] + "||" + cur_pair[1]
        except:
            print(cur_pair)
        pos_pairs.append(line)

## Save positive pairs to file in sapBERT format

In [None]:
with open(sap_output_path, 'w') as f:
    for line in pos_pairs:
        f.write(line + "\n")