# Notebook for Preprocessing of Raw Data

This notebook handles the preprocessing of the raw clinical notes of the MIMIC-III dataset. MIMIC-III is a publicly, available dataset containing data on over 40,000 ICU-patients retrieved from https://physionet.org/content/mimiciii/1.4/. 

This notebook creates to subsets of MIMIC-III
- MIMIC-III-Full: Contains hospital discharge summary notes associated with 52,722 unique identifiers and has a label space of 8,907 labels
- MIMIC-III-50: Contains hospital discharge summary notes associated with 11,368 unique identifiers with the 50 most frequent labels.

For the purpose of reproducibility we utilize existing information about the splits (i.e., training, validation, testing - https://github.com/jamesmullenbach/caml-mimic/tree/master/mimicdata/mimic3)  and the selection of the 50 most-frequent labels (https://github.com/jamesmullenbach/caml-mimic/blob/master/notebooks/dataproc_mimic_III.ipynb).


Click *Cell* --> *Run All* to execute preprocessing of the clinical documents. 

This may take some time. 

This notebook automatically runs the notebook with the name **notebook_preprocessing_icd9_code_descriptions** controlling the preprocessing of the ICD-9 code descriptions. 

# Load Libraries

In [1]:
# Built-in libraries
import os
import sys
import pickle
from typing import List
import re



# Set root path - it is expected to have the raw data stored in the following structure
# /root/data/raw/NOTESEVENTS.csv
# The processed data are saved to the following location
# /root/data/processed/Data.pkl
try:
    root = os.path.dirname(os.path.dirname(oa.path.abspath(__file__)))
except NameError:
    root = os.path.dirname(os.getcwd())
sys.path.append(root)

# Installed libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from torchtext.vocab import build_vocab_from_iterator
from gensim.models import Word2Vec
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# custom libraries
from src.tools.utils_preprocess import rel2abs, preproc_clinical_notes
from src.tools.utils_preprocess import create_splits, get_class_type

[nltk_data] Downloading package punkt to /Users/cmetzner/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/cmetzner/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/cmetzner/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
subsets = ['MimicFull', 'Mimic50']
# Set variables
SEED = 42

# Change paths to location of raw data (PROCEDURES_ICD.csv and DIAGNOSES_ICD.csv)
path_data_raw = os.path.join(root, 'data', 'raw')
# Change paths to location of processed data
path_data_proc = os.path.join(root, 'data', 'processed')
# Change paths to location where code descriptions are stored (CMS32_DESC_LONG_SHORT_DX.xlsx and CMS32_DESC_LONG_SHORT_SG.xlsx)
path_data_external = os.path.join(root, 'data', 'external')


# Create directories to store dataset specific data
for subset in subsets:
    if not os.path.exists(os.path.join(path_data_proc, f'data_{subset}/')):
        os.makedirs(os.path.join(path_data_proc, f'data_{subset}/'))

# Preprocess ICD-9 Procedure/Diagnoses Codes

In [3]:
# Load ICD-9 procedure and diagnosis codes
df_proc = pd.read_csv(os.path.join(path_data_raw, 'PROCEDURES_ICD.csv'))
df_diag = pd.read_csv(os.path.join(path_data_raw, 'DIAGNOSES_ICD.csv'))

# Remove all rows that have no ICD9-codes
df_proc = df_proc[~df_proc['ICD9_CODE'].isna()]
df_diag = df_diag[~df_diag['ICD9_CODE'].isna()]

print('PROCEDURE CODES:')
print(f'Shape of procedure data: {df_proc.shape}')
print(f"Number of unique patient ids (SUBJECT_ID): {df_proc.SUBJECT_ID.nunique()}")
print(f"Number of unique hospital admission ids (HADM_ID): {df_proc.HADM_ID.nunique()}")
print(f"Number of unique ICD-9 procedure codes: {df_proc.ICD9_CODE.nunique()}")
print()
print('DIAGNOSES CODES:')
print(f'Shape of diagnosis data: {df_diag.shape}')
print(f"Number of unique patient ids (SUBJECT_ID): {df_diag.SUBJECT_ID.nunique()}")
print(f"Number of unique hospital admission ids (HADM_ID): {df_diag.HADM_ID.nunique()}")
print(f"Number of unique ICD-9 diagnosis codes: {df_diag.ICD9_CODE.nunique()}")

print(f'\nTotal number of ICD-9 codes in raw data of MIMIC-III: {df_proc.ICD9_CODE.nunique() + df_diag.ICD9_CODE.nunique()}')

PROCEDURE CODES:
Shape of procedure data: (240095, 5)
Number of unique patient ids (SUBJECT_ID): 42214
Number of unique hospital admission ids (HADM_ID): 52243
Number of unique ICD-9 procedure codes: 2009

DIAGNOSES CODES:
Shape of diagnosis data: (651000, 5)
Number of unique patient ids (SUBJECT_ID): 46517
Number of unique hospital admission ids (HADM_ID): 58929
Number of unique ICD-9 diagnosis codes: 6984

Total number of ICD-9 codes in raw data of MIMIC-III: 8993


In [4]:
# Set ICD-9 codes to strings
df_proc['ICD9_CODE'] = df_proc['ICD9_CODE'].astype(str)
df_diag['ICD9_CODE'] = df_diag['ICD9_CODE'].astype(str)

# Remove all whitespace
df_proc['ICD9_CODE'] = df_proc.apply(lambda x: x.ICD9_CODE.strip(), axis=1)
df_diag['ICD9_CODE'] = df_diag.apply(lambda x: x.ICD9_CODE.strip(), axis=1)

# Transform ICD-9 code from relative to absolute represenation
df_proc['ABS_CODE'] = df_proc.apply(lambda x: rel2abs(x.ICD9_CODE, flag_proc=True), axis=1)
df_diag['ABS_CODE'] = df_diag.apply(lambda x: rel2abs(x.ICD9_CODE, flag_proc=False), axis=1)

# Add column to dataframe indicating type of ICD-9 code
df_proc['ICD9_TYPE'] = 'procedure'
df_diag['ICD9_TYPE'] = 'diagnosis'

# Concat dataframes containing procedure and diagnosis codes
df_codes = pd.concat([df_diag, df_proc])

# Show final df_codes
display(df_codes.head(3))

# Save codes as csv file at /root/data/processed/ALL_CODES.csv
df_codes.to_csv(os.path.join(path_data_proc, 'ALL_CODES.csv'), index=False,
                columns=['SUBJECT_ID', 'HADM_ID', 'ABS_CODE', 'ICD9_TYPE'],
                header=['SUBJECT_ID', 'HADM_ID', 'ICD9_CODE', 'ICD9_TYPE'])

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,SEQ_NUM,ICD9_CODE,ABS_CODE,ICD9_TYPE
0,1297,109,172335,1.0,40301,403.01,diagnosis
1,1298,109,172335,2.0,486,486.0,diagnosis
2,1299,109,172335,3.0,58281,582.81,diagnosis


In [5]:
# Load clinical notes dataset
notes = pd.read_csv(os.path.join(path_data_raw, 'NOTEEVENTS.csv'))

# We only consider notes of the category 'Discharge Summary'; these notes can include potential addenda
notes = notes.loc[notes.CATEGORY == 'Discharge summary']
notes = notes[['SUBJECT_ID', 'HADM_ID', 'TEXT']]

In [6]:
# Stores preprocessed clinical notes at /root/data/processed/CLEANED_NOTES.pkl
notes = preproc_clinical_notes(df_notes=notes,
                               path_data_proc=path_data_proc)

  0%|          | 0/52726 [00:00<?, ?it/s]

  0%|          | 0/52726 [00:00<?, ?it/s]

In [7]:
notes.head(3)

Unnamed: 0,SUBJECT_ID,HADM_ID,TEXT,TOKENS
0,22532,167853.0,admission date : deidentified discharge date :...,"[admission, date, :, deidentified, discharge, ..."
1,13702,107527.0,admission date : deidentified discharge date :...,"[admission, date, :, deidentified, discharge, ..."
2,13702,167118.0,admission date : deidentified discharge date :...,"[admission, date, :, deidentified, discharge, ..."


# Combine cleaned Notes and ICD-9 Labels

This section combines the cleaned clinical notes (i.e., hospital discharge summaries) and the ICD-9 procedure and diagnoses codes on unique hospital admission ids.



In [8]:
notes = pd.read_pickle(os.path.join(path_data_proc, 'CLEANED_NOTES.pkl'))
notes.head(3)

Unnamed: 0,SUBJECT_ID,HADM_ID,TEXT,TOKENS
0,22532,167853.0,admission date : deidentified discharge date :...,"[admission, date, :, deidentified, discharge, ..."
1,13702,107527.0,admission date : deidentified discharge date :...,"[admission, date, :, deidentified, discharge, ..."
2,13702,167118.0,admission date : deidentified discharge date :...,"[admission, date, :, deidentified, discharge, ..."


In [9]:
codes = pd.read_csv(os.path.join(path_data_proc, 'ALL_CODES.csv'), dtype={'ICD9_CODE': str})
codes.head(3)

Unnamed: 0,SUBJECT_ID,HADM_ID,ICD9_CODE,ICD9_TYPE
0,109,172335,403.01,diagnosis
1,109,172335,486.0,diagnosis
2,109,172335,582.81,diagnosis


In [10]:
# 1. Align HADM_IDS of codes and notes
# Some HADM_ID have no codes but no notes
# Some HADM_ID have notes but no codes (we already filtered out the HADM_ID without codes)

# Get all hospital admission ids that have notes
hadm_notes = notes.HADM_ID.unique().tolist()

# Retrieve codes for those hospital admission ids
codes = codes[codes['HADM_ID'].isin(hadm_notes)]

# Number of hospital admissions with clinical notes
print(f'Number of hospital admission ids with clinical notes: {notes.shape[0]}')

# Check number of hospital admission id's in codes dataset
print(f'Number of hospital admission ids that have codes and notes: {codes.HADM_ID.nunique()}')

# Apparently, we have 4 hospital admission id's that have notes but no codes
# Let's filter them out.
# Get hospital admission ids with notes but are not included in the codes dataframe
display(notes[~notes['HADM_ID'].isin(codes.HADM_ID.unique().tolist())])

hadm_notes_wo_codes = notes[~notes['HADM_ID'].isin(codes.HADM_ID.unique().tolist())].HADM_ID.tolist()
print(hadm_notes_wo_codes)

Number of hospital admission ids with clinical notes: 52726
Number of hospital admission ids that have codes and notes: 52722


Unnamed: 0,SUBJECT_ID,HADM_ID,TEXT,TOKENS
16238,13567,110220.0,admission date : deidentified discharge date :...,"[admission, date, :, deidentified, discharge, ..."
21393,31866,182252.0,admission date : deidentified discharge date :...,"[admission, date, :, deidentified, discharge, ..."
23115,24975,109963.0,admission date : deidentified discharge date :...,"[admission, date, :, deidentified, discharge, ..."
32306,17796,142890.0,admission date : deidentified discharge date :...,"[admission, date, :, deidentified, discharge, ..."


[110220.0, 182252.0, 109963.0, 142890.0]


In [11]:
notes = notes[~notes['HADM_ID'].isin(hadm_notes_wo_codes)]

In [12]:
# 1. Check if both dataframes have same number of hospital admission id's
print('Check if number of HADM_IDs are equal in both dataframes.')
print(f'Codes == Notes : {codes.HADM_ID.nunique()} == {notes.HADM_ID.nunique()}')

Check if number of HADM_IDs are equal in both dataframes.
Codes == Notes : 52722 == 52722


# Create MIMIC-III-Full and MIMIC-III-50 Subsets

In [13]:
def get_label_frequency_stats(df_codes: pd.DataFrame,
                              dataset: str,
                             labels: List[str]) -> None:
    
    label_counts = df_codes.ICD9_CODE.value_counts()
    arr= label_counts.values
    # Get percentiles at 25, 50, and 75
    q25 = np.quantile(arr, 0.25)
    q50 = np.quantile(arr, 0.5)
    q75 = np.quantile(arr, 0.75)
    
    print(f'Q1: {q25}')
    print(f'Median: {q50}')
    print(f'Q3: {q75}')
    
    # Create dictionary for storing the labels' affiliation to a certain quartile (Q1, Q2, Q3, or Q4)
    _d = {}
    
    for label, value in label_counts.items():
        _d[label] = {'freq': value}
        if value >= q75:
            _d[label]['quartile'] = 3  # (Q3, +inf)
        elif q50 <= value < q75:
            _d[label]['quartile'] = 2  # (Q2, Q3)
        elif q25 <= value < q50:
            _d[label]['quartile'] = 1  # (Q1, Q2)
        else:
            _d[label]['quartile'] = 0  # (-inf, Q1)
            
    l_codes_quartiles = []
    for label in labels:
        l_codes_quartiles.append(_d[label]["quartile"])
        
        
    with open(os.path.join(path_data_proc, f'data_{dataset}', f'l_codes_quartiles_{dataset}.pkl'), 'wb') as f:
        pickle.dump(l_codes_quartiles, f)

    with open(os.path.join(path_data_proc, f'data_{dataset}', f'l_codes_dict_{dataset}.pkl'), 'wb') as f:
        pickle.dump(_d, f)

## MIMIC-III-Full

## Dataset: 'full' set of codes in MIMIC-III:

In [14]:
codes_full = codes.copy()
# Get list of unique labels 
l_codes_full = sorted(codes_full.ICD9_CODE.unique().tolist())
print(f'Number of billable code labels: {len(l_codes_full)}')
with open(os.path.join(path_data_proc, f'data_MimicFull', 'l_codes_MimicFull.pkl'), 'wb') as f:
    pickle.dump(l_codes_full, f)

# Aggregate individual label mentions in lists
codes_full = codes_full.groupby(['SUBJECT_ID', 'HADM_ID']).agg({'ICD9_CODE': list}).reset_index()

# Remove redundant labels in lists
codes_full['ICD9_CODE'] = codes_full.apply(lambda x: sorted(list(set(x.ICD9_CODE))), axis=1)

# Generate full dataset
df_full = codes_full.merge(notes, how='left', on=['SUBJECT_ID', 'HADM_ID']).sort_values(['SUBJECT_ID', 'HADM_ID'])

# Save dataset
df_full.to_pickle(os.path.join(path_data_proc, 'data_MimicFull', 'DATA_MimicFull.pkl'))

Number of billable code labels: 8907


In [15]:
MIN_FREQ = 3
create_splits(subset='MimicFull', path_data_proc=path_data_proc, min_freq=MIN_FREQ)


Current subset: MimicFull
Current split: train
Vocab size for train: 120481
Current split: test
Current split: val


In [18]:
# Get statistics for MIMIC-III-50
hadm_ids = []
print('Data Description for MIMIC-III-Full')
for split in ['train', 'val', 'test']:
    print(f'Description for {split} split:')
    split_ids = pd.read_csv(os.path.join(path_data_proc, f'data_MimicFull', f'ids_MimicFull_{split}.csv')).HADM_ID.tolist()
    hadm_ids += split_ids
    
    df_full_split = df_full[df_full['HADM_ID'].isin(split_ids)]
    print(f'Number of unique patient ids: {df_full_split.SUBJECT_ID.nunique()}')
    print(f'Number of unique hospital admission ids in {split} dataset: {len(split_ids)}')
    

df_full = df_full[df_full['HADM_ID'].isin(hadm_ids)]
df_full_codes = df_full.explode('ICD9_CODE')
print(f'Number of unique codes: {len(df_full_codes.ICD9_CODE.unique())}')
print()
df_full['avg_words_doc'] = df_full.apply(lambda x: len(x.TOKENS), axis=1)
avg_words_docs = np.mean(df_full['avg_words_doc'])
std_words_docs = np.std(df_full['avg_words_doc'])
df_full['avg_labels_doc'] = df_full.apply(lambda x: len(x.ICD9_CODE), axis=1)
avg_labels_doc = np.mean(df_full['avg_labels_doc'])
print(f'Average Number of Words per Document: {avg_words_docs}')
print(f'Std of Words per Document: {std_words_docs}')
print(f'Average Number of Labels per Document: {avg_labels_doc}')

# Get frequency and quartile
print()
get_label_frequency_stats(df_full_codes, dataset='MimicFull', labels=l_codes_full)

Data Description for MIMIC-III-Full
Description for train split:
Number of unique patient ids: 36997
Number of unique hospital admission ids in train dataset: 47719
Description for val split:
Number of unique patient ids: 1374
Number of unique hospital admission ids in val dataset: 1631
Description for test split:
Number of unique patient ids: 2755
Number of unique hospital admission ids in test dataset: 3372
Number of unique codes: 8907

Average Number of Words per Document: 1861.5246766055916
Std of Words per Document: 948.6767511098549
Average Number of Labels per Document: 15.8828193164144

Q1: 2.0
Median: 6.0
Q3: 28.0


## MIMIC-III-50

In [19]:
# MIMIC-III-50: 50-most-frequent codes retrieved from https://github.com/jamesmullenbach/caml-mimic/blob/master/notebooks/dataproc_mimic_III.ipynb
l_codes_50 = ['038.9',
 '244.9',
 '250.00',
 '272.0',
 '272.4',
 '276.1',
 '276.2',
 '285.1',
 '285.9',
 '287.5',
 '305.1',
 '311',
 '33.24',
 '36.15',
 '37.22',
 '38.91',
 '38.93',
 '39.61',
 '39.95',
 '401.9',
 '403.90',
 '410.71',
 '412',
 '414.01',
 '424.0',
 '427.31',
 '428.0',
 '45.13',
 '486',
 '496',
 '507.0',
 '511.9',
 '518.81',
 '530.81',
 '584.9',
 '585.9',
 '599.0',
 '88.56',
 '88.72',
 '93.90',
 '96.04',
 '96.6',
 '96.71',
 '96.72',
 '99.04',
 '99.15',
 '995.92',
 'V15.82',
 'V45.81',
 'V58.61']

# Save
with open(os.path.join(path_data_proc, 'data_Mimic50', 'l_codes_Mimic50.pkl'), 'wb') as f:
    pickle.dump(l_codes_50, f)

In [20]:
codes_50 = codes.copy()
codes_50 = codes_50[codes_50['ICD9_CODE'].isin(l_codes_50)]

# Aggregate individual label mentions in lists
codes_50 = codes_50.groupby(['SUBJECT_ID', 'HADM_ID']).agg({'ICD9_CODE': list}).reset_index()

# Remove redundant labels in lists
codes_50['ICD9_CODE'] = codes_50.apply(lambda x: sorted(list(set(x.ICD9_CODE))), axis=1)

# Generate full dataset
df_50 = codes_50.merge(notes, how='left', on=['SUBJECT_ID', 'HADM_ID']).sort_values(['SUBJECT_ID', 'HADM_ID'])

# Save dataset
df_50.to_pickle(os.path.join(path_data_proc, 'data_Mimic50', 'DATA_Mimic50.pkl'))

In [21]:
MIN_FREQ = 3

create_splits(subset='Mimic50', path_data_proc=path_data_proc, min_freq=MIN_FREQ)


Current subset: Mimic50
Current split: train
Vocab size for train: 50176
Current split: test
Current split: val


In [25]:
# Get statistics for MIMIC-III-50
hadm_ids = []
print('Data Description for MIMIC-III-50')
for split in ['train', 'val', 'test']:
    print(f'Description for {split} split:')
    split_ids = pd.read_csv(os.path.join(path_data_proc, f'data_Mimic50', f'ids_Mimic50_{split}.csv')).HADM_ID.tolist()
    hadm_ids += split_ids
    
    df50_split = df_50[df_50['HADM_ID'].isin(split_ids)]
    print(f'Number of unique patient ids: {df50_split.SUBJECT_ID.nunique()}')
    print(f'Number of unique hospital admission ids in {split} dataset: {len(split_ids)}')
    

df50 = df_50[df_50['HADM_ID'].isin(hadm_ids)]
df_50_codes = df50.explode('ICD9_CODE')
print(f'Number of unique codes: {len(df_50_codes.ICD9_CODE.unique())}')
print()
df50['avg_words_doc'] = df50.apply(lambda x: len(x.TOKENS), axis=1)
avg_words_docs = np.mean(df50['avg_words_doc'])
std_words_docs = np.std(df50['avg_words_doc'])
df50['avg_labels_doc'] = df50.apply(lambda x: len(x.ICD9_CODE), axis=1)
avg_labels_doc = np.mean(df50['avg_labels_doc'])
print(f'Average Number of Words per Document: {avg_words_docs}')
print(f'Std of Words per Document: {std_words_docs}')
print(f'Average Number of Labels per Document: {avg_labels_doc}')

# Get frequency and quartile
print()
get_label_frequency_stats(df_50_codes, dataset='Mimic50', labels=l_codes_50)

Data Description for MIMIC-III-50
Description for train split:
Number of unique patient ids: 7501
Number of unique hospital admission ids in train dataset: 8066
Description for val split:
Number of unique patient ids: 1329
Number of unique hospital admission ids in val dataset: 1573
Description for test split:
Number of unique patient ids: 1526
Number of unique hospital admission ids in test dataset: 1729
Number of unique codes: 50

Average Number of Words per Document: 1989.295830401126
Std of Words per Document: 968.2481012006912
Average Number of Labels per Document: 5.770496129486277

Q1: 751.75
Median: 1021.0
Q3: 1531.5


# Preprocessing of Raw ICD-9 Code Descriptions

This notebook handles the preprocessing of the raw ICD-9 Code Descriptions related to the MIMIC-III dataset. The descriptions were taken from https://physionet.org/content/mimiciii/1.4/.

Click *Cell* --> *Run Cells* to retrieve code description embeddings

# Processing ICD-9 Codes

In [3]:
# Custom libraries
def rel2abs(x: str, flag_proc: bool = True) -> str:
    """ Function that transform relative ICD-9 code into absolute code

    Parameters
    ----------
    x : str
        relative ICD-9 code
    flag_proc : bool; default=True
        flag indicating if relative code is a procedure or diagnosis code

    Returns
    -------
    str
        absolute ICD-9 code, e.g., procedure: XX.XX / diagnosis: XXX.XX

    """
    if flag_proc:
        # Some codes are billable with only two or three digits, do not add period
        if len(x) == 2:  # Procedure codes
            return x
        else:
            return f'{x[:2]}.{x[2:]}'
    else:
        if len(x) == 3:  # Diagnosis codes
            return x
        else:
            if x[0] == 'E':
                if len(x) == 4:
                    return x
                else:
                    return f'{x[:4]}.{x[4:]}'
            else:
                return f'{x[:3]}.{x[3:]}'
            
            
def clean_desc(text: str) -> List[str]:
    """
    Function that preprocesses the text code/category descriptions by cleaning, tokenizing, and removing
    english stopwords.
    This function requires to have the following "nltk" packages installed:
        - nltk.download('punkt')
        - nltk.download('stopwords')
    
    Parameter
    ----------
    text : str
        Input code descriptions
    
    Return
    ------
    List[str]
        Preprocessed tokens for respective code description.
        
    """
    
    # 1. Transform text to be lowercase
    text = text.lower()
    # 2. Replace all newline characters to be spaces
    text = text.replace('\n', ' ')
    # 3. Remove excessive whitespace
    text = re.sub(' +', ' ', text)  
    # 4. Remove punctuations from string
    punc = ['.', '?', '!', ',', '#', ':', ';', '(', ')', '%', '/', '-', '+', '=', '&', '_']
    for p in punc:
        text = re.sub('\%s' % p, '', text)
    # 5. Tokenize string
    tokens = word_tokenize(text)
    # 6. Remove stopwords
    tokens = [word for word in tokens if not word in stopwords.words()]
    return tokens

# Preprocess ICD-9 Procedure and Diagnosis Code Descriptions


In [4]:
# DIAGNOSES CODES
diag_desc = pd.read_csv(os.path.join(root, 'data', 'external', 'D_ICD_DIAGNOSES.csv'), dtype={'ICD9_CODE': str})
diag_desc = diag_desc.drop(['ROW_ID', 'SHORT_TITLE'], axis=1)

# PROCEDURE CODES
proc_desc = pd.read_csv(os.path.join(root, 'data', 'external', 'D_ICD_PROCEDURES.csv'), dtype={'ICD9_CODE': str})
proc_desc = proc_desc.drop(['ROW_ID', 'SHORT_TITLE'], axis=1)


# Transform relative codes to absolute codes
diag_desc['ICD9_CODE'] = diag_desc.apply(lambda x: rel2abs(x.ICD9_CODE, flag_proc=False), axis=1)
proc_desc['ICD9_CODE'] = proc_desc.apply(lambda x: rel2abs(x.ICD9_CODE, flag_proc=True), axis=1)

display(diag_desc.head(5))
display(proc_desc.head(5))

Unnamed: 0,ICD9_CODE,LONG_TITLE
0,11.66,"Tuberculous pneumonia [any form], tubercle bac..."
1,11.7,"Tuberculous pneumothorax, unspecified"
2,11.71,"Tuberculous pneumothorax, bacteriological or h..."
3,11.72,"Tuberculous pneumothorax, bacteriological or h..."
4,11.73,"Tuberculous pneumothorax, tubercle bacilli fou..."


Unnamed: 0,ICD9_CODE,LONG_TITLE
0,8.51,Canthotomy
1,8.52,Blepharorrhaphy
2,8.59,Other adjustment of lid position
3,8.61,Reconstruction of eyelid with skin flap or graft
4,8.62,Reconstruction of eyelid with mucous membrane ...


# 2. Load MIMIC-III Data

In [5]:
# Load processed ICD-9 codes in complete MIMIC-III dataset
codes = pd.read_csv(os.path.join(root, 'data', 'processed', 'ALL_CODES.csv'), dtype={'ICD9_CODE': str})
display(codes.head(5))

# Load processed MimicFull dataset
MimicFull = pd.read_pickle(os.path.join(root, 'data', 'processed', 'data_MimicFull', 'DATA_MimicFull.pkl'))
display(MimicFull.head(5))

Unnamed: 0,SUBJECT_ID,HADM_ID,ICD9_CODE,ICD9_TYPE
0,109,172335,403.01,diagnosis
1,109,172335,486.0,diagnosis
2,109,172335,582.81,diagnosis
3,109,172335,585.5,diagnosis
4,109,172335,425.4,diagnosis


Unnamed: 0,SUBJECT_ID,HADM_ID,ICD9_CODE,TEXT,TOKENS
0,3,145834,"[038.9, 263.9, 38.93, 410.71, 425.4, 427.5, 42...",admission date : deidentified discharge date :...,"[admission, date, :, deidentified, discharge, ..."
1,4,185777,"[041.11, 042, 136.3, 276.3, 33.23, 38.93, 571....",admission date : deidentified discharge date :...,"[admission, date, :, deidentified, discharge, ..."
2,6,107064,"[275.3, 276.6, 276.7, 285.9, 38.06, 39.57, 403...",admission date : deidentified discharge date :...,"[admission, date, :, deidentified, discharge, ..."
3,9,150750,"[276.5, 401.9, 428.0, 431, 507.0, 584.9, 96.04...",admission date : deidentified discharge date :...,"[admission, date, :, deidentified, discharge, ..."
4,10,184167,"[765.15, 765.25, 774.2, 96.6, 99.15, 99.83, V2...",admission date : deidentified discharge date :...,"[admission, date, :, deidentified, discharge, ..."


In [6]:
# Next we want to make sure to only include codes that are actually associated with the hospital admission ids
# in the MimicFull dataset
MimicFull_hadm = MimicFull.HADM_ID.unique().tolist()

# Now use these hospital admission ids to subset the complete codes dataset
codes_full = codes[codes['HADM_ID'].isin(MimicFull_hadm)].copy()
codes_full = codes_full[['ICD9_CODE', 'ICD9_TYPE']]
codes_full = codes_full.drop_duplicates()

# Split this dataframe into diagnosis codes and procedure codes
codes_diag = codes_full[codes_full['ICD9_TYPE'] == 'diagnosis']
codes_proc = codes_full[codes_full['ICD9_TYPE'] == 'procedure']

# 3. Retrieve available code descriptions

In [7]:
# Retrieve all available code descriptions for the codes in the MimicFull dataset
diag_desc_full = diag_desc[diag_desc['ICD9_CODE'].isin(codes_diag.ICD9_CODE.unique().tolist())]
proc_desc_full = proc_desc[proc_desc['ICD9_CODE'].isin(codes_proc.ICD9_CODE.unique().tolist())]

print(f'Number of retrieved diagnosis code descriptions: {diag_desc_full.shape[0]}/{codes_diag.shape[0]}')
print(f'Number of retrieved procedure code descriptions: {proc_desc_full.shape[0]}/{codes_proc.shape[0]}')
print()
print('Retrieved sample diagnosis codes with description:')
display(diag_desc_full.sample(10))
print('Retrieved sample procedure codes with description:')
display(proc_desc_full.sample(10))

Number of retrieved diagnosis code descriptions: 6776/6918
Number of retrieved procedure code descriptions: 1817/1989

Retrieved sample diagnosis codes with description:


Unnamed: 0,ICD9_CODE,LONG_TITLE
2971,307.81,Tension headache
2884,304.72,Combinations of opioid type drug with any othe...
11939,721.2,Thoracic spondylosis without myelopathy
9853,V09.90,"Infection with drug-resistant microorganisms, ..."
2890,304.90,"Unspecified drug dependence, unspecified"
4185,379.99,Other ill-defined disorders of eye
5921,589.0,Unilateral small kidney
7595,747.22,Atresia and stenosis of aorta
7461,755.02,Polydactyly of toes
10632,E866.8,Accidental poisoning by other specified solid ...


Retrieved sample procedure codes with description:


Unnamed: 0,ICD9_CODE,LONG_TITLE
658,39.58,Repair of blood vessel with unspecified type o...
1700,45.22,Endoscopy of large intestine through artificia...
888,27.56,Other skin graft to lip and mouth
1610,37.66,Insertion of implantable heart assist system
1480,56.74,Ureteroneocystostomy
999,18.21,Excision of preauricular sinus
1902,83.03,Bursotomy
2515,81.85,Other repair of elbow
2500,81.63,Fusion or refusion of 4-8 vertebrae
2305,62.69,Other repair of testis


### Let's take a look at the ICD-9 codes in MIMIC-III where we were unable to retrieve a respective code description.

In [8]:
# Show ICD-9 codes not retrieved
diag_desc_full_not = codes_diag[~codes_diag['ICD9_CODE'].isin(diag_desc_full.ICD9_CODE.unique().tolist())]

print('Sample diagnosis codes unable to retrieve appropriate descriptions')
display(diag_desc_full_not.sort_values('ICD9_CODE').sample(10))

proc_desc_full_not = codes_proc[~codes_proc['ICD9_CODE'].isin(proc_desc_full.ICD9_CODE.unique().tolist())]

print('Sample procedure codes unable to retrieve appropriate descriptions')
display(proc_desc_full_not.sort_values('ICD9_CODE').sample(10))

Sample diagnosis codes unable to retrieve appropriate descriptions


Unnamed: 0,ICD9_CODE,ICD9_TYPE
164821,523.4,diagnosis
252527,616.8,diagnosis
536,V46.1,diagnosis
814,V17.4,diagnosis
122295,284.0,diagnosis
14543,357.8,diagnosis
172194,277.8,diagnosis
3977,780.9,diagnosis
115599,608.2,diagnosis
56316,282.4,diagnosis


Sample procedure codes unable to retrieve appropriate descriptions


Unnamed: 0,ICD9_CODE,ICD9_TYPE
724800,33.0,procedure
655314,45.8,procedure
848951,35.1,procedure
829101,76.4,procedure
729912,75.0,procedure
656596,11.8,procedure
815047,76.1,procedure
683872,45.0,procedure
720980,42.0,procedure
789162,33.2,procedure


### Let's cross-compare with original list of ICD-9 code descriptions (DIAGNOSES AND PROCEDURES) retrieved from (https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes - Version 32 Full and Abbreviated Code Titles  – Effective October 1, 2014 (ZIP)) 

Reasons why code descriptions were not retrieved
- Assigned code (in MIMIC-III) was overly generic and not specific enough (Shi et al. 2017)
    - See for diagnosis codes: 771.8 vs 771.81 / 771.82 / 771.83 / 771.89 (https://www.aapc.com/codes/icd9-codes-range/105/)
    - See for procedure codes: 72 vs. 72.0 / 72.1 (https://www.aapc.com/codes/icd9-codes-vol3-range/15/)

In [15]:
# Load additional icd-9 code information
diag_desc2 = pd.read_excel(os.path.join(root, 'data', 'external', 'CMS32_DESC_LONG_SHORT_DX.xlsx'))
proc_desc2 = pd.read_excel(os.path.join(root, 'data', 'external', 'CMS32_DESC_LONG_SHORT_SG.xlsx'),
                          converters={'PROCEDURE CODE': str})

In [17]:
diag_desc2['ICD9_CODE'] = diag_desc2.apply(lambda x: rel2abs(x['DIAGNOSIS CODE'], flag_proc=False), axis=1)
diag_desc2 = diag_desc2.drop(['DIAGNOSIS CODE', 'SHORT DESCRIPTION'], axis=1)
diag_desc2

Unnamed: 0,LONG DESCRIPTION,ICD9_CODE
0,Cholera due to vibrio cholerae,001.0
1,Cholera due to vibrio cholerae el tor,001.1
2,"Cholera, unspecified",001.9
3,Typhoid fever,002.0
4,Paratyphoid fever A,002.1
...,...,...
14562,"Quadruplet gestation, unable to determine numb...",V91.29
14563,"Other specified multiple gestation, unspecifie...",V91.90
14564,"Other specified multiple gestation, with two o...",V91.91
14565,"Other specified multiple gestation, with two o...",V91.92


In [18]:
proc_desc2['ICD9_CODE'] = proc_desc2.apply(lambda x: rel2abs(x['PROCEDURE CODE'], flag_proc=True), axis=1)
proc_desc2 = proc_desc2.drop(['PROCEDURE CODE', 'SHORT DESCRIPTION'], axis=1)
display(proc_desc2.head(10))

Unnamed: 0,LONG DESCRIPTION,ICD9_CODE
0,Therapeutic ultrasound of vessels of head and ...,0.01
1,Therapeutic ultrasound of heart,0.02
2,Therapeutic ultrasound of peripheral vascular ...,0.03
3,Other therapeutic ultrasound,0.09
4,Implantation of chemotherapeutic agent,0.1
5,Infusion of drotrecogin alfa (activated),0.11
6,Administration of inhaled nitric oxide,0.12
7,Injection or infusion of nesiritide,0.13
8,Injection or infusion of oxazolidinone class o...,0.14
9,High-dose infusion interleukin-2 [IL-2],0.15


In [21]:
# Combine procedure and diagnosis label descriptions
descs = pd.concat([diag_desc_full, proc_desc_full]).sort_values('ICD9_CODE')

with open(os.path.join(root, 'data', 'processed', 'code_descriptions.pkl'), 'wb') as f:
    pickle.dump(descs, f)
    
display(descs.head(5))
descs.shape



Unnamed: 0,ICD9_CODE,LONG_TITLE
68,3.0,Salmonella gastroenteritis
69,3.1,Salmonella septicemia
76,3.8,Other specified salmonella infections
77,3.9,"Salmonella infection, unspecified"
79,4.1,Shigella flexneri


(8593, 2)

# Clean code descriptions

In [21]:

# Clean available code-descriptions
descs['desc_clean'] = descs.apply(lambda x: clean_desc(text=x.LONG_TITLE), axis=1)
descs = descs.drop(['LONG_TITLE'], axis=1)
descs['desc_len'] = descs.apply(lambda x: len(x.desc_clean), axis=1) 
max_len = descs.desc_len.max() 
print(f'Longest code description contains {max_len} tokens.')

Longest code description contains 20 tokens.


In [22]:
descs

Unnamed: 0,ICD9_CODE,desc_clean,desc_len
68,003.0,"[salmonella, gastroenteritis]",2
69,003.1,"[salmonella, septicemia]",2
76,003.8,"[specified, salmonella, infections]",3
77,003.9,"[salmonella, infection, unspecified]",3
79,004.1,"[shigella, flexneri]",2
...,...,...,...
9750,V90.10,"[retained, metal, fragments, unspecified]",4
9757,V90.39,"[retained, organic, fragments]",3
9758,V90.81,"[retained, glass, fragments]",3
9760,V90.89,"[specified, retained, foreign, body]",4


# Create Docs and Tags for Doc2Vec Algorithm
- Docs: Cleaned code descriptions
- Tags: Respective ICD-9 Code Descriptions

In [23]:
docs = descs['desc_clean'].tolist()  # corpus to learn code descriptions - one doc is one label description
tags = descs['ICD9_CODE'].tolist()  # assign labels as tags

In [24]:
# Examples of cleaned descriptions
docs[:3]

[['salmonella', 'gastroenteritis'],
 ['salmonella', 'septicemia'],
 ['specified', 'salmonella', 'infections']]

# Compute Code Description Embedding Matrices using Doc2Vec

In [25]:
documents = [TaggedDocument(doc, [tag]) for doc, tag in zip(docs, tags)]

In [26]:
documents[:3]

[TaggedDocument(words=['salmonella', 'gastroenteritis'], tags=['003.0']),
 TaggedDocument(words=['salmonella', 'septicemia'], tags=['003.1']),
 TaggedDocument(words=['specified', 'salmonella', 'infections'], tags=['003.8'])]

In [29]:
with open(os.path.join(root, 'data', 'processed', 'data_MimicFull', 'l_codes_MimicFull.pkl'), 'rb') as f:
    labels_full = pickle.load(f)
    
with open(os.path.join(root, 'data', 'processed', 'data_Mimic50', 'l_codes_Mimic50.pkl'), 'rb') as f:
    labels_50 = pickle.load(f)
    
# Compute pre-trained word embeddings for embedding layer using Word2Vec
documents = [TaggedDocument(doc, [tag]) for doc, tag in zip(docs, tags)]
embedding_dims = [100, 200, 300]


for subset, labels in zip(['MimicFull', 'Mimic50'], [labels_full, labels_50]):
    print(f'Create code description embedding matrix for {subset}')
    if not os.path.exists(os.path.join(path_data_proc, f'data_{subset}', 'code_embeddings')):
        os.makedirs(os.path.join(path_data_proc, f'data_{subset}', 'code_embeddings'))
    for dim in embedding_dims:
        print(f'Embedding dimension: {dim}')
        model = Doc2Vec(documents, vector_size=dim, window=2, min_count=1, workers=4)
        model_shape = model.dv.vectors.shape    
        label_embedding_matrix = np.zeros((len(labels), dim))    
        count = 0
        for i, label in enumerate(labels):
            try:
                dv = model.dv[label]
            except:
                count += 1
                # create randomly initialized code description embedding vector 
                dv = np.random.uniform(low=-0.05, high=0.05, size=(dim, ))

            label_embedding_matrix[i,:] = dv
        print(f'Shape of label_embedding_matrix: {label_embedding_matrix.shape}')
    
        print(f'Number of randomly initialized label description embedding vectors {count}.')
    
        with open(os.path.join(root, 'data', 'processed', f'data_{subset}', 'code_embeddings', f'code_embedding_matrix_{subset}_{dim}.pkl'), 'wb') as f:
            pickle.dump(label_embedding_matrix, f)

Create code description embedding matrix for MimicFull
Embedding dimension: 100
Shape of label_embedding_matrix: (8907, 100)
Number of randomly initialized label description embedding vectors 314.
Embedding dimension: 200
Shape of label_embedding_matrix: (8907, 200)
Number of randomly initialized label description embedding vectors 314.
Embedding dimension: 300
Shape of label_embedding_matrix: (8907, 300)
Number of randomly initialized label description embedding vectors 314.
Create code description embedding matrix for Mimic50
Embedding dimension: 100
Shape of label_embedding_matrix: (50, 100)
Number of randomly initialized label description embedding vectors 0.
Embedding dimension: 200
Shape of label_embedding_matrix: (50, 200)
Number of randomly initialized label description embedding vectors 0.
Embedding dimension: 300
Shape of label_embedding_matrix: (50, 300)
Number of randomly initialized label description embedding vectors 0.


# Preprocess High-Level ICD-9 Code Categories

## MIMIC-III-Full

In [30]:
# Dictionary containing ICD-9 diagnosis (3 digits + V/E) / procedure codes (2 digits)
# Diagnosis codes: http://www.icd9data.com/2015/Volume1/default.htm
# Procedure codes: http://www.icd9data.com/2015/Volume3/default.htm

d_cat_desc = {
    '001-139': 'Infectious And Parasitic Diseases',
    '140-239': 'Neoplasms',
    '240-279': 'Endocrine, Nutritional And Metabolic Diseases, And Immunity Disorders',
    '280-289': 'Diseases Of The Blood And Blood-Forming Organs',
    '290-319': 'Mental Disorders',
    '320-389': 'Diseases Of The Nervous System And Sense Organs',
    '390-459': 'Diseases Of The Circulatory System',
    '460-519': 'Diseases Of The Respiratory System',
    '520-579': 'Diseases Of The Digestive System',
    '580-629': 'Diseases Of The Genitourinary System',
    '630-679': 'Complications Of Pregnancy, Childbirth, And The Puerperium',
    '680-709': 'Diseases Of The Skin And Subcutaneous Tissue',
    '710-739': 'Diseases Of The Musculoskeletal System And Connective Tissue',
    '740-759': 'Congenital Anomalies',
    '760-779': 'Certain Conditions Originating In The Perinatal Period',
    '780-799': 'Symptoms, Signs, And Ill-Defined Conditions',
    '800-999': 'Injury And Poisoning',
    'V01-V91': 'Supplementary Classification Of Factors Influencing Health Status And Contact With Health Services',
    'E000-E999': 'Supplementary Classification Of External Causes Of Injury And Poisoning',
    '00-00': 'Procedures And Interventions Not Elsewhere Classified',
    '01-05': 'Operations On The Nervous System',
    '06-07': 'Operations On The Endocrine System',
    '08-16': 'Operations On The Eye',
    '17-17': 'Other Miscellaneous Diagnostic And Therapeutic Procedures',
    '18-20': 'Operations On The Ear',
    '21-29': 'Operations On The Nose, Mouth, And Pharynx',
    '30-34': 'Operations On The Respiratory System',
    '35-39': 'Operations On The Cardiovascular System',
    '40-41': 'Operations On The Hemic And Lymphatic System',
    '42-54': 'Operations On The Digestive System',
    '55-59': 'Operations On The Urinary System',
    '60-64': 'Operations On The Male Genital Organs',
    '65-71': 'Operations On The Female Genital Organs',
    '72-75': 'Obstetrical Procedures',
    '76-84': 'Operations On The Musculoskeletal System',
    '85-86': 'Operations On The Integumentary System',
    '87-99': 'Miscellaneous Diagnostic And Therapeutic Procedures'
}

In [31]:
df_cat_desc = pd.DataFrame.from_dict(data=d_cat_desc.items())
df_cat_desc.columns = ['categories', 'description']
df_cat_desc

df_cat_desc['desc_clean'] = df_cat_desc.apply(lambda x: clean_desc(text=x.description), axis=1)
display(df_cat_desc)


docs = df_cat_desc['desc_clean'].tolist()  # corpus to learn code descriptions - one doc is one label description
tags = df_cat_desc['categories'].tolist()  # assign labels as tags

Unnamed: 0,categories,description,desc_clean
0,001-139,Infectious And Parasitic Diseases,"[infectious, parasitic, diseases]"
1,140-239,Neoplasms,[neoplasms]
2,240-279,"Endocrine, Nutritional And Metabolic Diseases,...","[endocrine, nutritional, metabolic, diseases, ..."
3,280-289,Diseases Of The Blood And Blood-Forming Organs,"[diseases, blood, bloodforming, organs]"
4,290-319,Mental Disorders,"[mental, disorders]"
5,320-389,Diseases Of The Nervous System And Sense Organs,"[diseases, nervous, system, organs]"
6,390-459,Diseases Of The Circulatory System,"[diseases, circulatory, system]"
7,460-519,Diseases Of The Respiratory System,"[diseases, respiratory, system]"
8,520-579,Diseases Of The Digestive System,"[diseases, digestive, system]"
9,580-629,Diseases Of The Genitourinary System,"[diseases, genitourinary, system]"


In [37]:
# Compute pre-trained word embeddings for embedding layer using Word2Vec
documents = [TaggedDocument(doc, [tag]) for doc, tag in zip(docs, tags)]
embedding_dims = [100, 200, 300]
for dim in embedding_dims:
    print()
    print(f'Create word embedding matrices with dimenions: {dim}')
    model = Doc2Vec(documents, vector_size=dim, window=2, min_count=1, workers=4)
    model_shape = model.dv.vectors.shape
    print('Shape of document embedding matrix.')
    print(f'Number of document embeddings: {model_shape[0]}')
    print(f'Document embedding dimension: {model_shape[1]}')
        
    label_embedding_matrix = model.dv.vectors
    with open(os.path.join(root, 'data', 'processed', 'data_MimicFull', 'code_embeddings', f'cat_embedding_matrix_MimicFull_{dim}.pkl'), 'wb') as f:
        pickle.dump(label_embedding_matrix, f)
    print()


Create word embedding matrices with dimenions: 100
Shape of document embedding matrix.
Number of document embeddings: 37
Document embedding dimension: 100


Create word embedding matrices with dimenions: 200
Shape of document embedding matrix.
Number of document embeddings: 37
Document embedding dimension: 200


Create word embedding matrices with dimenions: 300
Shape of document embedding matrix.
Number of document embeddings: 37
Document embedding dimension: 300



### Retrieve Category to billable Code Mapping

In [38]:
df_cat = pd.DataFrame.from_dict(data=d_cat_desc.items())
df_cat.columns = ['categories', 'description']

cat_bounds = df_cat['categories'].str.split("-", n = 1, expand = True)
df_cat['cat_lower'] = cat_bounds[0]
df_cat['cat_upper'] = cat_bounds[1]
df_cat

Unnamed: 0,categories,description,cat_lower,cat_upper
0,001-139,Infectious And Parasitic Diseases,001,139
1,140-239,Neoplasms,140,239
2,240-279,"Endocrine, Nutritional And Metabolic Diseases,...",240,279
3,280-289,Diseases Of The Blood And Blood-Forming Organs,280,289
4,290-319,Mental Disorders,290,319
5,320-389,Diseases Of The Nervous System And Sense Organs,320,389
6,390-459,Diseases Of The Circulatory System,390,459
7,460-519,Diseases Of The Respiratory System,460,519
8,520-579,Diseases Of The Digestive System,520,579
9,580-629,Diseases Of The Genitourinary System,580,629


In [39]:
lower_bounds = list(cat_bounds[0])
upper_bounds = list(cat_bounds[1])

In [40]:
cat2label_mapping_full = []
for label in labels_full:
    for i, (lower, upper) in enumerate(zip(lower_bounds, upper_bounds)):
        if '.' in label:
            if label[2] == '.':  # procedure code
                if len(lower) == 2:
                    if lower <= label[:2] <= upper:
                        print(f'Code: {label} -> Cat: {i}')
                        cat2label_mapping_full.append(i)
            else:  # diagnosis code
                if len(lower) == 3:
                    if lower <= label[:3] <= upper:
                        print(f'Code: {label} -> Cat: {i}')
                        cat2label_mapping_full.append(i)
                elif len(lower) == 4:  # codes starting with 'E'
                    if lower <= label[:4] <= upper:
                        print(f'Code: {label} -> Cat: {i}')
                        cat2label_mapping_full.append(i)
        else:
            if len(label) == 2:
                if len(lower) == 2:
                    if lower <= label <= upper:
                        print(f'Code: {label} -> Cat: {i}')
                        cat2label_mapping_full.append(i)
            elif len(label) == 3:
                if len(lower) == 3:
                    if lower <= label <= upper:
                        print(f'Code: {label} -> Cat: {i}')
                        cat2label_mapping_full.append(i)
            elif len(label) == 4:
                if len(lower) == 4:
                    if lower <= label <= upper:
                        print(f'Code: {label} -> Cat: {i}')
                        cat2label_mapping_full.append(i)

Code: 003.0 -> Cat: 0
Code: 003.1 -> Cat: 0
Code: 003.8 -> Cat: 0
Code: 003.9 -> Cat: 0
Code: 004.1 -> Cat: 0
Code: 004.8 -> Cat: 0
Code: 004.9 -> Cat: 0
Code: 005.1 -> Cat: 0
Code: 005.81 -> Cat: 0
Code: 005.9 -> Cat: 0
Code: 007.1 -> Cat: 0
Code: 007.4 -> Cat: 0
Code: 008.04 -> Cat: 0
Code: 008.41 -> Cat: 0
Code: 008.43 -> Cat: 0
Code: 008.45 -> Cat: 0
Code: 008.47 -> Cat: 0
Code: 008.5 -> Cat: 0
Code: 008.61 -> Cat: 0
Code: 008.62 -> Cat: 0
Code: 008.63 -> Cat: 0
Code: 008.69 -> Cat: 0
Code: 008.8 -> Cat: 0
Code: 009.0 -> Cat: 0
Code: 009.1 -> Cat: 0
Code: 009.2 -> Cat: 0
Code: 009.3 -> Cat: 0
Code: 010.85 -> Cat: 0
Code: 011.23 -> Cat: 0
Code: 011.36 -> Cat: 0
Code: 011.64 -> Cat: 0
Code: 011.86 -> Cat: 0
Code: 011.90 -> Cat: 0
Code: 011.93 -> Cat: 0
Code: 011.94 -> Cat: 0
Code: 012.05 -> Cat: 0
Code: 012.15 -> Cat: 0
Code: 013.00 -> Cat: 0
Code: 013.04 -> Cat: 0
Code: 013.25 -> Cat: 0
Code: 013.30 -> Cat: 0
Code: 013.54 -> Cat: 0
Code: 014.02 -> Cat: 0
Code: 014.05 -> Cat: 0
Code:

In [48]:
with open(os.path.join(root, 'data', 'processed', 'data_MimicFull', 'code_embeddings', f'embedding_matrix_MimicFull_mapping.pkl'), 'wb') as f:
    pickle.dump(cat2label_mapping_full, f)

## MIMIC-III-50

In [49]:
# Retain all categories associated with MIMIC-III-50:
dict_cat_desc_50 = {
    '001-139': 'Infectious And Parasitic Diseases',
    '240-279': 'Endocrine, Nutritional And Metabolic Diseases, And Immunity Disorders',
    '280-289': 'Diseases Of The Blood And Blood-Forming Organs',
    '290-319': 'Mental Disorders',
    '390-459': 'Diseases Of The Circulatory System',
    '460-519': 'Diseases Of The Respiratory System',
    '520-579': 'Diseases Of The Digestive System',
    '580-629': 'Diseases Of The Genitourinary System',
    '800-999': 'Injury And Poisoning',
    'V01-V91': 'Supplementary Classification Of Factors Influencing Health Status And Contact With Health Services',
    '30-34': 'Operations On The Respiratory System',
    '35-39': 'Operations On The Cardiovascular System',
    '42-54': 'Operations On The Digestive System',
    '87-99': 'Miscellaneous Diagnostic And Therapeutic Procedures'
}

# Create a pandas dataframe
df_cat_desc_50 = pd.DataFrame.from_dict(data=dict_cat_desc_50.items())
df_cat_desc_50.columns = ['categories', 'description']
df_cat_desc_50

Unnamed: 0,categories,description
0,001-139,Infectious And Parasitic Diseases
1,240-279,"Endocrine, Nutritional And Metabolic Diseases,..."
2,280-289,Diseases Of The Blood And Blood-Forming Organs
3,290-319,Mental Disorders
4,390-459,Diseases Of The Circulatory System
5,460-519,Diseases Of The Respiratory System
6,520-579,Diseases Of The Digestive System
7,580-629,Diseases Of The Genitourinary System
8,800-999,Injury And Poisoning
9,V01-V91,Supplementary Classification Of Factors Influe...


In [50]:
df_cat_desc_50['desc_clean'] = df_cat_desc_50.apply(lambda x: clean_desc(text=x.description), axis=1)
display(df_cat_desc_50)


docs = df_cat_desc_50['desc_clean'].tolist()  # corpus to learn code descriptions - one doc is one label description
tags = df_cat_desc_50['categories'].tolist()  # assign labels as tags

Unnamed: 0,categories,description,desc_clean
0,001-139,Infectious And Parasitic Diseases,"[infectious, parasitic, diseases]"
1,240-279,"Endocrine, Nutritional And Metabolic Diseases,...","[endocrine, nutritional, metabolic, diseases, ..."
2,280-289,Diseases Of The Blood And Blood-Forming Organs,"[diseases, blood, bloodforming, organs]"
3,290-319,Mental Disorders,"[mental, disorders]"
4,390-459,Diseases Of The Circulatory System,"[diseases, circulatory, system]"
5,460-519,Diseases Of The Respiratory System,"[diseases, respiratory, system]"
6,520-579,Diseases Of The Digestive System,"[diseases, digestive, system]"
7,580-629,Diseases Of The Genitourinary System,"[diseases, genitourinary, system]"
8,800-999,Injury And Poisoning,"[injury, poisoning]"
9,V01-V91,Supplementary Classification Of Factors Influe...,"[supplementary, classification, factors, influ..."


In [51]:
# Compute pre-trained word embeddings for embedding layer using Word2Vec
documents = [TaggedDocument(doc, [tag]) for doc, tag in zip(docs, tags)]
embedding_dims = [100, 200, 300]
for dim in embedding_dims:
    print()
    print(f'Create word embedding matrices with dimenions: {dim}')
    model = Doc2Vec(documents, vector_size=dim, window=2, min_count=1, workers=4)
    model_shape = model.dv.vectors.shape
    print('Shape of document embedding matrix.')
    print(f'Number of sentence embeddings: {model_shape[0]}')
    print(f'Word embedding dimension: {model_shape[1]}')
    
    print()
    print('Create label embedding matrix for Mimic50:')
    label_embedding_matrix = model.dv.vectors
    with open(os.path.join(root, 'data', 'processed', 'data_Mimic50', 'code_embeddings', f'cat_embedding_matrix_Mimic50_{dim}.pkl'), 'wb') as f:
        pickle.dump(label_embedding_matrix, f)
        


Create word embedding matrices with dimenions: 100
Shape of document embedding matrix.
Number of sentence embeddings: 14
Word embedding dimension: 100

Create label embedding matrix for Mimic50:

Create word embedding matrices with dimenions: 200
Shape of document embedding matrix.
Number of sentence embeddings: 14
Word embedding dimension: 200

Create label embedding matrix for Mimic50:

Create word embedding matrices with dimenions: 300
Shape of document embedding matrix.
Number of sentence embeddings: 14
Word embedding dimension: 300

Create label embedding matrix for Mimic50:


### Retrieve Category to billable Code Mapping

In [52]:
df_cat_50 = pd.DataFrame.from_dict(data=dict_cat_desc_50.items())
df_cat_50.columns = ['categories', 'description']

cat_bounds_50 = df_cat_50['categories'].str.split("-", n = 1, expand = True)
df_cat_50['cat_lower'] = cat_bounds_50[0]
df_cat_50['cat_upper'] = cat_bounds_50[1]

lower_bounds_50 = list(cat_bounds_50[0])
upper_bounds_50 = list(cat_bounds_50[1])



In [53]:
cat2label_mapping_50 = []
for label in labels_50:
    for i, (lower, upper) in enumerate(zip(lower_bounds_50, upper_bounds_50)):
        if '.' in label:
            if label[2] == '.':  # procedure code
                if len(lower) == 2:
                    if lower <= label[:2] <= upper:
                        print(f'Code: {label} -> Cat: {i}')
                        cat2label_mapping_50.append(i)
            else:  # diagnosis code
                if len(lower) == 3:
                    if lower <= label[:3] <= upper:
                        print(f'Code: {label} -> Cat: {i}')
                        cat2label_mapping_50.append(i)
                elif len(lower) == 4:  # codes starting with 'E'
                    if lower <= label[:4] <= upper:
                        print(f'Code: {label} -> Cat: {i}')
                        cat2label_mapping_50.append(i)
        else:
            if len(label) == 2:
                if len(lower) == 2:
                    if lower <= label <= upper:
                        print(f'Code: {label} -> Cat: {i}')
                        cat2label_mapping_50.append(i)
            elif len(label) == 3:
                if len(lower) == 3:
                    if lower <= label <= upper:
                        print(f'Code: {label} -> Cat: {i}')
                        cat2label_mapping_50.append(i)
            elif len(label) == 4:
                if len(lower) == 4:
                    if lower <= label <= upper:
                        print(f'Code: {label} -> Cat: {i}')
                        cat2label_mapping_50.append(i)

Code: 038.9 -> Cat: 0
Code: 244.9 -> Cat: 1
Code: 250.00 -> Cat: 1
Code: 272.0 -> Cat: 1
Code: 272.4 -> Cat: 1
Code: 276.1 -> Cat: 1
Code: 276.2 -> Cat: 1
Code: 285.1 -> Cat: 2
Code: 285.9 -> Cat: 2
Code: 287.5 -> Cat: 2
Code: 305.1 -> Cat: 3
Code: 311 -> Cat: 3
Code: 33.24 -> Cat: 10
Code: 36.15 -> Cat: 11
Code: 37.22 -> Cat: 11
Code: 38.91 -> Cat: 11
Code: 38.93 -> Cat: 11
Code: 39.61 -> Cat: 11
Code: 39.95 -> Cat: 11
Code: 401.9 -> Cat: 4
Code: 403.90 -> Cat: 4
Code: 410.71 -> Cat: 4
Code: 412 -> Cat: 4
Code: 414.01 -> Cat: 4
Code: 424.0 -> Cat: 4
Code: 427.31 -> Cat: 4
Code: 428.0 -> Cat: 4
Code: 45.13 -> Cat: 12
Code: 486 -> Cat: 5
Code: 496 -> Cat: 5
Code: 507.0 -> Cat: 5
Code: 511.9 -> Cat: 5
Code: 518.81 -> Cat: 5
Code: 530.81 -> Cat: 6
Code: 584.9 -> Cat: 7
Code: 585.9 -> Cat: 7
Code: 599.0 -> Cat: 7
Code: 88.56 -> Cat: 13
Code: 88.72 -> Cat: 13
Code: 93.90 -> Cat: 13
Code: 96.04 -> Cat: 13
Code: 96.6 -> Cat: 13
Code: 96.71 -> Cat: 13
Code: 96.72 -> Cat: 13
Code: 99.04 -> Cat:

In [54]:
with open(os.path.join(root, 'data', 'processed', 'data_Mimic50', 'code_embeddings', f'embedding_matrix_Mimic50_mapping.pkl'), 'wb') as f:
    pickle.dump(cat2label_mapping_50, f)