### Pre-processing input data
The input data is a synthetic data obtained from medisyn.ai and follows the OMOP/CMD format. We downloaded the Outpatient data files. The following files are used:
- persons.csv
- visit_occurrence.csv
- conditions_occurrence.csv

The goal of the pre-processing process is to produce three files that will be fed for BERT embedding:
- conditions.pkl: Contains a record per patient that has the conditions (medical codes) for each visit.
- ages.pkl: The person's age at each of the conditions observed. 
- condition_codes: List of condition codes to build codes vocanb

Following BERT data format, each person's data will start with "CLS" and each visit is separated by "SEP".

Note: The produced files will be saved in pickle format.

In [43]:
import numpy as np
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta

DATA_PATH = 'C:/Birhanu/Education/UIL/cs598/Final/data/'
FREQ_THRESHOLD = 3000
MIN_VISITS = 10


In [44]:
def csv2Pickle(in_file_name,
               usecols:list=None, 
               dtypes:list=None, 
               converters:list=None, 
               column_mapper:list=None,
               out_filename:str=None,
               delimiter = ","):
    """
    Converts a csv file to a pickle file.
    
    Parameters:
        in_file_name (str): The name of the csv file to be converted.
        usecols (list): A list of columns to be used from the dataframe. Drop the rest.
        dtypes (dict): A dictionary of column names and their data types.
        converters (dict): A dictionary of column names and their converters.
        column_mapper (dict): A dictionary of column names and their new names.
    """
    if (out_filename is None):
        out_filename = in_file_name.replace('.csv', '.pkl')

    csv = pd.read_csv(
        in_file_name,
        delimiter=delimiter,
        usecols=usecols,
        dtype=dtypes,
        converters=converters
    )

    if (column_mapper is not None and len(column_mapper) > 0):
        csv.rename(columns=column_mapper, inplace=True)

    csv.to_pickle(out_filename)

### Convert the patient.csv to patient.pkl file format.
Only keep the columns needed for this project:
- person_id -> pid
- gender_concept_id -> gender
- birth_datetime -> dob
- ethnicity_concept_id -> ethnicity

**Time esitmate:** 1 sec

In [45]:
# 1.1 Convert the person data file to a pickle file
person_columns = [
    'person_id', 
    'gender_concept_id',
    'birth_datetime',
    'ethnicity_concept_id'
]
 
person_column_mapper = {
    'person_id'             : 'pid',
    'gender_concept_id'     : 'gender',
    'birth_datetime'        : 'dob',
    'ethnicity_concept_id'  : 'ethnicity'
}

person_dtypes = {
    'person_id'             : 'str',
    'gender_concept_id'     : 'str',
    'ethnicity_concept_id'  : 'str',
}

person_converters = {
    'date_of_birth': lambda x: pd.to_datetime(x, format='%Y-%m-%d hh:mm:ss') 
}

csv2Pickle(in_file_name=DATA_PATH + 'person.csv', 
           usecols=person_columns, 
           dtypes=person_dtypes,
           converters=person_converters, 
           column_mapper=person_column_mapper,
           out_filename=DATA_PATH + 'person.pkl'
    )


### Convert the visit_occurrence.csv to visits.pkl file format.
Only keep the columns needed for this project:
- visit_occurrence_id -> vid
- person_id -> pid
- visit_start_date -> start_date

**Time estimates:** 8:40min

In [46]:
# Convert the visits data file to a pickle file
visit_columns = [
    'visit_occurrence_id',
    'person_id',
    'visit_start_date'
]
 
visit_column_mapper = {
    'visit_occurrence_id'   : 'vid',
    'person_id'             : 'pid',
    'visit_start_date'      : 'visit_date'
}

visit_dtypes = {
    'visit_occurrence_id'       : 'str',
    'person_id'                 : 'str',
}

visit_converters = {
    'visit_start_date': lambda x: pd.to_datetime(x, format='%Y-%m-%d')
}

csv2Pickle(in_file_name=DATA_PATH + 'visit_occurrence.csv', 
           usecols=visit_columns,
           dtypes=visit_dtypes,
           converters=visit_converters,
           column_mapper=visit_column_mapper,
           out_filename=DATA_PATH + 'visit.pkl'
    )


### Convert the condition_occurrence.csv to condition_visit.pkl file format.
Only keep the columns needed for this project:
- person_id -> pid
- condition_start_date -> event_date
- condition_concept_id -> condition

**Time estimate**: 13 minutes

In [47]:
# Convert the conditions occurrence data file to a pickle file
condition_columns = [
    'person_id',
    'visit_occurrence_id',
    'condition_start_date',
    'condition_concept_id',
]
 
condition_column_mapper = {
    'person_id'                 : 'pid',
    'visit_occurrence_id'       : 'vid',
    'condition_start_date'      : 'event_date',
    'condition_concept_id'      : 'condition'
}

condition_dtypes = {
    'person_id'                 : 'str',
    'visit_occurrence_id'       : 'str',
    'condition_concept_id'      : 'str',
}

condition_converters = {
    'condition_start_date'      : lambda x: pd.to_datetime(x, format='%Y-%m-%d')
}

csv2Pickle(in_file_name=DATA_PATH + 'condition_occurrence.csv',
           usecols=condition_columns,
           dtypes=condition_dtypes,
           column_mapper=condition_column_mapper,
           converters=condition_converters,
           out_filename=DATA_PATH + 'condition_visit.pkl'
    )

### Load and process the concepts file

Only keep the columns needed for this project:
- concept_id
- domain_id 
- concept_name

**Time estimate**: 1 sec

In [48]:
# Convert the concepts file to pickle format
concept_columns = [
    'concept_id',
    'concept_name',
    'domain_id'
]

concept_column_mapper = {
    'concept_id': 'concept_id',
    'concept_name': 'concept_name',
    'domain_id': 'domain_id'
}

concept_dtypes = {
    'concept_id': 'str',
    'concept_name': 'str',
    'domain_id' : 'str'
}

concept_converters = None

csv2Pickle(in_file_name=DATA_PATH + 'CONCEPT.csv',
           usecols=concept_columns,
           dtypes=concept_dtypes,
           column_mapper=concept_column_mapper,
           converters=concept_converters,
           out_filename=DATA_PATH + 'concept.pkl',
           delimiter='\t'
           )


### Format the conditions dataframe
In this step we process the data frame to remove infrequent conditions, group data by the person id and produce the visit sequence and conditions for each visit.

Inputs: 
    - person.pkl: Persons files after the uncessary columns have been removed.
    - condition_visit.pkl - Persons files after the uncessary columns have been removed.

- demographics = [pid, gender, dob, ethnicity]
- conditions  =  [pid, eventDate, codes = [conditions] ]

**Time estimates:** 3 min

In [49]:
# Determine the number of occurrences of each condition in the dataset
demographics = pd.read_pickle(DATA_PATH + 'person.pkl')
visits = pd.read_pickle(DATA_PATH + 'visit.pkl')
conditions = pd.read_pickle(DATA_PATH + 'condition_visit.pkl')

# Remove patients with less than min visit
initial_vists = visits.shape[0]

visit_counts = visits.groupby('pid').size().reset_index(name='visit_counts')
selected_pids = visit_counts[visit_counts['visit_counts'] >= MIN_VISITS].reset_index(drop=True)
visits = visits[visits['pid'].isin(selected_pids['pid'])]
print(f"Visits sized reduced by due number of visits ({MIN_VISITS}): {visits.shape[0] - initial_vists}")

# Get the frequency count of each condition across the dataset
freq_conditions = conditions.groupby(['condition']).size().reset_index(name='counts')

# Keep only the conditions that occur more than FREQ_THRESHOLD (3000) times
freq_conditions = freq_conditions[freq_conditions['counts'] >= FREQ_THRESHOLD].reset_index(drop=True)
assert freq_conditions[freq_conditions['condition']=="10851"].counts.values[0] == 4323

# Filter the conditions data to only include the frequent conditions
conditions = conditions[conditions['condition'].isin(freq_conditions['condition'])]

# Join visit and conditions with demographics data to add the date of birth
conditions = conditions.merge(demographics,  on='pid', how='inner').reset_index(drop=True)
conditions = conditions.merge(visits, on=['pid','vid'], how='inner').reset_index(drop=True)

# Format the data [pid, dob, visit_date, [condition[]] format, remove duplicates
conditions = conditions.groupby(['pid', 'dob', 'visit_date'])['condition'].unique().apply(list).reset_index()

# Format conditions as [pid, [[condition]]] format. Rename [condition] to visits
#conditions = conditions.groupby(['pid', 'dob', 'visit_date'])['condition'].apply(list).reset_index()
conditions.rename(columns={'condition': 'conditions'}, inplace=True)


Visits sized reduced by due number of visits (10): -70527


### Add age for each visit
Compute the patient's age at each visit and add the computed column. 
Note: Age is calculated in months (total number of full months since birth)

Time: 2:30 min

In [57]:
# Add age to the conditions data
from datetime import datetime
from dateutil.relativedelta import relativedelta

def calculate_age(x):
    """
        Purpose: Calculate the age of a patient at the time of a visit
        Parameters: 
            dob  (datetime): The date of birth of the patient
            visit_date (datetime): The date of the visit
        Returns:
            age (int): The age of the patient in months
    """
    dob = datetime.strptime(x.dob, "%Y-%m-%d %H:%M:%S")
    visit_date = x["visit_date"].to_pydatetime()

    age = relativedelta(visit_date, dob).years * 12 + relativedelta(visit_date, dob).months
    return str(age)

conditions['age'] = conditions.apply(calculate_age, axis=1)


In [51]:
print(conditions.shape)
conditions[conditions['pid'] == '176103']


(794302, 5)


Unnamed: 0,pid,dob,visit_date,conditions,age
50,176103,1937-07-29 00:00:00,1992-12-23,[1570669],664
51,176103,1937-07-29 00:00:00,1993-02-09,[1570669],666
52,176103,1937-07-29 00:00:00,1993-08-13,[35208969],672
53,176103,1937-07-29 00:00:00,1994-08-21,[35208968],684


### Add the condition names
Add the name of the condition (diagnosis name) to the dataframe. It will be used later in the 2D 

In [58]:
# Read the clean concepts file
concepts = pd.read_pickle(DATA_PATH + 'concept.pkl')
concepts = concepts[concepts["domain_id"] == "Condition"]

freq_conditions = freq_conditions.merge(concepts, how='inner', left_on='condition', right_on='concept_id') 
freq_conditions = freq_conditions[ ['condition', 'concept_name', 'counts'] ]

freq_conditions.rename(columns={'concept_name': 'condition_name'}, inplace=True)
freq_conditions.to_pickle(DATA_PATH + 'condition_codes.pkl')

print(freq_conditions.shape)
print(freq_conditions.head())


(296, 3)
  condition                                     condition_name  counts
0     10851  Sprain of joints and ligaments of other parts ...    4323
1     11803                Sprain of ligaments of lumbar spine    4293
2   1567286                                       Other sepsis    3217
3   1567391                Viral infection of unspecified site    4177
4   1567392                                    Dermatophytosis    4198


### Produce the condition sequences for each patient and visit

The output will follow:[pid, [code1,code2,SEP,code1,code3,code4,...,SEP]]

Time estimate: 37 secs

In [53]:
### Format the condition codes as a sequence of conditions: CLS,code1,code2,SEP,code1,code3,code4,...,SEP
def concat_codes(x: pd.DataFrame):
    """
        Purpose: Concatenate the condition codes in a visit sequence
        Parameters: 
            visits  (pd.DataFrame): A dataframe of visits
        Returns:
            seqs (list): A list of unique concatenated condition codes, sorted by code
    """
    sep = 'SEP'
    seqs = []

    for i in range(len(x['visit_date'])):
        conditions = x['conditions'].iloc[i]
        if (conditions is not None):
            conditions = sorted(conditions)

            for c in conditions:
                seqs.append(c)

            seqs.append(sep)
 
    return seqs

seq_codes = conditions.groupby(['pid'])['visit_date', 'conditions'].apply(concat_codes).rename("conditions").reset_index()
print(seq_codes.iloc[2])
                                                                    


  seq_codes = conditions.groupby(['pid'])['visit_date', 'conditions'].apply(concat_codes).rename("conditions").reset_index()


pid                                                      176103
conditions    [1570669, SEP, 1570669, SEP, 35208969, SEP, 35...
Name: 2, dtype: object


### Produce the age sequence for each person for each condition

The formatted output will be: [ [ CLS,age1,age1,SEP,age2,age2,SEP,...] ]
One row per person.

Time estimate: 1:35 min

In [59]:

def concat_ages(x: pd.DataFrame):
    """
        Purpose: Concatenate the age of the patient for each condition code.
        
        Parameters:
            x (DataFrame): A DataFrame with the following columns:
                pid (str): The patient id
                dob (datetime): The date of birth of the patient
                event_date (datetime): The date of the condition
                visits (list): A list of condition codes
        Returns:
            seqs (list): A list of concatenated age for each condition code
    
    """
    seqs = []

    for i in range(len(x['conditions'])):
        conditions = x['conditions'].iloc[i]
        for _ in conditions:
            seqs.append(x['age'].iloc[i])

        seqs.append(x['age'].iloc[i])

    return seqs

seq_ages = conditions.groupby(["pid"])[["age", "visit_date", "conditions"]].apply(concat_ages).rename("ages").reset_index()

print(seq_ages.iloc[2])



pid                                       176103
ages    [664, 664, 666, 666, 672, 672, 684, 684]
Name: 2, dtype: object


### 2.6 Validation the codes and ages sequences
The seq_codes and seq_ages must have the same length

In [60]:
assert len(seq_codes["conditions"].iloc[0]) == len(seq_ages["ages"].iloc[0]), \
    "The number of condition codes and ages do not match"


### 3.1 Write the updated data into the final files
Produce files:
- "conditions.pkl" for conditions
- "ages.pkl" for ages.pkl
- "condition_codes.pkl" from freq_conditions

Time estimate: 2 sec

In [61]:
# Save the data to pickle files

seq_codes.to_pickle(DATA_PATH + 'conditions.pkl' )
seq_ages.to_pickle(DATA_PATH + 'ages.pkl')
