# Algorithm for Unexpected Weight Loss phenotype: rule-based

12 July 2023

Updated 23 February 2024

---

## Description

Code that takes tables from Patron and creates a cohort of patients who likely have Unexpected Weight Loss. 

## Methodology
- Transform each of the tables to clean datetimes and create flags for different types of 'weight-related' events: weight, weight loss, weight loss medication, weight loss inquiry
- Create an event log by concatenating all the tables
- Create uwl_flag for candiate patients based on boolean condition of flags
- Use rule-based classification algorithm to further filter the candidates to remove those that are unlikely to be UWL
- Create time-period flags: index date; time periods from index date; cancer diagnosis date
- Calculate the statistics of numbers of UWL patients and proportion that are diagnosed with cancer within six months of index date.

## Patient flow
- Create the subset of patients at each step in the creation of the cohort

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set()
%matplotlib inline

In [2]:
pd.set_option("display.max_rows", 1000)
pd.set_option("display.max_columns", 100)

## 1. Data sources

### Patron

In [3]:
# Patron source folders
source_folder1_patron = "M:/Working/AL/Projects/weight_loss/outputs/patron_data_extracts"
source_folder2_patron = "M:/Working/AL/Projects/weight_loss/outputs"
source_folder3_patron = "M:/Working/DataAnalysis/CleanAndStructure/DataFiles"
source_folder4_patron = "M:/Working/DataAnalysis/RCOLD/DataFiles"

# patron files
filename_encounter_patron = "patron_encounters_310123.parquet"
filename_diagnosis_patron = "patron_condition_history_310123.parquet"
filename_prescription_patron = "patron_prescriptions_300123.parquet"
filename_patient_patron = "patron_patient_300123.parquet"
filename_pathology_patron = "pathology_patron_10_results_cleaned_BN_ranges_270423.parquet"
filename_hwbmi_patron = "patron_height_weight_observations_cleaned_300123.parquet"
filename_pathology_patron = "pathology_patron_10_results_cleaned_Vic_ranges_120523.parquet"

# VAED Procedures
filename_vaed_proc = 'VAED_AllProceduresLong_202112.parquet'
filename_vaed_proc_ah = 'VAED_AllProceduresLong_AH_202112.parquet'

filename_cancer = 'VAED_Cancers_coded.csv'

## 2. Regular expressions and inclusion / exclusion criteria

In [4]:
# weight-related terms
weight_spellings = ["WEIGHT", "WIEGHT", " WT", "WT ", 'WEIGT', 'WEGHT', 'WEIGH ', '^LOW$',
                    'WEIGTH', 'WGT', 'WGHT', 'WHT ', '[0-9\s]+KG', 'CACHEXI', 'BMI']

initial_regex = r"|".join(weight_spellings)

# terms that relate to weight loss
loss_terms = ['LOSS', 'LOSSS', 'LOST', 'DECREASE', 'LOSING', 'LOSSING', 'LOW', 'LAST', 'CACHEXI']
loss_regex = r"|".join(loss_terms)

# exclusion terms that relate to a weight loss program or other enquiry
exclusion_terms = ["PROGRAM", "COUNSELLING", "MANAGEMENT", 
                     "ADVICE", "CLINIC", "DISCUSSION",
                     "REGIME", "MANAGMENT", "JENNY CRAIG", "CONSULTATION", "MEDICATION"]

exclusion_regex = r"|".join(exclusion_terms)

In [5]:
columns_encounters = ["usi", 
                      "dte", 
                      "reason", 
                      "weight_flag", 
                      "weightloss_flag", 
                      "weightloss_inquiry_flag", 
                      "event_type"]

columns_diagnoses = ["usi", 
                     "dte", 
                     "reason", 
                     "weight_flag", 
                     "weightloss_flag", 
                     "weightloss_inquiry_flag", 
                     "event_type"]

In [6]:
# how many patients with the following
bariatric_regex = r'BARIATRIC|LAP BAND|GASTRIC BAND|GASTRIC SLEEVE'

In [7]:
prescription_cols_nps = ['patientid', 
                     'usi',
                     'firstprescrip_dte', 
                     'lastprescrip_dte',
                     'medicine_active_ingredient', 
                     'prescription_reason']

prescription_cols_nps = ['usi',  
                     'dte',  
                     'value', 
                     'value2', 
                     'event_type', 
                     'weightloss_med_flag']

medicine_wtloss_terms = ['PHENTERMINE', 
                         'SIBURTRAMINE', 
                         'ORLISTAT', 
                         'LIRAGLUTIDE', 
                         'TOPIRAMATE', 
                         'SEMAGLUTIDE', 
                         'SIBUTRAMINE', 
                         'DUROMINE']

# create regular expression to extract weight loss medications
medicine_wtloss_regex = r'|'.join(medicine_wtloss_terms)

In [8]:
# covers: replacing inequalities, removing non-numeric strings
weird_values_replace = {r'>\\S\\([0-9.]+)': '\g<1>',
                        r'>\s?([0-9.]+)': '\g<1>',
                        r'>\s*\^([0-9.]+)': '\g<1>',
                        r'>\^([0-9.]+)': '\g<1>',
                        r'>\s*([0-9.]+)': '\g<1>',
                        r'>\s?\^([0-9.]+)': '\g<1>',
                        r'<\\S\\([0-9.]+)': '\g<1>',
                        r'<\s?([0-9.]+)': '\g<1>',
                        r'<\s*\^([0-9.]+)': '\g<1>',
                        r'<\^([0-9.]+)': '\g<1>',
                        r'<\s*([0-9.]+)': '\g<1>',
                        r'<\s?\^([0-9.]+)': '\g<1>',
                        r'^([0-9]+)\+$': '\g<1>',
                        r'^([0-9.]+)\+$': '\g<1>',
                        r'^([0-9]+)\-$': '\g<1>',
                        r'^([0-9.]+)\-$': '\g<1>',
                        r'^([0-9]+)\`$': '\g<1>',
                        r'^([0-9.]+)\`$': '\g<1>',
                        r'^([0-9\.]+)[A-Za-z\.\s]+$': '\g<1>', 
                        r'^[0]*\,([0-9]{2})$': '\g<1>',
                        r'^[^0-9\.]+$': '', 
                        r'([0-9\.]{3}).+': '\g<1>'}

In [9]:
rename_cols_observations = {'observation_dte': 'dte', 
                            'observation_type': 'event_subtype', 
                            'observation_value_cleaned': 'value'}

select_cols_observations = ['usi', 'dte', 'value', 'event_type', 'event_subtype']

dte_cols_observations = ['observation_dte']

text_cols_observations = ['observation_type']

In [10]:
rename_cols_prescription = {'prescription_reason': 'value', 
                            'medicine_active_ingredient': 'value2', 
                            'medicine_name': 'value3', 
                            'firstprescrip_dte': 'dte'}

select_cols_prescription = ['usi', 'dte', 'value', 'value2', 'event_type', 'weightloss_med_flag']

dte_cols_prescription = ['firstprescrip_dte']

text_cols_prescription = ['prescription_reason', 'medicine_active_ingredient', 'medicine_name']

flag_cols_prescription = {'weightloss_med_flag': lambda df_: df_['medicine_active_ingredient'].str.contains(medicine_wtloss_regex, na=False)}

### Patron

In [11]:
standard_cols_prescription = {'medication_active_ingredient': 'medicine_active_ingredient', 
                              'dte': 'firstprescrip_dte'}

In [12]:
rename_cols_pathology = {'result_dte': 'dte', 
                         'result_name_standard': 'event_subtype', 
                         'result_value_cleaned': 'value', 
                         'units_standard': 'value2', 
                         'value_range': 'value3'}

select_cols_pathology = ['usi', 'dte', 'event_type', 'event_subtype', 'value', 'value2', 'value3']

flags_col_pathology = None

dte_cols_pathology = ['result_dte']

text_cols_pathology = ['result_name_standard']

In [13]:
flag_cols_encounters = {'weight_flag': lambda df_: df_.encounter_reason.str.contains(initial_regex, na=False), 
                        'weightloss_flag': lambda df_: df_.weight_flag & df_.encounter_reason.str.contains(loss_regex, na=False), 
                        'weightloss_inquiry_flag': lambda df_: df_.weightloss_flag & df_.encounter_reason.str.contains(exclusion_regex, na=False),
                        'bariatric_surgery_flag': lambda df_: df_.encounter_reason.str.contains(bariatric_regex, na=False)}

rename_cols_encounters = {'visit_dte': 'dte', 
                          'encounter_reason': 'value'}

select_cols_encounters = ['usi', 
                          'dte', 
                          'event_type', 
                          'value', 
                          'weight_flag', 
                          'weightloss_flag', 
                          'weightloss_inquiry_flag', 
                          'bariatric_surgery_flag']

dte_cols_encounters = ['visit_dte']

text_cols_encounters = ['encounter_reason']

In [14]:
flag_cols_diagnoses = {'weight_flag': lambda df_: df_.diagnosis_reason.str.contains(initial_regex, na=False), 
                       'weightloss_flag': lambda df_: df_.weight_flag & df_.diagnosis_reason.str.contains(loss_regex, na=False), 
                       'weightloss_inquiry_flag': lambda df_: df_.weightloss_flag & df_.diagnosis_reason.str.contains(exclusion_regex, na=False), 
                       'bariatric_surgery_flag': lambda df_: df_.diagnosis_reason.str.contains(bariatric_regex, na=False)}

rename_cols_diagnoses = {'diagnosis_dte': 'dte', 
                         'diagnosis_reason': 'value'}

select_cols_diagnoses = ['usi', 
                         'dte', 
                         'event_type', 
                         'value', 
                         'weight_flag', 
                         'weightloss_flag', 
                         'weightloss_inquiry_flag', 
                         'bariatric_surgery_flag']

dte_cols_diagnoses = ['diagnosis_dte']

text_cols_diagnoses = ['diagnosis_reason']

In [15]:
dt_formats = ['%Y-%m-%d', 
              '%d/%m/%Y',
              '%d/%m/%y', 
              '%d.%m.%Y', 
              '%d.%m.%y', 
              '%d %m %y', 
              '%d/%m%Y', 
              '%d%m%Y', 
              '%d.%m.%Y', 
              '%d.%m/%Y', 
              '%d %m %Y', 
              '%d-%m-%Y']

def format_dates(dates, date_formats=['%d/%m/%Y']):
    """
    Standardize a pd.Series of datetimes based on a set of specified datetime formats
    
    Args
    ----
    
    dates (pd.Series): input of datetime values to convert
    formats (list of str): strings that specify all the possible formats that dates appear in
    
    Return
    ------
    dates_standard (pd.Series): standardized datetime values
    """
    dates_standard = pd.to_datetime(dates, format=date_formats[0], errors='coerce')
    
    for n in range(1, len(date_formats)):
        dates_standard = dates_standard.fillna(pd.to_datetime(dates, format=dt_formats[n], errors='coerce'))
    
    return dates_standard

## 3. Functions

In [16]:
def add_index_case_label(df, 
                         patientid='usi', 
                         date='date', 
                         case='uwl_flag'):
    """
    Add a new column for the index date for each of the patients.
    
    Args
    ----
    df (pd.DataFrame): the input dataframe
    patientid (str): the column name of the patientid field
    date (str): the column name of the date field
    csae (str): the column name of the case field
    
    Return
    ------
    
    df (pd.DataFrame): dataframe with the index date column added
    """
    df = df.copy().assign(index_case=False)
    df.loc[df.groupby(patientid)[case].idxmax().values, 'index_case'] = True
    
    return df

In [17]:
def events_within_time_period(df, 
                              patientid='usi', 
                              date='date',
                              index_date='index_case', 
                              time_period=(0, 31), 
                              time_period_label='one_month'):
    """
    Create a column that, for each patient, flags all events within a given 
    time period
    
    Args
    ----
    df (pd.DataFrame): input dataframe
    patientid (str): column name field patient id field
    date (str): column name for date field
    index_date (str): column name of index date
    time_period (tuple): the time period (in days)
    time_period_label (str): label of the time period
    
    Return
    ------
    
    """
    df = df.copy()
    
    results = (
        df
        .groupby(patientid)
        .apply(lambda df_: df_[date] - df_[df_[index_date] == True][date].values[0])
        .dt.days.between(*time_period)
        .reset_index()
        .loc[:, date]
    )
    
    df = df.assign(**{f'{time_period_label}': results})
    
    return df

In [18]:
def recent_weight_change(df):
    """
    Calculate the weight change up until the index date
    If there is no weight recording at the index date then
    returns -1000
    
    Args
    ----
    df (pd.DataFrame): event log for a given patient
    
    Return
    ------
    weight_change (float): difference in weight between the index date and
                           previous weight observation
    """
    index_date = df.query('index_case == 1')['dte'].values[0]
    
    weights_previous = (
        df
        .query(f'dte <= "{index_date}"')
        .query('event_subtype == "WEIGHT"')
    )
    
    # check that there is a weight recording at the index date
    if weights_previous.query('at_index_date == True').shape[0] == 0:
        return -1000
    else:
        # check that there are at least two weight recordings to calculate
        # difference
        if weights_previous.shape[0] >= 2:
            weights = weights_previous['value'].astype(float).values
            weight_change = weights[-1] - weights[-2]
            return weight_change
        else:
            return -1000
        
        
def weight_increase(df):
    """
    Create flag for whether there is a weight increase up to
    index date
    
    Args
    ----
    df (pd.DataFrame): input event log for a single patient
    
    Return
    ------
    df (pd.DataFrame): dataframe with flag added
    
    """
    df = df.copy()
    df = df.assign(weight_increase=lambda df_: 
                   True if 
                   recent_weight_change(df_) > 0 
                   else 
                   False)
    return df

In [19]:
def transform_table_standard(events, 
                             rename_cols, 
                             select_cols, 
                             flag_cols,
                             dte_cols=['visit_dte'], 
                             main_date_col = 'dte',
                             text_cols=['encounter_reason'], 
                             event_name='encounter'):
    """
    Description
    -----------
    Transform a given input table (such as encounters, diagnoses, prescriptions)
    so that datetime values are processed, basic cleaning has been done to text fields
    columns have been renamed appropriated and subsetted and additional columns added
    to flag relevant events
    
    Args
    ----
    events (pd.DataFrame): dataframe containing a given set of events
    rename_cols (dict): mapping from old column to new column names
    select_cols (list of str): columns to select from the dataframe
    flag_cols (dict): mapping of new column names to transformations that define flags 
    dte_cols (list of str): datetime columns
    text_cols (list of str): columns that contain text
    event_name (str): event name to describe this set of events
    
    Returns
    -------
    events_new (pd.DataFrame): transformed dataframe
    """
    
    # acceptable formats for datetime values
    dte_pattern1 = '\d{1,2}\/\d{1,2}\/\d{4}'
    dte_pattern2 = '\d{4}-\d{2}-\d{2}.*'
    dte_pattern3 = '\d{2}\.\d{2}\.\d{4}'
    dte_pattern4 = '\d{1,2}[\s\.\/]\d{1,2}[\s\.\/]\d{2}'

    dte_pattern_all = f'^(?!({dte_pattern1}|{dte_pattern2}|{dte_pattern3}|{dte_pattern4}))'

    # clean the datetime variables
    dt_clean = dict(
        zip(
            dte_cols, 
            [lambda df_, dte_col=dte_col: df_[dte_col].astype(str).fillna('').str.replace(dte_pattern_all, '', regex=True) for dte_col in dte_cols]
        )        
    )
    
    # standardize the datetime variables
    dt_changes = dict(
        zip(
            dte_cols, 
            [lambda df_, dte_col=dte_col: format_dates(df_[dte_col], dt_formats) for dte_col in dte_cols]
        )
    )
    
    # standardize the text variables
    text_changes = dict(
        zip(
            text_cols, 
            [lambda df_, text_col=text_col: df_[text_col].str.upper().str.strip() for text_col in text_cols]
        )
    )
    
    # query that removes rows with null dt values
    null_dt_filter = ' & '.join([f'({dte_col}.isnull() == False)' for dte_col in dte_cols])
    
    if flag_cols is None:
        flag_cols = {}
        
    if dt_changes is None:
        dt_changes = {}
        
    if text_changes is None:
        text_changes = {}

    events_new = (
        events
   #     .query('(usi in @zedmed_usi) == False')
        .query(null_dt_filter)
        .assign(**dt_clean) #
        .assign(**dt_changes)
        .assign(**text_changes)
        .assign(**flag_cols)
        .assign(event_type=event_name)
        .rename(columns=rename_cols)
        .query(f'dte.isnull() == False')
        .loc[:, select_cols]
        .drop_duplicates()
    )
    
    return events_new

In [20]:
def transform_vaed(vaed_proc):
    """
    Filter VAED procedures to only those that correspond to bariatric surgery
    
    Args
    ----
    vaed_proc (pd.DataFrame): VAED procedures dataframe
    
    Return
    ------
    vaed_proc_events (pd.DataFrame): only those VAED procedures that match the criteria
    """
    vaed_proc_events = (
        vaed_proc
        .loc[vaed_proc.procedure_code.str.startswith('30511') | vaed_proc.procedure_code.str.startswith('30512'), :]
        .query('usi != ""')
        .loc[:, ['usi', 'effective_dte', 'procedure_code']]
        .rename(columns={'effective_dte': 'dte', 'procedure_code': 'value'})
        .assign(event_type='vaed_surgery')
        .assign(bariatric_surgery_flag=True)
    )
    
    return vaed_proc_events

In [21]:
def add_index_case_label(df, 
                         patientid='usi', 
                         date='date', 
                         case='uwl_flag'):
    """
    Add a new column for the index date for each of the patients.
    
    Args
    ----
    df (pd.DataFrame): the input dataframe
    patientid (str): the column name of the patientid field
    date (str): the column name of the date field
    csae (str): the column name of the case field
    
    Return
    ------
    
    df (pd.DataFrame): dataframe with the index date column added
    """
    df = df.copy().assign(index_case=False)
    df.loc[df.groupby(patientid)[case].idxmax().values, 'index_case'] = True
    
    return df

In [22]:
def transform_tables(encounters, diagnoses, pathology, prescription, hwbmi):
    """
    Transform each of the event types, introducing relevant flags and renaming
    columns
    """
    # transform each of the source data tables, including adding flags
    encounters_new = transform_table_standard(encounters, 
                                              rename_cols_encounters, 
                                              select_cols_encounters, 
                                              flag_cols=flag_cols_encounters, 
                                              dte_cols=['visit_dte'], 
                                              main_date_col='visit_dte',
                                              text_cols=['encounter_reason'], 
                                              event_name='encounter')
    
    diagnoses_new = transform_table_standard(diagnoses, 
                                             rename_cols_diagnoses, 
                                             select_cols_diagnoses, 
                                             flag_cols=flag_cols_diagnoses, 
                                             dte_cols=['diagnosis_dte'], 
                                             main_date_col='diagnosis_dte',
                                             text_cols=['diagnosis_reason'], 
                                             event_name='diagnosis')
    
    # the basic cohort consisting of encounters and diagnoses
    # with patients who have at least one weight_flag = True event
    cohort = (
        pd.concat([encounters_new, diagnoses_new])
    )
    
    usi_weight = cohort.query('weight_flag == True').usi.unique()
    
    cohort_small = (
        cohort
        .query('usi in @usi_weight')
        .drop_duplicates()
        .reset_index()
        .iloc[:, 1:]
    )
    
    # get the patients from this cohort
    usi_basic = cohort_small.usi.unique()
    
    pathology_new = transform_table_standard(pathology.query('usi in @usi_basic'), 
                                             rename_cols_pathology, 
                                             select_cols_pathology, 
                                             flag_cols=None, 
                                             dte_cols=['result_dte'], 
                                             main_date_col='result_dte',
                                             text_cols=['result_name_standard'], 
                                             event_name='pathology')
    
    prescription_new = transform_table_standard(prescription.query('usi in @usi_basic'), 
                                                rename_cols_prescription, 
                                                select_cols_prescription, 
                                                flag_cols_prescription, 
                                                dte_cols_prescription, 
                                                main_date_col='firstprescrip_dte',
                                                text_cols=text_cols_prescription, 
                                                event_name='prescription')
    
    hwbmi_new = transform_table_standard(hwbmi.query('usi in @usi_basic'),
                                         rename_cols_observations, 
                                         select_cols_observations, 
                                         flag_cols=None,
                                         dte_cols=['observation_dte'], 
                                         main_date_col='observation_dte',
                                         text_cols=['observation_type'], 
                                         event_name='observation')
    
    return cohort_small, pathology_new, prescription_new, hwbmi_new


def filter_vaed_data(cohort, vaed_proc, cancer):
    """
    Given a cohort of patients, vaed procedures and cancer cases, 
    filter the vaed and cancer data to only those relevant to cohort
    """    
    
    uwl_usi = cohort.usi.unique()
    
    # filter to VAED events related to bariatric surgery
    vaed_proc = (
        vaed_proc
        .loc[vaed_proc.procedure_code.str.startswith('30511') | vaed_proc.procedure_code.str.startswith('30512'), :]
        .query('usi != ""')
        .loc[:, ['usi', 'effective_dte', 'procedure_code']]
        .rename(columns={'effective_dte': 'dte', 'procedure_code': 'value'})
        .assign(event_type='vaed_surgery')
        .assign(bariatric_surgery_flag=True)
    )
    
    # find first instance of cancer for each usi
    cancer_uwl_first = (
        cancer
        .assign(usi=cancer.usi.astype(str).str.rjust(10, '0'))
        .rename(columns={'incidence_dte': 'dte', 'diagnosis_icd10am_3cde': 'value'})
        .query('usi in @uwl_usi')
        .assign(dte=lambda df_: pd.to_datetime(df_.dte, format='%d%b%Y'))
        .assign(event_type='cancer_diagnosis')
        .loc[:, ['usi', 'dte', 'value', 'event_type']]
        .sort_values(by=['usi', 'dte'])
    )
    
    return vaed_proc, cancer_uwl_first

def filter_cohort(cohort):
    """
    Filter the cohort to only those with likely uwl encounters
    
    """
    # filter to patients who have a uwl_flag == True
    uwl_candidates = (
        cohort
        .assign(dte=pd.to_datetime(cohort.dte))
        .assign(dte=lambda df_: df_.dte.dt.normalize())
        .assign(value=cohort.value.astype(str))
        .query('weightloss_flag == True')
        .query('weightloss_med_flag == False')
        .query('weightloss_inquiry_flag == False')
        .usi
        .unique()
    )

    cohort = (
        cohort
        .query('usi in @uwl_candidates')
        .assign(uwl_flag=False)
        .assign(uwl_flag=lambda df_: df_.uwl_flag.where(((df_.weightloss_flag == True) & 
                                                        (df_.weightloss_med_flag == False) & 
                                                        (df_.weightloss_inquiry_flag == False)) == False, True)) 
        .assign(usi_linked=lambda df_: df_.usi.str.startswith('G') == False, # flag for whether patient can be linked
                index_case=0)
    )
    
    return cohort

def add_cancer_dates(cohort):
    """
    Add flags to cohort describing events that are cancer diagnoses within 6 months of index date
    and whether patient is diagnosed with cancer before index date
    """
    # first cancer diagnosis date for each patient
    cancer_diagnosis_date = (
        cohort
        .query('event_type == "cancer_diagnosis"')
        .sort_values(by=['usi', 'dte'])
        .groupby('usi')
        .dte
        .first()
    )

    # index dates for each patient
    index_date = (
        cohort
        .query('at_index_date == True')
        .sort_values(by=['usi', 'dte'])
        .groupby('usi')
        .dte
        .first()
    )
    
    # those patients who have a cancer diagnosis before the index date
    cancer_before_index = (cancer_diagnosis_date - index_date) < '0 days'
    usi_cancer_before_index = cancer_before_index[cancer_before_index].index
    
    # those patients who have a cancer diagnosis within 6 months of index date
    time_delay1 = (cancer_diagnosis_date - index_date) <= '183 days'
    usi_cancer_within_6months = time_delay1[time_delay1].index

    cohort = (
        cohort
        .assign(cancer_before_index_date=lambda df_: df_.usi.isin(usi_cancer_before_index),
                cancer_within_6months_index_date=lambda df_: df_.usi.isin(usi_cancer_within_6months))
    )
    
    return cohort

In [23]:
def filter_cohort_new(cohort):
    """
    Filter the cohort to only those with likely uwl encounters
    
    """
    # filter to patients who have a uwl_flag == True
    uwl_candidates = (
        cohort
        .query('weightloss_flag == True')
        .query('weightloss_med_flag == False')
        .query('weightloss_inquiry_flag == False')
        .usi
        .unique()
    )

    cohort = (
        cohort
        .query('usi in @uwl_candidates')
        .assign(uwl_flag=False)
        .assign(uwl_flag=lambda df_: df_.uwl_flag.where(((df_.weightloss_flag == True) & 
                                                        (df_.weightloss_med_flag == False) & 
                                                        (df_.weightloss_inquiry_flag == False)) == False, True)) 
        .assign(usi_linked=lambda df_: df_.usi.str.startswith('G') == False, # flag for whether patient can be linked
                index_case=0)
    )
    
    return cohort

## 4. Tidy up

In [24]:
final_cols_new = ['usi',
                  'dte',
                  'event_type',
                  'event_subtype',
                  'value',
                  'value2',
                  'value3', 
                  'weight_flag',
                  'weightloss_flag',
                  'weightloss_inquiry_flag',
                  'weightloss_med_flag',
                  'bariatric_surgery_flag']

In [25]:
def load_data(filename_encounter, 
              filename_diagnosis, 
              filename_prescription, 
              filename_patient, 
              filename_hwbmi,
              filename_pathology,
              filename_vaed_proc,
              filename_cancer, 
              source_folder1, 
              source_folder2, 
              source_folder3, 
              source_folder4,
              remove_zedmed=False):
    
    encounters_a1 = pd.read_parquet(f'{source_folder1}/{filename_encounter}')
    diagnoses_a1 = pd.read_parquet(f'{source_folder1}/{filename_diagnosis}')
    prescription_a1 = pd.read_parquet(f'{source_folder1}/{filename_prescription}').rename(columns=standard_cols_prescription)
    patient_a1 = pd.read_parquet(f'{source_folder1}/{filename_patient}')

    hwbmi_a1 = pd.read_parquet(f'{source_folder2}/{filename_hwbmi}')
    pathology_a1 = pd.read_parquet(f'{source_folder2}/{filename_pathology}')

    vaed_proc_a1 = pd.read_parquet(f'{source_folder3}/{filename_vaed_proc}')

    cancer_a1 = pd.read_csv(f'{source_folder4}/{filename_cancer}')

    # remove ZedMed patients from Patron
    zedmed_usi = pd.read_csv('M:/Working/DataAnalysis/CleanAndStructure/PAT012_ZM_USI.csv').Patient_USI.values

    # patientid to usi
    patientid_to_usi = dict(patient_a1[['patientid', 'usi']].values)

    # usi to patientid
    usi_to_patientid = dict(patient_a1[['usi', 'patientid']].values)

    pathology_a1.result_dte = pd.to_datetime(pathology_a1.result_dte, format='%d%b%Y', errors='coerce')

    # convert usi 
    patient_a1.usi = patient_a1.usi.astype(str)
    cancer_a1.usi = cancer_a1.usi.astype(str).str.rjust(10, '0')

    # include usi in all the events tables
    encounters_a1 = (
        encounters_a1
        .assign(usi=lambda df_: df_.patientid.map(patientid_to_usi))
    )

    diagnoses_a1 = (
        diagnoses_a1
        .assign(usi=lambda df_: df_.patientid.map(patientid_to_usi))
    )

    prescription_a1 = (
        prescription_a1
        .assign(usi=lambda df_: df_.patientid.map(patientid_to_usi))
    )

    hwbmi_a1 = (
        hwbmi_a1
        .assign(usi=lambda df_: df_.patientid.map(patientid_to_usi))
    )

    pathology_a1 = (
        pathology_a1
        .assign(usi=lambda df_: df_.patientid.map(patientid_to_usi))
    )

    vaed_proc_a1 = (
        vaed_proc_a1
    #    .assign(usi=lambda df_: df_.patientid.map(patientid_to_usi))
    )

    cancer_a1 = (
        cancer_a1
    #    .assign(usi=lambda df_: df_.patientid.map(patientid_to_usi))
    )
    
    # if we want to remove the zedmed usi
    if remove_zedmed:
        encounters_a1 = encounters_a1.query('(usi in @zedmed_usi) == False')
        diagnoses_a1 = diagnoses_a1.query('(usi in @zedmed_usi) == False')
        prescription_a1 = prescription_a1.query('(usi in @zedmed_usi) == False')
        hwbmi_a1 = hwbmi_a1.query('(usi in @zedmed_usi) == False')
        pathology_a1 = pathology_a1.query('(usi in @zedmed_usi) == False')
        vaed_proc_a1 = vaed_proc_a1.query('(usi in @zedmed_usi) == False')
        cancer_a1 = cancer_a1.query('(usi in @zedmed_usi) == False')
    
    return encounters_a1, diagnoses_a1, prescription_a1, patient_a1, hwbmi_a1, pathology_a1, vaed_proc_a1, cancer_a1

In [26]:
def create_event_log(cohort_small, prescription, hwbmi, pathology, vaed, cancer):
    
    cohort_small_new = pd.concat([cohort_small, 
                                  prescription, 
                                  hwbmi, 
                                  pathology])

    # filter vaed data to potential uwl patients
    vaed_new, cancer_uwl = filter_vaed_data(cohort_small_new, vaed, cancer)

    # the event log
    cohort_new = pd.concat([cohort_small_new, 
                           prescription,
                           hwbmi, 
                           pathology,
                           cancer_uwl, 
                           vaed_new])[final_cols_new].drop_duplicates()

    # fill in missing flag values with False
    cohort_new = (
        cohort_new
        .assign(weight_flag=lambda df_: df_.weight_flag.fillna(False),
                weightloss_flag=lambda df_: df_.weightloss_flag.fillna(False),
                weightloss_inquiry_flag=lambda df_: df_.weightloss_inquiry_flag.fillna(False),
                weightloss_med_flag=lambda df_: df_.weightloss_med_flag.fillna(False), 
                bariatric_surgery_flag=lambda df_: df_.bariatric_surgery_flag.fillna(False))
    )    

    # convert values to strings and tidy
    cohort_new = (
        cohort_new
        .assign(value=lambda df_: df_.value.str.upper().str.strip())
        .assign(value2=lambda df_: df_.value2.str.upper().str.strip())
        .assign(value3=lambda df_: df_.value3.str.upper().str.strip())
    )
    
    return cohort_new

def tidy_event_log(events):
    cols = ['usi', 'dte', 'value', 'value2', 'value3', 'event_type', 'weight_flag', 'weightloss_flag',
           'weightloss_inquiry_flag', 'bariatric_surgery_flag', 
           'weightloss_med_flag', 'event_subtype']

    # the event log
    events_new = (
        events
        .loc[:, cols]
        .assign(weight_flag=lambda df_: df_.weight_flag.fillna(False),
                weightloss_flag=lambda df_: df_.weightloss_flag.fillna(False),
                weightloss_inquiry_flag=lambda df_: df_.weightloss_inquiry_flag.fillna(False), 
                weightloss_med_flag=lambda df_: df_.weightloss_med_flag.fillna(False), 
                bariatric_surgery_flag=lambda df_: df_.bariatric_surgery_flag.fillna(False))
        .query('usi != ""')
        .sort_values(by=['usi', 'dte'])
        .drop_duplicates()
    )

    # consider only those usi values that have at least one weightloss_flag == True 
    usi_wl_a2 = (
        events_new
        .query('weightloss_flag == True')
        .usi
        .unique()
    )

    events_final = (
        events_new
        .query('usi in @usi_wl_a2')
    )

    events_final = (
        events_final
        .assign(dte=lambda df_: pd.to_datetime(df_.dte, format='%Y-%m-%d'))
        .assign(dte=lambda df_: df_.dte.dt.normalize())
        .assign(value=lambda df_: df_.value.astype(str))
        .reset_index()
        .iloc[:, 1:]
    )
    
    return events_final

def create_final_cohort(cohort):
    uwl_candidates = (
        cohort
        .query('weightloss_flag == True')
        .query('weightloss_med_flag == False')
        .query('weightloss_inquiry_flag == False')
        .query('weight_increase == False')
        .query('bariatric_surgery_flag == False')
        .query('cancer_before_index_date == False')
        .loc[:, 'usi']
        .unique()
    )

    cohort_uwl = (
        cohort
        .query('usi in @uwl_candidates')
    )

    usi_previous_med = (
        cohort_uwl.query('weightloss_med_flag == True')
        .query('two_years_before == True')[['usi', 'usi_linked']]
        .drop_duplicates()
     #   .query('usi_linked == True')
    ).usi.values

    usi_previous_bariatric = (
        cohort_uwl.query('bariatric_surgery_flag == True')
        .query('six_months_before == True')[['usi', 'usi_linked']]
        .drop_duplicates()
      #  .query('usi_linked == True')
    ).usi.values

    cohort_uwl = (
        cohort_uwl
        .query('usi not in @usi_previous_bariatric')
        .query('usi not in @usi_previous_med')
    )
    
    return cohort_uwl

## 5. Create the cohort

### Load and transform data to create event log

In [91]:
print("Loading data for patron")
encounters_patron, diagnoses_patron, prescription_patron, patient_patron, hwbmi_patron, pathology_patron, vaed_proc_a1, cancer_a1 = load_data(filename_encounter_patron, 
                                                                                                                      filename_diagnosis_patron,
                                                                                                                      filename_prescription_patron,
                                                                                                                      filename_patient_patron,
                                                                                                                      filename_hwbmi_patron,
                                                                                                                      filename_pathology_patron,
                                                                                                                      filename_vaed_proc,
                                                                                                                      filename_cancer,
                                                                                                                      source_folder1_patron,
                                                                                                                      source_folder2_patron,
                                                                                                                      source_folder3_patron,
                                                                                                                      source_folder4_patron,
                                                                                                                      remove_zedmed=True)

Loading data for patron


In [254]:
encounters_all = (
    encounters_patron
    .assign(dte=lambda df_: format_dates(df_.visit_dte, date_formats=['%Y-%m-%d']))
#    .query('dte.dt.year >= 2020')
 #   .query('dte.dt.year <= 2020')
    .reset_index()
 #   .loc[:, ['patientid', 'usi', 'visit_dte', 'encounter_reason']]
)

diagnoses_all = (
    diagnoses_patron
    .assign(dte=lambda df_: format_dates(df_.diagnosis_dte))
   # .query('dte.dt.year >= 2020')
   # .query('dte.dt.year <= 2020')
    .reset_index()
 #   .loc[:, ['patientid', 'usi', 'diagnosis_reason', 'diagnosis_dte', 'diagnosis_status_active_flg']]
)

In [256]:
enc_diag = pd.concat([encounters_all, diagnoses_all])

In [209]:
# prescriptions
prescription_all = prescription_patron

# prescriptions
patient_all = patient_patron

# prescriptions
hwbmi_all = hwbmi_patron

# prescriptions
pathology_all = pathology_patron

In [2]:
# how many patients to start with?
num_patients_step1 = patient_all.usi.unique().shape[0]
num_patients_step1

In [211]:
print("Transforming tables")
# creates the flags for weight loss, weight loss inquiry, weight, and bariatric surgery.
cohort_small_a2, pathology_new_a2, prescription_new_a2, hwbmi_new_a2 = transform_tables(encounters_all, 
                                                                                        diagnoses_all, 
                                                                                        pathology_all, 
                                                                                        prescription_all,
                                                                                        hwbmi_all)

print("Creating event log")
# creates an event log with all the event types along with a flag for weight loss prescription
cohort_small_new_a2 = create_event_log(cohort_small_a2, 
                                       prescription_new_a2, 
                                       hwbmi_new_a2, 
                                       pathology_new_a2, 
                                       vaed_proc_a1, 
                                       cancer_a1)

print("Tidying event log")
# cleans datetimes, removes empty usi values and filters to usi values that have weightloss_flag == True
cohort_small_new_a2 = tidy_event_log(cohort_small_new_a2)

Transforming tables
Creating event log
Tidying event log


In [None]:
cohort_small_a2.usi.unique().shape

In [None]:
# how many patients have a weight_flag == True event?, how many encounters correspond to these?
cohort_small_a2.query('weight_flag == True').usi.unique().shape[0], cohort_small_a2.shape[0]

In [None]:
# how many have weightloss_flag == True event, i.e., have a weight regex and a loss regex?
usi_weightloss = cohort_small_a2.query('weightloss_flag == True').usi.unique()
usi_weightloss.shape[0], cohort_small_a2.query('usi in @usi_weightloss').shape[0]

In [None]:
# how many have weightloss_inquiry_flag == True, i.e., have weightloss_flag == True and an inquiry regex?
usi_weightloss_inquiry = cohort_small_a2.query('weightloss_inquiry_flag == True').usi.unique()
usi_weightloss_inquiry.shape[0], cohort_small_a2.query('usi in @usi_weightloss_inquiry').shape[0]

In [None]:
# how many have weightloss_flag == True and weightloss_inquiry_flag == False
usi_weightloss_no_inquiry = cohort_small_a2.query('weightloss_flag == True').query('weightloss_inquiry_flag == False').usi.unique()
usi_weightloss_no_inquiry.shape[0], cohort_small_a2.query('usi in @usi_weightloss_no_inquiry').shape[0]

In [None]:
# how many have bariatric_surgery == True
usi_bariatric = cohort_small_a2.query('bariatric_surgery_flag == True').usi.unique()
usi_bariatric.shape[0], cohort_small_a2.query('usi in @usi_bariatric').shape[0]

In [None]:
# how many are candidates at this stage? e.g., (weightloss_flag == True) & (weightloss_inquiry_flag == True) & (bariatric_surgery_flag == False)
usi_candidates_step2 = cohort_small_a2.query('weightloss_flag == True').query('weightloss_inquiry_flag == False').query('bariatric_surgery_flag == False').usi.unique()
usi_candidates_step2.shape[0], cohort_small_a2.query('usi in @usi_candidates_step2').shape[0]

In [None]:
cohort_small_new_a2.query('weightloss_flag == True').usi.unique().shape[0]

In [None]:
# number of patients with weightloss_flag == True
cohort_small_new_a2.usi.unique().shape[0], cohort_small_new_a2.shape[0]

### Make predictions of UWL / not UWL based on the text

In [225]:
# text processing
weight_measurement_pattern2 = r'([0-9]+\s{0,1}kg)'
weight_measurement_pattern1 = r'([0-9]+\.[0-9]+\s{0,1}kg)'
weight_measurement_pattern = r'|'.join([weight_measurement_pattern1, weight_measurement_pattern2])

number_pattern = r'([0-9]+)'
months_pattern = r'([0-9]+)\s{0,1}\/\s{0,1}12'
weeks_pattern = r'([0-9]+)\s{0,1}\/\s{0,1}52'
days_pattern = r'([0-9]+)\s{0,1}\/\s{0,1}7'

token_pattern = r'([\!\/\?\+\<\>])'
remove_pattern = r'[\.\,\;\:\-\(\)]'

In [226]:
def text_processing(texts):
    """
    Preprocessing of strings from encounter reasons to ensure
    relevant non-alphanumeric characters are treated as tokens
    and others are removed.
    
    Args
    texts (pd.Series): input reasons for encounter texts 
    
    Return
    texts (pd.Series): the processed text after applying transformations
    """
    texts = (texts
             .str.lower()
             .str.replace(weeks_pattern, '\g<1> weeks', regex=True)
             .str.replace(days_pattern, '\g<1> days', regex=True)
             .str.replace(months_pattern, '\g<1> months', regex=True)
             .str.replace(weight_measurement_pattern, '[weight_measurement]', regex=True)
             .str.replace(number_pattern, '[number_value]', regex=True)
             .str.replace(token_pattern, ' \g<1> ', regex=True)
             .str.replace(remove_pattern, ' ', regex=True)
             .str.replace('[ ]+', ' ', regex=True)
             .str.strip()
            )
    
    return texts

In [None]:
# tokens that relate to positive sentiment (label 3)
features_group1 = ['excellent', 'great', 'good', 'vg', '!', 'success', 
                   'spectacular', 'well', 'improved', 'very']

# tokens that indicate actions to lose weight (label 3)
features_group2 = ['exercise', 'diet', 'start', 'restart', 'aim', 'target', 
                   'struggle', 'difficult', 'trouble']

# tokens that indicate some sort of weight loss treatment (label 3)
features_group3 = ['request', 'therapy', 'drug', 'treament', 'medication', 'med', 'tab']

# tokens that indicate mood disorder (label 2)
features_group4 = ['depression', 'anxiety', 'anxious', 'depressed', 'mood']

# tokens that indicate patient seeking advice on how to lose weight (label 3)
features_group5 = ['talk', 'advise', 'advice', 'discuss', 'consider', 'strategy', 
                   'education', 'counsel', 'program', 'mx']

# tokens that also relate to weight loss being intentional or trying to lose weighyt (label 3)
features_group6 = ['holiday', 'surgery', 'overweight', 'gain']

# intention (label 1)
features_group7 = ['unintentional', 'unexplained', 'unintended', 'no cause']

# investigations (label 1)
features_group8 = ['fi', 'f.i.', 'f.i', 'f/i', 'ix', 'investigation']

# weight loss symptom (label 1)
features_group9 = ['cachexic', 'cachexia', 'anorexia']

# cancer symptom (label 1)
features_group10 = ['nausea', 'tiredness', 'cough', 'sweats', 'splenomegaly', 
                    'abdo', 'diarrhoea', 'appetite', 'lethargy', 'tired', 'anaemia', 'anemia']

# combine the tokens into negative (unlikely due to UWL) and positive (may be due to UWL)
features_negative = features_group1 + features_group2 + features_group3 + features_group5 + features_group6
features_positive = features_group4 + features_group7 + features_group8 + features_group9 + features_group10

len(set(features_negative)), len(set(features_positive))

In [228]:
# simplest classifier: check if there are positive or negative features and classify accordingly
regex_neg = r'|'.join(features_negative)
regex_pos = r'|'.join(features_positive)

def uwl_classifier(text, uwl_terms, nonuwl_terms):
    num_uwl_terms = sum([term in text for term in uwl_terms])
    num_nonuwl_terms = sum([term in text for term in nonuwl_terms])
    
    if num_nonuwl_terms > num_uwl_terms:
        return 0
    else:
        return 1

In [229]:
print("Create cohort of UWL candidate patients")
cohort_final_new_a2 = (
    filter_cohort_new(cohort_small_new_a2)
    .assign(dte=lambda df_: pd.to_datetime(df_.dte))
    .query('dte.isnull() == False')
   # .pipe(add_index_case_label, patientid='usi')
    .reset_index()
    .iloc[:, 1:]
)

Create cohort of UWL candidate patients


In [230]:
cohort_final_new_a2.usi.unique().shape, cohort_final_new_a2.query('uwl_flag == True').query('usi_linked == True').usi.unique().shape

In [232]:
# classification algorithm on candidate texts
cohort_final_new_a2 = (
    cohort_final_new_a2
    .assign(text_processed=lambda df_: text_processing(df_.value))
)

uwl_candidate_texts = cohort_final_new_a2.text_processed.values
uwl_candidate_preds = np.array([uwl_classifier(text, features_positive, features_negative) for text in uwl_candidate_texts])

cohort_final_new_a2 = (
    cohort_final_new_a2
    .assign(uwl_prediction=uwl_candidate_preds)
)

# set those that have a prediction of 0 based on the classifier to have uwl_flag == False
cohort_final_new_a2.loc[cohort_final_new_a2.query('uwl_flag == True').query('uwl_prediction == 0').index, 'uwl_flag'] = False

In [None]:
# how many candidates are there are this stage?
usi_candidates_final_stage = cohort_final_new_a2.usi.unique()
usi_candidates_final_stage.shape[0], cohort_final_new_a2.query('usi in @usi_candidates_final_stage').shape[0]

### Create index date and time period flags

In [234]:
print("Create cohort of UWL candidate patients")
cohort_final_new_a2 = (
    cohort_final_new_a2
 #   filter_cohort_new(cohort_small_new_a2)
  #  .assign(dte=lambda df_: pd.to_datetime(df_.dte))
  #  .query('dte.isnull() == False')
    .pipe(add_index_case_label, patientid='usi')
    .reset_index()
    .iloc[:, 1:]
)

print("Adding time period flags")
cohort_final_new_a2 = (
    cohort_final_new_a2
    .pipe(events_within_time_period, date='dte', time_period=(0, 31), time_period_label='one_month_after')
    .pipe(events_within_time_period, date='dte', time_period=(0, 92), time_period_label='three_months_after')
    .pipe(events_within_time_period, date='dte', time_period=(0, 183), time_period_label='six_months_after')
    .pipe(events_within_time_period, date='dte', time_period=(0, 366), time_period_label='one_year_after')
    .pipe(events_within_time_period, date='dte', time_period=(0, 731), time_period_label='two_years_after')
    .pipe(events_within_time_period, date='dte', time_period=(0, 1828), time_period_label='five_years_after')
    .pipe(events_within_time_period, date='dte', time_period=(-31, 0), time_period_label='one_month_before')
    .pipe(events_within_time_period, date='dte', time_period=(-92, 0), time_period_label='three_months_before')
    .pipe(events_within_time_period, date='dte', time_period=(-183, 0), time_period_label='six_months_before')
    .pipe(events_within_time_period, date='dte', time_period=(-366, 0), time_period_label='one_year_before')
    .pipe(events_within_time_period, date='dte', time_period=(-731, 0), time_period_label='two_years_before')
    .pipe(events_within_time_period, date='dte', time_period=(-1828, 0), time_period_label='five_years_before')
    .pipe(events_within_time_period, date='dte', time_period=(0, 0), time_period_label='at_index_date')
)

print("Add dates of cancer diagnoses")
cohort_final_new_a2 = add_cancer_dates(cohort_final_new_a2)

print("Remove patients with a weight increase")
# patients who have an observed weight increase
cohort_final_new_a2 = (
    cohort_final_new_a2
    .query('dte.isnull() == False')
    .groupby('usi')
    .apply(lambda df_: weight_increase(df_))
)

print("Final cohort") # remove patients who have had previous bariatric surgery and on weight loss medication
cohort_uwl = create_final_cohort(cohort_final_new_a2)

Create cohort of UWL candidate patients
Adding time period flags
Add dates of cancer diagnoses
Remove patients with a weight increase
Final cohort


In [None]:
usi_uwl_final = cohort_uwl.query('uwl_flag == True').usi.unique()
usi_uwl_final.shape[0], cohort_uwl.query('usi in @usi_uwl_final').shape[0]

In [None]:
cohort_uwl.usi.unique().shape[0], cohort_uwl.shape

In [114]:
# write the data
cohort_uwl.to_parquet("M:/Working/AL/projects/weight_loss/outputs/uwl_cohort_patron_010823.parquet")