# Summary stats of UWL patients

4 July 2022

Updated 5 August

Updated 8 February 2024

---

## Description

This notebook covers the generation of statistics from the cohort of Unexpected Weight Loss (UWL) patients. The statistics allow for comparison with the UK study (Nicholson et al.).

Similar to other cohort stats notebook but this time we remove those patients with an encounter reason that has been assigned either 'Intended' or 'Not relevant'.

Statistics generated include:
- Baseline characteristics of the UWL cohort
- Missingness of variables: pathology, observations, smoking, alcohol
- Weight and BMI distributions (overall and by age)
- Weight changes of UWL cohort patients up to index date
- Prevalence of recording of pathology test results (including over different time periods)
- Prevalence of abnormal pathology test results

## 0.1. Config

In [95]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import re

sns.set()
%matplotlib inline

In [96]:
pd.set_option("display.max_rows", 1000)
pd.set_option("display.max_columns", 100)

## 0.2. Load data

In [97]:
source_folder = "M:/AL/projects/weight_loss/outputs"
source_folder2 = "M:/MDAP_Collaboration/data/original/NPS_Complete"
source_folder4 = "M:/DataAnalysis/RCOLD/DataFiles"

source_folder = "M:/Working/AL/projects/weight_loss/outputs"
source_folder2 = "M:/Working/MDAP_Collaboration/data/original/NPS_Complete"
source_folder4 = "M:/Working/DataAnalysis/RCOLD/DataFiles"

filename_cohort = 'uwl_cohort_210422.parquet'
filename_cohort = 'uwl_cohort_130722b.parquet' # this includes the VAED cancer coding and UWL subclasses
filename_cohort = 'uwl_cohort_020822b.parquet'
filename_cohort = 'uwl_cohort_201122.parquet'

filename_cohort = 'uwl_cohort_201122.csv'
filename_patient = "NPS_Patient_202107.parquet"
filename_patient = "NPS_Patient_202107.csv"

#filename_cohort = 'cohort_patron_nps_with_labels_060323.parquet' 
#filename_patient = "patron_nps_with_age_groups_060323.parquet"

filename_cohort = 'cohort_patron_nps_with_labels_BN_ranges_210323.parquet'
filename_patient = "patron_nps_with_age_groups_BN_ranges_210323.parquet"

filename_cohort = 'nps_patron_cohort_020224.parquet'
filename_patient = 'nps_patron_with_age_groups_uk_ranges_020224.parquet'

filename_cancer = 'VAED_Cancers_coded.csv'

In [98]:
# the cohort of all weight related patients
#cohort_base = pd.read_parquet(f'{source_folder}/{filename_cohort}')
cohort_base = pd.read_parquet(f'{source_folder}/{filename_cohort}')

# patient table
#patient = pd.read_parquet(f'{source_folder2}/{filename_patient}')
patient = pd.read_parquet(f'{source_folder}/{filename_patient}')

# filter to those patients in the cohort who do not have a weight increase
#cohort = cohort_base.query('weight_increase == False')

# VAED cancer coding
#cancer = pd.read_csv(f'{source_folder4}/{filename_cancer}')

In [99]:
# filter to those patients in the cohort who do not have a weight increase
cohort = cohort_base.query('weight_increase == False')

# VAED cancer coding
# convert cancer usi to standard 10-digit format
cancer = (
    pd.read_csv(f'{source_folder4}/{filename_cancer}')
    .assign(usi=lambda df_: df_.usi.astype(str).str.pad(side='left', fillchar='0', width=10))
)

In [100]:
# standardize the linkable USI values to be 10 digit left zero-padded integers
patient = (
    patient
    .query('usi.isnull() == False')
    .query('usi.str.startswith("G") == False')
    .assign(usi=lambda df_: df_.usi
            .astype(float)
            .astype(int)
            .astype(str)
            .str
            .pad(width=10, side='left', fillchar='0'))
)

In [101]:
# datetime conversions
cohort = (
    cohort
    .assign(dte=pd.to_datetime(cohort.dte))
  #  .assign(dte2=pd.to_datetime(cohort.dte2))
    .assign(dte=lambda df_: df_.dte.dt.normalize())
  #  .assign(dte2=lambda df_: df_.dte2.dt.normalize())
    .drop_duplicates()
)

## 0.3. Useful functions

In [102]:
from scipy.stats import norm

In [103]:
# for calculating Positive Predictive Values and confidence intervals
def z_value(alpha):
    return norm.ppf(1 - alpha/2)

def calculate_ppv(tp, fp):
    return tp / (tp + fp)

def calculate_seppv(tp, fp):
    ppv = calculate_ppv(tp, fp)
    se = np.sqrt(ppv * (1 - ppv) / (tp + fp))
    return se

def calculate_cippv(tp, fp, alpha):
    ppv = calculate_ppv(tp, fp)
    se = calculate_seppv(tp, fp)
    ci_l = ppv - z_value(alpha)*se
    ci_r = ppv + z_value(alpha)*se
    return ci_l, ci_r

In [104]:
def recent_weight_change(df, perc=False):
    """
    Calculate difference in weights up to index date
    
    Args
    df (pd.DataFrame): a patien event log for a given patient
    perc (bool, default=False): whether or not to calculate the change as a percentage
    
    Return
    weight_change (float): change between weight before and at index date
    
    """
    index_date = df[df['index_case'] == 1]['dte'].values[0]
    df_recent = df[df['dte'] <= index_date]
    weights_recent = df_recent.query('event_subtype == "WEIGHT"')
    weights = weights_recent['value'].astype(float).values
    if perc:
        weight_change = (weights[-1] - weights[-2]) / weights[-2] * 100
    else:
        weight_change = weights[-1] - weights[-2]
    return weight_change


def weight_change_over_period(df, perc=False):
    """
    Calculate difference in weights up to index date
    
    Args
    df (pd.DataFrame): a patien event log for a given patient
    perc (bool, default=False): whether or not to calculate the change as a percentage
    
    Return
    weight_change (float): change between weight before and at index date
    
    """
    index_date = df[df['index_case'] == 1]['dte'].values[0]
    df_recent = df[df['dte'] <= index_date]
    weights_recent = df_recent.query('event_subtype == "WEIGHT"')
    weights = weights_recent['value'].astype(float).values
    if perc:
        weight_change = (weights[-1] - weights[0]) / weights[0] * 100
    else:
        weight_change = weights[-1] - weights[0]
    return weight_change

In [105]:
def best_pathology_result(results):
    """
    Given a set of results, return the one that is either 
    within one month after the index date or if this does 
    not exist then the one that is the closest in the previous 
    3 months
    
    Args
    results (pd.DataFrame): dataframe with at least the following fields
        1month_after
        3months_before
        dte
        result
        
    Return
    best_result (pd.DataFrame): most representative pathology results
    """
    results_after = results[results['1month_after'] == True]
    results_before = results[results['3months_before'] == True]
    
    if results_after.shape[0] > 0:
        best_result = results_after.sort_values(by='index_date_dt').iloc[0:1]
    elif results_before.shape[0] > 0:
        best_result = results_before.sort_values(by='index_date_dt').iloc[0:1]
    else:
        best_result = results_after
        
    return best_result


def results_in_pathology_group(results, pathology_group):
    """
    Given a set of results and list of pathology test names
    determine if the results contains them all
    
    Args
    results (pd.DataFrame): dataframe with pathology data
    pathology_group (list of str): pathology test names
    
    Returns
    result_in_group (bool): True if results contains all the 
        result names
    """
    results_overlap = set(results['event_subtype'].values).intersection(pathology_group)
    results_in_group = (len(results_overlap) == len(pathology_group))
    return results_in_group


def calculate_cohort_counts(patients, groups=['age_group', 'gender_code'], ascending_order=[True, False]):
    """
    Calculate the numbers of cancer and non-cancer patients
    
    Args
    patients (pd.DataFrame): has columns usi, num_cancer_patients along with columns in groups
    groups (list of str): the columns that define the grouping of the data
    ascending_order (list of bool): ordering of the values for each of the columns in groups
    
    Returns
    ppv_results (pd.DataFrame): patients dataframe with ppvs calculated
    """
    cohort_counts = patients.groupby(groups)['cancer'].agg(num_patients=('size'), num_cancer=('sum'))
    
    # order the groups
    cohort_counts = cohort_counts.sort_values(by=groups, ascending=ascending_order)
    cohort_counts['num_no_cancer'] = cohort_counts['num_patients'] - cohort_counts['num_cancer']
    
    return cohort_counts

In [106]:
def results_in_pathology_group(results, pathology_group):
    """
    Given a set of results and list of pathology test names
    determine if the results contains them all
    
    Args
    results (pd.DataFrame): dataframe with pathology data
    pathology_group (list of str): pathology test names
    
    Returns
    result_in_group (bool): True if results contains all the 
        result names
    """
    results_overlap = set(results['event_subtype'].values).intersection(pathology_group)
    results_in_group = (len(results_overlap) == len(pathology_group))
    return results_in_group

## 1. Overall statistics

For each of the subcohorts (Unintended weight loss, Intent unknown, combined, and all), calculate:

- Number of patients
- Number of patients diagnosed with cancer within 6 months of index date
- Number of events
- Number of events before and after the index date
- Number of patients that can be linked

### 1.1. Number of patients

How many patients in each of the subcohorts:
- Unintended weight loss
- Intent unknown
- Either intent

In [1]:
# how many patients in cohort
patient['usi'].unique().shape

In [2]:
# how many of the USI can be linked (e.g., do not have a G?)
cohort.query('uwl_flag == 1')[['usi', 'usi_linked']].drop_duplicates()['usi_linked'].value_counts()

In [3]:
# only unintended
cohort.query('uwl_prediction == 1')['usi'].unique().shape

In [4]:
# how many of the USI can be linked (e.g., do not have a G?)
cohort.query('uwl_prediction == 1')[['usi', 'usi_linked']].drop_duplicates()['usi_linked'].value_counts()

In [5]:
# intent unknown
cohort.query('uwl_prediction == 2')[['usi', 'usi_linked']].drop_duplicates()['usi_linked'].value_counts()

In [208]:
# different subcohorts based on intent
cohort1 = cohort[cohort.usi.isin(cohort.query('uwl_prediction == 1').usi.unique())]
cohort2 = cohort[cohort.usi.isin(cohort.query('uwl_prediction == 2').usi.unique())]
cohort12 = cohort[cohort.usi.isin(cohort.query('uwl_prediction in [1, 2]').usi.unique())]

In [210]:
cohort_uwl = cohort12.query('usi_linked == True')
#cohort_uwl.query('event_type == "pathology"')['event_subtype'].value_counts()

### 1.2. Number of events

How many different events in each of the subcohorts.

In [None]:
cohort_uwl['usi'].shape

In [271]:
usi_12_linked = cohort12.query('usi_linked == True').usi.unique()
usi_1_linked = cohort1.query('usi_linked == True').usi.unique()
usi_2_linked = cohort2.query('usi_linked == True').usi.unique()

In [272]:
cohort1_linkable = cohort1.query('usi in @usi_1_linked')
cohort2_linkable = cohort2.query('usi in @usi_2_linked')
cohort12_linkable = cohort12.query('usi in @usi_12_linked')

In [None]:
cohort1_linkable.shape, cohort2_linkable.shape, cohort12_linkable.shape

In [274]:
num_uwl1_linkable = cohort1_linkable['usi'].unique().shape[0]
num_uwl12_linkable = cohort12_linkable['usi'].unique().shape[0]

num_uwl1_linkable_cancer = cohort1_linkable.merge(cancer, 
                                                  left_on='usi', 
                                                  right_on='usi', 
                                                  how='inner')['usi'].unique().shape[0]
num_uwl12_linkable_cancer = cohort12_linkable.merge(cancer, 
                                                    left_on='usi', 
                                                    right_on='usi', 
                                                    how='inner')['usi'].unique().shape[0]

In [None]:
print(f'Proportion of patients in NPS who have a cancer diagnosis in VAED: {num_uwl1_linkable_cancer / num_uwl1_linkable * 100}')
print(f'Proportion of patients in UWL cohort who have a cancer diagnosis in VAED: {num_uwl12_linkable_cancer / num_uwl12_linkable * 100}')

In [276]:
#print(f'Proportion of patients in NPS who have a cancer diagnosis in VAED: {num_nps_linkable_cancer / num_nps_linkable * 100}')
#print(f'Proportion of patients in UWL cohort who have a cancer diagnosis in VAED: {num_uwl_linkable_cancer / num_uwl_linkable * 100}')

In [277]:
usi_uwl = cohort_uwl['usi'].unique()
patient_uwl = patient[patient['usi'].isin(usi_uwl)]

In [None]:
cohort_uwl['index_case'].value_counts()

In [None]:
cohort1['usi'].unique().shape, cohort2['usi'].unique().shape, cohort12['usi'].unique().shape

In [280]:
# ignore events that occur before 2000
# what is the time period for each patient? unintended
usi_time_period = cohort1.query('dte > "2000-01-01"').groupby('usi')['dte'].agg([np.min, np.max]).apply(lambda r: 
                                                                                     r['amax'] - r['amin'], axis=1) / np.timedelta64(1, 'Y')

In [None]:
usi_time_period.agg([np.min, np.max, np.median, np.max, np.mean])

In [228]:
# ignore events that occur before 2000
# what is the time period for each patient? intent unknown subcohort
usi_time_period = cohort2.query('dte > "2000-01-01"').groupby('usi')['dte'].agg([np.min, np.max]).apply(lambda r: 
                                                                                     r['amax'] - r['amin'], axis=1) / np.timedelta64(1, 'Y')

In [None]:
usi_time_period.agg([np.min, np.max, np.median, np.max, np.mean])

In [230]:
# ignore events that occur before 2000
# what is the time period for each patient? combined 12 subcohort
usi_time_period = cohort12.query('dte > "2000-01-01"').groupby('usi')['dte'].agg([np.min, np.max]).apply(lambda r: 
                                                                                     r['amax'] - r['amin'], axis=1) / np.timedelta64(1, 'Y')

In [None]:
usi_time_period.agg([np.min, np.max, np.median, np.max, np.mean])

### 1.3. Cancer status

How many patients have cancer within 6 months of index date?

In [None]:
# in original cohort
num_cancer = cohort.query('cancer_within_6months_index_date == True')['usi'].unique().shape[0]; num_cancer

In [None]:
# unintended
num_cancer1 = cohort1.query('cancer_within_6months_index_date == True')['usi'].unique().shape[0]; num_cancer1

In [None]:
# intent unknown
num_cancer2 = cohort2.query('cancer_within_6months_index_date == True')['usi'].unique().shape[0]; num_cancer2

In [None]:
# intent unknown or unintended
num_cancer12 = cohort12.query('cancer_within_6months_index_date == True')['usi'].unique().shape[0]; num_cancer12

In [None]:
# fraction of patients who develop cancer (who can be linked)
# unintended WL
num_linked_usi1 = cohort1.query('usi_linked == True')['usi'].unique().shape[0]
num_cancer1 / num_linked_usi1 * 100

In [None]:
# fraction of patients who develop cancer (who can be linked)
# intent unknown
num_linked_usi2 = cohort2.query('usi_linked == True')['usi'].unique().shape[0]
num_cancer2 / num_linked_usi2 * 100

In [None]:
# fraction of patients (who can be linked)
num_linked_likely_unintended = cohort12.query('usi_linked == True')['usi'].unique().shape[0]
num_cancer12 / num_linked_likely_unintended * 100

In [None]:
num_linked_likely_unintended

Check that we have removed all patients who have had a diagnosis of cancer before the index date


In [None]:
# patients who had a diagnosis of cancer before the index date
usi_cancer_before = cohort.query('cancer_before_index_date == True')['usi'].unique()
usi_cancer_before.shape[0]

In [None]:
# patients who have UWL and can be linked
num_uwl_patients = cohort12.query('usi_linked == True')['usi'].unique().shape[0]
num_uwl_patients, num_uwl_patients - num_cancer

In [140]:
# filter to only UWL patients who can be linked
cohort_uwl = cohort12.query('usi_linked == True')

In [None]:
cohort_uwl.usi.unique().shape[0]

In [142]:
num_uwl_patients = cohort_uwl.usi.unique().shape[0]

In [143]:
cohort_1 = cohort1.query('usi_linked == True')

In [144]:
cohort_2 = cohort2.query('usi_linked == True')

In [None]:
num_uwl_patients

In [146]:
# convert to 10 digit 0-padded usi
cohort_uwl['usi'] = cohort_uwl['usi'].astype(str).str.pad(width=10, side='left', fillchar='0')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cohort_uwl['usi'] = cohort_uwl['usi'].astype(str).str.pad(width=10, side='left', fillchar='0')


## 2. Observations: Weight and BMI

How many BMI and weight measurements are there of these (overall and between particular time periods)?

In [None]:
cohort_uwl.usi_linked.value_counts()

In [236]:
weights = cohort_uwl.query('event_subtype == "WEIGHT"')[['usi', 
                                                         'dte', 
                                                         'value', 
                                                         'at_index_date', 
                                                         'five_years_before', 
                                                         'two_years_before', 
                                                         'one_year_before',
                                                         'six_months_before', 
                                                         'three_months_before', 
                                                         'one_month_before',
                                                         'one_month_after', 
                                                         'six_months_after']].drop_duplicates()

bmi = cohort_uwl.query('event_subtype == "BMI"')[['usi', 
                                                  'dte', 
                                                  'value', 
                                                  'at_index_date', 
                                                  'five_years_before', 
                                                  'two_years_before', 
                                                  'one_year_before',
                                                  'six_months_before',
                                                  'three_months_before', 
                                                  'one_month_before',
                                                  'one_month_after', 
                                                  'six_months_after']].drop_duplicates()

weights1 = cohort_1.query('event_subtype == "WEIGHT"')[['usi', 
                                                         'dte', 
                                                         'value', 
                                                         'at_index_date', 
                                                         'five_years_before', 
                                                         'two_years_before', 
                                                         'one_year_before',
                                                         'six_months_before', 
                                                         'three_months_before', 
                                                         'one_month_before',
                                                         'one_month_after', 
                                                         'six_months_after']].drop_duplicates()

bmi1 = cohort_1.query('event_subtype == "BMI"')[['usi', 
                                                 'dte', 
                                                 'value',
                                                 'at_index_date', 
                                                 'five_years_before',
                                                 'two_years_before', 
                                                 'one_year_before',
                                                 'six_months_before', 
                                                 'three_months_before', 
                                                 'one_month_before',
                                                 'one_month_after', 
                                                 'six_months_after']].drop_duplicates()

weights2 = cohort_2.query('event_subtype == "WEIGHT"')[['usi', 
                                                         'dte', 
                                                         'value', 
                                                         'at_index_date', 
                                                         'five_years_before', 
                                                         'two_years_before', 
                                                         'one_year_before',
                                                         'six_months_before', 
                                                         'three_months_before', 
                                                         'one_month_before',
                                                         'one_month_after', 
                                                         'six_months_after']].drop_duplicates()

bmi2 = cohort_2.query('event_subtype == "BMI"')[['usi', 
                                                 'dte', 
                                                 'value',
                                                 'at_index_date', 
                                                 'five_years_before', 
                                                 'two_years_before', 
                                                 'one_year_before',
                                                 'six_months_before', 
                                                 'three_months_before', 
                                                 'one_month_before',
                                                 'one_month_after', 
                                                 'six_months_after']].drop_duplicates()

In [None]:
# how many patients have at least one weight measurement?
weights['usi'].unique().shape, weights['usi'].unique().shape[0] / num_uwl_patients * 100

In [None]:
# how many patients have at least one weight measurement?
weights1['usi'].unique().shape, weights1['usi'].unique().shape[0] / num_linked_usi1 * 100

In [None]:
# how many patients have at least one weight measurement?
weights2['usi'].unique().shape, weights2['usi'].unique().shape[0] / num_linked_usi2 * 100

### 2.1. Weights

In [287]:
# convert to tidy form to make calculations easier
weights_tidy = pd.melt(weights, 
                       id_vars=['usi', 'dte', 'value'], 
                       value_vars=['at_index_date', 
                                   'one_month_before', 
                                   'five_years_before',
                                   'two_years_before',
                                   'one_year_before',
                                   'six_months_before',
                                   'three_months_before',
                                   'one_month_before',
                                   'one_month_after', 
                                   'six_months_after'], 
                       value_name='in_time_period', 
                       var_name='time_period')

In [288]:
# convert to tidy form to make calculations easier
weights_tidy1 = pd.melt(weights1, 
                       id_vars=['usi', 'dte', 'value'], 
                       value_vars=['at_index_date', 
                                   'one_month_before', 
                                   'five_years_before',
                                   'two_years_before',
                                   'one_year_before',
                                   'six_months_before',
                                   'three_months_before',
                                   'one_month_before',
                                   'one_month_after', 
                                   'six_months_after'], 
                       value_name='in_time_period', 
                       var_name='time_period')

In [289]:
# convert to tidy form to make calculations easier
weights_tidy2 = pd.melt(weights2, 
                       id_vars=['usi', 'dte', 'value'], 
                       value_vars=['at_index_date', 
                                   'one_month_before', 
                                   'five_years_before',
                                   'two_years_before',
                                   'one_year_before',
                                   'six_months_before',
                                   'three_months_before',
                                   'one_month_before',
                                   'one_month_after', 
                                   'six_months_after'], 
                       value_name='in_time_period', 
                       var_name='time_period')

In [290]:
# convert time period to categorical based on time ordering
weights_tidy['time_period'] = pd.Categorical(weights_tidy['time_period'], 
                                             ['five_years_before', 
                                              'two_years_before', 
                                              'one_year_before', 
                                              'six_months_before', 
                                              'three_months_before', 
                                              'one_month_before', 
                                              'at_index_date', 
                                              'one_month_after', 
                                              'six_months_after'])

# convert time period to categorical based on time ordering
weights_tidy1['time_period'] = pd.Categorical(weights_tidy1['time_period'], 
                                             ['five_years_before', 
                                              'two_years_before', 
                                              'one_year_before', 
                                              'six_months_before', 
                                              'three_months_before', 
                                              'one_month_before', 
                                              'at_index_date', 
                                              'one_month_after', 
                                              'six_months_after'])

# convert time period to categorical based on time ordering
weights_tidy2['time_period'] = pd.Categorical(weights_tidy2['time_period'], 
                                             ['five_years_before', 
                                              'two_years_before', 
                                              'one_year_before', 
                                              'six_months_before', 
                                              'three_months_before', 
                                              'one_month_before', 
                                              'at_index_date', 
                                              'one_month_after', 
                                              'six_months_after'])

In [None]:
# number of total weight measurements in cohort by time period
weights_tidy.query('in_time_period == True').groupby('time_period').size()

In [None]:
# for each usi and time period, how many weight measurements are there?
usi_time_period_counts = weights_tidy.query('in_time_period == True').groupby(['usi', 'time_period'], as_index=False).size()
usi_time_period_counts

# for each usi and time period, how many weight measurements are there?
usi_time_period_counts1 = weights_tidy1.query('in_time_period == True').groupby(['usi', 'time_period'], as_index=False).size()
usi_time_period_counts1

# for each usi and time period, how many weight measurements are there?
usi_time_period_counts2 = weights_tidy2.query('in_time_period == True').groupby(['usi', 'time_period'], as_index=False).size()
usi_time_period_counts2

In [293]:
# for each time period and size how many patients are there?
time_period_size_counts = usi_time_period_counts.groupby(['time_period', 'size'], as_index=False).count().query('size > 0')

# for each time period and size how many patients are there?
time_period_size_counts1 = usi_time_period_counts1.groupby(['time_period', 'size'], as_index=False).count().query('size > 0')

# for each time period and size how many patients are there?
time_period_size_counts2 = usi_time_period_counts2.groupby(['time_period', 'size'], as_index=False).count().query('size > 0')

In [294]:
# sort the data and calculate number of usi in time period with at least size n
time_period_size_counts['cumulative_count'] = time_period_size_counts.sort_values(by=['time_period', 'size'], ascending=[True, False]).groupby('time_period')['usi'].cumsum()

# sort the data and calculate number of usi in time period with at least size n
time_period_size_counts1['cumulative_count'] = time_period_size_counts1.sort_values(by=['time_period', 'size'], ascending=[True, False]).groupby('time_period')['usi'].cumsum()

# sort the data and calculate number of usi in time period with at least size n
time_period_size_counts2['cumulative_count'] = time_period_size_counts2.sort_values(by=['time_period', 'size'], ascending=[True, False]).groupby('time_period')['usi'].cumsum()

In [None]:
# pivot table showing number of patients with at least n weight measurements for each of the time periods (1 <= n <= 5)
time_period_count_pivot = time_period_size_counts[['time_period', 
                                                   'size', 
                                                   'cumulative_count']].query('size < 6').pivot(index='time_period', 
                                                                                                columns='size', 
                                                                                                values='cumulative_count').T

time_period_count_pivot

In [None]:
# pivot table showing number of patients with at least n weight measurements for each of the time periods (1 <= n <= 5)
time_period_count_pivot1 = time_period_size_counts1[['time_period', 
                                                   'size', 
                                                   'cumulative_count']].query('size < 6').pivot(index='time_period', 
                                                                                                columns='size', 
                                                                                                values='cumulative_count').T

time_period_count_pivot1

In [None]:
# pivot table showing number of patients with at least n weight measurements for each of the time periods (1 <= n <= 5)
time_period_count_pivot2 = time_period_size_counts2[['time_period', 
                                                   'size', 
                                                   'cumulative_count']].query('size < 6').pivot(index='time_period', 
                                                                                                columns='size', 
                                                                                                values='cumulative_count').T

time_period_count_pivot2

In [None]:
plt.figure(figsize=(12, 8))
ax = sns.heatmap(time_period_count_pivot.T, annot=True, fmt='d')
ax.set_xlabel('Number of measurements')
ax.set_ylabel('Time period')
ax.set_xticklabels(['$\geq 1$', '$\geq 2$', '$\geq 3$', '$\geq 4$', '$\geq 5$'])
ax.set_yticklabels(['5 years before', '2 years before', '1 year before', 
                    '6 months before', '3 months before', '1 month before', 
                    'at index date', '1 month after', 'six months after'])
plt.show()

In [None]:
plt.figure(figsize=(12, 8))
ax = sns.heatmap(time_period_count_pivot.T / num_uwl_patients * 100, annot=True)
ax.set_xlabel('Number of measurements')
ax.set_ylabel('Time period')
ax.set_xticklabels(['$\geq 1$', '$\geq 2$', '$\geq 3$', '$\geq 4$', '$\geq 5$'])
ax.set_yticklabels(['5 years before', '2 years before', '1 year before', 
                    '6 months before', '3 months before', '1 month before', 
                    'at index date', '1 month after', 'six months after'])
plt.show()

In [None]:
# how many distinct weight measurements?
weights[['usi', 'dte', 'value']].shape[0]

### 2.2. Weight changes

For those patients who have at least two weight recordings in the 24 months up until the index date, what are the distributions in the weight changes?

In [None]:
# filter to patients who have at least two weight recordings in 30 days up to index date
usi_two_weights = usi_time_period_counts.query('size >= 2').query('time_period == ["two_years_before", "at_index_date"]')['usi'].values
usi_two_weights.shape

# filter to patients who have at least two weight recordings in 30 days up to index date
usi_two_weights1 = usi_time_period_counts1.query('size >= 2').query('time_period == ["two_years_before", "at_index_date"]')['usi'].values
usi_two_weights1.shape

# filter to patients who have at least two weight recordings in 30 days up to index date
usi_two_weights2 = usi_time_period_counts2.query('size >= 2').query('time_period == ["two_years_before", "at_index_date"]')['usi'].values
usi_two_weights2.shape

In [None]:
usi_two_weights.shape, usi_two_weights.shape[0] / num_uwl_patients * 100

In [None]:
usi_two_weights1.shape, usi_two_weights1.shape[0] / num_uwl1_linkable * 100

In [None]:
usi_two_weights2.shape, usi_two_weights2.shape[0] / num_linked_usi2 * 100

In [309]:
# the event log of patients who have at least two weight measurements in the time window
cohort_double_weights = cohort_uwl[cohort_uwl['usi'].isin(usi_two_weights)].sort_values(by=['usi', 'dte'])

# the event log of patients who have at least two weight measurements in the time window
cohort_double_weights1 = cohort_uwl[cohort_uwl['usi'].isin(usi_two_weights1)].sort_values(by=['usi', 'dte'])

# the event log of patients who have at least two weight measurements in the time window
cohort_double_weights2 = cohort_uwl[cohort_uwl['usi'].isin(usi_two_weights2)].sort_values(by=['usi', 'dte'])

In [310]:
# cohort_double_weights
weight_counts_temp = cohort_double_weights.reset_index().groupby('usi').apply(lambda g: recent_weight_change(g))

In [None]:
# calculate the weight changes
weight_changes = cohort_double_weights.reset_index().groupby('usi').apply(lambda g: recent_weight_change(g))
weight_changes_perc = cohort_double_weights.groupby('usi').apply(lambda g: recent_weight_change(g, perc=True))
weight_changes.mean(), weight_changes_perc.mean(), weight_changes.median(), weight_changes_perc.median()

In [None]:
# calculate the weight changes
weight_changes1 = cohort_double_weights1.reset_index().groupby('usi').apply(lambda g: recent_weight_change(g))
weight_changes_perc1 = cohort_double_weights1.groupby('usi').apply(lambda g: recent_weight_change(g, perc=True))
weight_changes1.mean(), weight_changes_perc1.mean(), weight_changes1.median(), weight_changes_perc1.median()

In [None]:
# calculate the weight changes
weight_changes2 = cohort_double_weights2.reset_index().groupby('usi').apply(lambda g: recent_weight_change(g))
weight_changes_perc2 = cohort_double_weights2.groupby('usi').apply(lambda g: recent_weight_change(g, perc=True))
weight_changes2.mean(), weight_changes_perc2.mean(), weight_changes2.median(), weight_changes_perc2.median()

In [None]:
plt.figure(figsize=(12, 8))
sns.boxplot(weight_changes1)
plt.xlabel('Weight change in 2 years up to index date')
plt.show()

### 2.3. BMI

In [177]:
# how many distinct BMI measurements?
bmi[['usi', 'dte', 'value']].drop_duplicates().shape[0]

25375

In [None]:
# convert to tidy form
bmi_tidy = pd.melt(bmi,
                   id_vars=['usi', 'dte', 'value'],
                   value_vars=['at_index_date',
                               '1month_before', 
                               '5years_before',
                               '24months_before',
                               '12months_before',
                               '6months_before',
                               '3months_before',
                               '1month_before',
                               '1month_after', 
                               '6months_after'], 
                   value_name='in_time_period', 
                   var_name='time_period')

In [None]:
# convert time period to categorical based on time ordering
bmi_tidy['time_period'] = pd.Categorical(bmi_tidy['time_period'],
                                         ['5years_before', '24months_before', '12months_before', 
                                          '6months_before', '3months_before', '1month_before', 
                                          'at_index_date', 
                                          '1month_after', '6months_after'])

In [None]:
# for each usi and time period, how many bmi measurements are there?
usi_time_period_counts = bmi_tidy.query('in_time_period == True').groupby(['usi', 'time_period'], as_index=False).size()

# for each time period and size how many patients are there?
time_period_size_counts = usi_time_period_counts.groupby(['time_period', 'size'], as_index=False).count().query('size > 0')

# sort the data and calculate number of usi in time period with at least size n
time_period_size_counts['cumulative_count'] = time_period_size_counts.sort_values(by=['time_period', 'size'], ascending=[True, False]).groupby('time_period')['usi'].cumsum()

# pivot table showing number of patients with at least n bmi measurements for each of the time periods (1 <= n <= 5)
time_period_count_pivot = time_period_size_counts[['time_period', 
                                                   'size', 
                                                   'cumulative_count']].query('size < 6').pivot(index='time_period', 
                                                                                                columns='size', 
                                                                                                values='cumulative_count').T

time_period_count_pivot

In [None]:
plt.figure(figsize=(12, 8))
ax = sns.heatmap(time_period_count_pivot.T, annot=True, fmt='d')
plt.show()

In [None]:
plt.figure(figsize=(12, 8))
ax = sns.heatmap(time_period_count_pivot.T / num_uwl_patients * 100, annot=True)
plt.show()

## 3. Demographics

Calculate the baseline characteristics of the cohort

In [None]:
usi_uwl = cohort_uwl['usi'].unique()
usi_uwl.shape

In [244]:
patient_uwl['index_date'] = pd.to_datetime(patient_uwl['index_date'])

In [245]:
#cohort_linkable = (
#    cohort_linkable
#    .assign(usi=cohort_linkable.usi.astype(float).astype(int).astype(str).str.pad(width=10, side='left', fillchar='0'))
#)

In [246]:
# patient age at index date
usi_to_index_date = dict(cohort.query('index_case == True')[['usi', 'dte']].values)

# consider only those patients that are UWL and can be linked
patient_uwl = patient[patient['usi'].isin(usi_uwl)]
patient_uwl = patient_uwl.query('usi != ""')

patient_uwl['index_date'] = patient_uwl['usi'].apply(lambda u: usi_to_index_date[u])
patient_uwl['index_date'] = pd.to_datetime(patient_uwl['index_date'])
patient_uwl['age_at_index'] = patient_uwl['index_date'].dt.year - patient_uwl['year_of_birth']

In [247]:
# where duplicate ages recorded, select first
patient_uwl = patient_uwl.groupby('usi', as_index=False).first()

In [248]:
# create the age group variable
bins = [17, 39, 49, 59, 69, 79, 119]
labels = ['18 - 39', '40 - 49', '50 - 59', '60 - 69', '70 - 79', '80+']

patient_uwl['age_group'] = pd.cut(x=patient_uwl['age_at_index'], bins=bins, labels=labels)

In [249]:
# second age group variable 
# (used when calculating PPV values for pathology results as these are larger to include more patients)
bins = [39, 59, 79, 119]
labels = ['40 - 59', '60 - 79', '80+']

patient_uwl['age_group2'] = pd.cut(x=patient_uwl['age_at_index'], bins=bins, labels=labels)

In [250]:
# write this as a new patient file with the additional fields included
#patient_uwl.to_parquet(f'{source_folder}/NPS_Patient_202107_derived_fields_250123.parquet')
#patient_uwl.to_csv(f'{source_folder}/cohort_patient_202107_derived_fields_210224.csv', index=False)

### 3.1. Characteristics of baseline cohort

- Cohort defined as patients with at least one UWL event.
- What is the base line? Use the patient table for age, gender, smoking status codes

In [None]:
# consider only those patients who can be linked
num_uwl_patients = patient_uwl['usi'].unique().shape[0]
num_uwl_patients

In [None]:
patient_uwl[patient_uwl['usi'].isin(usi_uwl)].groupby(['age_group']).size()

In [None]:
patient_uwl[patient_uwl['usi'].isin(usi_uwl)].groupby(['age_group']).size() / num_uwl_patients * 100

In [None]:
patient_uwl.groupby('gender_code').size()

In [None]:
patient_uwl.groupby('gender_code').size() / num_uwl_patients * 100

In [None]:
patient_uwl.groupby('smoking_status_code').size()

In [None]:
patient_uwl.groupby('smoking_status_code').size() / num_uwl_patients * 100

In [None]:
patient_uwl.groupby('smoking_status_code').size()

In [266]:
#usi_to_age = dict(patient_uwl[['usi', 'age_at_index']].drop_duplicates().values)
#usi_to_sex = dict(patient_uwl[['usi', 'gender_code']].drop_duplicates().values)
#usi_to_irsad = dict(patient_uwl[['usi', 'irsad_decile']].drop_duplicates().values)

In [None]:
patient_uwl_small = patient_uwl[['usi', 'phn_code', 'remoteness_area_cde', 'sa3_code_2016', 
                                 'irsad_decile', 'gender_code', 'atsi_code', 'year_of_birth']].drop_duplicates()

In [None]:
uwl_patients_phn = patient_uwl_small[['usi', 'phn_code']].drop_duplicates().groupby('phn_code').size().reset_index()
uwl_patients_irsad = patient_uwl_small[['usi', 'irsad_decile']].drop_duplicates().groupby('irsad_decile').size().reset_index()
uwl_patients_gender = patient_uwl_small[['usi', 'gender_code']].drop_duplicates().groupby('gender_code').size().reset_index()

In [None]:
cohort_linkable.usi.unique().shape[0]

## 4. Pathology results

### 4.1. Prevalence of recording of path results

How many patients have at least one type of each pathology result ever recorded?

In [93]:
# how many patients have at least one of each type of pathology test?
(cohort_uwl.query('event_type == "pathology_test_result"').groupby(['usi', 'event_subtype']).size() >= 1).reset_index().groupby('event_subtype').size()

Series([], dtype: int64)

In [None]:
# percentages
(cohort_uwl.query('event_type == "pathology_test_result"').groupby(['usi', 'event_subtype']).size() >= 1).reset_index().groupby('event_subtype').size() / num_uwl_patients * 100

Filter cohort to within 4 month window of index date (3 months before to 1 month after)

In [97]:
cohort_within_4month_period = pd.concat([cohort_uwl[cohort_uwl['3months_before'] == True], 
                                         cohort_uwl[cohort_uwl['1month_after'] == True]])

In [None]:
(cohort_within_4month_period.query('event_type == "pathology_test_result"').groupby(['usi', 'event_subtype']).size() >= 1).reset_index().groupby('event_subtype').size()

In [None]:
(cohort_within_4month_period.query('event_type == "pathology_test_result"').groupby(['usi', 'event_subtype']).size() >= 1).reset_index().groupby('event_subtype').size() / num_uwl_patients * 100

Cohort covering 7 month window (6 months before to 1 month after)

In [100]:
cohort_within_7month_period = pd.concat([cohort_uwl[cohort_uwl['6months_before'] == True], 
                                         cohort_uwl[cohort_uwl['1month_after'] == True]])

In [None]:
(cohort_within_7month_period.query('event_type == "pathology_test_result"').groupby(['usi', 'event_subtype']).size() >= 1).reset_index().groupby('event_subtype').size() / num_uwl_patients * 100

### 4.2. Abnormal pathology results

In [None]:
cohort_uwl.query('event_type == "pathology_test_result"').groupby(['event_subtype', 'value3']).size()

In [103]:
# how many abnormal results 