# Cluster probabilities and outcomes

---

***Authors:***

- Daniel E. Coral

- Femke Smit

- Elena Santos

- Ali Farzaneh

---

In this second part of the analysis, we will examine how the clusters we have validated across cohorts are associated with prevalent diseases at the time of clustering, and also assess whether they add significant information for prediction of MACE events and diabetes progression on top  of commonly used risk stratification tools.

## Libraries and functions

The libraries needed to run this analysis:

In [1]:
library(readr)
library(dplyr, warn.conflicts = FALSE)
library(tidyr, warn.conflicts = FALSE)
library(purrr)
library(survival)

And the functions we have prepared to facilitate some steps:

In [61]:
source("cross_sectional_FX2.R")

---

## Loading data needed

### Initial input table of biomarkers and basic covariates

The input table is the same table of 10 traits we had prior to run UMAP. Here is a description of this table:

|System targeted |Biomarker               |Units                        |Column name|
|----------------|------------------------|-----------------------------|-----------|
|                |                        |                             |           |
|Individual ID   |-                       |-                            |eid        |
|                |                        |                             |           |
|Blood pressure  |Systolic blood pressure |millimeters of mercury (mmHg)|sbp        |
|                |Diastolic blood pressure|millimeters of mercury (mmHg)|dbp        |
|                |                        |                             |           |
|Lipid fractions |High density lipoprotein|mmol/L                       |hdl        |
|                |Low density lipoprotein |mmol/L                       |ldl        |
|                |Tryglicerides           |mmol/L                       |tg         |
|                |                        |                             |           |
|Glycemia        |Fasting glucose         |mmol/L                       |fg         |
|                |                        |                             |           |
|Liver metabolism|Alanine transaminase    |U/L                          |alt        |
|                |                        |                             |           |
|Fat distribution|Waist-to-hip ratio      |cm/cm                        |whr        |
|                |                        |                             |           |
|Kidney function |Serum creatinine        |umol/L                       |scr        |
|                |                        |                             |           |
|Inflammation    |C reactive protein      |mg/L                         |crp        |
|                |                        |                             |           |
|Basic covariates|Current smoking status  |1 if yes, 0 if not           |smoking    |
|                |Sex                     |String ("Female" or "Male")  |sex        |
|                |Age                     |Years                        |age        |

***Important note:*** All columns should be there in the units required, and the names should match, so that the functions we have prepared for the analyses work properly. This is true for this and all the following tables we require for our analysis.

This input table has been preprocessed by:

1. Filtering out values that are possible errors in measurement (>5 SD away from the mean in continuous variables).
2. Only including complete cases.
3. Stratifying by sex.

Here is how the input table should look like - a list of two data frames, one for each sex:

In [3]:
load("../data/ukb/strat_dat.RData")

In [4]:
map(strat_dat, head)

eid,age,sex,bmi,whr,sbp,dbp,alt,scr,crp,hdl,tg,ldl,fg,smoking
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1000117,47,Female,23.8408,0.7254902,147.5,84.0,14.07,61.0,0.24,1.972,0.591,2.252,4.395,0
1000132,43,Female,35.6559,0.8403361,137.0,100.5,18.89,60.5,4.31,1.236,2.037,3.686,5.214,0
1000176,69,Female,38.1271,0.8897638,137.5,93.5,36.39,68.9,3.69,1.601,1.988,4.551,4.266,0
1000223,63,Female,25.4603,0.7789474,163.0,94.0,6.1,67.1,1.29,1.453,2.829,3.491,5.876,0
1000282,48,Female,25.4297,0.7708333,135.5,89.0,9.63,46.2,0.16,2.185,0.722,3.584,5.212,0
1000367,42,Female,19.328,0.6777778,107.0,72.5,9.34,57.1,0.69,2.346,0.395,3.072,4.649,0

eid,age,sex,bmi,whr,sbp,dbp,alt,scr,crp,hdl,tg,ldl,fg,smoking
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1000039,44,Male,36.6959,0.9911504,124.5,64.5,34.97,93.0,3.6,1.158,2.8,3.956,5.427,0
1000071,67,Male,39.4807,0.8857143,179.5,103.0,46.74,68.7,9.41,1.372,1.127,2.311,7.079,0
1000088,60,Male,24.2786,0.8761905,152.0,89.0,13.14,80.6,1.2,0.983,1.59,4.2,5.401,0
1000096,41,Male,26.5744,0.9587629,143.0,90.0,30.32,80.1,6.13,1.041,2.713,4.029,4.239,0
1000109,62,Male,33.8719,1.0818182,156.5,104.5,16.26,89.3,14.42,0.89,2.437,3.525,6.1,0
1000125,66,Male,36.11,1.0625,155.0,102.5,25.59,88.7,1.91,1.061,1.32,2.538,4.531,1


### Table of validated clusters

The second thing needed is the clusters we have validated. We have put this in an R file called `validclusmod`:

In [5]:
load("../data/validclusmod.RData")
print(validclusmod)

[90m# A tibble: 2 x 3[39m
  sex    residmod          clusmod         
  [3m[90m<chr>[39m[23m  [3m[90m<list>[39m[23m            [3m[90m<list>[39m[23m          
[90m1[39m Female [90m<tibble [10 x 6]>[39m [90m<tibble [6 x 4]>[39m
[90m2[39m Male   [90m<tibble [10 x 6]>[39m [90m<tibble [5 x 4]>[39m


This object contais, for each sex:
- `residmod`: The model to obtain residuals for each variable, i.e., the variability beyond what is explained by BMI, adjusting for age and smoking.
- `clusmod`: The clustering model to apply to the residuals.

### Table of pre-existing conditions and medications

The third thing we need is a table of pre-existing conditions and medications participants are currently taking:

In [6]:
covar_dat <- read_tsv("../data/covar_dat.tsv", show_col_types = FALSE)
head(covar_dat)

eid,HT,CHD,Stroke,PAD,CKD,LiverFailure,RA,T2D,T1D,T2Dage,Insulin,AntiDM,AntiHT,LipidLower
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1000027,0,0,0,0,0,0,0,0,0,0.0,0,0,0,0
1000039,0,0,0,0,0,0,0,0,0,0.0,0,0,0,0
1000040,1,0,0,0,0,0,0,0,0,0.0,0,0,0,0
1000053,0,0,0,0,0,0,0,0,0,0.0,0,0,0,0
1000064,1,0,0,0,0,0,0,1,0,49.5,0,1,1,0
1000071,1,0,0,0,0,0,0,1,0,65.5,0,0,1,1


All the columns in this table are coded 1 or 0 representing current diagnosis of a disease or whether the person is taking the medications specified. The exception is `T2Dage`, which is the age of onset of T2D. This is what each column represent:

|Group       |Column name |Meaning
|------------|------------|--------
|Diagnoses   |HT          |Hypertension
|            |CHD         |Coronary heart disease
|            |Stroke      |Stroke
|            |PAD         |Peripheral artery disease
|            |CKD         |Chronic kidney disease
|            |LiverFailure|Liver failure
|            |RA          |Rheumatoid arthritis
|            |T2D         |Type 2 diabetes
|            |T1D         |Type 1 diabetes
|Age at onset|T2Dage      |Age at onset of T2D - It is 0 if `T2D` is 0. Needed in SCORE2.
|Medication  |Insulin     |Taking insulin
|            |AntiDM      |Taking medication for diabetes other than insulin
|            |AntiHT      |Taking medication for hypertension
|            |LipidLower  |Taking lipid-lowering medication 

If any of the columns in this table are missing in your data, one option is to assume that none in your population had the disease, i.e., you should have a column with 0 for all individuals.

### Survival data

Lastly, we need survival data for MACE and diabetes progression. They should look like this:

In [7]:
survmacedat <- read_tsv("../data/survmacedat.tsv", show_col_types = FALSE)
head(survmacedat)

eid,outcome_value,outcome_timeyrs
<dbl>,<dbl>,<dbl>
1000071,0,10.001369
1000223,1,6.874743
1000324,1,3.101985
1000583,1,3.761807
1001175,1,4.539357
1001892,1,9.185489


In [8]:
survdmdat <- read_tsv("../data/survdmdat.tsv", show_col_types = FALSE)
head(survdmdat)

eid,outcome_value,outcome_timeyrs
<dbl>,<dbl>,<dbl>
1000109,1,3.600274
1000132,1,1.24846
1004267,1,4.550308
1006281,1,1.957563
1007454,0,9.423682
1010295,1,6.852841


These two tables should include individuals followed ***up to 10 years***. Any outcome after 10 years should be censored. `outcome_value` is 1 if the person experienced the event during the follow-up time and 0 if not. `outcome_timeyrs` is the time of follow-up in years, up to the first event or up to 10 years. 

It is important that these tables ***do not include*** individuals who already experience the events we will study. In any case, we will make sure of this in the next step, when we combine all the data. For example, any individual in the `survmacedat` table with a value of 1 in the columns `CHD`, `Stroke` or `PAD` of the `covar_tab` table, will be excluded from the analysis.

In case your cohort does not have survival data, then follow this guideline until the section below entitled "Prevalent diseases and medication".

---

## Calculation of cluster probabilities

With the data needed in place, we can start by calculating cluster allocation probabilities given the biomarker data. For that we will first add a new column called `data` to the `validclusmod` table where we will put the biomarker data for each sex:

In [9]:
clusterdfs <- clusterprobcalc(ClusModDf = validclusmod, StratDat = strat_dat)

In [10]:
print(clusterdfs)

[90m# A tibble: 2 x 2[39m
  sex    data                  
  [3m[90m<chr>[39m[23m  [3m[90m<list>[39m[23m                
[90m1[39m Female [90m<tibble [77,207 x 21]>[39m
[90m2[39m Male   [90m<tibble [67,904 x 20]>[39m


Checking that the probabilities were calculated for each sex:

In [11]:
head(clusterdfs$data[[1]])

eid,age,sex,bmi,whr,sbp,dbp,alt,scr,crp,⋯,tg,ldl,fg,smoking,probBC,probDHT,probDAL,probDLT,probDIS,probDHG
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1000117,47,Female,23.8408,0.7254902,147.5,84.0,14.07,61.0,0.24,⋯,0.591,2.252,4.395,0,0.1008059,0.8982179,8.145289e-08,0.0006249365,0.0002629902,8.816286e-05
1000132,43,Female,35.6559,0.8403361,137.0,100.5,18.89,60.5,4.31,⋯,2.037,3.686,5.214,0,0.3995536,0.5942413,0.002669602,0.0014612974,0.0009994929,0.001074671
1000176,69,Female,38.1271,0.8897638,137.5,93.5,36.39,68.9,3.69,⋯,1.988,4.551,4.266,0,0.9885268,9.507371e-06,5.39449e-05,0.0110183762,0.0002941394,9.725173e-05
1000223,63,Female,25.4603,0.7789474,163.0,94.0,6.1,67.1,1.29,⋯,2.829,3.491,5.876,0,0.939154,0.0002297826,0.05226406,0.000423494,0.0001454109,0.007783232
1000282,48,Female,25.4297,0.7708333,135.5,89.0,9.63,46.2,0.16,⋯,0.722,3.584,5.212,0,0.7901985,0.2027605,5.046971e-07,0.001904094,0.0015270843,0.003609283
1000367,42,Female,19.328,0.6777778,107.0,72.5,9.34,57.1,0.69,⋯,0.395,3.072,4.649,0,0.9827497,0.002825642,2.925853e-07,0.0033190879,0.0093330496,0.001772251


---

## Descriptive statistics

At this point we will recheck some of the characteristics of the clusters as we did in our previous script, weighting calculations by cluster probabilities.

The distribution of biomarkers per cluster:

In [12]:
markerdistribdf <- markerdistribfx(clusterdfs)

In [13]:
head(markerdistribdf)

sex,Variable,Cluster,Type,N,Summary1,Summary2
<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>
Female,whr,BC,Numeric,58707.879,0.82 (0.07),0.81 (0.7 - 0.77 - 0.86 - 0.96)
Female,whr,DHT,Numeric,7483.777,0.79 (0.05),0.78 (0.69 - 0.75 - 0.82 - 0.9)
Female,whr,DAL,Numeric,3950.988,0.87 (0.06),0.86 (0.76 - 0.83 - 0.91 - 0.99)
Female,whr,DLT,Numeric,2835.984,0.84 (0.07),0.84 (0.71 - 0.79 - 0.89 - 0.98)
Female,whr,DIS,Numeric,2750.474,0.84 (0.07),0.83 (0.71 - 0.79 - 0.89 - 0.98)
Female,whr,DHG,Numeric,1477.897,0.85 (0.08),0.85 (0.71 - 0.79 - 0.91 - 1.02)


The effect of BMI on biomarkers specifically within each cluster, adjusted for age and smoking:

In [14]:
bmieffmarkerdf <- bmieffmarkerfx(clusterdfs)

In [15]:
head(bmieffmarkerdf)

sex,Variable,Cluster,term,estimate,se
<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>
Female,whr,BC,(Intercept),0.57675356,0.00191741
Female,whr,BC,age,0.001289826,2.768274e-05
Female,whr,BC,smoking,0.02011465,0.0007707873
Female,whr,BC,bmi,0.006069039,4.309974e-05
Female,whr,DHT,(Intercept),0.569642086,0.001568987
Female,whr,DHT,age,0.001087705,2.096435e-05


---

## Prevalent diseases and medication

To add covariate data to the `alldat` table we will do the following:

In [16]:
clusterdfs <- addcovardat(X = clusterdfs, CovarDat = covar_dat)

In [17]:
print(clusterdfs)

[90m# A tibble: 2 x 2[39m
  sex    data                  
  [3m[90m<chr>[39m[23m  [3m[90m<list>[39m[23m                
[90m1[39m Female [90m<tibble [77,151 x 35]>[39m
[90m2[39m Male   [90m<tibble [67,848 x 34]>[39m


Checking again if the columns were added as expected:

In [18]:
head(clusterdfs$data[[1]])

eid,age,sex,bmi,whr,sbp,dbp,alt,scr,crp,⋯,CKD,LiverFailure,RA,T2D,T1D,T2Dage,Insulin,AntiDM,AntiHT,LipidLower
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1000117,47,Female,23.8408,0.7254902,147.5,84.0,14.07,61.0,0.24,⋯,0,0,0,0,0,0,0,0,0,0
1000132,43,Female,35.6559,0.8403361,137.0,100.5,18.89,60.5,4.31,⋯,0,0,0,0,0,0,0,0,1,0
1000176,69,Female,38.1271,0.8897638,137.5,93.5,36.39,68.9,3.69,⋯,0,0,0,0,0,0,0,0,1,0
1000223,63,Female,25.4603,0.7789474,163.0,94.0,6.1,67.1,1.29,⋯,0,0,0,0,0,0,0,0,1,1
1000282,48,Female,25.4297,0.7708333,135.5,89.0,9.63,46.2,0.16,⋯,0,0,0,0,0,0,0,0,0,0
1000367,42,Female,19.328,0.6777778,107.0,72.5,9.34,57.1,0.69,⋯,0,0,0,0,0,0,0,0,0,0


We will first count the number of individuals with disease in each cluster. Here we will also count the number of individuals taking each class of medications in each cluster.

In [19]:
countcovarsdf <- countcovarsfx(clusterdfs)

In [20]:
head(countcovarsdf)

sex,Cluster,Covariate,Nclus,Ncases,Nnoncases
<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Female,probBC,HT,58673.33,12946.70991,45726.62
Female,probBC,CHD,58673.33,1659.0005,57014.33
Female,probBC,Stroke,58673.33,848.10719,57825.23
Female,probBC,PAD,58673.33,147.89695,58525.44
Female,probBC,CKD,58673.33,67.25413,58606.08
Female,probBC,LiverFailure,58673.33,50.16266,58623.17


We will use this table to calculate prevalences and compare prevalences across clusters.

We are also interesting in looking at the proportion of individuals receiving medications in each cluster, stratified by each condition. This is obtained with the following function:

In [21]:
countdxmed <- countdxmedfx(clusterdfs)

In [22]:
head(countdxmed)

sex,Dx,Cluster,Med,Nnoncases,Ncases
<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>
Female,CHD,probBC,NoMed,42000.949,206.108367
Female,CHD,probBC,Insulin,137.0,17.0
Female,CHD,probBC,AntiDM,615.0,67.0
Female,CHD,probBC,AntiHT,10480.0,1320.0
Female,CHD,probBC,LipidLower,8573.0,1206.0
Female,CHD,probDAL,NoMed,2708.061,9.496002


We will also formally test the association between cluster allocation and diseases using logistic regressions where the outcome is each disease and the predictors are the cluster allocations. We will have two models for each disease, one with only clusters, and a second one adjusting for medication.

In [23]:
assocdxdf <- assocdxfx(clusterdfs)

In [24]:
print(assocdxdf)

[90m# A tibble: 36 x 5[39m
   sex    Dx_name model        estimates  varcovmat      
   [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m   [3m[90m<chr>[39m[23m        [3m[90m<list>[39m[23m     [3m[90m<list>[39m[23m         
[90m 1[39m Female HT      OnlyClusters [90m<dbl [6]>[39m  [90m<dbl [6 x 6]>[39m  
[90m 2[39m Female HT      FullModel    [90m<dbl [29]>[39m [90m<dbl [29 x 29]>[39m
[90m 3[39m Female CHD     OnlyClusters [90m<dbl [6]>[39m  [90m<dbl [6 x 6]>[39m  
[90m 4[39m Female CHD     FullModel    [90m<dbl [29]>[39m [90m<dbl [29 x 29]>[39m
[90m 5[39m Female Stroke  OnlyClusters [90m<dbl [6]>[39m  [90m<dbl [6 x 6]>[39m  
[90m 6[39m Female Stroke  FullModel    [90m<dbl [29]>[39m [90m<dbl [29 x 29]>[39m
[90m 7[39m Female PAD     OnlyClusters [90m<dbl [6]>[39m  [90m<dbl [6 x 6]>[39m  
[90m 8[39m Female PAD     FullModel    [90m<dbl [29]>[39m [90m<dbl [29 x 29]>[39m
[90m 9[39m Female CKD     OnlyClusters [90m<dbl [6]>

---

## Adding survival data

As explained before, since we want to be careful when adding survival data for analysis, we have prepared a function separately for both outcomes, and making sure we exclude individuals who already experience the events under study:

In [46]:
clustersurvdfs <- addsurvdat(X = clusterdfs, SurvMACEDf = survmacedat, SurvDMDf = survdmdat)

In [47]:
print(clustersurvdfs)

[90m# A tibble: 4 x 3[39m
  sex    outcome data                  
  [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m   [3m[90m<list>[39m[23m                
[90m1[39m Female MACE    [90m<tibble [73,378 x 34]>[39m
[90m2[39m Male   MACE    [90m<tibble [60,348 x 33]>[39m
[90m3[39m Female DM      [90m<tibble [34,581 x 32]>[39m
[90m4[39m Male   DM      [90m<tibble [29,006 x 31]>[39m


`data` now contain the data necessary to run survival analysis.

---

## Creating follow-up subsets

In each subset the follow-up will be censored at a specific point in time:

In [49]:
clustersurvdfs <- futsubsetsfx(clustersurvdfs)
print(clustersurvdfs)

[90m# A tibble: 8 x 4[39m
  sex    outcome data                     fut
  [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m   [3m[90m<list>[39m[23m                 [3m[90m<dbl>[39m[23m
[90m1[39m Female MACE    [90m<tibble [73,378 x 34]>[39m     5
[90m2[39m Female MACE    [90m<tibble [73,378 x 34]>[39m    10
[90m3[39m Male   MACE    [90m<tibble [60,348 x 33]>[39m     5
[90m4[39m Male   MACE    [90m<tibble [60,348 x 33]>[39m    10
[90m5[39m Female DM      [90m<tibble [34,581 x 32]>[39m     5
[90m6[39m Female DM      [90m<tibble [34,581 x 32]>[39m    10
[90m7[39m Male   DM      [90m<tibble [29,006 x 31]>[39m     5
[90m8[39m Male   DM      [90m<tibble [29,006 x 31]>[39m    10


## Summary of survival data

We need information on the data available for survival analysis. Here is the function:

In [52]:
survsum <- survsumfx(clustersurvdfs)
survsum

sex,outcome,fut,N,Ncases,TPT,timeq2.5,timeq25,timeq50,timeq75,timeq97.5
<chr>,<chr>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Female,MACE,5,73378,1349,363900.5,5.0,5.0,5.0,5.0,5
Female,MACE,10,73378,2926,644648.0,6.366667,8.104038,8.947296,9.607118,10
Male,MACE,5,60348,2506,295798.6,3.15948,5.0,5.0,5.0,5
Male,MACE,10,60348,4927,519200.7,3.15948,8.027379,8.895277,9.593429,10
Female,DM,5,34581,665,171313.8,5.0,5.0,5.0,5.0,5
Female,DM,10,34581,1159,303403.6,6.151951,8.202601,8.870637,9.604381,10
Male,DM,5,29006,938,142786.5,3.909993,5.0,5.0,5.0,5
Male,DM,10,29006,1578,251704.8,3.909993,8.109514,8.851472,9.587953,10


---

## Rates of outcomes by cluster

Similar to what was done in the cross sectiona setting, we will calculate the number of cases and the total follow-up in each cluster using the weighted approach:

In [53]:
ratesbyclus <- ratesclusfx(clustersurvdfs)

In [54]:
head(ratesbyclus)

sex,outcome,Cluster,Ncases,TPT,fut
<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Female,MACE,probBC,956.80374,276917.2,5
Female,MACE,probDAL,92.19911,18711.26,5
Female,MACE,probDHG,50.83606,6323.39,5
Female,MACE,probDHT,130.3375,35971.92,5
Female,MACE,probDIS,72.2293,12668.5,5
Female,MACE,probDLT,46.59429,13308.2,5


We will also do this by medication status:

In [55]:
ratesbyclusmed <- ratesclusmedfx(clustersurvdfs)

In [56]:
head(ratesbyclusmed)

sex,outcome,Cluster,Med_name,Ncases,TPT,fut
<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Female,MACE,probBC,AntiDM,32.929531,2863.6116,5
Female,MACE,probBC,AntiHT,365.702399,48673.5125,5
Female,MACE,probBC,Insulin,10.178344,630.9561,5
Female,MACE,probBC,LipidLower,247.432195,39067.3397,5
Female,MACE,probBC,NoMed,495.310186,207067.3974,5
Female,MACE,probDAL,AntiDM,2.868887,197.8934,5


---

## Cox models

To quantify the association of clusters to MACE, as well as its potential contribution for prediction, we will compare two models. The reference model will include all predictors that are part of SCORE2, the risk stratification tool for CVD recommended by the European Society of Cardiology <cite id="pzdxs"><a href="#zotero|10831815/ZY2CL5NC">(SCORE2 working group and ESC Cardiovascular risk collaboration, 2021)</a></cite>. We will use a version of this score that has been validated in diabetic populations, and includes some additional clinically useful predictors <cite id="itp38"><a href="#zotero|10831815/FPUFQKFI">(SCORE2-Diabetes Working Group and the ESC Cardiovascular Risk Collaboration, 2023)</a></cite>. Additionally, for the sake of completeness, some important pre-existing conditions and pharmacological treatments, such as hypertension/antihypertensives, as well as any predictor that we had in our initial input table that are not part of SCORE2, will also be included. We will compare this reference model to one that includes also the cluster probabilities and then we'll compare the ability of these two models to predict MACE.

Similarly, for diabetes we will also fit two models, one containing all biomarkers and another one containing the biomarkers plus the cluster probabilities.

The way we will introduce the cluster probabilities into the Cox models will be employing the log-ratio transformation <cite id="0lvet"><a href="#zotero|10831815/UYLBIXH3">(Coenders &#38; Pawlowsky-Glahn, 2020)</a></cite>.

#### Fitting models

In [57]:
coxmoddf <- coxmodels(clustersurvdfs)

In [58]:
print(coxmoddf)

[90m# A tibble: 8 x 8[39m
  sex    outcome data       fut survdf   NullMod    mod_base mod_clus
  [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m   [3m[90m<list>[39m[23m   [3m[90m<dbl>[39m[23m [3m[90m<list>[39m[23m   [3m[90m<list>[39m[23m     [3m[90m<list>[39m[23m   [3m[90m<list>[39m[23m  
[90m1[39m Female MACE    [90m<tibble>[39m     5 [90m<tibble>[39m [90m<cxph.nll>[39m [90m<coxph>[39m  [90m<coxph>[39m 
[90m2[39m Female MACE    [90m<tibble>[39m    10 [90m<tibble>[39m [90m<cxph.nll>[39m [90m<coxph>[39m  [90m<coxph>[39m 
[90m3[39m Male   MACE    [90m<tibble>[39m     5 [90m<tibble>[39m [90m<cxph.nll>[39m [90m<coxph>[39m  [90m<coxph>[39m 
[90m4[39m Male   MACE    [90m<tibble>[39m    10 [90m<tibble>[39m [90m<cxph.nll>[39m [90m<coxph>[39m  [90m<coxph>[39m 
[90m5[39m Female DM      [90m<tibble>[39m     5 [90m<tibble>[39m [90m<cxph.nll>[39m [90m<coxph>[39m  [90m<coxph>[39m 
[90m6[39m Female DM      [9

Here `mod_base` contains the baseline model, while `mod_clus` contains the baseline plus clusters model. `mod_null` contains the null model, which we will use to calculate our metrics.

#### Coefficient estimates

To properly calculate the effect of clusters, we need the coefficients estimated by the log-ratio models as well as their covariance, contained in the `estimates` and `varcovmat` columns. To properly calculate the expected risk for a given phenotype, we need first the means of all variables in the model, contained in the column `Means` and the parameters of the baseline hazard, contained in the `Afit` column.

In [62]:
survcoefs <- survcoefx(coxmoddf)

In [63]:
print(survcoefs)

[90m# A tibble: 16 x 8[39m
   sex    outcome   fut model estimates  varcovmat       Means      Afit        
   [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m   [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<list>[39m[23m     [3m[90m<list>[39m[23m          [3m[90m<list>[39m[23m     [3m[90m<list>[39m[23m      
[90m 1[39m Female MACE        5 base  [90m<dbl [30]>[39m [90m<dbl [30 x 30]>[39m [90m<dbl [30]>[39m [90m<named list>[39m
[90m 2[39m Female MACE        5 clus  [90m<dbl [35]>[39m [90m<dbl [35 x 35]>[39m [90m<dbl [35]>[39m [90m<named list>[39m
[90m 3[39m Female MACE       10 base  [90m<dbl [30]>[39m [90m<dbl [30 x 30]>[39m [90m<dbl [30]>[39m [90m<named list>[39m
[90m 4[39m Female MACE       10 clus  [90m<dbl [35]>[39m [90m<dbl [35 x 35]>[39m [90m<dbl [35]>[39m [90m<named list>[39m
[90m 5[39m Male   MACE        5 base  [90m<dbl [30]>[39m [90m<dbl [30 x 30]>[39m [90m<dbl [30]>[39m [90m<named list>[39m
[

#### Comparison of predictive ability

To assess the predictive ability of the two nested models, we will use the gold-standard method: the likelihood ratio test. Given the wide use of the c-statistic, we will also use this metric. However, comparing two c-statistics is not as powerful as the likelihood ratio test.

In [64]:
compmoddf <- comparemods(coxmoddf)

In [65]:
compmoddf

sex,outcome,fut,LL0,LLBase,NVBase,LLBaseCl,NVBaseCl,LRTstat,LRTdf,LRTp,AdeqInd,CBase,CseBase,CBaseCl,CseBaseCl,Cdiff,Cdiffse,Cdiffp
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Female,MACE,5,-15100.891,-14618.848,30,-14607.431,35,22.834384,5,0.0003630825,0.976863,0.7328223,0.006465895,0.7344978,0.00647368,0.0016754179,0.0009406441,0.07488974
Female,MACE,10,-32421.972,-31439.463,30,-31425.868,35,27.191486,5,5.234738e-05,0.9863511,0.7264155,0.00446397,0.7275248,0.00446529,0.0011092586,0.0005270384,0.03531727
Male,MACE,5,-27533.009,-26845.514,30,-26805.962,34,79.104025,4,2.696446e-16,0.9455991,0.703614,0.004698475,0.7088752,0.004684074,0.005261199,0.0012679722,3.334912e-05
Male,MACE,10,-53605.732,-52507.146,30,-52458.631,34,97.028555,4,4.219367e-20,0.957707,0.6854264,0.003528423,0.689491,0.003534826,0.0040646343,0.0008366927,1.185908e-06
Female,DM,5,-6943.529,-6165.961,22,-6165.128,27,1.665862,5,0.8931722,0.9989299,0.8723565,0.006691853,0.8722215,0.006714578,-0.0001349738,0.0002306181,0.5583663
Female,DM,10,-12041.251,-10901.812,22,-10896.164,27,11.296818,5,0.04580241,0.9950673,0.8543404,0.005291641,0.8548817,0.005284485,0.0005412843,0.0003872531,0.1621872
Male,DM,5,-9622.876,-8708.941,22,-8708.323,26,1.237698,4,0.8718537,0.9993233,0.841219,0.006265155,0.8413424,0.006251815,0.0001234357,0.0001595101,0.439024
Male,DM,10,-16105.258,-14836.238,22,-14831.444,26,9.588273,4,0.04796469,0.9962364,0.8210544,0.005080388,0.8222832,0.005046399,0.001228734,0.0003787754,0.001178782


Some details of these columns:

- `LRTp` is the p-value of the likelihood ratio test comparing models with or without cluster allocations.
- `AdeqInd` is the adequacy index comparing likelihood ratios of the two models. 1 minus this value represent the fraction of added information by cluster allocation.
- `cdiffp` is the p-value of the difference between c-statistics of the two models.

While p-values of both the likelihood ratio tests and the difference between C-statistics show evidence of added value, the adequacy index quantifies how much additional information is obtained when cluster allocations are added on top of the baseline model.

#### Adequacy index by cluster

Next we quantify how much is the added value of the new model to each cluster. We do this by recalculating the log likelihood of each model but this time weighting individuals by their cluster probabilities. 

In [66]:
adeqindbyclus <- AdeqIndClusFx(coxmoddf)

In [67]:
adeqindbyclus

sex,outcome,fut,Cluster,AdeqInd
<chr>,<chr>,<dbl>,<chr>,<dbl>
Female,MACE,5,probBC,0.9721397
Female,MACE,5,probDHT,0.9640652
Female,MACE,5,probDAL,1.0
Female,MACE,5,probDLT,1.0
Female,MACE,5,probDIS,0.9516598
Female,MACE,5,probDHG,1.0
Female,MACE,10,probBC,0.980832
Female,MACE,10,probDHT,0.9785928
Female,MACE,10,probDAL,1.0
Female,MACE,10,probDLT,1.0


#### Adequacy index by MACE probability given by baseline model

We are also interested in how the more complex model behaves along the scale of MACE probabilities given by the baseline model:

In [68]:
adeqindbypre <- AdeqIndByPreFx(coxmoddf)

In [69]:
head(adeqindbypre)

sex,outcome,fut,threshold,AdeqInd
<chr>,<chr>,<dbl>,<dbl>,<dbl>
Female,MACE,5,0.0,0.976863
Female,MACE,5,0.01,0.9525982
Female,MACE,5,0.02,0.8928134
Female,MACE,5,0.03,0.8220271
Female,MACE,5,0.04,0.6485129
Female,MACE,5,0.05,0.7717624


#### Decision curve analysis

The last step in assessing clinical utility of clustering allocations is to perform a decision curve analysis. First we will assess the overall net benefit of both models:

In [70]:
dcares <- DCurvFx(coxmoddf)

In [71]:
head(dcares)

sex,outcome,fut,pred,n,threshold,pos_rate,tp_rate,fp_rate,net_benefit,net_intervention_avoided
<chr>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Female,MACE,5,base,73378,0.0,0.01838426,0.018384257,0.98161574,0.0183842569,
Female,MACE,5,base,73378,0.01,0.01838426,0.0164627,0.61087792,0.010292216,0.1805037
Female,MACE,5,base,73378,0.02,0.01838426,0.012265257,0.3200278,0.005734077,0.3617569
Female,MACE,5,base,73378,0.03,0.01838426,0.008258606,0.16134264,0.0032686277,0.4928771
Female,MACE,5,base,73378,0.04,0.01838426,0.005260432,0.08452125,0.0017387137,0.5821227
Female,MACE,5,base,73378,0.05,0.01838426,0.003161711,0.04514977,0.0007854068,0.6472376


We will also calculate this by cluster:

In [72]:
dcaclusres <- DCurvbyClFx(coxmoddf)

In [73]:
head(dcaclusres)

sex,outcome,fut,Cluster,pred,n,threshold,pos_rate,tp_rate,fp_rate,net_benefit,net_intervention_avoided
<chr>,<chr>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Female,MACE,5,probBC,base,55802.39,0.0,0.01714629,0.017146286,0.98285371,0.0171462861,
Female,MACE,5,probBC,base,55802.39,0.01,0.01714629,0.015132583,0.58025347,0.0092714372,0.2032437
Female,MACE,5,probBC,base,55802.39,0.02,0.01714629,0.01087825,0.28771671,0.005006481,0.3880033
Female,MACE,5,probBC,base,55802.39,0.03,0.01714629,0.006964654,0.13813947,0.0026922995,0.5155081
Female,MACE,5,probBC,base,55802.39,0.04,0.01714629,0.00437357,0.06933774,0.0014844973,0.6069708
Female,MACE,5,probBC,base,55802.39,0.05,0.01714629,0.002497458,0.03464816,0.0006738702,0.6698778


## Interaction between clusters and medications

Finally, we will assess the interaction between certain medications and clusters:

In [74]:
interactmods <- interactmodfx(coxmoddf)

[1m[22m[36mi[39m In argument: `mods = purrr::map(...)`.
[1m[22m[36mi[39m In argument: `mods = purrr::map(...)`.
[33m![39m Loglik converged before variable  7 ; coefficient may be infinite. "


In [75]:
print(interactmods)

[90m# A tibble: 48 x 9[39m
   sex    outcome   fut Med_name   model estimates varcovmat  Means Afit        
   [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m   [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m      [3m[90m<chr>[39m[23m [3m[90m<list>[39m[23m    [3m[90m<list>[39m[23m     [3m[90m<lis>[39m[23m [3m[90m<list>[39m[23m      
[90m 1[39m Female MACE        5 Insulin    Only~ [90m<dbl>[39m     [90m<dbl[...]>[39m [90m<dbl>[39m [90m<named list>[39m
[90m 2[39m Female MACE        5 Insulin    Full~ [90m<dbl>[39m     [90m<dbl[...]>[39m [90m<dbl>[39m [90m<named list>[39m
[90m 3[39m Female MACE        5 AntiDM     Only~ [90m<dbl>[39m     [90m<dbl[...]>[39m [90m<dbl>[39m [90m<named list>[39m
[90m 4[39m Female MACE        5 AntiDM     Full~ [90m<dbl>[39m     [90m<dbl[...]>[39m [90m<dbl>[39m [90m<named list>[39m
[90m 5[39m Female MACE        5 AntiHT     Only~ [90m<dbl>[39m     [90m<dbl[...]>[39m [90m<dbl>[39m [90m<

---

## Saving data

As done before, we will ask you to save an R file that does not contain any individual data, only summary statistics, as follows:

In [76]:
result_file2 <- list(
    MarkerDistrib = markerdistribdf,
    BMIeffOnMarker = bmieffmarkerdf,
    CountCovars = countcovarsdf,
    CountDXMeds = countdxmed,
    CrossSectAssoc = assocdxdf,
    SurvSum = survsum,
    RatesByClus = ratesbyclus,
    RatesByClusMeds = ratesbyclusmed,
    SurvCoefs = survcoefs,
    Comparison = compmoddf,
    AdeqIndByClus = adeqindbyclus,
    AdeqIndByPre = adeqindbypre,
    DCARes = dcares,
    DCAClusREs = dcaclusres,
    InteractMods = interactmods
)

In [77]:
save(
    result_file2,
    file = "../data/ukb/result_file2.RData"
)

This file should be uploaded to the respective folder of the cohort in Teams:

> CrossWP > Analyst working groups > WG1 > UMAP_project > *cohort_name* > data

---

## References

<!-- BIBLIOGRAPHY START -->
<div class="csl-bib-body">
  <div class="csl-entry"><i id="zotero|10831815/UYLBIXH3"></i>Coenders, G., &#38; Pawlowsky-Glahn, V. (2020). On interpretations of tests and effect sizes in regression models with a compositional predictor. <i>SORT-Statistics and Operations Research Transactions</i>, <i>44</i>(1), 201–220. https://doi.org/10.2436/20.8080.02.100</div>
  <div class="csl-entry"><i id="zotero|10831815/ZY2CL5NC"></i>SCORE2 working group and ESC Cardiovascular risk collaboration. (2021). SCORE2 risk prediction algorithms: new models to estimate 10-year risk of cardiovascular disease in Europe. <i>European Heart Journal</i>, <i>42</i>(25), 2439–2454. https://doi.org/10.1093/eurheartj/ehab309</div>
  <div class="csl-entry"><i id="zotero|10831815/FPUFQKFI"></i>SCORE2-Diabetes Working Group and the ESC Cardiovascular Risk Collaboration. (2023). SCORE2-Diabetes: 10-year cardiovascular risk estimation in type 2 diabetes in Europe. <i>European Heart Journal</i>, ehad260. https://doi.org/10.1093/eurheartj/ehad260</div>
</div>
<!-- BIBLIOGRAPHY END -->