# Cluster probabilities and outcomes

---

***Authors:***

- Daniel E. Coral

- Femke Smit

- Elena Santos

- Ali Farzaneh

---

In this second part of the analysis, we will examine how the clusters we have validated across cohorts are associated with prevalent diseases at the time of clustering, and also assess whether they add significant information for prediction of MACE events and diabetes progression on top  of commonly used risk stratification tools.

## Libraries and functions

The libraries needed to run this analysis:

In [1]:
library(readr)
library(dplyr, warn.conflicts = FALSE)
library(tidyr, warn.conflicts = FALSE)
library(purrr)
library(survival)

And the functions we have prepared to facilitate some steps:

In [2]:
source("cross_sectional_FX2.R")

---

## Loading data needed

### Initial input table of biomarkers and basic covariates

The input table is the same table of 10 traits we had prior to run UMAP. Here is a description of this table:

|System targeted |Biomarker               |Units                        |Column name|
|----------------|------------------------|-----------------------------|-----------|
|                |                        |                             |           |
|Individual ID   |-                       |-                            |eid        |
|                |                        |                             |           |
|Blood pressure  |Systolic blood pressure |millimeters of mercury (mmHg)|sbp        |
|                |Diastolic blood pressure|millimeters of mercury (mmHg)|dbp        |
|                |                        |                             |           |
|Lipid fractions |High density lipoprotein|mmol/L                       |hdl        |
|                |Low density lipoprotein |mmol/L                       |ldl        |
|                |Tryglicerides           |mmol/L                       |tg         |
|                |                        |                             |           |
|Glycemia        |Fasting glucose         |mmol/L                       |fg         |
|                |                        |                             |           |
|Liver metabolism|Alanine transaminase    |U/L                          |alt        |
|                |                        |                             |           |
|Fat distribution|Waist-to-hip ratio      |cm/cm                        |whr        |
|                |                        |                             |           |
|Kidney function |Serum creatinine        |umol/L                       |scr        |
|                |                        |                             |           |
|Inflammation    |C reactive protein      |mg/L                         |crp        |
|                |                        |                             |           |
|Basic covariates|Current smoking status  |1 if yes, 0 if not           |smoking    |
|                |Sex                     |String ("Female" or "Male")  |sex        |
|                |Age                     |Years                        |age        |

***Important note:*** All columns should be there in the units required, and the names should match, so that the functions we have prepared for the analyses work properly. This is true for this and all the following tables we require for our analysis.

This input table has been preprocessed by:

1. Filtering out values that are possible errors in measurement (>5 SD away from the mean in continuous variables).
2. Only including complete cases.
3. Stratifying by sex.

Here is how the input table should look like - a list of two data frames, one for each sex:

In [3]:
load("../data/ukb/strat_dat.RData")

In [4]:
map(strat_dat, head)

eid,age,sex,bmi,whr,sbp,dbp,alt,scr,crp,hdl,tg,ldl,fg,smoking
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1000117,47,Female,23.8408,0.7254902,147.5,84.0,14.07,61.0,0.24,1.972,0.591,2.252,4.395,0
1000132,43,Female,35.6559,0.8403361,137.0,100.5,18.89,60.5,4.31,1.236,2.037,3.686,5.214,0
1000176,69,Female,38.1271,0.8897638,137.5,93.5,36.39,68.9,3.69,1.601,1.988,4.551,4.266,0
1000223,63,Female,25.4603,0.7789474,163.0,94.0,6.1,67.1,1.29,1.453,2.829,3.491,5.876,0
1000282,48,Female,25.4297,0.7708333,135.5,89.0,9.63,46.2,0.16,2.185,0.722,3.584,5.212,0
1000367,42,Female,19.328,0.6777778,107.0,72.5,9.34,57.1,0.69,2.346,0.395,3.072,4.649,0

eid,age,sex,bmi,whr,sbp,dbp,alt,scr,crp,hdl,tg,ldl,fg,smoking
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1000039,44,Male,36.6959,0.9911504,124.5,64.5,34.97,93.0,3.6,1.158,2.8,3.956,5.427,0
1000071,67,Male,39.4807,0.8857143,179.5,103.0,46.74,68.7,9.41,1.372,1.127,2.311,7.079,0
1000088,60,Male,24.2786,0.8761905,152.0,89.0,13.14,80.6,1.2,0.983,1.59,4.2,5.401,0
1000096,41,Male,26.5744,0.9587629,143.0,90.0,30.32,80.1,6.13,1.041,2.713,4.029,4.239,0
1000109,62,Male,33.8719,1.0818182,156.5,104.5,16.26,89.3,14.42,0.89,2.437,3.525,6.1,0
1000125,66,Male,36.11,1.0625,155.0,102.5,25.59,88.7,1.91,1.061,1.32,2.538,4.531,1


### Table of validated clusters

The second thing needed is the clusters we have validated. We have put this in an R file called `validclusmod`:

In [5]:
load("../data/validclusmod.RData")
print(validclusmod)

[90m# A tibble: 2 x 3[39m
  sex    residmod          clusmod         
  [3m[90m<chr>[39m[23m  [3m[90m<list>[39m[23m            [3m[90m<list>[39m[23m          
[90m1[39m Female [90m<tibble [10 x 6]>[39m [90m<tibble [6 x 4]>[39m
[90m2[39m Male   [90m<tibble [10 x 6]>[39m [90m<tibble [5 x 4]>[39m


This object contais, for each sex:
- `residmod`: The model to obtain residuals for each variable, i.e., the variability beyond what is explained by BMI, adjusting for age and smoking.
- `clusmod`: The clustering model to apply to the residuals.

### Table of pre-existing conditions and medications

The third thing we need is a table of pre-existing conditions and medications participants are currently taking:

In [6]:
covar_dat <- read_tsv("../data/covar_dat.tsv", show_col_types = FALSE)
head(covar_dat)

eid,HT,CHD,Stroke,PAD,CKD,LiverFailure,RA,T2D,T1D,T2Dage,Insulin,AntiDM,AntiHT,LipidLower
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1000027,0,0,0,0,0,0,0,0,0,0.0,0,0,0,0
1000039,0,0,0,0,0,0,0,0,0,0.0,0,0,0,0
1000040,1,0,0,0,0,0,0,0,0,0.0,0,0,0,0
1000053,0,0,0,0,0,0,0,0,0,0.0,0,0,0,0
1000064,1,0,0,0,0,0,0,1,0,49.5,0,1,1,0
1000071,1,0,0,0,0,0,0,1,0,65.5,0,0,1,1


All the columns in this table are coded 1 or 0 representing current diagnosis of a disease or whether the person is taking the medications specified. The exception is `T2Dage`, which is the age of onset of T2D. This is what each column represent:

|Group       |Column name |Meaning
|------------|------------|--------
|Diagnoses   |HT          |Hypertension
|            |CHD         |Coronary heart disease
|            |Stroke      |Stroke
|            |PAD         |Peripheral artery disease
|            |CKD         |Chronic kidney disease
|            |LiverFailure|Liver failure
|            |RA          |Rheumatoid arthritis
|            |T2D         |Type 2 diabetes
|            |T1D         |Type 1 diabetes
|Age at onset|T2Dage      |Age at onset of T2D - It is 0 if `T2D` is 0. Needed in SCORE2.
|Medication  |Insulin     |Taking insulin
|            |AntiDM      |Taking medication for diabetes other than insulin
|            |AntiHT      |Taking medication for hypertension
|            |LipidLower  |Taking lipid-lowering medication 

If any of the columns in this table are missing in your data, one option is to assume that none in your population had the disease, i.e., you should have a column with all 0.

### Survival data

Lastly, we need survival data for MACE and diabetes progression. They should look like this:

In [7]:
survmacedat <- read_tsv("../data/survmacedat.tsv", show_col_types = FALSE)
head(survmacedat)

eid,outcome_value,outcome_timeyrs
<dbl>,<dbl>,<dbl>
1000071,0,10.001369
1000223,1,6.874743
1000324,1,3.101985
1000583,1,3.761807
1001175,1,4.539357
1001892,1,9.185489


In [8]:
survdmdat <- read_tsv("../data/survdmdat.tsv", show_col_types = FALSE)
head(survdmdat)

eid,outcome_value,outcome_timeyrs
<dbl>,<dbl>,<dbl>
1000109,1,3.600274
1000132,1,1.24846
1004267,1,4.550308
1006281,1,1.957563
1007454,0,9.423682
1010295,1,6.852841


These two tables include individuals followed ***up to 10 years***. This means that any outcome after 10 years is censored. `outcome_value` is 1 if the person experienced the event during the follow-up time and 0 if not. `outcome_timeyrs` is the time of follow-up in years, up to the first event or up to 10 years. 

It is important that these tables ***do not include*** individuals who already experience the events we will study. In any case, we will make sure of this in the next step, when we combine all the data. For example, any individual in the `survmacedat` table with a value of 1 in the columns `CHD`, `Stroke` or `PAD` of the `covar_tab` table, will be excluded from the analysis.

In case your cohort does not have survival data, then follow this guideline until the section below entitled "Prevalent diseases and medication".

---

## Calculation of cluster probabilities

With the data needed in place, we can start by calculating cluster allocation probabilities given the biomarker data. For that we will first add a new column called `data` to the `validclusmod` table where we will put the biomarker data for each sex:

In [9]:
alldat <- mutate(
    validclusmod,
    data = map(sex, ~strat_dat[[.x]])
)
print(alldat)

[90m# A tibble: 2 x 4[39m
  sex    residmod          clusmod          data                  
  [3m[90m<chr>[39m[23m  [3m[90m<list>[39m[23m            [3m[90m<list>[39m[23m           [3m[90m<list>[39m[23m                
[90m1[39m Female [90m<tibble [10 x 6]>[39m [90m<tibble [6 x 4]>[39m [90m<tibble [77,207 x 15]>[39m
[90m2[39m Male   [90m<tibble [10 x 6]>[39m [90m<tibble [5 x 4]>[39m [90m<tibble [67,904 x 15]>[39m


Once we have this table, we can run the function to calculate cluster probabilities:

In [10]:
alldat <- clusterprobcalc(alldat)
print(alldat)

[90m# A tibble: 2 x 4[39m
  sex    residmod          clusmod          data                  
  [3m[90m<chr>[39m[23m  [3m[90m<list>[39m[23m            [3m[90m<list>[39m[23m           [3m[90m<list>[39m[23m                
[90m1[39m Female [90m<tibble [10 x 6]>[39m [90m<tibble [6 x 4]>[39m [90m<tibble [77,207 x 21]>[39m
[90m2[39m Male   [90m<tibble [10 x 6]>[39m [90m<tibble [5 x 4]>[39m [90m<tibble [67,904 x 20]>[39m


Checking that the probabilities were calculated for each sex:

In [11]:
head(alldat$data[[1]])

eid,age,sex,bmi,whr,sbp,dbp,alt,scr,crp,⋯,tg,ldl,fg,smoking,probBC,probDHT,probDAL,probDLT,probDIS,probDHG
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1000117,47,Female,23.8408,0.7254902,147.5,84.0,14.07,61.0,0.24,⋯,0.591,2.252,4.395,0,0.1008059,0.8982179,8.145289e-08,0.0006249365,0.0002629902,8.816286e-05
1000132,43,Female,35.6559,0.8403361,137.0,100.5,18.89,60.5,4.31,⋯,2.037,3.686,5.214,0,0.3995536,0.5942413,0.002669602,0.0014612974,0.0009994929,0.001074671
1000176,69,Female,38.1271,0.8897638,137.5,93.5,36.39,68.9,3.69,⋯,1.988,4.551,4.266,0,0.9885268,9.507371e-06,5.39449e-05,0.0110183762,0.0002941394,9.725173e-05
1000223,63,Female,25.4603,0.7789474,163.0,94.0,6.1,67.1,1.29,⋯,2.829,3.491,5.876,0,0.939154,0.0002297826,0.05226406,0.000423494,0.0001454109,0.007783232
1000282,48,Female,25.4297,0.7708333,135.5,89.0,9.63,46.2,0.16,⋯,0.722,3.584,5.212,0,0.7901985,0.2027605,5.046971e-07,0.001904094,0.0015270843,0.003609283
1000367,42,Female,19.328,0.6777778,107.0,72.5,9.34,57.1,0.69,⋯,0.395,3.072,4.649,0,0.9827497,0.002825642,2.925853e-07,0.0033190879,0.0093330496,0.001772251


---

## Descriptive statistics

At this point we will recheck some of the characteristics of the clusters as we did in our previous script, weighting calculations by cluster probabilities.

The distribution of biomarkers per cluster:

In [12]:
markerdistribdf <- markerdistribfx(alldat)

In [13]:
head(markerdistribdf)

sex,Variable,Cluster,Type,N,Summary1,Summary2
<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>
Female,whr,BC,Numeric,58707.879,0.82 (0.07),0.81 (0.7 - 0.77 - 0.86 - 0.96)
Female,whr,DHT,Numeric,7483.777,0.79 (0.05),0.78 (0.69 - 0.75 - 0.82 - 0.9)
Female,whr,DAL,Numeric,3950.988,0.87 (0.06),0.86 (0.76 - 0.83 - 0.91 - 0.99)
Female,whr,DLT,Numeric,2835.984,0.84 (0.07),0.84 (0.71 - 0.79 - 0.89 - 0.98)
Female,whr,DIS,Numeric,2750.474,0.84 (0.07),0.83 (0.71 - 0.79 - 0.89 - 0.98)
Female,whr,DHG,Numeric,1477.897,0.85 (0.08),0.85 (0.71 - 0.79 - 0.91 - 1.02)


The effect of BMI on biomarkers specifically within each cluster, adjusted for age and smoking:

In [14]:
bmieffmarkerdf <- bmieffmarkerfx(alldat)

In [15]:
head(bmieffmarkerdf)

sex,Variable,Cluster,term,estimate,se
<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>
Female,whr,BC,(Intercept),0.57675356,0.00191741
Female,whr,BC,age,0.001289826,2.768274e-05
Female,whr,BC,smoking,0.02011465,0.0007707873
Female,whr,BC,bmi,0.006069039,4.309974e-05
Female,whr,DHT,(Intercept),0.569642086,0.001568987
Female,whr,DHT,age,0.001087705,2.096435e-05


---

## Prevalent diseases and medication

To add covariate data to the `alldat` table we will do the following:

In [16]:
alldat <- mutate(
    alldat,
    data = map(data, inner_join, covar_dat, by = "eid")
)
print(alldat)

[90m# A tibble: 2 x 4[39m
  sex    residmod          clusmod          data                  
  [3m[90m<chr>[39m[23m  [3m[90m<list>[39m[23m            [3m[90m<list>[39m[23m           [3m[90m<list>[39m[23m                
[90m1[39m Female [90m<tibble [10 x 6]>[39m [90m<tibble [6 x 4]>[39m [90m<tibble [77,151 x 35]>[39m
[90m2[39m Male   [90m<tibble [10 x 6]>[39m [90m<tibble [5 x 4]>[39m [90m<tibble [67,848 x 34]>[39m


Checking again if the columns were added as expected:

In [17]:
head(alldat$data[[1]])

eid,age,sex,bmi,whr,sbp,dbp,alt,scr,crp,⋯,CKD,LiverFailure,RA,T2D,T1D,T2Dage,Insulin,AntiDM,AntiHT,LipidLower
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1000117,47,Female,23.8408,0.7254902,147.5,84.0,14.07,61.0,0.24,⋯,0,0,0,0,0,0,0,0,0,0
1000132,43,Female,35.6559,0.8403361,137.0,100.5,18.89,60.5,4.31,⋯,0,0,0,0,0,0,0,0,1,0
1000176,69,Female,38.1271,0.8897638,137.5,93.5,36.39,68.9,3.69,⋯,0,0,0,0,0,0,0,0,1,0
1000223,63,Female,25.4603,0.7789474,163.0,94.0,6.1,67.1,1.29,⋯,0,0,0,0,0,0,0,0,1,1
1000282,48,Female,25.4297,0.7708333,135.5,89.0,9.63,46.2,0.16,⋯,0,0,0,0,0,0,0,0,0,0
1000367,42,Female,19.328,0.6777778,107.0,72.5,9.34,57.1,0.69,⋯,0,0,0,0,0,0,0,0,0,0


We will first count the number of individuals with disease in each cluster. Here we will also count the number of individuals taking each class of medications in each cluster.

In [18]:
countcovarsdf <- countcovarsfx(alldat)

In [19]:
head(countcovarsdf)

sex,Cluster,Covariate,Nclus,NclusDX
<chr>,<chr>,<chr>,<dbl>,<dbl>
Female,BC,HT,58673.33,12946.70991
Female,BC,CHD,58673.33,1659.0005
Female,BC,Stroke,58673.33,848.10719
Female,BC,PAD,58673.33,147.89695
Female,BC,CKD,58673.33,67.25413
Female,BC,LiverFailure,58673.33,50.16266


We will use this table to calculate prevalences and compare prevalences across clusters.

We are also interesting in looking at the following proportions in each cluster:

- Proportion of individuals with hypertension receiving antihypertensives.
- Proportion of individuals with CHD receiving lipid-lowering medication.
- Proportion of individuals with T2D taking insulin.
- Proportion of individuals with T2D taking insulin or any other anti-diabetic medication.

In [20]:
countspectxdf <- countspectxfx(alldat)

In [21]:
head(countspectxdf)

sex,DX,MED,Cluster,NclusDXM
<chr>,<chr>,<chr>,<chr>,<dbl>
Female,HT,AntiHT,BC,9868.8856
Female,HT,AntiHT,DHT,1366.1636
Female,HT,AntiHT,DAL,791.7766
Female,HT,AntiHT,DLT,654.6246
Female,HT,AntiHT,DIS,671.0381
Female,HT,AntiHT,DHG,550.5115


We will also formally test the association between cluster allocation and diseases using logistic regressions where the outcome is each disease and the predictors are the cluster allocations. We will have two models for each disease, one with only clusters, and a second one adjusting for medication.

In [22]:
assocdxdf <- assocdxfx(alldat)

In [23]:
head(assocdxdf)

sex,DX,model,term,estimate,se
<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>
Female,HT,OnlyClusters,(Intercept),-1.34906659,0.010818331
Female,HT,OnlyClusters,probDHT,0.09715432,0.003340768
Female,HT,OnlyClusters,probDAL,0.05051826,0.004554858
Female,HT,OnlyClusters,probDLT,0.07656229,0.005173978
Female,HT,OnlyClusters,probDIS,0.07378141,0.004870324
Female,HT,OnlyClusters,probDHG,0.14847308,0.006453885


---

## Adding survival data

As explained before, since we want to be careful when adding survival data for analysis, we have prepared a function separately for both outcomes, and making sure we exclude individuals who already experience the events under study:

In [24]:
alldat <- alldat %>%
    mutate(
        macedf = purrr::map(data, addsurvmacedat, SURVDATA = survmacedat),
        dmdf = purrr::map(data, addsurvdmdat, SURVDATA = survdmdat)
    )
print(alldat)

[90m# A tibble: 2 x 6[39m
  sex    residmod          clusmod          data     macedf   dmdf    
  [3m[90m<chr>[39m[23m  [3m[90m<list>[39m[23m            [3m[90m<list>[39m[23m           [3m[90m<list>[39m[23m   [3m[90m<list>[39m[23m   [3m[90m<list>[39m[23m  
[90m1[39m Female [90m<tibble [10 x 6]>[39m [90m<tibble [6 x 4]>[39m [90m<tibble>[39m [90m<tibble>[39m [90m<tibble>[39m
[90m2[39m Male   [90m<tibble [10 x 6]>[39m [90m<tibble [5 x 4]>[39m [90m<tibble>[39m [90m<tibble>[39m [90m<tibble>[39m


`macedf` and `dmdf` now contain the data necessary to run survival analysis.

---

## Overall and cluster-specific Kaplan-Meier estimates

The first thing to do is to obtain overall and cluster-specific cumulative incidence rates using the Kaplan-Meier method:

In [25]:
kmestdf <- kmestfx(alldat)

In [26]:
head(kmestdf)

sex,Outcome,Cluster,risk,lower,upper
<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Female,MACE,Overall,0.04651655,0.04468811,0.04834149
Female,MACE,probBC,0.04399324,0.04194391,0.04603818
Female,MACE,probDHT,0.04512474,0.03942353,0.05079211
Female,MACE,probDAL,0.06075845,0.05173248,0.06969851
Female,MACE,probDLT,0.04782438,0.03785284,0.05769258
Female,MACE,probDIS,0.06237728,0.05175902,0.07287664


---

## Cox models

### MACE

To quantify the association of clusters to MACE, as well as its potential contribution for prediction, we will compare two models. The reference model will include all predictors that are part of SCORE2, the risk stratification tool for CVD recommended by the European Society of Cardiology <cite id="pzdxs"><a href="#zotero|10831815/ZY2CL5NC">(SCORE2 working group and ESC Cardiovascular risk collaboration, 2021)</a></cite>. We will use a version of this score that has been validated in diabetic populations, and includes some additional clinically useful predictors <cite id="itp38"><a href="#zotero|10831815/FPUFQKFI">(SCORE2-Diabetes Working Group and the ESC Cardiovascular Risk Collaboration, 2023)</a></cite>. Additionally, for the sake of completeness, some important pre-existing conditions and pharmacological treatments, such as hypertension/antihypertensives, as well as any predictor that we had in our initial input table that are not part of SCORE2, will also be included. We will compare this reference model to one that includes also the cluster probabilities and then we'll compare the ability of these two models to predict MACE.

#### Fitting models

In [27]:
coxmodmacedf <- coxmodelsmace(alldat)

In [28]:
print(coxmodmacedf)

[90m# A tibble: 2 x 6[39m
  sex    macedf                 score2   mod_null   mod_score2 mod_score2clus
  [3m[90m<chr>[39m[23m  [3m[90m<list>[39m[23m                 [3m[90m<list>[39m[23m   [3m[90m<list>[39m[23m     [3m[90m<list>[39m[23m     [3m[90m<list>[39m[23m        
[90m1[39m Female [90m<tibble [73,378 x 37]>[39m [90m<tibble>[39m [90m<cxph.nll>[39m [90m<coxph>[39m    [90m<coxph>[39m       
[90m2[39m Male   [90m<tibble [60,348 x 36]>[39m [90m<tibble>[39m [90m<cxph.nll>[39m [90m<coxph>[39m    [90m<coxph>[39m       


Here `score2` contains the predictors used in SCORE2-Diabetes, plus other predictors that we have in our input table and cluster allocations. The two models are contained in the last two columns. `mod_null` contains the null model, which we will use to calculate our metrics.

#### Coefficient estimates

In [29]:
macesurvcoefs <- macesurvcoefx(coxmodmacedf)

In [30]:
head(macesurvcoefs)

sex,model,term,estimate,se
<chr>,<chr>,<chr>,<dbl>,<dbl>
Female,score2,age,0.07172804,0.098409552
Female,score2,smoking,1.8449113,0.482558654
Female,score2,sbp,0.03008216,0.009867439
Female,score2,T2D,0.46023335,1.019902541
Female,score2,tchol,0.48286593,0.179341661
Female,score2,hdl,-1.89109301,0.541588011


#### Comparison of predictive ability

We can now assess the predictive ability of each model and compare them. As the models are nested, we will use the gold-standard method, the likelihood ratio test. Given the wide use of the c-statistic, we will also use this metric. However, comparing two c-statistics is not as powerful as the likelihood ratio test.

In [31]:
compmodmacedf <- comparemodsmace(coxmodmacedf)
compmodmacedf

sex,LL0,LLmod_score2,LLmod_score2clus,NVmod_score2,NVmod_score2clus,LRTstat,LRTdf,LRTp,AdeqInd,Cmod_score2,Cmodse_score2,Cmod_score2clus,Cmodse_score2clus,Cdiff,Cdiffse,Cdiffp
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Female,-32431.04,-31447.77,-31438.68,29,34,18.17871,5,0.002730575,0.9908407,0.7264359,1.992463e-05,0.7269904,1.992463e-05,-0.0005545383,0.0004228024,0.189662633
Male,-53605.73,-52508.64,-52487.42,29,33,42.45329,4,1.343569e-08,0.9810191,0.6852298,1.245548e-05,0.6868959,1.245548e-05,-0.0016661126,0.0005453441,0.002249415


Some details of these columns:

- `LRTp` is the p-value of the likelihood ratio test comparing models with or without cluster allocations.
- `AdeqInd` is the adequacy index comparing likelihood ratios of the two models. 1 minus this value represent the fraction of added information by cluster allocation.
- `cdiffp` is the p-value of the difference between c-statistics of the two models.

While p-values of both the likelihood ratio tests and the difference between C-statistics show evidence of added value, the adequacy index quantifies how much additional information is obtained when cluster allocations are added on top of common biomarkers contained the SCORE2.

#### Adequacy index by cluster

Next we quantify how much is the added value of the new model to each cluster. We do this by recalculating the log likelihood of each model but this time weighting individuals by their cluster probabilities. 

In [32]:
adeqindbyclusmace <- AdeqIndClusMACEFx(coxmodmacedf)

In [33]:
adeqindbyclusmace

sex,Cluster,AdeqInd
<chr>,<chr>,<dbl>
Female,probBC,0.9948141
Female,probDHT,0.9894939
Female,probDAL,1.0
Female,probDLT,0.9925278
Female,probDIS,0.9680502
Female,probDHG,0.9718811
Male,probBC,0.9920569
Male,probDAL,0.9501901
Male,probDLT,0.9795781
Male,probDIS,0.9077753


#### Adequacy index by SCORE2 probabilities

We will also assess the adequacy index in individuals over certain thresholds of 10-year probability of MACE calculated by SCORE2. This is to have an idea of the utility of adding cluster allocation information across the scale of SCALE2.

In [34]:
adeqindbypremace <- AdeqIndByPreMACEFx(coxmodmacedf)

In [35]:
adeqindbypremace

sex,threshold,AdeqInd
<chr>,<chr>,<dbl>
Female,0.0,0.9908407
Female,0.05,0.954424
Female,0.1,0.7869291
Female,0.15,0.4383541
Female,0.2,0.7576286
Female,0.25,0.7010158
Male,0.0,0.9810191
Male,0.05,0.9491618
Male,0.1,0.8630751
Male,0.15,0.7697056


#### Decision curve analysis

The last step in assessing clinical utility of clustering allocations is to perform a decision curve analysis. First we will assess the overall net benefit of both models:

In [36]:
dcamace <- DCurvMACEFx(coxmodmacedf)
head(dcamace)

sex,pred,n,threshold,pos_rate,tp_rate,fp_rate,net_benefit,net_intervention_avoided
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Female,score2_y,73378,0.0,0.04662649,0.04662649,0.9533735,0.04662649,
Female,score2_y,73378,0.01,0.04662649,0.04589481,0.8266147,0.03754517,0.05432287
Female,score2_y,73378,0.02,0.04662649,0.04369676,0.6703592,0.03001596,0.13945778
Female,score2_y,73378,0.03,0.04662649,0.04079859,0.541147,0.02406208,0.22379113
Female,score2_y,73378,0.04,0.04662649,0.03676295,0.4268229,0.01897866,0.28982569
Female,score2_y,73378,0.05,0.04662649,0.03215872,0.3282218,0.01488389,0.35026414


We will also calculate this by cluster:

In [37]:
dcaclusmace <- DCurvMACEbyClFx(coxmodmacedf)

In [38]:
head(dcaclusmace)

sex,Cluster,pred,n,threshold,pos_rate,tp_rate,fp_rate,net_benefit,net_intervention_avoided
<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Female,probBC,score2_y,55802.39,0.0,0.04412461,0.04412461,0.9558754,0.04412461,
Female,probBC,score2_y,55802.39,0.01,0.04412461,0.04335271,0.8127846,0.03514277,0.06667229
Female,probBC,score2_y,55802.39,0.02,0.04412461,0.04095128,0.6476071,0.02773481,0.15277488
Female,probBC,score2_y,55802.39,0.03,0.04412461,0.03813883,0.5140207,0.02224128,0.2483143
Female,probBC,score2_y,55802.39,0.04,0.04412461,0.03393073,0.3977704,0.01735696,0.31345161
Female,probBC,score2_y,55802.39,0.05,0.04412461,0.02939634,0.3004757,0.01358183,0.37556248


### Diabetes

For diabetes progression, the reference model will include fasting glucose, in addition to all the components of the metabolic syndrome that we already have included in our input table. And as before, the second model will include cluster probabilities.

#### Fitting models

In [39]:
coxmoddmdf <- coxmodelsdm(alldat)

In [40]:
print(coxmoddmdf)

[90m# A tibble: 2 x 6[39m
  sex    dmdf                   baseclus mod_null   mod_base mod_baseclus
  [3m[90m<chr>[39m[23m  [3m[90m<list>[39m[23m                 [3m[90m<list>[39m[23m   [3m[90m<list>[39m[23m     [3m[90m<list>[39m[23m   [3m[90m<list>[39m[23m      
[90m1[39m Female [90m<tibble [34,581 x 37]>[39m [90m<tibble>[39m [90m<cxph.nll>[39m [90m<coxph>[39m  [90m<coxph>[39m     
[90m2[39m Male   [90m<tibble [29,006 x 36]>[39m [90m<tibble>[39m [90m<cxph.nll>[39m [90m<coxph>[39m  [90m<coxph>[39m     


#### Coefficient estimates

In [41]:
dmsurvcoefs <- dmsurvcoefx(coxmoddmdf)

In [42]:
head(dmsurvcoefs)

sex,model,term,estimate,se
<chr>,<chr>,<chr>,<dbl>,<dbl>
Female,base,whr,6.1386758297,0.448697839
Female,base,sbp,0.0051523055,0.002125225
Female,base,dbp,0.0052238998,0.004000151
Female,base,alt,0.0213193397,0.002098734
Female,base,scr,-0.0004285693,0.002460233
Female,base,crp,0.0662471889,0.006247497


#### Comparison of predictive ability

In [43]:
compmoddmdf <- comparemodsdm(coxmoddmdf)
compmoddmdf

sex,LL0,LLmod_base,LLmod_baseclus,NVmod_base,NVmod_baseclus,LRTstat,LRTdf,LRTp,AdeqInd,Cmod_base,Cmodse_base,Cmod_baseclus,Cmodse_baseclus,Cdiff,Cdiffse,Cdiffp
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Female,-12041.25,-10977.42,-10949.17,21,26,56.50309,5,6.401543e-11,0.9741306,0.8449716,2.89836e-05,0.8473597,2.89836e-05,-0.002388193,0.0009235356,0.0097118291
Male,-16105.26,-14928.39,-14913.04,21,25,30.7065,4,3.513763e-06,0.9871222,0.8106764,2.69697e-05,0.8130544,2.69697e-05,-0.002377953,0.0006716014,0.0003990354


#### Adequacy index by cluster

In [44]:
adeqindbyclusdm <- AdeqIndClusDMFx(coxmoddmdf)

In [45]:
adeqindbyclusdm

sex,Cluster,AdeqInd
<chr>,<chr>,<dbl>
Female,probBC,0.9844923
Female,probDHT,0.9629648
Female,probDAL,1.0
Female,probDLT,0.9627108
Female,probDIS,0.9504234
Female,probDHG,1.0
Male,probBC,0.9848789
Male,probDAL,0.9896813
Male,probDLT,1.0
Male,probDIS,1.0


#### Adequacy index by probabilities of base model

In [46]:
adeqindbypredm <- AdeqIndByPreDMFx(coxmoddmdf)

In [47]:
adeqindbypredm

sex,threshold,AdeqInd
<chr>,<chr>,<dbl>
Female,0.0,0.9741306
Female,0.05,0.8825301
Female,0.1,0.8424337
Female,0.15,0.8596839
Female,0.2,0.902096
Female,0.25,1.0
Male,0.0,0.9871222
Male,0.05,0.9675628
Male,0.1,0.9911908
Male,0.15,0.998863


#### Decision curve analysis

The last step in assessing clinical utility of clustering allocations is to perform a decision curve analysis. First we will assess the overall net benefit of both models:

In [48]:
dcadm <- DCurvDMFx(coxmoddmdf)
head(dcadm)

sex,pred,n,threshold,pos_rate,tp_rate,fp_rate,net_benefit,net_intervention_avoided
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Female,base_y,34581,0.0,0.03580024,0.03580024,0.9641998,0.03580024,
Female,base_y,34581,0.01,0.03580024,0.03446519,0.5931916,0.02847335,0.2388374
Female,base_y,34581,0.02,0.03580024,0.03221275,0.3839696,0.02437663,0.4044428
Female,base_y,34581,0.03,0.03580024,0.02939749,0.2699576,0.02104828,0.4872197
Female,base_y,34581,0.04,0.03580024,0.02651286,0.2035846,0.01803017,0.5377179
Female,base_y,34581,0.05,0.03580024,0.02445338,0.1594337,0.01606213,0.5891756


We will also calculate this by cluster:

In [49]:
dcaclusdm <- DCurvDMbyClFx(coxmoddmdf)

In [50]:
head(dcaclusdm)

sex,Cluster,pred,n,threshold,pos_rate,tp_rate,fp_rate,net_benefit,net_intervention_avoided
<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Female,probBC,base_y,26382.99,0.0,0.03050377,0.03050377,0.9694962,0.03050377,
Female,probBC,base_y,26382.99,0.01,0.03050377,0.02896888,0.5602301,0.02330999,0.2573115
Female,probBC,base_y,26382.99,0.02,0.03050377,0.02652416,0.3430397,0.01952335,0.4314557
Female,probBC,base_y,26382.99,0.03,0.03050377,0.02353238,0.2276125,0.01649282,0.5164753
Female,probBC,base_y,26382.99,0.04,0.03050377,0.02079057,0.1642921,0.01394506,0.5720872
Female,probBC,base_y,26382.99,0.05,0.03050377,0.01859647,0.1239074,0.01207502,0.61935


---

## Saving data

As done before, we will ask you to save an R file that does not contain any individual data, only summary statistics, as follows:

In [51]:
result_file2 <- list(
    MarkerDistrib = markerdistribdf,
    BMIeffOnMarker = bmieffmarkerdf,
    CountCovars = countcovarsdf,
    CountSpecDXMeds = countspectxdf,
    CrossSectAssoc = assocdxdf,
    KaplanMeierDF = kmestdf,
    MACESurvCoefs = macesurvcoefs,
    ComparisonMACE = compmodmacedf,
    AdeqIndByClusMACE = adeqindbyclusmace,
    AdeqIndByPreMACE = adeqindbypremace,
    DCAResMACE = dcamace,
    DCAREsClusMACE = dcaclusmace,
    DMSurvCoefs = dmsurvcoefs,
    ComparisonDM = compmoddmdf,
    AdeqIndByClusDM = adeqindbyclusdm,
    AdeqIndByPreDM = adeqindbypredm,
    DCAResDM = dcadm,
    DCAREsClusDM = dcaclusdm
)

In [52]:
save(
    result_file2,
    file = "../data/ukb/result_file2.RData"
)

This file should be uploaded to the respective folder of the cohort in Teams:

> CrossWP > Analyst working groups > WG1 > UMAP_project > *cohort_name* > data

---

## References

<!-- BIBLIOGRAPHY START -->
<div class="csl-bib-body">
  <div class="csl-entry"><i id="zotero|10831815/ZY2CL5NC"></i>SCORE2 working group and ESC Cardiovascular risk collaboration. (2021). SCORE2 risk prediction algorithms: new models to estimate 10-year risk of cardiovascular disease in Europe. <i>European Heart Journal</i>, <i>42</i>(25), 2439–2454. https://doi.org/10.1093/eurheartj/ehab309</div>
  <div class="csl-entry"><i id="zotero|10831815/FPUFQKFI"></i>SCORE2-Diabetes Working Group and the ESC Cardiovascular Risk Collaboration. (2023). SCORE2-Diabetes: 10-year cardiovascular risk estimation in type 2 diabetes in Europe. <i>European Heart Journal</i>, ehad260. https://doi.org/10.1093/eurheartj/ehad260</div>
</div>
<!-- BIBLIOGRAPHY END -->