# Cluster probabilities and outcomes

---

***Authors:***

- Daniel E. Coral

- Femke Smit

- Elena Santos

- Ali Farzaneh

---

In this second part of the analysis, we will examine how the clusters we have validated across cohorts are associated with prevalent diseases at the time of clustering, and also assess whether they add significant information for prediction of MACE events and diabetes progression on top  of commonly used risk stratification tools.

## Libraries and functions

The libraries needed to run this analysis:

In [1]:
library(readr)
library(dplyr, warn.conflicts = FALSE)
library(tidyr, warn.conflicts = FALSE)
library(purrr)
library(survival)

And the functions we have prepared to facilitate some steps:

In [2]:
source("cross_sectional_FX2.R")

---

## Loading data needed

### Initial input table of biomarkers and basic covariates

The input table is the same table of 10 traits we had prior to run UMAP. Here is a description of this table:

|System targeted |Biomarker               |Units                        |Column name|
|----------------|------------------------|-----------------------------|-----------|
|                |                        |                             |           |
|Individual ID   |-                       |-                            |eid        |
|                |                        |                             |           |
|Blood pressure  |Systolic blood pressure |millimeters of mercury (mmHg)|sbp        |
|                |Diastolic blood pressure|millimeters of mercury (mmHg)|dbp        |
|                |                        |                             |           |
|Lipid fractions |High density lipoprotein|mmol/L                       |hdl        |
|                |Low density lipoprotein |mmol/L                       |ldl        |
|                |Tryglicerides           |mmol/L                       |tg         |
|                |                        |                             |           |
|Glycemia        |Fasting glucose         |mmol/L                       |fg         |
|                |                        |                             |           |
|Liver metabolism|Alanine transaminase    |U/L                          |alt        |
|                |                        |                             |           |
|Fat distribution|Waist-to-hip ratio      |cm/cm                        |whr        |
|                |                        |                             |           |
|Kidney function |Serum creatinine        |umol/L                       |scr        |
|                |                        |                             |           |
|Inflammation    |C reactive protein      |mg/L                         |crp        |
|                |                        |                             |           |
|Basic covariates|Current smoking status  |1 if yes, 0 if not           |smoking    |
|                |Sex                     |String ("Female" or "Male")  |sex        |
|                |Age                     |Years                        |age        |

***Important note:*** All columns should be there in the units required, and the names should match, so that the functions we have prepared for the analyses work properly. This is true for this and all the following tables we require for our analysis.

This input table has been preprocessed by:

1. Filtering out values that are possible errors in measurement (>5 SD away from the mean in continuous variables).
2. Only including complete cases.
3. Stratifying by sex.

Here is how the input table should look like - a list of two data frames, one for each sex:

In [3]:
load("../data/ukb/strat_dat.RData")

In [4]:
map(strat_dat, head)

eid,age,sex,bmi,whr,sbp,dbp,alt,scr,crp,hdl,tg,ldl,fg,smoking
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1000117,47,Female,23.8408,0.7254902,147.5,84.0,14.07,61.0,0.24,1.972,0.591,2.252,4.395,0
1000132,43,Female,35.6559,0.8403361,137.0,100.5,18.89,60.5,4.31,1.236,2.037,3.686,5.214,0
1000176,69,Female,38.1271,0.8897638,137.5,93.5,36.39,68.9,3.69,1.601,1.988,4.551,4.266,0
1000223,63,Female,25.4603,0.7789474,163.0,94.0,6.1,67.1,1.29,1.453,2.829,3.491,5.876,0
1000282,48,Female,25.4297,0.7708333,135.5,89.0,9.63,46.2,0.16,2.185,0.722,3.584,5.212,0
1000367,42,Female,19.328,0.6777778,107.0,72.5,9.34,57.1,0.69,2.346,0.395,3.072,4.649,0

eid,age,sex,bmi,whr,sbp,dbp,alt,scr,crp,hdl,tg,ldl,fg,smoking
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1000039,44,Male,36.6959,0.9911504,124.5,64.5,34.97,93.0,3.6,1.158,2.8,3.956,5.427,0
1000071,67,Male,39.4807,0.8857143,179.5,103.0,46.74,68.7,9.41,1.372,1.127,2.311,7.079,0
1000088,60,Male,24.2786,0.8761905,152.0,89.0,13.14,80.6,1.2,0.983,1.59,4.2,5.401,0
1000096,41,Male,26.5744,0.9587629,143.0,90.0,30.32,80.1,6.13,1.041,2.713,4.029,4.239,0
1000109,62,Male,33.8719,1.0818182,156.5,104.5,16.26,89.3,14.42,0.89,2.437,3.525,6.1,0
1000125,66,Male,36.11,1.0625,155.0,102.5,25.59,88.7,1.91,1.061,1.32,2.538,4.531,1


### Table of validated clusters

The second thing needed is the clusters we have validated. We have put this in an R file called `validclusmod`:

In [5]:
load("../data/validclusmod.RData")
print(validclusmod)

[90m# A tibble: 2 x 3[39m
  sex    residmod          clusmod         
  [3m[90m<chr>[39m[23m  [3m[90m<list>[39m[23m            [3m[90m<list>[39m[23m          
[90m1[39m Female [90m<tibble [10 x 6]>[39m [90m<tibble [6 x 4]>[39m
[90m2[39m Male   [90m<tibble [10 x 6]>[39m [90m<tibble [5 x 4]>[39m


This object contais, for each sex:
- `residmod`: The model to obtain residuals for each variable, i.e., the variability beyond what is explained by BMI, adjusting for age and smoking.
- `clusmod`: The clustering model to apply to the residuals.

### Table of pre-existing conditions and medications

The third thing we need is a table of pre-existing conditions and medications participants are currently taking:

In [6]:
covar_dat <- read_tsv("../data/covar_dat.tsv", show_col_types = FALSE)
head(covar_dat)

eid,HT,CHD,Stroke,PAD,CKD,LiverFailure,RA,T2D,T1D,T2Dage,Insulin,AntiDM,AntiHT,LipidLower
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1000027,0,0,0,0,0,0,0,0,0,0.0,0,0,0,0
1000039,0,0,0,0,0,0,0,0,0,0.0,0,0,0,0
1000040,1,0,0,0,0,0,0,0,0,0.0,0,0,0,0
1000053,0,0,0,0,0,0,0,0,0,0.0,0,0,0,0
1000064,1,0,0,0,0,0,0,1,0,49.5,0,1,1,0
1000071,1,0,0,0,0,0,0,1,0,65.5,0,0,1,1


All the columns in this table are coded 1 or 0 representing current diagnosis of a disease or whether the person is taking the medications specified. The exception is `T2Dage`, which is the age of onset of T2D. This is what each column represent:

|Group       |Column name |Meaning
|------------|------------|--------
|Diagnoses   |HT          |Hypertension
|            |CHD         |Coronary heart disease
|            |Stroke      |Stroke
|            |PAD         |Peripheral artery disease
|            |CKD         |Chronic kidney disease
|            |LiverFailure|Liver failure
|            |RA          |Rheumatoid arthritis
|            |T2D         |Type 2 diabetes
|            |T1D         |Type 1 diabetes
|Age at onset|T2Dage      |Age at onset of T2D - It is 0 if `T2D` is 0. Needed in SCORE2.
|Medication  |Insulin     |Taking insulin
|            |AntiDM      |Taking medication for diabetes other than insulin
|            |AntiHT      |Taking medication for hypertension
|            |LipidLower  |Taking lipid-lowering medication 

If any of the columns in this table are missing in your data, one option is to assume that none in your population had the disease.

### Survival data

Lastly, we need survival data for MACE and diabetes progression. They should look like this:

In [7]:
survmacedat <- read_tsv("../data/survmacedat.tsv", show_col_types = FALSE)
head(survmacedat)

eid,outcome_value,outcome_timeyrs
<dbl>,<dbl>,<dbl>
1000071,0,10.001369
1000223,1,6.874743
1000324,1,3.101985
1000583,1,3.761807
1001175,1,4.539357
1001892,1,9.185489


In [8]:
survdmdat <- read_tsv("../data/survdmdat.tsv", show_col_types = FALSE)
head(survdmdat)

eid,outcome_value,outcome_timeyrs
<dbl>,<dbl>,<dbl>
1000109,1,3.600274
1000132,1,1.24846
1004267,1,4.550308
1006281,1,1.957563
1007454,0,9.423682
1010295,1,6.852841


These two tables include individuals followed ***up to 10 years***. This means that any outcome after 10 years is censored. `outcome_value` is 1 if the person experienced the event during the follow-up time and 0 if not. `outcome_timeyrs` is the time of follow-up in years, up to the first event or up to 10 years. 

It is important that these tables ***do not include*** individuals who already experience the events we will study. In any case, we will make sure of this in the next step, when we combine all the data. For example, any individual in the `survmacedat` table with a value of 1 in the columns `CHD`, `Stroke` or `PAD` of the `covar_tab` table, will be excluded from the analysis.

In case your cohort does not have survival data, then follow this guideline until the section below entitled "Prevalent diseases and medication".

---

## Calculation of cluster probabilities

With the data needed in place, we can start by calculating cluster allocation probabilities given the biomarker data. For that we will first add a new column called `data` to the `validclusmod` table where we will put the biomarker data for each sex:

In [9]:
alldat <- mutate(
    validclusmod,
    data = map(sex, ~strat_dat[[.x]])
)
print(alldat)

[90m# A tibble: 2 x 4[39m
  sex    residmod          clusmod          data                  
  [3m[90m<chr>[39m[23m  [3m[90m<list>[39m[23m            [3m[90m<list>[39m[23m           [3m[90m<list>[39m[23m                
[90m1[39m Female [90m<tibble [10 x 6]>[39m [90m<tibble [6 x 4]>[39m [90m<tibble [77,207 x 15]>[39m
[90m2[39m Male   [90m<tibble [10 x 6]>[39m [90m<tibble [5 x 4]>[39m [90m<tibble [67,904 x 15]>[39m


Once we have this table, we can run the function to calculate cluster probabilities:

In [10]:
alldat <- clusterprobcalc(alldat)
print(alldat)

[90m# A tibble: 2 x 4[39m
  sex    residmod          clusmod          data                  
  [3m[90m<chr>[39m[23m  [3m[90m<list>[39m[23m            [3m[90m<list>[39m[23m           [3m[90m<list>[39m[23m                
[90m1[39m Female [90m<tibble [10 x 6]>[39m [90m<tibble [6 x 4]>[39m [90m<tibble [77,207 x 21]>[39m
[90m2[39m Male   [90m<tibble [10 x 6]>[39m [90m<tibble [5 x 4]>[39m [90m<tibble [67,904 x 20]>[39m


Checking that the probabilities were calculated for each sex:

In [11]:
head(alldat$data[[1]])

eid,age,sex,bmi,whr,sbp,dbp,alt,scr,crp,⋯,tg,ldl,fg,smoking,probBC,probDHT,probDAL,probDLT,probDIS,probDHG
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1000117,47,Female,23.8408,0.7254902,147.5,84.0,14.07,61.0,0.24,⋯,0.591,2.252,4.395,0,0.1008059,0.8982179,8.145289e-08,0.0006249365,0.0002629902,8.816286e-05
1000132,43,Female,35.6559,0.8403361,137.0,100.5,18.89,60.5,4.31,⋯,2.037,3.686,5.214,0,0.3995536,0.5942413,0.002669602,0.0014612974,0.0009994929,0.001074671
1000176,69,Female,38.1271,0.8897638,137.5,93.5,36.39,68.9,3.69,⋯,1.988,4.551,4.266,0,0.9885268,9.507371e-06,5.39449e-05,0.0110183762,0.0002941394,9.725173e-05
1000223,63,Female,25.4603,0.7789474,163.0,94.0,6.1,67.1,1.29,⋯,2.829,3.491,5.876,0,0.939154,0.0002297826,0.05226406,0.000423494,0.0001454109,0.007783232
1000282,48,Female,25.4297,0.7708333,135.5,89.0,9.63,46.2,0.16,⋯,0.722,3.584,5.212,0,0.7901985,0.2027605,5.046971e-07,0.001904094,0.0015270843,0.003609283
1000367,42,Female,19.328,0.6777778,107.0,72.5,9.34,57.1,0.69,⋯,0.395,3.072,4.649,0,0.9827497,0.002825642,2.925853e-07,0.0033190879,0.0093330496,0.001772251


---

## Centered log-ratio transformation

Since using cluster allocation probabilities in regression is problematic because all probabilities for each individual add up to 1, we will use the centered log-ratio transformation (CLR), a technique commonly used when dealing with compositional data.

The clr transformation of the probability of an individual $i$ to belong to a cluster $c$ is the following:

$$
\text{CLR}(cluster_{c})_{i} = \log\left(\frac{P(cluster_{c})_{i}}{\exp\left(\frac{1}{K}\sum_{k=1}^{K}\log(P(cluster_{k})_{i})\right)}\right)
$$

The denominator in this formula is the geometric mean of the cluster allocation probabilities of individual $i$.

The reason why we chose this transformation is not only because it enables us to use cluster allocation in regression analysis, but also because it helps in the interpretability of the coefficients derived from the regression. In each regression we leave out the CLR of the reference cluster, which in our case is the cluster where biomarkers change concordantly with BMI. We have label this cluster "BC". Removing this CLR and only using the CLRs of the remaining clusters means that regression coefficients can be interpreted as the expected change in the dependent variable by increasing $P(cluster_{c})$ specifically at the expense of reducing $P(cluster_{BC})$ <cite id="abe09"><a href="#zotero|10831815/UYLBIXH3">(1)</a></cite>, or in other words, the effect of deviating from the concordant cluster towards a specific discordant phenotype.

To calculate the transformation:

In [12]:
alldat <- clrcalc(alldat)
print(alldat)

[90m# A tibble: 2 x 4[39m
  sex    residmod          clusmod          data                  
  [3m[90m<chr>[39m[23m  [3m[90m<list>[39m[23m            [3m[90m<list>[39m[23m           [3m[90m<list>[39m[23m                
[90m1[39m Female [90m<tibble [10 x 6]>[39m [90m<tibble [6 x 4]>[39m [90m<tibble [77,207 x 27]>[39m
[90m2[39m Male   [90m<tibble [10 x 6]>[39m [90m<tibble [5 x 4]>[39m [90m<tibble [67,904 x 25]>[39m


Here is how the data looks like now:

In [13]:
head(alldat$data[[1]])

eid,age,sex,bmi,whr,sbp,dbp,alt,scr,crp,⋯,probDAL,probDLT,probDIS,probDHG,clrDHT,clrDAL,clrDLT,clrDIS,clrDHG,Gmean
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1000117,47,Female,23.8408,0.7254902,147.5,84.0,14.07,61.0,0.24,⋯,8.145289e-08,0.0006249365,0.0002629902,8.816286e-05,7.1731109,-9.042788,-0.09740705,-0.9629401,-2.05587132,0.0006888731
1000132,43,Female,35.6559,0.8403361,137.0,100.5,18.89,60.5,4.31,⋯,0.002669602,0.0014612974,0.0009994929,0.001074671,4.085553,-1.319803,-1.92240781,-2.3022397,-2.22971806,0.0099914774
1000176,69,Female,38.1271,0.8897638,137.5,93.5,36.39,68.9,3.69,⋯,5.39449e-05,0.0110183762,0.0002941394,9.725173e-05,-4.3500455,-2.61415,2.70520675,-0.9180593,-2.02481018,0.0007366501
1000223,63,Female,25.4603,0.7789474,163.0,94.0,6.1,67.1,1.29,⋯,0.05226406,0.000423494,0.0001454109,0.007783232,-2.9031599,2.523771,-2.29175433,-3.3607304,0.61943315,0.0041893203
1000282,48,Female,25.4297,0.7708333,135.5,89.0,9.63,46.2,0.16,⋯,5.046971e-07,0.001904094,0.0015270843,0.003609283,4.1880868,-8.715491,-0.47993257,-0.7005787,0.15957011,0.0030769501
1000367,42,Female,19.328,0.6777778,107.0,72.5,9.34,57.1,0.69,⋯,2.925853e-07,0.0033190879,0.0093330496,0.001772251,0.4057625,-8.769727,0.56671701,1.6005888,-0.06072239,0.001883201


---

## Descriptive statistics

At this point we will recheck some of the characteristics of the clusters as we did in our previous script, weighting calculations by cluster probabilities.

The distribution of biomarkers per cluster:

In [48]:
markerdistribdf <- markerdistribfx(alldat)

In [49]:
head(markerdistribdf)

sex,Variable,Cluster,Type,N,Summary1,Summary2
<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>
Female,whr,BC,Numeric,58673.335,0.82 (0.07),0.81 (0.7 - 0.77 - 0.86 - 0.96)
Female,whr,DHT,Numeric,7482.132,0.79 (0.05),0.78 (0.69 - 0.75 - 0.82 - 0.9)
Female,whr,DAL,Numeric,3950.531,0.87 (0.06),0.86 (0.76 - 0.83 - 0.91 - 0.99)
Female,whr,DLT,Numeric,2835.848,0.84 (0.07),0.84 (0.71 - 0.79 - 0.89 - 0.98)
Female,whr,DIS,Numeric,2746.421,0.84 (0.07),0.83 (0.71 - 0.79 - 0.89 - 0.98)
Female,whr,DHG,Numeric,1462.734,0.85 (0.08),0.85 (0.71 - 0.79 - 0.91 - 1.02)


The effect of BMI on biomarkers specifically within each cluster, adjusted for age and smoking:

In [50]:
bmieffmarkerdf <- bmieffmarkerfx(alldat)

In [51]:
head(bmieffmarkerdf)

sex,Variable,Cluster,term,estimate,se
<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>
Female,whr,BC,(Intercept),0.57683302,0.001918737
Female,whr,BC,age,0.001288071,2.768758e-05
Female,whr,BC,smoking,0.02011062,0.0007708985
Female,whr,BC,bmi,0.006069326,4.315308e-05
Female,whr,DHT,(Intercept),0.569734826,0.001569475
Female,whr,DHT,age,0.001087087,2.09694e-05


---

## Prevalent diseases and medication

To add covariate data to the `alldat` table we will do the following:

In [41]:
alldat <- mutate(
    alldat,
    data = map(data, inner_join, covar_dat, by = "eid")
)
print(alldat)

[90m# A tibble: 2 x 4[39m
  sex    residmod          clusmod          data                  
  [3m[90m<chr>[39m[23m  [3m[90m<list>[39m[23m            [3m[90m<list>[39m[23m           [3m[90m<list>[39m[23m                
[90m1[39m Female [90m<tibble [10 x 6]>[39m [90m<tibble [6 x 4]>[39m [90m<tibble [77,151 x 41]>[39m
[90m2[39m Male   [90m<tibble [10 x 6]>[39m [90m<tibble [5 x 4]>[39m [90m<tibble [67,848 x 39]>[39m


Checking again if the columns were added as expected:

In [42]:
head(alldat$data[[1]])

eid,age,sex,bmi,whr,sbp,dbp,alt,scr,crp,⋯,CKD,LiverFailure,RA,T2D,T1D,T2Dage,Insulin,AntiDM,AntiHT,LipidLower
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1000117,47,Female,23.8408,0.7254902,147.5,84.0,14.07,61.0,0.24,⋯,0,0,0,0,0,0,0,0,0,0
1000132,43,Female,35.6559,0.8403361,137.0,100.5,18.89,60.5,4.31,⋯,0,0,0,0,0,0,0,0,1,0
1000176,69,Female,38.1271,0.8897638,137.5,93.5,36.39,68.9,3.69,⋯,0,0,0,0,0,0,0,0,1,0
1000223,63,Female,25.4603,0.7789474,163.0,94.0,6.1,67.1,1.29,⋯,0,0,0,0,0,0,0,0,1,1
1000282,48,Female,25.4297,0.7708333,135.5,89.0,9.63,46.2,0.16,⋯,0,0,0,0,0,0,0,0,0,0
1000367,42,Female,19.328,0.6777778,107.0,72.5,9.34,57.1,0.69,⋯,0,0,0,0,0,0,0,0,0,0


We will first count the number of individuals with disease in each cluster. Here we will also count the number of individuals taking each class of medications in each cluster.

In [53]:
countcovarsdf <- countcovarsfx(alldat)

In [54]:
head(countcovarsdf)

sex,Cluster,Covariate,Nclus,NclusDX
<chr>,<chr>,<chr>,<dbl>,<dbl>
Female,Overall,HT,77151,18898
Female,Overall,CHD,77151,2177
Female,Overall,Stroke,77151,1122
Female,Overall,PAD,77151,194
Female,Overall,CKD,77151,83
Female,Overall,LiverFailure,77151,80


We will use this table to calculate prevalences and compare prevalences across clusters.

We are also interesting in looking at the following proportions in each cluster:

- Proportion of individuals with hypertension receiving antihypertensives.
- Proportion of individuals with CHD receiving lipid-lowering medication.
- Proportion of individuals with T2D taking insulin.
- Proportion of individuals with T2D taking insulin or any other anti-diabetic medication.

In [59]:
countspectxdf <- countspectxfx(alldat)

In [60]:
head(countspectxdf)

sex,DX,MED,Cluster,NclusDXM
<chr>,<chr>,<chr>,<chr>,<dbl>
Female,HT,AntiHT,Overall,13903.0
Female,CHD,LipidLower,Overall,1590.0
Female,T2D,Insulin,Overall,216.0
Female,T2D,InsulinOrAntiDM,Overall,1247.0
Female,HT,AntiHT,BC,9868.886
Female,HT,AntiHT,DHT,1366.164


We will also formally test the association between cluster allocation and diseases using logistic regressions where the outcome is each disease and the predictors are the cluster allocations. We will have two models for each disease, one with only clusters, and a second one adjusting for medication.

In [90]:
assocdxdf <- assocdxfx(alldat)

In [92]:
head(assocdxdf)

sex,DX,model,term,estimate,se
<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>
Female,HT,OnlyClusters,(Intercept),-1.8511322,0.035621448
Female,HT,OnlyClusters,clrDHT,-0.1137165,0.00551844
Female,HT,OnlyClusters,clrDAL,-0.1321959,0.005795165
Female,HT,OnlyClusters,clrDLT,-0.1733019,0.007581059
Female,HT,OnlyClusters,clrDIS,-0.1835971,0.007922777
Female,HT,OnlyClusters,clrDHG,-0.1091824,0.007320712


---

## Adding survival data

As explained before, since we want to be careful when adding survival data for analysis, we have prepared a function separately for both outcomes, and making sure we exclude individuals who already experience the events under study:

In [409]:
alldat <- alldat %>%
    mutate(
        macedf = purrr::map(data, addsurvmacedat, SURVDATA = survmacedat),
        dmdf = purrr::map(data, addsurvdmdat, SURVDATA = survdmdat)
    )
print(alldat)

[90m# A tibble: 2 x 6[39m
  sex    residmod          clusmod          data     macedf   dmdf    
  [3m[90m<chr>[39m[23m  [3m[90m<list>[39m[23m            [3m[90m<list>[39m[23m           [3m[90m<list>[39m[23m   [3m[90m<list>[39m[23m   [3m[90m<list>[39m[23m  
[90m1[39m Female [90m<tibble [10 x 6]>[39m [90m<tibble [6 x 4]>[39m [90m<tibble>[39m [90m<tibble>[39m [90m<tibble>[39m
[90m2[39m Male   [90m<tibble [10 x 6]>[39m [90m<tibble [5 x 4]>[39m [90m<tibble>[39m [90m<tibble>[39m [90m<tibble>[39m


`macedf` and `dmdf` now contain the data necessary to run survival analysis.

---

## Overall and cluster-specific Kaplan-Meier estimates

The first thing to do is to obtain overall and cluster-specific cumulative incidence rates using the Kaplan-Meier method:

In [202]:
kmestdf <- kmestfx(alldat)

In [203]:
head(kmestdf)

sex,Outcome,Cluster,risk,lower,upper
<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Female,MACE,Overall,0.04651655,0.04468811,0.04834149
Female,MACE,probBC,0.04399324,0.04194391,0.04603818
Female,MACE,probDHT,0.04512474,0.03942353,0.05079211
Female,MACE,probDAL,0.06075845,0.05173248,0.06969851
Female,MACE,probDLT,0.04782438,0.03785284,0.05769258
Female,MACE,probDIS,0.06237728,0.05175902,0.07287664


---

## Cox models

### MACE

To quantify the association of clusters to MACE, as well as its potential contribution for prediction, we will compare two models. The reference model will include all predictors that are part of SCORE2, the risk stratification tool for CVD recommended by the European Society of Cardiology <cite id="pzdxs"><a href="#zotero|10831815/ZY2CL5NC">(2)</a></cite>. We will use a version of this score that has been validated in diabetic populations, and includes some additional clinically useful predictors <cite id="itp38"><a href="#zotero|10831815/FPUFQKFI">(3)</a></cite>. Additionally, for the sake of completeness, some important pre-existing conditions and pharmacological treatments, such as hypertension/antihypertensives, as well as any predictor that we had in our initial input table that are not part of SCORE2, will also be included. We will compare this reference model to one that includes also the cluster probabilities and then we'll compare the ability of these two models to predict MACE.

#### Fitting models

In [289]:
coxmodmacedf <- coxmodelsmace(alldat)

In [290]:
print(coxmodmacedf)

[90m# A tibble: 2 x 6[39m
  sex    macedf                 score2   mod_null   mod_score2 mod_score2clr
  [3m[90m<chr>[39m[23m  [3m[90m<list>[39m[23m                 [3m[90m<list>[39m[23m   [3m[90m<list>[39m[23m     [3m[90m<list>[39m[23m     [3m[90m<list>[39m[23m       
[90m1[39m Female [90m<tibble [73,378 x 43]>[39m [90m<tibble>[39m [90m<cxph.nll>[39m [90m<coxph>[39m    [90m<coxph>[39m      
[90m2[39m Male   [90m<tibble [60,348 x 41]>[39m [90m<tibble>[39m [90m<cxph.nll>[39m [90m<coxph>[39m    [90m<coxph>[39m      


Here `score2` contains the predictors used in SCORE2-Diabetes, plus other predictors that we have in our input table and the CLR-transformed cluster allocations. The two models are contained in the last two columns. `mod_null` contains the null model, which we will use to calculate our metrics.

#### Coefficient estimates

In [291]:
macesurvcoefs <- macesurvcoefx(coxmodmacedf)

In [292]:
head(macesurvcoefs)

sex,model,term,estimate,se
<chr>,<chr>,<chr>,<dbl>,<dbl>
Female,score2,age,0.07170994,0.098415561
Female,score2,smoking,1.84493372,0.48256066
Female,score2,sbp,0.03007609,0.009876803
Female,score2,T2D,0.45975276,1.020463821
Female,score2,tchol,0.48281573,0.179377494
Female,score2,hdl,-1.89054317,0.542971021


#### Comparison of predictive ability

We can now assess the predictive ability of each model and compare them. As the models are nested, we will use the gold-standard method, the likelihood ratio test. Given the wide use of the c-statistic, we will also use this metric. However, comparing two c-statistics is not as powerful as the likelihood ratio test.

In [293]:
compmodmacedf <- comparemodsmace(coxmodmacedf)
compmodmacedf

sex,LL0,LLmod_score2,LLmod_score2clr,NVmod_score2,NVmod_score2clr,LRTstat,LRTdf,LRTp,AdeqInd,Cmod_score2,Cmodse_score2,Cmod_score2clr,Cmodse_score2clr,Cdiff,Cdiffse,Cdiffp
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Female,-32431.04,-31447.77,-31435.04,30,35,25.44914,5,0.0001140898,0.9872243,0.7264359,1.992483e-05,0.7275326,1.992483e-05,-0.001096737,0.0005008944,0.02855637
Male,-53605.73,-52507.15,-52464.38,30,34,85.52808,4,1.172008e-17,0.9625321,0.6854263,1.244977e-05,0.6889464,1.244977e-05,-0.003520119,0.0007853716,7.390979e-06


Some details of these columns:

- `LRTp` is the p-value of the likelihood ratio test comparing models with or without cluster allocations.
- `AdeqInd` is the adequacy index comparing likelihood ratios of the two models. 1 minus this value represent the fraction of added information by cluster allocation.
- `cdiffp` is the p-value of the difference between c-statistics of the two models.

While p-values of both the likelihood ratio tests and the difference between C-statistics show evidence of added value, the adequacy index quantifies how much additional information is obtained when cluster allocations are added on top of common biomarkers contained the SCORE2.

#### Adequacy index by cluster

Next we quantify how much is the added value of the new model to each cluster. We do this by recalculating the log likelihood of each model but this time weighting individuals by their cluster probabilities. 

In [453]:
adeqindbyclusmace <- AdeqIndClusMACEFx(coxmodmacedf)

In [454]:
adeqindbyclusmace

sex,Cluster,AdeqInd
<chr>,<chr>,<dbl>
Female,probBC,0.9832994
Female,probDHT,0.9783044
Female,probDAL,1.0
Female,probDLT,1.0
Female,probDIS,0.9940754
Female,probDHG,1.0
Male,probBC,0.9639555
Male,probDAL,0.8891315
Male,probDLT,0.9998741
Male,probDIS,1.0


#### Adequacy index by SCORE2 probabilities

We will also assess the adequacy index in individuals over certain thresholds of 10-year probability of MACE calculated by SCORE2. This is to have an idea of the utility of adding cluster allocation information across the scale of SCALE2.

In [455]:
adeqindbypremace <- AdeqIndByPreMACEFx(coxmodmacedf)

In [456]:
adeqindbypremace

sex,threshold,AdeqInd
<chr>,<chr>,<dbl>
Female,0.0,0.9872243
Female,0.05,0.9351968
Female,0.1,0.7424549
Female,0.15,0.5919734
Female,0.2,0.9629261
Female,0.25,1.0
Male,0.0,0.9625321
Male,0.05,0.9069405
Male,0.1,0.7664398
Male,0.15,0.6523507


#### Decision curve analysis

The last step in assessing clinical utility of clustering allocations is to perform a decision curve analysis. First we will assess the overall net benefit of both models:

In [328]:
dcamace <- DCurvMACEFx(coxmodmacedf)
head(dcamace)

sex,pred,n,threshold,pos_rate,tp_rate,fp_rate,net_benefit,net_intervention_avoided
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Female,score2_y,73378,0.0,0.04662649,0.04662649,0.9533735,0.04662649,
Female,score2_y,73378,0.01,0.04662649,0.045895,0.8266008,0.03754549,0.05435516
Female,score2_y,73378,0.02,0.04662649,0.04369664,0.6704002,0.03001501,0.13941108
Female,score2_y,73378,0.03,0.04662649,0.04078583,0.5411188,0.02405019,0.22340673
Female,score2_y,73378,0.04,0.04662649,0.03677727,0.4268085,0.01899358,0.29018374
Female,score2_y,73378,0.05,0.04662649,0.03215838,0.3282494,0.0148821,0.35023019


We will also calculate this by cluster:

In [329]:
dcaclusmace <- DCurvMACEbyClFx(coxmodmacedf)

In [330]:
head(dcaclusmace)

sex,Cluster,pred,n,threshold,pos_rate,tp_rate,fp_rate,net_benefit,net_intervention_avoided
<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Female,probBC,score2_y,55802.39,0.0,0.04412461,0.04412461,0.9558754,0.04412461,
Female,probBC,score2_y,55802.39,0.01,0.04412461,0.04335303,0.8127737,0.0351432,0.06671509
Female,probBC,score2_y,55802.39,0.02,0.04412461,0.04095127,0.6476546,0.02773383,0.15272697
Female,probBC,score2_y,55802.39,0.03,0.04412461,0.03812193,0.5140035,0.02222492,0.2477852
Female,probBC,score2_y,55802.39,0.04,0.04412461,0.03393248,0.397764,0.01735898,0.31350023
Female,probBC,score2_y,55802.39,0.05,0.04412461,0.02939596,0.3005098,0.01357965,0.37552111


### Diabetes

For diabetes progression, the reference model will include fasting glucose, in addition to all the components of the metabolic syndrome that we already have included in our input table. And as before, the second model will include cluster probabilities.

#### Fitting models

In [425]:
coxmoddmdf <- coxmodelsdm(alldat)

In [426]:
print(coxmoddmdf)

[90m# A tibble: 2 x 6[39m
  sex    dmdf                   baseclr  mod_null   mod_base mod_baseclr
  [3m[90m<chr>[39m[23m  [3m[90m<list>[39m[23m                 [3m[90m<list>[39m[23m   [3m[90m<list>[39m[23m     [3m[90m<list>[39m[23m   [3m[90m<list>[39m[23m     
[90m1[39m Female [90m<tibble [34,581 x 43]>[39m [90m<tibble>[39m [90m<cxph.nll>[39m [90m<coxph>[39m  [90m<coxph>[39m    
[90m2[39m Male   [90m<tibble [29,006 x 41]>[39m [90m<tibble>[39m [90m<cxph.nll>[39m [90m<coxph>[39m  [90m<coxph>[39m    


#### Coefficient estimates

In [427]:
dmsurvcoefs <- dmsurvcoefx(coxmoddmdf)

In [428]:
head(dmsurvcoefs)

sex,model,term,estimate,se
<chr>,<chr>,<chr>,<dbl>,<dbl>
Female,base,whr,4.958514083,0.460981153
Female,base,sbp,0.006630098,0.002118119
Female,base,dbp,-0.003712713,0.004070644
Female,base,alt,0.018543889,0.002148772
Female,base,scr,-0.00211532,0.002509501
Female,base,crp,0.038375581,0.007042164


#### Comparison of predictive ability

In [429]:
compmoddmdf <- comparemodsdm(coxmoddmdf)
compmoddmdf

sex,LL0,LLmod_base,LLmod_baseclr,NVmod_base,NVmod_baseclr,LRTstat,LRTdf,LRTp,AdeqInd,Cmod_base,Cmodse_base,Cmod_baseclr,Cmodse_baseclr,Cdiff,Cdiffse,Cdiffp
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Female,-12041.25,-10901.81,-10890.98,22,27,21.67037,5,0.0006047794,0.9905803,0.8543404,2.800147e-05,0.8559203,2.800147e-05,-0.001579897,0.0005280558,0.0027723552
Male,-16105.26,-14836.24,-14829.18,22,26,14.11503,4,0.0069368439,0.9944694,0.8210544,2.581034e-05,0.8229324,2.581034e-05,-0.001877981,0.0004913838,0.0001324694


#### Adequacy index by cluster

In [430]:
adeqindbyclusdm <- AdeqIndClusDMFx(coxmoddmdf)

In [431]:
adeqindbyclusdm

sex,Cluster,AdeqInd
<chr>,<chr>,<dbl>
Female,probBC,0.9903213
Female,probDHT,0.9981356
Female,probDAL,0.9851899
Female,probDLT,0.9867936
Female,probDIS,0.9899095
Female,probDHG,0.9831184
Male,probBC,0.9882301
Male,probDAL,0.9873052
Male,probDLT,0.9975297
Male,probDIS,1.0


#### Adequacy index by probabilities of base model

In [432]:
adeqindbypredm <- AdeqIndByPreDMFx(coxmoddmdf)

In [433]:
adeqindbypredm

sex,threshold,AdeqInd
<chr>,<chr>,<dbl>
Female,0.0,0.9905803
Female,0.05,0.9763545
Female,0.1,0.9561414
Female,0.15,0.928284
Female,0.2,0.8979685
Female,0.25,0.909008
Male,0.0,0.9944694
Male,0.05,0.99843
Male,0.1,1.0
Male,0.15,1.0


#### Decision curve analysis

The last step in assessing clinical utility of clustering allocations is to perform a decision curve analysis. First we will assess the overall net benefit of both models:

In [435]:
dcadm <- DCurvDMFx(coxmoddmdf)
head(dcadm)

sex,pred,n,threshold,pos_rate,tp_rate,fp_rate,net_benefit,net_intervention_avoided
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Female,base_y,34581,0.0,0.03580024,0.03580024,0.9641998,0.03580024,
Female,base_y,34581,0.01,0.03580024,0.03449615,0.5671059,0.02876781,0.2679891
Female,base_y,34581,0.02,0.03580024,0.03209059,0.3702402,0.02453467,0.4121866
Female,base_y,34581,0.03,0.03580024,0.0295217,0.2647729,0.02133285,0.4964205
Female,base_y,34581,0.04,0.03580024,0.02732197,0.2001151,0.01898384,0.5606061
Female,base_y,34581,0.05,0.03580024,0.02512055,0.1587666,0.01676442,0.602519


We will also calculate this by cluster:

In [440]:
dcaclusdm <- DCurvDMbyClFx(coxmoddmdf)

In [441]:
head(dcaclusdm)

sex,Cluster,pred,n,threshold,pos_rate,tp_rate,fp_rate,net_benefit,net_intervention_avoided
<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Female,probBC,base_y,26382.99,0.0,0.03050377,0.03050377,0.9694962,0.03050377,
Female,probBC,base_y,26382.99,0.01,0.03050377,0.02906631,0.5393289,0.02361854,0.2878582
Female,probBC,base_y,26382.99,0.02,0.03050377,0.02649736,0.3372873,0.01961395,0.4358948
Female,probBC,base_y,26382.99,0.03,0.03050377,0.02402328,0.2315843,0.01686088,0.5283759
Female,probBC,base_y,26382.99,0.04,0.03050377,0.02181344,0.1702712,0.01471881,0.5906572
Female,probBC,base_y,26382.99,0.05,0.03050377,0.01972607,0.1321085,0.01277299,0.6326113


---

## Saving data

As done before, we will ask you to save an R file that does not contain any individual data, only summary statistics, as follows:

In [458]:
result_file2 <- list(
    MarkerDistrib = markerdistribdf,
    BMIeffOnMarker = bmieffmarkerdf,
    CountCovars = countcovarsdf,
    CountSpecDXMeds = countspectxdf,
    CrossSectAssoc = assocdxdf,
    KaplanMeierDF = kmestdf,
    MACESurvCoefs = macesurvcoefs,
    ComparisonMACE = compmodmacedf,
    AdeqIndByClusMACE = adeqindbyclusmace,
    AdeqIndByPreMACE = adeqindbypremace,
    DCAResMACE = dcamace,
    DCAREsClusMACE = dcaclusmace,
    DMSurvCoefs = dmsurvcoefs,
    ComparisonDM = compmoddmdf,
    AdeqIndByClusDM = adeqindbyclusdm,
    AdeqIndByPreDM = adeqindbypredm,
    DCAResDM = dcadm,
    DCAREsClusDM = dcaclusdm
)

In [459]:
save(
    result_file2,
    file = "../data/ukb/result_file2.RData"
)

This file should be uploaded to the respective folder of the cohort in Teams:

> CrossWP > Analyst working groups > WG1 > UMAP_project > *cohort_name* > data

---

## References

<!-- BIBLIOGRAPHY START -->
<div class="csl-bib-body">
  <div class="csl-entry"><i id="zotero|10831815/UYLBIXH3"></i>
    <div class="csl-left-margin">1. </div><div class="csl-right-inline">Coenders G, Pawlowsky-Glahn V. On interpretations of tests and effect sizes in regression models with a compositional predictor. SORT-Statistics and Operations Research Transactions [Internet]. 2020 Jun 26 [cited 2023 Jul 28];44(1):201–20. Available from: https://raco.cat/index.php/SORT/article/view/371189</div>
  </div>
  <div class="csl-entry"><i id="zotero|10831815/ZY2CL5NC"></i>
    <div class="csl-left-margin">2. </div><div class="csl-right-inline">SCORE2 working group and ESC Cardiovascular risk collaboration. SCORE2 risk prediction algorithms: new models to estimate 10-year risk of cardiovascular disease in Europe. European Heart Journal [Internet]. 2021 Jul 1 [cited 2023 Jul 20];42(25):2439–54. Available from: https://doi.org/10.1093/eurheartj/ehab309</div>
  </div>
  <div class="csl-entry"><i id="zotero|10831815/FPUFQKFI"></i>
    <div class="csl-left-margin">3. </div><div class="csl-right-inline">SCORE2-Diabetes Working Group and the ESC Cardiovascular Risk Collaboration. SCORE2-Diabetes: 10-year cardiovascular risk estimation in type 2 diabetes in Europe. European Heart Journal [Internet]. 2023 May 29 [cited 2023 Jul 13];ehad260. Available from: https://doi.org/10.1093/eurheartj/ehad260</div>
  </div>
</div>
<!-- BIBLIOGRAPHY END -->