title | author | date | output | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Regression Analysis in Health and Medicine Using R |
Assoc Prof Kamarul Imran Musa |
2020-09-04 |
|
\newpage
My name is Kamarul Imran, but just call me KIM.
I hold the position of Associate Professor in Epidemiology and Statistics at the School of Medical Sciences, Universiti Sains Malaysia.
My academic profile is available here http://www.medic.usm.my/jpm/index.php/en/academic-information/587-prof-madya-dr-kamarul-imran-musa
Currently also a teaching associate (sessional lecturer) at Monash Universiti Malaysia. I teach business analytics course for a Master programme.
My research profile at Google Scholar is available here https://scholar.google.com/citations?user=XVf2_QcAAAAJ&hl=en&authuser=1.
My SCOPUS author ID is 57194536466
My research interests include medical epidemiology, statistical modelling and machine learning.
Recently, I was awarded with the FRGS grant (RM125,000) to understand the roles of mammography images and clinical conditions on the performance of machine learning and statistical models to predict breast cancer.
Email: drkamarul@usm.my
Twitter: @kamarul_imran
Personal page: https://myanalytics.com.my/
We conduct pur own regular R courses. But we also receive invitations to conduct training on R and on data and statistical analysis nationwide.
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships among variables.
It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors').
Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed.
Source: https://en.wikipedia.org/wiki/Regression_analysis
Analysis of data with the outcome variable as a continuous variable and the expected outcome follows Gaussian distribution (as a function of covariates).
Analysis of data with the outcome variable is a binary categorical variable and the expected outcome follows Bernoulli distribution (as a function of covariates).
Analysis of data with the outcome variable is a count variable and the expected outcome follows the Poisson distribution (as a function of covariates).
Analysis of data with the outcome variable is time-to-event variable, a Cox semi-parametric regression is the most regression.
We will be using RStudio Cloud.
I have prepared the environment for our workshop in RStudio Cloud.
Click this link http://bit.ly/Reg_in_med. (The link is dead already!). So use your own RStudio Space.
Please:
- Click on the link
- Register
- Log in
On its webpage, it is stated the THE MISSION as
We created RStudio Cloud to make it easy for professionals, hobbyists, trainers, teachers and students to do, share, teach and learn data science using R.
Code-along
Type and try to understand!
File -> New File -> R Script
You will get this
Notes :
- YOU WILL TYPE YOUR CODES INSIDE THIS BLANK R SCRIPT
- SAVE THE R SCRIPT AFTER YOU HAVE DONE
- REOPEN THE R SCRIPT, IF YOU NEED TO CONTINUE YOUR CODING
Steps:
- Types the codes below inside your R script
library(tidyverse)
library(here)
library(haven)
library(readxl)
library(kableExtra)
library(broom)
library(cdata)
library(corrplot)
library(survival)
-
Then to run each line of the code, either you
a. click the Run button, or b. click CNTRL + ENTER keys on your keyboards
Note:
If you received this error message Error in library(kableExtra) : there is no package called ‘kableExtra’
, then you need to install the package. Go to Package
, then click Install
, then type the package you want to install. Make sure you click install dependencies
. Then click install
.
Wait for a few seconds for RStudio to dowload and install the package.
In multiple regression model, we assume that a linear relationship exists between some variable
The independent variables are sometimes referred to as explanatory variables, because of their use in explaining the variation in
Generally, the equation of multiple linear regression model is:
This is the data that our research team collected among the Malaysian general population. It is part of a larger dataset.
We would like to understand the problem of Metabolic Syndrome among Malaysians.
There were more than 4000 participants.
We use readxl::read_xlsx() to read MS Excel datasets.
And then use dplyr::glimpse() to briefly view the data.
met <- read_xlsx(here('datasets', 'metab_syndrome.xlsx')) %>%
janitor::clean_names()
glimpse(met)
## Rows: 4,341
## Columns: 17
## $ id <chr> "A11", "B11", "C12", "D11", "E11", "F11", "G12", "H12", "I1...
## $ age <chr> "46", "47", "48", "63", "39", "74", "43", "55", "22", "42",...
## $ dmdx <dbl> 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ height <chr> "1.6", "1.68", "1.47", "1.5", "1.51", "1.43", "1.77", "1.63...
## $ weight <dbl> 70.0, 52.0, 88.6, 81.0, 63.5, 50.0, 90.3, 69.0, 74.0, 53.2,...
## $ waist <dbl> 89.0, 89.0, 84.0, 125.0, 87.0, 85.0, 101.0, 94.0, 87.0, 72....
## $ neck <dbl> 38.0, 38.0, 32.0, 34.0, 40.0, 34.5, 39.0, 43.3, 34.0, 33.0,...
## $ hip <dbl> 97.0, 98.0, 94.0, 95.0, 105.0, 95.0, 112.0, 103.0, 106.0, 8...
## $ pulse <chr> "80", "83", "78", "94", "99", "96", "82", "89", "75", "78",...
## $ msbpr <dbl> 133.0, 163.0, 146.5, 206.5, 129.0, 190.5, 160.0, 138.0, 114...
## $ mdbpr <dbl> 83.5, 84.0, 93.5, 94.0, 70.0, 92.0, 101.0, 86.0, 63.5, 69.0...
## $ hba1c <chr> "5.3", "5.6", "5.7", "7.2", "5.4", "5.7", "5.09999999999999...
## $ fbs <chr> "5.82", "6.29", "8.2899999999999991", "8.39", "5.23", "6.45...
## $ mogtt2h <chr> "7.89", "6.05", "17.39", "#NULL!", "7.84", "14.88", "7.61",...
## $ totchol <chr> "5.27", "6.77", "5.87", "8.09", "5.55", "6.33", "7.48", "5....
## $ hdl <chr> "0.84", "1.45", "0.82", "1.79", "1.04", "1.62", "1.57", "1....
## $ ldl <chr> "3.45", "3.7", "3.96", "4.68", "4.33", "3.03", "4.59", "3.8...
You will see that there are a mix of variables
- character (correctly and wrongly assigned)
- double
That justify that MS Excel is not a good collection or data storage medium. You may want to use other alternatives like EpiData Entry or ODK.
Let us get the summary of the data. You can use summary() to provide you with a brief but insightful summary or descriptive statistics for your data.
You will notice that there is no summary statistics for variables of class character.
summary(met)
## id age dmdx height
## Length:4341 Length:4341 Min. :0.0000 Length:4341
## Class :character Class :character 1st Qu.:0.0000 Class :character
## Mode :character Mode :character Median :0.0000 Mode :character
## Mean :0.1083
## 3rd Qu.:0.0000
## Max. :1.0000
## NA's :1
## weight waist neck hip
## Min. : 30.00 Min. : 50.80 Min. :22.00 Min. : 61.00
## 1st Qu.: 53.80 1st Qu.: 77.00 1st Qu.:33.00 1st Qu.: 91.00
## Median : 62.00 Median : 86.00 Median :35.00 Median : 97.00
## Mean : 63.75 Mean : 86.32 Mean :35.38 Mean : 97.88
## 3rd Qu.: 71.95 3rd Qu.: 95.00 3rd Qu.:38.00 3rd Qu.:104.00
## Max. :187.80 Max. :154.50 Max. :50.00 Max. :160.00
## NA's :2 NA's :2 NA's :5 NA's :2
## pulse msbpr mdbpr hba1c
## Length:4341 Min. : 68.5 Min. : 41.50 Length:4341
## Class :character 1st Qu.:117.0 1st Qu.: 70.00 Class :character
## Mode :character Median :130.0 Median : 77.50 Mode :character
## Mean :133.5 Mean : 78.46
## 3rd Qu.:146.5 3rd Qu.: 86.00
## Max. :237.0 Max. :128.50
##
## fbs mogtt2h totchol hdl
## Length:4341 Length:4341 Length:4341 Length:4341
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## ldl
## Length:4341
## Class :character
## Mode :character
##
##
##
##
We will convert the character variables (wrongly classed) to the correct numeric class variables.
We will use dplyr::mutate_at()
met <- met %>%
mutate_at(vars(-id), ~as.numeric(.))
## Warning: Problem with `mutate()` input `age`.
## i NAs introduced by coercion
## i Input `age` is `(structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
## Warning in ~as.numeric(.): NAs introduced by coercion
## Warning: Problem with `mutate()` input `height`.
## i NAs introduced by coercion
## i Input `height` is `(structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
## Warning in ~as.numeric(.): NAs introduced by coercion
## Warning: Problem with `mutate()` input `pulse`.
## i NAs introduced by coercion
## i Input `pulse` is `(structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
## Warning in ~as.numeric(.): NAs introduced by coercion
## Warning: Problem with `mutate()` input `hba1c`.
## i NAs introduced by coercion
## i Input `hba1c` is `(structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
## Warning in ~as.numeric(.): NAs introduced by coercion
## Warning: Problem with `mutate()` input `fbs`.
## i NAs introduced by coercion
## i Input `fbs` is `(structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
## Warning in ~as.numeric(.): NAs introduced by coercion
## Warning: Problem with `mutate()` input `mogtt2h`.
## i NAs introduced by coercion
## i Input `mogtt2h` is `(structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
## Warning in ~as.numeric(.): NAs introduced by coercion
## Warning: Problem with `mutate()` input `totchol`.
## i NAs introduced by coercion
## i Input `totchol` is `(structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
## Warning in ~as.numeric(.): NAs introduced by coercion
## Warning: Problem with `mutate()` input `hdl`.
## i NAs introduced by coercion
## i Input `hdl` is `(structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
## Warning in ~as.numeric(.): NAs introduced by coercion
## Warning: Problem with `mutate()` input `ldl`.
## i NAs introduced by coercion
## i Input `ldl` is `(structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
## Warning in ~as.numeric(.): NAs introduced by coercion
Let us see the summary statistics again
summary(met)
## id age dmdx height
## Length:4341 Min. :18.00 Min. :0.0000 Min. :1.270
## Class :character 1st Qu.:38.00 1st Qu.:0.0000 1st Qu.:1.510
## Mode :character Median :48.00 Median :0.0000 Median :1.560
## Mean :47.84 Mean :0.1083 Mean :1.568
## 3rd Qu.:58.00 3rd Qu.:0.0000 3rd Qu.:1.630
## Max. :89.00 Max. :1.0000 Max. :1.960
## NA's :1 NA's :1 NA's :1
## weight waist neck hip
## Min. : 30.00 Min. : 50.80 Min. :22.00 Min. : 61.00
## 1st Qu.: 53.80 1st Qu.: 77.00 1st Qu.:33.00 1st Qu.: 91.00
## Median : 62.00 Median : 86.00 Median :35.00 Median : 97.00
## Mean : 63.75 Mean : 86.32 Mean :35.38 Mean : 97.88
## 3rd Qu.: 71.95 3rd Qu.: 95.00 3rd Qu.:38.00 3rd Qu.:104.00
## Max. :187.80 Max. :154.50 Max. :50.00 Max. :160.00
## NA's :2 NA's :2 NA's :5 NA's :2
## pulse msbpr mdbpr hba1c
## Min. : 24.00 Min. : 68.5 Min. : 41.50 Min. : 0.200
## 1st Qu.: 70.00 1st Qu.:117.0 1st Qu.: 70.00 1st Qu.: 5.100
## Median : 78.00 Median :130.0 Median : 77.50 Median : 5.400
## Mean : 79.27 Mean :133.5 Mean : 78.46 Mean : 5.805
## 3rd Qu.: 86.00 3rd Qu.:146.5 3rd Qu.: 86.00 3rd Qu.: 5.800
## Max. :975.00 Max. :237.0 Max. :128.50 Max. :15.000
## NA's :8 NA's :70
## fbs mogtt2h totchol hdl
## Min. : 0.160 Min. : 0.160 Min. : 0.180 Min. :0.080
## 1st Qu.: 4.400 1st Qu.: 5.150 1st Qu.: 4.970 1st Qu.:1.110
## Median : 5.150 Median : 6.600 Median : 5.700 Median :1.320
## Mean : 5.622 Mean : 7.343 Mean : 5.792 Mean :1.346
## 3rd Qu.: 5.982 3rd Qu.: 8.410 3rd Qu.: 6.530 3rd Qu.:1.540
## Max. :34.270 Max. :37.370 Max. :23.140 Max. :4.430
## NA's :117 NA's :608 NA's :54 NA's :52
## ldl
## Min. : 0.14
## 1st Qu.: 2.79
## Median : 3.46
## Mean : 3.55
## 3rd Qu.: 4.25
## Max. :10.56
## NA's :53
Look at variables for outliers and NA for variable PULSE, MOGTT2H, TOTCHOL, FBS.
For example, let us explore variable Hba1c
met %>%
ggplot(aes(x = hba1c)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 70 rows containing non-finite values (stat_bin).
Let us do some more data wrangling using
- filter()
- mutate()
met <- met %>%
filter(hba1c > 2.5, hba1c < 25.0,
ldl > 0.5, hdl > 0.2,
totchol > 2.0, totchol < 15.0,
fbs > 2, fbs < 20,
pulse < 140) %>%
mutate(bmi = weight/(height^2),
overweight = if_else(bmi >=25.0,'overwt','not_overwt'),
overweight = factor(overweight),
dmdx = factor(dmdx, labels = c('NoDM', 'HaveDM')))
Let us check the summary stat again
summary(met)
## id age dmdx height
## Length:4078 Min. :18.0 NoDM :3631 Min. :1.270
## Class :character 1st Qu.:38.0 HaveDM: 447 1st Qu.:1.510
## Mode :character Median :48.0 Median :1.560
## Mean :47.9 Mean :1.569
## 3rd Qu.:58.0 3rd Qu.:1.630
## Max. :89.0 Max. :1.960
## NA's :1
## weight waist neck hip
## Min. : 30.00 Min. : 50.80 Min. :22.00 Min. : 61.00
## 1st Qu.: 53.80 1st Qu.: 77.00 1st Qu.:33.00 1st Qu.: 91.00
## Median : 62.23 Median : 86.00 Median :35.00 Median : 97.00
## Mean : 63.89 Mean : 86.41 Mean :35.36 Mean : 97.89
## 3rd Qu.: 72.00 3rd Qu.: 95.00 3rd Qu.:38.00 3rd Qu.:104.00
## Max. :187.80 Max. :154.50 Max. :50.00 Max. :160.00
## NA's :2 NA's :1 NA's :3 NA's :1
## pulse msbpr mdbpr hba1c
## Min. : 24.00 Min. : 68.5 Min. : 41.50 Min. : 3.800
## 1st Qu.: 69.00 1st Qu.:117.1 1st Qu.: 70.00 1st Qu.: 5.100
## Median : 78.00 Median :130.0 Median : 78.00 Median : 5.400
## Mean : 78.29 Mean :133.8 Mean : 78.57 Mean : 5.803
## 3rd Qu.: 86.00 3rd Qu.:147.0 3rd Qu.: 86.00 3rd Qu.: 5.800
## Max. :135.00 Max. :237.0 Max. :128.50 Max. :15.000
##
## fbs mogtt2h totchol hdl
## Min. : 2.030 Min. : 0.160 Min. : 2.13 Min. :0.280
## 1st Qu.: 4.442 1st Qu.: 5.200 1st Qu.: 4.97 1st Qu.:1.110
## Median : 5.160 Median : 6.630 Median : 5.70 Median :1.320
## Mean : 5.628 Mean : 7.368 Mean : 5.79 Mean :1.354
## 3rd Qu.: 6.000 3rd Qu.: 8.430 3rd Qu.: 6.52 3rd Qu.:1.550
## Max. :19.340 Max. :37.370 Max. :14.91 Max. :4.430
## NA's :477
## ldl bmi overweight
## Min. : 0.51 Min. : 9.241 not_overwt:1922
## 1st Qu.: 2.81 1st Qu.:22.232 overwt :2154
## Median : 3.46 Median :25.402 NA's : 2
## Mean : 3.56 Mean :25.938
## 3rd Qu.: 4.23 3rd Qu.:28.870
## Max. :10.56 Max. :57.040
## NA's :2
What is happening to Hba1c?
met %>%
ggplot(aes(x = hba1c)) +
geom_histogram() +
facet_wrap(~ dmdx)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We could do correlational analysis to give us idea for possible multicollinearity issues.
Multicollinerity is the situation where one pair or more than one pairs of variables are higly correlated with each other.
If we include co-linear variables in the model (as covariates), the regression parameters will be biased (wrong).
met_num <- met %>%
select_if(is.numeric)
The results of correlation matrix are:
cor.met <- cor(met_num, use="complete.obs", method="pearson")
head(round(cor.met,2))
## age height weight waist neck hip pulse msbpr mdbpr hba1c fbs
## age 1.00 -0.21 -0.08 0.12 0.04 0.00 -0.15 0.45 0.23 0.15 0.17
## height -0.21 1.00 0.41 0.15 0.40 0.11 -0.19 -0.05 0.01 -0.04 -0.05
## weight -0.08 0.41 1.00 0.80 0.67 0.80 -0.03 0.18 0.27 0.17 0.13
## waist 0.12 0.15 0.80 1.00 0.60 0.61 0.00 0.27 0.29 0.23 0.16
## neck 0.04 0.40 0.67 0.60 1.00 0.51 -0.04 0.21 0.25 0.14 0.15
## hip 0.00 0.11 0.80 0.61 0.51 1.00 0.04 0.16 0.23 0.15 0.13
## mogtt2h totchol hdl ldl bmi
## age 0.20 0.31 0.17 0.26 0.02
## height -0.15 -0.08 -0.19 -0.07 -0.09
## weight 0.15 0.05 -0.23 0.10 0.87
## waist 0.22 0.12 -0.21 0.13 0.79
## neck 0.13 0.09 -0.29 0.10 0.51
## hip 0.18 0.08 -0.13 0.14 0.82
This the correlogram to represent the correlation matrix:
corrplot(cor.met, method="circle")
When we assume that the expected values for the outcome variable given the covariates are normally distributed, then we can perform linear regression.
In R, this can be done using lm()
. This is the model where the expected values of HBA1C is model as a function of FBS (fasting blood sugar).
met_hba1c <- lm(hba1c ~ fbs, data = met)
summary(met_hba1c)
##
## Call:
## lm(formula = hba1c ~ fbs, data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0666 -0.4776 -0.1052 0.3091 8.5293
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.502193 0.040972 85.48 <2e-16 ***
## fbs 0.408802 0.006702 60.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.021 on 4076 degrees of freedom
## Multiple R-squared: 0.4772, Adjusted R-squared: 0.4771
## F-statistic: 3720 on 1 and 4076 DF, p-value: < 2.2e-16
Run another linear regression model with these covariates:
met_hba1c_mv <- lm(hba1c ~ fbs + age + msbpr + mdbpr, data = met)
summary(met_hba1c_mv)
##
## Call:
## lm(formula = hba1c ~ fbs + age + msbpr + mdbpr, data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0630 -0.4718 -0.1051 0.3098 8.6575
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.009202 0.111352 27.024 < 2e-16 ***
## fbs 0.400576 0.006822 58.721 < 2e-16 ***
## age 0.008054 0.001259 6.396 1.77e-10 ***
## msbpr -0.004954 0.001080 -4.588 4.61e-06 ***
## mdbpr 0.010390 0.001928 5.390 7.43e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.014 on 4072 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.485, Adjusted R-squared: 0.4845
## F-statistic: 958.8 on 4 and 4072 DF, p-value: < 2.2e-16
Should we add Diabetes Status (DMDX)?
met_hba1c_mv2 <- lm(hba1c ~ fbs + age + msbpr + mdbpr + bmi + hdl + ldl + dmdx,
data = met)
summary(met_hba1c_mv2)
##
## Call:
## lm(formula = hba1c ~ fbs + age + msbpr + mdbpr + bmi + hdl +
## ldl + dmdx, data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0202 -0.4007 -0.0588 0.2985 8.9164
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.170719 0.130759 24.249 < 2e-16 ***
## fbs 0.318074 0.007102 44.785 < 2e-16 ***
## age 0.003454 0.001210 2.854 0.00434 **
## msbpr -0.004441 0.001003 -4.425 9.88e-06 ***
## mdbpr 0.007240 0.001827 3.964 7.51e-05 ***
## bmi 0.013409 0.003019 4.441 9.18e-06 ***
## hdl -0.039540 0.044232 -0.894 0.37141
## ldl 0.073575 0.014573 5.049 4.64e-07 ***
## dmdxHaveDM 1.330114 0.053865 24.693 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9412 on 4066 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.5569, Adjusted R-squared: 0.556
## F-statistic: 638.8 on 8 and 4066 DF, p-value: < 2.2e-16
We will add an interaction term (DMDX with AGE) in the covariates
met_hba1c_ia <- lm(hba1c ~ fbs + age + msbpr + mdbpr + bmi + hdl + ldl + dmdx +
dmdx*age, data = met)
summary(met_hba1c_ia)
##
## Call:
## lm(formula = hba1c ~ fbs + age + msbpr + mdbpr + bmi + hdl +
## ldl + dmdx + dmdx * age, data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0362 -0.4004 -0.0557 0.3003 8.9217
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.168289 0.130759 24.230 < 2e-16 ***
## fbs 0.317246 0.007129 44.501 < 2e-16 ***
## age 0.003838 0.001244 3.085 0.002049 **
## msbpr -0.004355 0.001005 -4.332 1.52e-05 ***
## mdbpr 0.007038 0.001833 3.840 0.000125 ***
## bmi 0.013302 0.003020 4.405 1.09e-05 ***
## hdl -0.041075 0.044243 -0.928 0.353253
## ldl 0.073038 0.014577 5.010 5.66e-07 ***
## dmdxHaveDM 1.639626 0.239192 6.855 8.21e-12 ***
## age:dmdxHaveDM -0.005475 0.004123 -1.328 0.184220
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9411 on 4065 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.5571, Adjusted R-squared: 0.5561
## F-statistic: 568.1 on 9 and 4065 DF, p-value: < 2.2e-16
We can take advantage of broom package to produce better outputs
tidy(met_hba1c_mv2, conf.int = TRUE)
## # A tibble: 9 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.17 0.131 24.2 1.96e-121 2.91 3.43
## 2 fbs 0.318 0.00710 44.8 0. 0.304 0.332
## 3 age 0.00345 0.00121 2.85 4.34e- 3 0.00108 0.00583
## 4 msbpr -0.00444 0.00100 -4.43 9.88e- 6 -0.00641 -0.00247
## 5 mdbpr 0.00724 0.00183 3.96 7.51e- 5 0.00366 0.0108
## 6 bmi 0.0134 0.00302 4.44 9.18e- 6 0.00749 0.0193
## 7 hdl -0.0395 0.0442 -0.894 3.71e- 1 -0.126 0.0472
## 8 ldl 0.0736 0.0146 5.05 4.64e- 7 0.0450 0.102
## 9 dmdxHaveDM 1.33 0.0539 24.7 1.47e-125 1.22 1.44
To get the predicted values
pred_met <- augment(met_hba1c_mv2)
head(pred_met)
## # A tibble: 6 x 16
## .rownames hba1c fbs age msbpr mdbpr bmi hdl ldl dmdx .fitted
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 1 5.3 5.82 46 133 83.5 27.3 0.84 3.45 NoDM 5.78
## 2 2 5.6 6.29 47 163 84 18.4 1.45 3.7 NoDM 5.68
## 3 3 5.7 8.29 48 146. 93.5 41.0 0.82 3.96 NoDM 6.81
## 4 4 7.2 8.39 63 206. 94 36 1.79 4.68 Have~ 7.91
## 5 5 5.4 5.23 39 129 70 27.8 1.04 4.33 NoDM 5.55
## 6 6 5.7 6.45 74 190. 92 24.5 1.62 3.03 NoDM 5.78
## # ... with 5 more variables: .resid <dbl>, .std.resid <dbl>, .hat <dbl>,
## # .sigma <dbl>, .cooksd <dbl>
Remember the LINE assumptions
pred_met %>%
ggplot(aes(x = .resid)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Next,
pred_met %>%
ggplot(aes(x = .fitted, y = .std.resid)) +
geom_point()
There are problems?
- outliers? Influential data?
- heteroscesdasticity?
Perhaps, we should do further treatment of our data.
pred_met %>%
filter(between(.std.resid, -3, 3)) %>%
ggplot(aes(x = .fitted, y = .std.resid)) +
geom_point()
\newpage
The data comes from a surgeon (Dr Najmi). The data contains variables from patients with peptic ulcer disease.
It is in excel form.
PUP2 <- read_excel(here('datasets', 'PUP2.xlsx'))
Quickly examine types of variables and summary statistics
glimpse(PUP2)
## Rows: 121
## Columns: 23
## $ no <dbl> 105, 107, 94, 103, 109, 111, 79, 85, 88, 89, 10...
## $ age <dbl> 42, 66, 67, 19, 77, 39, 62, 71, 69, 97, 52, 21,...
## $ gender <chr> "male", "female", "male", "male", "male", "male...
## $ epigastric_pain <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes"...
## $ onset_more_24_hrs <chr> "no", "no", "no", "yes", "yes", "yes", "yes", "...
## $ diabetes <chr> "no", "no", "no", "no", "no", "no", "yes", "no"...
## $ NSAIDS <chr> "no", "no", "yes", "no", "no", "no", "no", "no"...
## $ previous_OGDS <chr> "no", "no", "no", "yes", "no", "no", "no", "no"...
## $ ASA <dbl> 1, 1, 1, 1, 2, 1, 2, 2, 1, 1, 2, 1, 2, 1, 1, 2,...
## $ systolic <dbl> 141, 197, 126, 90, 147, 115, 103, 159, 145, 105...
## $ diastolic <dbl> 98, 88, 73, 40, 82, 86, 55, 68, 75, 65, 74, 50,...
## $ tenderness <chr> "generalized", "generalized", "generalized", "l...
## $ guarding <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes",...
## $ hemoglobin <dbl> 18.0, 12.0, 12.0, 12.0, 11.0, 18.0, 8.1, 13.3, ...
## $ twc <dbl> 6.0, 6.0, 13.0, 20.0, 21.0, 4.0, 5.0, 12.0, 6.0...
## $ platelet <dbl> 415, 292, 201, 432, 324, 260, 461, 210, 293, 59...
## $ PULP <dbl> 2, 3, 3, 2, 7, 1, 2, 5, 3, 4, 2, 3, 4, 3, 5, 5,...
## $ admission_to_op_hrs <dbl> 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 6, 6, 6,...
## $ perforation <dbl> 0.5, 1.0, 0.5, 0.5, 1.0, 1.0, 3.0, 1.5, 0.5, 1....
## $ degree_perforation <chr> "small", "small", "small", "small", "small", "s...
## $ SSSI <chr> "no", "no", "no", "no", "no", "no", "no", "no",...
## $ sepsis <chr> "no", "no", "no", "no", "no", "no", "yes", "no"...
## $ outcome <chr> "alive", "alive", "alive", "alive", "alive", "a...
summary(PUP2)
## no age gender epigastric_pain
## Min. : 1 Min. :19.00 Length:121 Length:121
## 1st Qu.: 31 1st Qu.:49.00 Class :character Class :character
## Median : 61 Median :64.00 Mode :character Mode :character
## Mean : 61 Mean :60.43
## 3rd Qu.: 91 3rd Qu.:75.00
## Max. :121 Max. :97.00
## onset_more_24_hrs diabetes NSAIDS previous_OGDS
## Length:121 Length:121 Length:121 Length:121
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## ASA systolic diastolic tenderness
## Min. :1.000 Min. : 67.0 Min. : 38.00 Length:121
## 1st Qu.:1.000 1st Qu.:112.0 1st Qu.: 63.00 Class :character
## Median :1.000 Median :128.0 Median : 71.00 Mode :character
## Mean :1.545 Mean :128.6 Mean : 72.07
## 3rd Qu.:2.000 3rd Qu.:143.0 3rd Qu.: 81.00
## Max. :3.000 Max. :197.0 Max. :116.00
## guarding hemoglobin twc platelet
## Length:121 Min. : 3.30 Min. : 2.00 Min. : 11.0
## Class :character 1st Qu.:10.00 1st Qu.: 9.00 1st Qu.:224.0
## Mode :character Median :12.00 Median :12.00 Median :308.0
## Mean :12.32 Mean :13.03 Mean :314.4
## 3rd Qu.:15.00 3rd Qu.:16.00 3rd Qu.:392.0
## Max. :19.40 Max. :37.00 Max. :798.0
## PULP admission_to_op_hrs perforation degree_perforation
## Min. :0.000 Min. : 1.00 Min. :0.300 Length:121
## 1st Qu.:2.000 1st Qu.: 5.00 1st Qu.:0.500 Class :character
## Median :3.000 Median : 8.00 Median :1.000 Mode :character
## Mean :3.529 Mean :10.07 Mean :1.225
## 3rd Qu.:5.000 3rd Qu.:12.00 3rd Qu.:1.500
## Max. :9.000 Max. :72.00 Max. :5.000
## SSSI sepsis outcome
## Length:121 Length:121 Length:121
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
You can use summarytools::descr to get a quick but more elaborate summary statistics
This is the summary statistics for all numerical variables.
library(summarytools)
## Registered S3 method overwritten by 'pryr':
## method from
## print.bytes Rcpp
## For best results, restart R session and update pander using devtools:: or remotes::install_github('rapporter/pander')
##
## Attaching package: 'summarytools'
## The following object is masked from 'package:wrapr':
##
## view
## The following object is masked from 'package:tibble':
##
## view
descr(PUP2)
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Non-numerical variable(s) ignored: gender, epigastric_pain, onset_more_24_hrs, diabetes, NSAIDS, previous_OGDS, tenderness, guarding, degree_perforation, SSSI, sepsis, outcome
## Descriptive Statistics
##
## admission_to_op_hrs age ASA diastolic hemoglobin no
## ----------------- --------------------- -------- -------- ----------- ------------ --------
## Mean 10.07 60.43 1.55 72.07 12.32 61.00
## Std.Dev 9.44 18.05 0.62 13.99 3.33 35.07
## Min 1.00 19.00 1.00 38.00 3.30 1.00
## Q1 5.00 49.00 1.00 63.00 10.00 31.00
## Median 8.00 64.00 1.00 71.00 12.00 61.00
## Q3 12.00 75.00 2.00 81.00 15.00 91.00
## Max 72.00 97.00 3.00 116.00 19.40 121.00
## MAD 5.93 17.79 0.00 13.34 2.97 44.48
## IQR 7.00 26.00 1.00 18.00 5.00 60.00
## CV 0.94 0.30 0.40 0.19 0.27 0.57
## Skewness 3.36 -0.55 0.66 0.12 -0.22 0.00
## SE.Skewness 0.22 0.22 0.22 0.22 0.22 0.22
## Kurtosis 15.95 -0.53 -0.55 0.22 0.02 -1.23
## N.Valid 121.00 121.00 121.00 121.00 121.00 121.00
## Pct.Valid 100.00 100.00 100.00 100.00 100.00 100.00
##
## Table: Table continues below
##
##
##
## perforation platelet PULP systolic twc
## ----------------- ------------- ---------- -------- ---------- --------
## Mean 1.22 314.43 3.53 128.56 13.03
## Std.Dev 0.91 140.32 2.28 24.51 6.66
## Min 0.30 11.00 0.00 67.00 2.00
## Q1 0.50 224.00 2.00 112.00 9.00
## Median 1.00 308.00 3.00 128.00 12.00
## Q3 1.50 392.00 5.00 143.00 16.00
## Max 5.00 798.00 9.00 197.00 37.00
## MAD 0.74 124.54 2.97 22.24 5.93
## IQR 1.00 168.00 3.00 31.00 7.00
## CV 0.74 0.45 0.65 0.19 0.51
## Skewness 1.66 0.75 0.25 0.36 0.78
## SE.Skewness 0.22 0.22 0.22 0.22 0.22
## Kurtosis 3.34 1.02 -0.91 0.58 0.60
## N.Valid 121.00 121.00 121.00 121.00 121.00
## Pct.Valid 100.00 100.00 100.00 100.00 100.00
The outcome of interest is variable outcome. It is a character variable labelled as alive or dead.
Tidyverse prefers to use character variables rather than factor variables.
PUP2 %>% group_by(outcome) %>%
count()
## # A tibble: 2 x 2
## # Groups: outcome [2]
## outcome n
## <chr> <int>
## 1 alive 83
## 2 dead 38
descr give also provide summary statistics based on the outcome of interest
PUP2 %>% split(.$outcome) %>%
map(~descr(.x, stats = c('mean', 'sd', 'min', 'med', 'max')),
transpose = TRUE)
## x must either be a summarytools object created with freq(), descr(), or a list of summarytools objects created using by()
We convert the variable outcome from character to factor
PUP2 <- PUP2 %>%
mutate(oc2 = factor(outcome))
PUP2 %>%
ggplot(aes(x = oc2)) +
geom_bar() +
xlab('outcome') + ylab('freq')
How about variable ASA. It is not a numerical variable. Let us plot it against outcome.
PUP2 %>%
ggplot(aes(ASA)) +
geom_bar() +
facet_wrap(~oc2)
options(scipen = 999)
pup_uni <- PUP2 %>%
select(oc2, age, gender, epigastric_pain, onset_more_24_hrs, diabetes,
NSAIDS, previous_OGDS, systolic, diastolic, ASA, PULP, perforation,
admission_to_op_hrs, hemoglobin, twc) %>%
map_dfr(~tidy(glm(oc2 ~ .x, family = binomial, data = PUP2),
conf.int = T), .id = 'variable') %>%
filter(variable != 'oc2')
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
kable(pup_uni) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
variable | term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|---|
age | (Intercept) | -2.2466889 | 0.7843206 | -2.8645030 | 0.0041766 | -3.8954373 | -0.7936808 |
age | .x | 0.0236740 | 0.0119691 | 1.9779287 | 0.0479368 | 0.0010929 | 0.0483811 |
gender | (Intercept) | -0.0800427 | 0.4003204 | -0.1999466 | 0.8415223 | -0.8789758 | 0.7105500 |
gender | .xmale | -0.9103560 | 0.4615231 | -1.9725036 | 0.0485521 | -1.8225388 | 0.0007672 |
epigastric_pain | (Intercept) | -1.3862944 | 1.1180337 | -1.2399397 | 0.2149977 | -4.3598898 | 0.5246956 |
epigastric_pain | .xyes | 0.6277644 | 1.1356428 | 0.5527833 | 0.5804118 | -1.3256526 | 3.6198522 |
onset_more_24_hrs | (Intercept) | -1.0986123 | 0.3333330 | -3.2958397 | 0.0009813 | -1.7942261 | -0.4749176 |
onset_more_24_hrs | .xyes | 0.5065612 | 0.4133389 | 1.2255348 | 0.2203738 | -0.2889264 | 1.3414593 |
diabetes | (Intercept) | -0.9019020 | 0.2240691 | -4.0251058 | 0.0000569 | -1.3566078 | -0.4745240 |
diabetes | .xyes | 0.5654298 | 0.4707818 | 1.2010441 | 0.2297341 | -0.3772175 | 1.4849256 |
NSAIDS | (Intercept) | -0.9490806 | 0.2404072 | -3.9478044 | 0.0000789 | -1.4390465 | -0.4920164 |
NSAIDS | .xyes | 0.5436154 | 0.4205273 | 1.2926994 | 0.1961150 | -0.2904184 | 1.3672726 |
previous_OGDS | (Intercept) | -0.7753853 | 0.2042775 | -3.7957449 | 0.0001472 | -1.1870841 | -0.3835253 |
previous_OGDS | .xyes | -0.0719126 | 0.7196661 | -0.0999249 | 0.9204039 | -1.6508310 | 1.2715818 |
systolic | (Intercept) | 0.9192236 | 1.0873957 | 0.8453441 | 0.3979187 | -1.1798631 | 3.1196146 |
systolic | .x | -0.0133813 | 0.0085103 | -1.5723584 | 0.1158675 | -0.0308281 | 0.0028134 |
diastolic | (Intercept) | 0.6301161 | 1.0434752 | 0.6038630 | 0.5459347 | -1.4068184 | 2.7166343 |
diastolic | .x | -0.0197762 | 0.0144968 | -1.3641724 | 0.1725133 | -0.0491267 | 0.0081354 |
ASA | (Intercept) | -1.6065753 | 0.5455487 | -2.9448796 | 0.0032308 | -2.7150116 | -0.5629706 |
ASA | .x | 0.5224183 | 0.3154616 | 1.6560438 | 0.0977129 | -0.0941774 | 1.1523982 |
PULP | (Intercept) | -1.6645299 | 0.4098748 | -4.0610691 | 0.0000488 | -2.5155950 | -0.8987627 |
PULP | .x | 0.2361062 | 0.0906398 | 2.6048836 | 0.0091905 | 0.0624931 | 0.4201174 |
perforation | (Intercept) | -2.3438540 | 0.4324202 | -5.4203151 | 0.0000001 | -3.2517343 | -1.5466796 |
perforation | .x | 1.2032426 | 0.2843708 | 4.2312458 | 0.0000232 | 0.6827088 | 1.8029077 |
admission_to_op_hrs | (Intercept) | -0.8431923 | 0.2853226 | -2.9552246 | 0.0031244 | -1.4139903 | -0.2840466 |
admission_to_op_hrs | .x | 0.0060960 | 0.0202384 | 0.3012113 | 0.7632534 | -0.0373817 | 0.0459273 |
hemoglobin | (Intercept) | 0.7138662 | 0.7573475 | 0.9425874 | 0.3458920 | -0.7617317 | 2.2369643 |
hemoglobin | .x | -0.1237478 | 0.0615749 | -2.0097125 | 0.0444616 | -0.2493337 | -0.0057555 |
twc | (Intercept) | -1.0461722 | 0.4361081 | -2.3988830 | 0.0164452 | -1.9266672 | -0.2058904 |
twc | .x | 0.0200912 | 0.0291979 | 0.6881022 | 0.4913885 | -0.0379630 | 0.0776960 |
Get the odds ratios
crude_or_co <- PUP2 %>%
select(oc2, age, gender, epigastric_pain, onset_more_24_hrs, diabetes,
NSAIDS, previous_OGDS, systolic, diastolic, ASA, PULP, perforation,
admission_to_op_hrs, hemoglobin, twc) %>%
map(~glm(oc2 ~ .x, family = binomial, data = PUP2)) %>%
map_dfr(~tidy(., exponentiate = TRUE, conf.int = TRUE), .id = 'variable') %>%
filter(variable != 'oc2')
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
kable(crude_or_co) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
variable | term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|---|
age | (Intercept) | 0.1057488 | 0.7843206 | -2.8645030 | 0.0041766 | 0.0203345 | 0.4521773 |
age | .x | 1.0239565 | 0.0119691 | 1.9779287 | 0.0479368 | 1.0010935 | 1.0495706 |
gender | (Intercept) | 0.9230769 | 0.4003204 | -0.1999466 | 0.8415223 | 0.4152080 | 2.0351103 |
gender | .xmale | 0.4023810 | 0.4615231 | -1.9725036 | 0.0485521 | 0.1616149 | 1.0007675 |
epigastric_pain | (Intercept) | 0.2500000 | 1.1180337 | -1.2399397 | 0.2149977 | 0.0127798 | 1.6899444 |
epigastric_pain | .xyes | 1.8734177 | 1.1356428 | 0.5527833 | 0.5804118 | 0.2656295 | 37.3320493 |
onset_more_24_hrs | (Intercept) | 0.3333333 | 0.3333330 | -3.2958397 | 0.0009813 | 0.1662561 | 0.6219363 |
onset_more_24_hrs | .xyes | 1.6595745 | 0.4133389 | 1.2255348 | 0.2203738 | 0.7490674 | 3.8246207 |
diabetes | (Intercept) | 0.4057971 | 0.2240691 | -4.0251058 | 0.0000569 | 0.2575329 | 0.6221811 |
diabetes | .xyes | 1.7602041 | 0.4707818 | 1.2010441 | 0.2297341 | 0.6857669 | 4.4146368 |
NSAIDS | (Intercept) | 0.3870968 | 0.2404072 | -3.9478044 | 0.0000789 | 0.2371538 | 0.6113924 |
NSAIDS | .xyes | 1.7222222 | 0.4205273 | 1.2926994 | 0.1961150 | 0.7479506 | 3.9246320 |
previous_OGDS | (Intercept) | 0.4605263 | 0.2042775 | -3.7957449 | 0.0001472 | 0.3051097 | 0.6814549 |
previous_OGDS | .xyes | 0.9306122 | 0.7196661 | -0.0999249 | 0.9204039 | 0.1918904 | 3.5664897 |
systolic | (Intercept) | 2.5073428 | 1.0873957 | 0.8453441 | 0.3979187 | 0.3073208 | 22.6376538 |
systolic | .x | 0.9867078 | 0.0085103 | -1.5723584 | 0.1158675 | 0.9696422 | 1.0028173 |
diastolic | (Intercept) | 1.8778285 | 1.0434752 | 0.6038630 | 0.5459347 | 0.2449213 | 15.1293161 |
diastolic | .x | 0.9804181 | 0.0144968 | -1.3641724 | 0.1725133 | 0.9520605 | 1.0081686 |
ASA | (Intercept) | 0.2005733 | 0.5455487 | -2.9448796 | 0.0032308 | 0.0662042 | 0.5695148 |
ASA | .x | 1.6861001 | 0.3154616 | 1.6560438 | 0.0977129 | 0.9101213 | 3.1657760 |
PULP | (Intercept) | 0.1892796 | 0.4098748 | -4.0610691 | 0.0000488 | 0.0808148 | 0.4070730 |
PULP | .x | 1.2663088 | 0.0906398 | 2.6048836 | 0.0091905 | 1.0644871 | 1.5221402 |
perforation | (Intercept) | 0.0959571 | 0.4324202 | -5.4203151 | 0.0000001 | 0.0387070 | 0.2129539 |
perforation | .x | 3.3309004 | 0.2843708 | 4.2312458 | 0.0000232 | 1.9792317 | 6.0672637 |
admission_to_op_hrs | (Intercept) | 0.4303346 | 0.2853226 | -2.9552246 | 0.0031244 | 0.2431710 | 0.7527315 |
admission_to_op_hrs | .x | 1.0061147 | 0.0202384 | 0.3012113 | 0.7632534 | 0.9633084 | 1.0469983 |
hemoglobin | (Intercept) | 2.0418703 | 0.7573475 | 0.9425874 | 0.3458920 | 0.4668573 | 9.3648596 |
hemoglobin | .x | 0.8836027 | 0.0615749 | -2.0097125 | 0.0444616 | 0.7793198 | 0.9942611 |
twc | (Intercept) | 0.3512798 | 0.4361081 | -2.3988830 | 0.0164452 | 0.1456328 | 0.8139223 |
twc | .x | 1.0202943 | 0.0291979 | 0.6881022 | 0.4913885 | 0.9627485 | 1.0807940 |
Combine the results
model_uni <- bind_cols(pup_uni, crude_or_co) %>%
janitor::clean_names() %>%
filter(term_2 != '(Intercept)') %>%
select(-c(statistic_5, p_value_6)) %>%
rename(odds_ratio = estimate_3)
## New names:
## * variable -> variable...1
## * term -> term...2
## * estimate -> estimate...3
## * std.error -> std.error...4
## * statistic -> statistic...5
## * ...
kable(model_uni) %>%
kable_styling(bootstrap_options = c("striped", "hover",
"condensed", "responsive"))
variable_1 | term_2 | odds_ratio | std_error_4 | conf_low_7 | conf_high_8 | variable_9 | term_10 | estimate_11 | std_error_12 | statistic_13 | p_value_14 | conf_low_15 | conf_high_16 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
age | .x | 0.0236740 | 0.0119691 | 0.0010929 | 0.0483811 | age | .x | 1.0239565 | 0.0119691 | 1.9779287 | 0.0479368 | 1.0010935 | 1.0495706 |
gender | .xmale | -0.9103560 | 0.4615231 | -1.8225388 | 0.0007672 | gender | .xmale | 0.4023810 | 0.4615231 | -1.9725036 | 0.0485521 | 0.1616149 | 1.0007675 |
epigastric_pain | .xyes | 0.6277644 | 1.1356428 | -1.3256526 | 3.6198522 | epigastric_pain | .xyes | 1.8734177 | 1.1356428 | 0.5527833 | 0.5804118 | 0.2656295 | 37.3320493 |
onset_more_24_hrs | .xyes | 0.5065612 | 0.4133389 | -0.2889264 | 1.3414593 | onset_more_24_hrs | .xyes | 1.6595745 | 0.4133389 | 1.2255348 | 0.2203738 | 0.7490674 | 3.8246207 |
diabetes | .xyes | 0.5654298 | 0.4707818 | -0.3772175 | 1.4849256 | diabetes | .xyes | 1.7602041 | 0.4707818 | 1.2010441 | 0.2297341 | 0.6857669 | 4.4146368 |
NSAIDS | .xyes | 0.5436154 | 0.4205273 | -0.2904184 | 1.3672726 | NSAIDS | .xyes | 1.7222222 | 0.4205273 | 1.2926994 | 0.1961150 | 0.7479506 | 3.9246320 |
previous_OGDS | .xyes | -0.0719126 | 0.7196661 | -1.6508310 | 1.2715818 | previous_OGDS | .xyes | 0.9306122 | 0.7196661 | -0.0999249 | 0.9204039 | 0.1918904 | 3.5664897 |
systolic | .x | -0.0133813 | 0.0085103 | -0.0308281 | 0.0028134 | systolic | .x | 0.9867078 | 0.0085103 | -1.5723584 | 0.1158675 | 0.9696422 | 1.0028173 |
diastolic | .x | -0.0197762 | 0.0144968 | -0.0491267 | 0.0081354 | diastolic | .x | 0.9804181 | 0.0144968 | -1.3641724 | 0.1725133 | 0.9520605 | 1.0081686 |
ASA | .x | 0.5224183 | 0.3154616 | -0.0941774 | 1.1523982 | ASA | .x | 1.6861001 | 0.3154616 | 1.6560438 | 0.0977129 | 0.9101213 | 3.1657760 |
PULP | .x | 0.2361062 | 0.0906398 | 0.0624931 | 0.4201174 | PULP | .x | 1.2663088 | 0.0906398 | 2.6048836 | 0.0091905 | 1.0644871 | 1.5221402 |
perforation | .x | 1.2032426 | 0.2843708 | 0.6827088 | 1.8029077 | perforation | .x | 3.3309004 | 0.2843708 | 4.2312458 | 0.0000232 | 1.9792317 | 6.0672637 |
admission_to_op_hrs | .x | 0.0060960 | 0.0202384 | -0.0373817 | 0.0459273 | admission_to_op_hrs | .x | 1.0061147 | 0.0202384 | 0.3012113 | 0.7632534 | 0.9633084 | 1.0469983 |
hemoglobin | .x | -0.1237478 | 0.0615749 | -0.2493337 | -0.0057555 | hemoglobin | .x | 0.8836027 | 0.0615749 | -2.0097125 | 0.0444616 | 0.7793198 | 0.9942611 |
twc | .x | 0.0200912 | 0.0291979 | -0.0379630 | 0.0776960 | twc | .x | 1.0202943 | 0.0291979 | 0.6881022 | 0.4913885 | 0.9627485 | 1.0807940 |
write_csv(model_uni, 'model_uni.csv')
We start with main effect models (no interaction)
Let us run this model with
- Outcome: oc2
- Primary variables: ASA, degree of perforation, PULP, NSAIDS, Hg, malena, onset more than 24
- Confounding variables: age, gender, diabetes, hypertension
Primary variables are variables of interest that would predict outcome of peptic ulcer. Confounding variables are variables that may alter the effect of the primary variables when they are in the model. This may happen due to different distribution of them at the beginning of the study.
Do not control variables after the study (refer to Andrew Gelman argument)
model_multivar <- glm(oc2 ~ age + gender + epigastric_pain + onset_more_24_hrs +
diabetes + NSAIDS + factor(ASA) + PULP + perforation +
hemoglobin,
family = binomial(link = 'logit'), data = PUP2)
model_multivar2 <- tidy(model_multivar, conf.int = T)
kable(model_multivar2) %>%
kable_styling(bootstrap_options = c("striped", "hover",
"condensed", "responsive"))
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -3.7719198 | 2.2578692 | -1.6705661 | 0.0948074 | -8.5384323 | 0.4634684 |
age | 0.0000431 | 0.0205982 | 0.0020928 | 0.9983302 | -0.0407516 | 0.0409729 |
gendermale | -0.8327136 | 0.5929182 | -1.4044325 | 0.1601901 | -2.0218586 | 0.3264799 |
epigastric_painyes | 0.5371919 | 1.2596323 | 0.4264672 | 0.6697674 | -1.6838093 | 3.6779167 |
onset_more_24_hrsyes | 0.1541497 | 0.5164268 | 0.2984929 | 0.7653270 | -0.8588498 | 1.1840249 |
diabetesyes | 0.0059755 | 0.6874532 | 0.0086922 | 0.9930647 | -1.3740242 | 1.3454276 |
NSAIDSyes | 0.2909611 | 0.4880827 | 0.5961307 | 0.5510879 | -0.6832510 | 1.2445234 |
factor(ASA)2 | 0.7622642 | 0.5989237 | 1.2727234 | 0.2031162 | -0.4011611 | 1.9676895 |
factor(ASA)3 | -0.0945293 | 1.1837434 | -0.0798562 | 0.9363516 | -2.6255024 | 2.1174561 |
PULP | 0.0750294 | 0.1683727 | 0.4456149 | 0.6558754 | -0.2564848 | 0.4087811 |
perforation | 1.1229983 | 0.3317340 | 3.3852377 | 0.0007112 | 0.5217920 | 1.8243413 |
hemoglobin | 0.0683182 | 0.0826810 | 0.8262859 | 0.4086419 | -0.0941807 | 0.2332461 |
With interaction model. Because, the effect of covariates on outcome maybe different according to different level of PULP and extent of perforation size.
The interaction effect between perforation and PULP.
model_ia <- glm(oc2 ~ age + gender + epigastric_pain + onset_more_24_hrs +
diabetes + NSAIDS + factor(ASA) + PULP + perforation +
hemoglobin + perforation:PULP,
family = binomial(link = 'logit'), data = PUP2)
model_ia <- tidy(model_ia, conf.int = T)
kable(model_ia) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -4.0998681 | 2.3260039 | -1.7626231 | 0.0779641 | -8.9861364 | 0.2724682 |
age | -0.0012018 | 0.0208996 | -0.0575036 | 0.9541440 | -0.0426999 | 0.0402279 |
gendermale | -0.8121256 | 0.5946346 | -1.3657556 | 0.1720157 | -2.0044230 | 0.3507217 |
epigastric_painyes | 0.5648047 | 1.2419028 | 0.4547898 | 0.6492604 | -1.6300740 | 3.6813556 |
onset_more_24_hrsyes | 0.1684374 | 0.5191416 | 0.3244537 | 0.7455946 | -0.8492422 | 1.2044757 |
diabetesyes | 0.0051572 | 0.6882091 | 0.0074937 | 0.9940210 | -1.3765049 | 1.3457319 |
NSAIDSyes | 0.2789421 | 0.4874188 | 0.5722843 | 0.5671294 | -0.6935094 | 1.2314276 |
factor(ASA)2 | 0.7395709 | 0.6010129 | 1.2305409 | 0.2184946 | -0.4272342 | 1.9492697 |
factor(ASA)3 | -0.0133695 | 1.1511487 | -0.0116140 | 0.9907336 | -2.5065349 | 2.1403235 |
PULP | 0.1863448 | 0.2503740 | 0.7442659 | 0.4567156 | -0.3041504 | 0.6860178 |
perforation | 1.4600565 | 0.6630358 | 2.2020779 | 0.0276598 | 0.2495542 | 2.8457067 |
hemoglobin | 0.0642286 | 0.0830514 | 0.7733604 | 0.4393091 | -0.0989351 | 0.2298803 |
PULP:perforation | -0.0812585 | 0.1343404 | -0.6048699 | 0.5452655 | -0.3456000 | 0.1862839 |
For ASA adjusting for baseline covariates (age, gender, diabetes, onset_more_24_hrs)
model_ASA <- glm(oc2 ~ factor(ASA) + age + gender + diabetes +
onset_more_24_hrs,
family = binomial(link = 'logit'), data = PUP2)
m_ASA <- tidy(model_ASA, exponentiate = TRUE, conf.int = TRUE)
For Perforation
model_perf <- glm(oc2 ~ degree_perforation + age + gender + diabetes +
onset_more_24_hrs,
family = binomial(link = 'logit'), data = PUP2)
m_perf <- tidy(model_perf, exponentiate = TRUE, conf.int = TRUE)
For PULP
model_PULP <- glm(oc2 ~ PULP + age + gender + diabetes + onset_more_24_hrs,
family = binomial(link = 'logit'), data = PUP2)
m_pulp <- tidy(model_PULP, exponentiate = TRUE, conf.int = TRUE)
For perforation
model_perf2 <- glm(oc2 ~ perforation + age + gender + diabetes +
onset_more_24_hrs,
family = binomial(link = 'logit'), data = PUP2)
m_perf2 <- tidy(model_perf2, exponentiate = TRUE, conf.int = TRUE)
For haemoglobin
model_hg <- glm(oc2 ~ hemoglobin + age + gender + diabetes +
onset_more_24_hrs,
family = binomial(link = 'logit'), data = PUP2)
m_hg <- tidy(model_hg, exponentiate = TRUE, conf.int = TRUE)
Results in combined nicer tidy format
multi_confirm <- bind_rows(m_ASA, m_perf, m_pulp, m_perf2, m_hg) %>%
filter(term %in% c("factor(ASA)2", "factor(ASA)3",
"degree_perforationmoderate", "degree_perforationsmall",
"PULP", "perforation", "hemoglobin")) %>%
rename(adjOR = estimate, lower_CI = conf.low, upper_CI = conf.high)
kable(multi_confirm) %>%
kable_styling(bootstrap_options = c("striped", "hover",
"condensed", "responsive"))
term | adjOR | std.error | statistic | p.value | lower_CI | upper_CI |
---|---|---|---|---|---|---|
factor(ASA)2 | 2.4756866 | 0.5103041 | 1.7764265 | 0.0756626 | 0.9196786 | 6.8907542 |
factor(ASA)3 | 1.1375937 | 0.9615300 | 0.1340730 | 0.8933448 | 0.1354452 | 6.8469894 |
degree_perforationmoderate | 0.2944452 | 0.6383464 | -1.9153591 | 0.0554467 | 0.0801986 | 1.0005652 |
degree_perforationsmall | 0.1388571 | 0.5347515 | -3.6920143 | 0.0002225 | 0.0467385 | 0.3864164 |
PULP | 1.2209041 | 0.1349680 | 1.4788077 | 0.1391917 | 0.9404310 | 1.6027977 |
perforation | 3.0238388 | 0.2988394 | 3.7027485 | 0.0002133 | 1.7672258 | 5.7287607 |
hemoglobin | 0.9504030 | 0.0711117 | -0.7153421 | 0.4743977 | 0.8247297 | 1.0928258 |
write_csv(multi_confirm, 'multivar_final.csv')
Making predictions
- log-odds
- probability
#augment(model_multivar, type = c('link', 'response'))
augment(model_multivar, type.predict = 'link')
## # A tibble: 121 x 17
## oc2 age gender epigastric_pain onset_more_24_h~ diabetes NSAIDS
## <fct> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 alive 42 male yes no no no
## 2 alive 66 female yes no no no
## 3 alive 67 male yes no no yes
## 4 alive 19 male yes yes no no
## 5 alive 77 male yes yes no no
## 6 alive 39 male yes yes no no
## 7 dead 62 female yes yes yes no
## 8 alive 71 female yes no no no
## 9 alive 69 male yes no yes yes
## 10 alive 97 female yes yes no yes
## # ... with 111 more rows, and 10 more variables: `factor(ASA)` <fct>,
## # PULP <dbl>, perforation <dbl>, hemoglobin <dbl>, .fitted <dbl>,
## # .resid <dbl>, .std.resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>
augment(model_multivar, type.predict = 'response')
## # A tibble: 121 x 17
## oc2 age gender epigastric_pain onset_more_24_h~ diabetes NSAIDS
## <fct> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 alive 42 male yes no no no
## 2 alive 66 female yes no no no
## 3 alive 67 male yes no no yes
## 4 alive 19 male yes yes no no
## 5 alive 77 male yes yes no no
## 6 alive 39 male yes yes no no
## 7 dead 62 female yes yes yes no
## 8 alive 71 female yes no no no
## 9 alive 69 male yes no yes yes
## 10 alive 97 female yes yes no yes
## # ... with 111 more rows, and 10 more variables: `factor(ASA)` <fct>,
## # PULP <dbl>, perforation <dbl>, hemoglobin <dbl>, .fitted <dbl>,
## # .resid <dbl>, .std.resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>
Others:
- confusion matrix
- overall model fitness (Hosmer Lemeshow test)
- Area under the ROC curve
- Model diagnostics
\newpage
Survival analysis is a common analysis in medicine. It also known as the time to event analysis or duration analysis. It is a type of regression.
In survival analysis, one of the most common modelling method is the Cox proportional hazard regression.
In survival analysis, the outcome variable (dependent variable) is TIME TO THE OCCURRENCE OF AN EVENT or shortly known as the time-to-event variable.
In survival analysis, we follow a subject of interest until a certain time (the last follow up). Different patients will have different follow-up times.
For example, we observe a group of subjects; with the outcome variable named as 'status' and the outcome coded as 'death' or 'alive'.
The status at the last follow up, can be an event either of 'death' or of other than death - 'non-death'.
Any other subjects who are under the same follow-up and still survive until the latest follow-up is known as a 'censor' case.
For our example, data comes from a group of stroke patients.
Data is in the stata format.
stroke <- read_dta(here('datasets', 'stroke_outcome.dta'))
Let's get a brief view
- class of variables
- observations and variables
str(stroke)
## tibble [226 x 9] (S3: tbl_df/tbl/data.frame)
## $ id : chr [1:226] "B179568" "B454143" "B221072" "B455158" ...
## ..- attr(*, "label")= chr "id"
## ..- attr(*, "format.stata")= chr "%8s"
## $ sex : dbl+lbl [1:226] 2, 1, 2, 2, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 1, 2, 1, ...
## ..@ label : chr "sex"
## ..@ format.stata: chr "%6.0f"
## ..@ labels : Named num [1:2] 1 2
## .. ..- attr(*, "names")= chr [1:2] "male" "female"
## $ days : num [1:226] 4 3 16 23 5 4 22 14 4 2 ...
## ..- attr(*, "label")= chr "days"
## ..- attr(*, "format.stata")= chr "%9.0g"
## $ gcs : num [1:226] 15 15 15 11 15 7 5 13 15 15 ...
## ..- attr(*, "label")= chr "earliest Glasgow Coma Scale"
## ..- attr(*, "format.stata")= chr "%2.0f"
## $ sbp : num [1:226] 150 152 231 110 199 190 145 161 222 161 ...
## ..- attr(*, "label")= chr "earliest systolic BP (mmHg)"
## ..- attr(*, "format.stata")= chr "%3.0f"
## $ dbp : num [1:226] 87 108 117 79 134 101 102 96 129 107 ...
## ..- attr(*, "label")= chr "earliest diastolic BP (mmHg)"
## ..- attr(*, "format.stata")= chr "%3.0f"
## $ age : num [1:226] 69 64 63 67 66 78 55 65 67 56 ...
## ..- attr(*, "label")= chr "age in years"
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ outcome: dbl+lbl [1:226] 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, ...
## ..@ label : chr "status @discharge 1=dead, 0=alive"
## ..@ format.stata: chr "%9.0g"
## ..@ labels : Named num [1:2] 0 1
## .. ..- attr(*, "names")= chr [1:2] "alive" "dead"
## $ icd10 : dbl+lbl [1:226] 0, 0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 1, ...
## ..@ label : chr "0=Cerebral Isch or others, 1=SAH, 2=Haemorrhagic"
## ..@ format.stata: chr "%22.0g"
## ..@ labels : Named num [1:3] 0 1 2
## .. ..- attr(*, "names")= chr [1:3] "CI,Others" "SAH" "ICB, Other Haemorrhage"
## - attr(*, "label")= chr "Data file created by EpiData based on DEADALIVEHUSM2.REC"
glimpse(stroke)
## Rows: 226
## Columns: 9
## $ id <chr> "B179568", "B454143", "B221072", "B455158", "B099206", "A17...
## $ sex <dbl+lbl> 2, 1, 2, 2, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 1, 2, 1, 1...
## $ days <dbl> 4, 3, 16, 23, 5, 4, 22, 14, 4, 2, 22, 2, 43, 2, 4, 4, 3, 21...
## $ gcs <dbl> 15, 15, 15, 11, 15, 7, 5, 13, 15, 15, 10, 15, 14, 9, 15, 15...
## $ sbp <dbl> 150, 152, 231, 110, 199, 190, 145, 161, 222, 161, 149, 153,...
## $ dbp <dbl> 87, 108, 117, 79, 134, 101, 102, 96, 129, 107, 90, 61, 95, ...
## $ age <dbl> 69, 64, 63, 67, 66, 78, 55, 65, 67, 56, 50, 78, 59, 83, 83,...
## $ outcome <dbl+lbl> 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0...
## $ icd10 <dbl+lbl> 0, 0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 1, 0...
We can convert many variables; in this case all labelled variable to factors variables.
stroke2 <- stroke %>%
mutate_if(is.labelled, ~as_factor(.))
glimpse(stroke2)
## Rows: 226
## Columns: 9
## $ id <chr> "B179568", "B454143", "B221072", "B455158", "B099206", "A17...
## $ sex <fct> female, male, female, female, male, female, female, female,...
## $ days <dbl> 4, 3, 16, 23, 5, 4, 22, 14, 4, 2, 22, 2, 43, 2, 4, 4, 3, 21...
## $ gcs <dbl> 15, 15, 15, 11, 15, 7, 5, 13, 15, 15, 10, 15, 14, 9, 15, 15...
## $ sbp <dbl> 150, 152, 231, 110, 199, 190, 145, 161, 222, 161, 149, 153,...
## $ dbp <dbl> 87, 108, 117, 79, 134, 101, 102, 96, 129, 107, 90, 61, 95, ...
## $ age <dbl> 69, 64, 63, 67, 66, 78, 55, 65, 67, 56, 50, 78, 59, 83, 83,...
## $ outcome <fct> dead, alive, alive, alive, alive, dead, alive, dead, alive,...
## $ icd10 <fct> "CI,Others", "CI,Others", "ICB, Other Haemorrhage", "ICB, O...
Collapse stroke types (variable icd10) from 3 categories to 2 categories
table(stroke2$icd10)
##
## CI,Others SAH ICB, Other Haemorrhage
## 149 19 58
stroke2 <- stroke2 %>%
mutate(icd10 = recode(icd10, 'SAH' = 'HS', 'ICB, Other Haemorrhage' = 'HS'))
Remember, data must have at least
- duration taken to develop event of interest (time variable)
- event of interest (event variable)
Data can have censor observations too.
In medicine and epidemiology, the most used survival model uses the Cox proportional hazard regression.
It is a semi-parametric model.
This is because we do not specify the exact distribution of the baseline hazard. But the other covariates follow some form or assumed distribution.
The formula for Cox PH model
where
The null model contains no covariate.
cox.null <- coxph(Surv(time = days, event = outcome == 'dead') ~ 1,
data = stroke2)
summary(cox.null)
## Call: coxph(formula = Surv(time = days, event = outcome == "dead") ~
## 1, data = stroke2)
##
## Null model
## log likelihood= -228.691
## n= 226
Main effect model
Let us include Glasgow Coma Scale as the covariate
cox.gcs <- coxph(Surv(time = days, event = outcome == 'dead') ~ gcs,
data = stroke2)
summary(cox.gcs)
## Call:
## coxph(formula = Surv(time = days, event = outcome == "dead") ~
## gcs, data = stroke2)
##
## n= 225, number of events= 52
## (1 observation deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## gcs -0.19188 0.82541 0.03261 -5.883 0.00000000402 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## gcs 0.8254 1.212 0.7743 0.8799
##
## Concordance= 0.787 (se = 0.035 )
## Likelihood ratio test= 35.11 on 1 df, p=0.000000003
## Wald test = 34.62 on 1 df, p=0.000000004
## Score (logrank) test = 40.65 on 1 df, p=0.0000000002
Now, model the risk for death as a function of Glasgow Coma Scale and age.
cox.gcs.age <- coxph(Surv(time = days, event = outcome == 'dead') ~ gcs +
age, data = stroke2)
summary(cox.gcs.age)
## Call:
## coxph(formula = Surv(time = days, event = outcome == "dead") ~
## gcs + age, data = stroke2)
##
## n= 225, number of events= 52
## (1 observation deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## gcs -0.20001 0.81872 0.03288 -6.082 0.00000000118 ***
## age 0.02982 1.03027 0.01129 2.642 0.00823 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## gcs 0.8187 1.2214 0.7676 0.8732
## age 1.0303 0.9706 1.0077 1.0533
##
## Concordance= 0.814 (se = 0.03 )
## Likelihood ratio test= 42.2 on 2 df, p=0.0000000007
## Wald test = 41 on 2 df, p=0.000000001
## Score (logrank) test = 47.67 on 2 df, p=0.00000000004
Model with interaction
We add an interaction between gcs and age into our covariate.
cox.gcs.age.ia <- coxph(Surv(time = days, event = outcome == 'dead') ~ gcs +
age + gcs:age, data = stroke2)
summary(cox.gcs.age.ia)
## Call:
## coxph(formula = Surv(time = days, event = outcome == "dead") ~
## gcs + age + gcs:age, data = stroke2)
##
## n= 225, number of events= 52
## (1 observation deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## gcs -0.371286 0.689846 0.179678 -2.066 0.0388 *
## age 0.006273 1.006293 0.026607 0.236 0.8136
## gcs:age 0.002740 1.002744 0.002811 0.975 0.3297
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## gcs 0.6898 1.4496 0.4851 0.9811
## age 1.0063 0.9937 0.9552 1.0602
## gcs:age 1.0027 0.9973 0.9972 1.0083
##
## Concordance= 0.81 (se = 0.031 )
## Likelihood ratio test= 43.14 on 3 df, p=0.000000002
## Wald test = 38.07 on 3 df, p=0.00000003
## Score (logrank) test = 48.93 on 3 df, p=0.0000000001
cox.mv <- coxph(Surv(time = days, event = outcome == 'dead') ~ gcs + age +
icd10 + sex, data = stroke2)
summary(cox.mv)
## Call:
## coxph(formula = Surv(time = days, event = outcome == "dead") ~
## gcs + age + icd10 + sex, data = stroke2)
##
## n= 225, number of events= 52
## (1 observation deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## gcs -0.18619 0.83011 0.03463 -5.377 0.0000000757 ***
## age 0.03173 1.03224 0.01137 2.790 0.00527 **
## icd10HS 0.46930 1.59888 0.31615 1.484 0.13769
## sexfemale 0.12632 1.13464 0.31973 0.395 0.69279
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## gcs 0.8301 1.2047 0.7756 0.8884
## age 1.0322 0.9688 1.0095 1.0555
## icd10HS 1.5989 0.6254 0.8604 2.9711
## sexfemale 1.1346 0.8813 0.6063 2.1233
##
## Concordance= 0.819 (se = 0.03 )
## Likelihood ratio test= 44.56 on 4 df, p=0.000000005
## Wald test = 42.07 on 4 df, p=0.00000002
## Score (logrank) test = 49.78 on 4 df, p=0.0000000004
The problems when you have both GCS and stroke types (icd10)
cox.mv2 <- coxph(Surv(time = days, event = outcome == 'dead') ~ age +
icd10 + sex, data = stroke2)
summary(cox.mv2)
## Call:
## coxph(formula = Surv(time = days, event = outcome == "dead") ~
## age + icd10 + sex, data = stroke2)
##
## n= 226, number of events= 53
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.02865 1.02906 0.01141 2.510 0.01206 *
## icd10HS 0.89564 2.44889 0.31349 2.857 0.00428 **
## sexfemale 0.39415 1.48312 0.30736 1.282 0.19972
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.029 0.9718 1.006 1.052
## icd10HS 2.449 0.4083 1.325 4.527
## sexfemale 1.483 0.6743 0.812 2.709
##
## Concordance= 0.719 (se = 0.045 )
## Likelihood ratio test= 15.29 on 3 df, p=0.002
## Wald test = 14.6 on 3 df, p=0.002
## Score (logrank) test = 14.77 on 3 df, p=0.002
Model 1
tidy(cox.mv, exponentiate = TRUE, conf.int = TRUE)
## # A tibble: 4 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 gcs 0.830 0.0346 -5.38 0.0000000757 0.776 0.888
## 2 age 1.03 0.0114 2.79 0.00527 1.01 1.06
## 3 icd10HS 1.60 0.316 1.48 0.138 0.860 2.97
## 4 sexfemale 1.13 0.320 0.395 0.693 0.606 2.12
Model 2
tidy(cox.mv2, exponentiate = TRUE, conf.int = TRUE)
## # A tibble: 3 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age 1.03 0.0114 2.51 0.0121 1.01 1.05
## 2 icd10HS 2.45 0.313 2.86 0.00428 1.32 4.53
## 3 sexfemale 1.48 0.307 1.28 0.200 0.812 2.71
One of the most important assumption in Cox proportional hazars regression is the proportional hazard assumptions.
We can use survival::cox.zph()
prop.h <- cox.zph(cox.mv2, transform = 'km', global = TRUE)
prop.h
## chisq df p
## age 1.7882 1 0.1811
## icd10 7.9078 1 0.0049
## sex 0.0333 1 0.8552
## GLOBAL 9.6778 3 0.0215
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] summarytools_0.9.6 survival_3.2-3 corrplot_0.84 cdata_1.1.8
## [5] wrapr_2.0.2 broom_0.7.0 kableExtra_1.2.1 readxl_1.3.1
## [9] haven_2.3.1 here_0.1 forcats_0.5.0 stringr_1.4.0
## [13] dplyr_1.0.2 purrr_0.3.4 readr_1.3.1 tidyr_1.1.2
## [17] tibble_3.0.3 ggplot2_3.3.2 tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] matrixStats_0.56.0 fs_1.5.0 lubridate_1.7.9 webshot_0.5.2
## [5] httr_1.4.2 rprojroot_1.3-2 tools_4.0.2 backports_1.1.9
## [9] utf8_1.1.4 R6_2.4.1 DBI_1.1.0 colorspace_1.4-1
## [13] withr_2.2.0 tidyselect_1.1.0 compiler_4.0.2 cli_2.0.2
## [17] rvest_0.3.6 xml2_1.3.2 labeling_0.3 bookdown_0.20
## [21] scales_1.1.1 checkmate_2.0.0 rquery_1.4.5 digest_0.6.25
## [25] rmarkdown_2.3 rqdatatable_1.2.8 base64enc_0.1-3 pkgconfig_2.0.3
## [29] htmltools_0.5.0 dbplyr_1.4.4 highr_0.8 rlang_0.4.7
## [33] rstudioapi_0.11 pryr_0.1.4 farver_2.0.3 generics_0.0.2
## [37] jsonlite_1.7.0 magrittr_1.5 rapportools_1.0 Matrix_1.2-18
## [41] Rcpp_1.0.5 munsell_0.5.0 fansi_0.4.1 lifecycle_0.2.0
## [45] stringi_1.4.6 yaml_2.2.1 snakecase_0.11.0 MASS_7.3-52
## [49] plyr_1.8.6 grid_4.0.2 blob_1.2.1 parallel_4.0.2
## [53] crayon_1.3.4 lattice_0.20-41 splines_4.0.2 pander_0.6.3
## [57] hms_0.5.3 magick_2.4.0 knitr_1.29 pillar_1.4.6
## [61] tcltk_4.0.2 codetools_0.2-16 reprex_0.3.0 glue_1.4.2
## [65] evaluate_0.14 data.table_1.13.0 modelr_0.1.8 vctrs_0.3.4
## [69] cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1 xfun_0.16
## [73] janitor_2.0.1 viridisLite_0.3.0 ellipsis_0.3.1