In this notebook, I'll be completing an analysis to check whether the independent variables of bleeding study data are independent or correlated. 

The continous variables are: 
1. Age
2. Platelet Nadir while on Ibruntinib
3. Prior lines of therapy (using the continous column of data) 

The continous variables below need to be analyzed on their own, since they both only have data at the time of bleed, and since this data is only used for one sub-analysis regression model. 
1. Platelet at the time of bleed 
2. Hb at the time of bleed 

The categorical variables are: 
1. Gender (M/F) 
2. Platelets <50 (Y/N) 
3. Anemia (hb < 100) (Y/N) 
4. Anemia (hb < 110) (Y/N)  - Going to ignore since ill use hb < 100. Including both messes things up. 
5. HR Molecular/Cytogenetics (Y/N)
6. Anticoagulation (Y/N)
7. Anti-platelet (Y/N)
8. PMHx bleeding risk (Y/N)
9. Prior lines of therapy 1,2,3,4,6 (using categorical columns of data. 
10. Major bleed and Minor bleed (the 2 categorical dummy variables)

Helpful links: 
- Implenting a log-linear model: https://data.library.virginia.edu/an-introduction-to-loglinear-models/
- Converting between a data frame of cases, a data frame of counts of each type of case, and a contingency table : http://www.cookbook-r.com/Manipulating_data/Converting_between_data_frames_and_contingency_tables/#cases-to-counts

Load necessary libraries here

In [1]:
#import here 

Load dataset

In [2]:
# This dataset did not remove the rows with unknown cytogenetics. With this dataset, will exclude cytogenetics from log linear analysis 
Bleed_data <- read.csv(file ='/Users/anthonyquint/Desktop/LHSC_Work_Folder/Mina/Bleeding_study/Ibrutinib Data Set, July 13,2021, de- identified data_cleaned_forSurvAnal.csv')

# This dataset did remove the rows with unknown cytogenetics. With this dataset, will include cytogenetics from log linear analysis
Bleed_data_cytocleaned <- read.csv(file = '/Users/anthonyquint/Desktop/LHSC_Work_Folder/Mina/Bleeding_study/Ibrutinib Data Set, July 13,2021, de- identified data_cleaned_forSurvAnal_CtyoCleaned.csv')

#head(Bleed_data)

Create new df with only categorical variables

In [3]:
df_cat <- Bleed_data[,c("gender",'Platelets...50..Y.N.','Anemia..hb...100...Y.N.','Anemia..hb...110...Y.N.','anticoagulation..Y.N.', 'anti.platelet..Y.N.','PMHx.bleeding.risk..Y.N.','Prior.lines.of.therapy.1', 'Prior.lines.of.therapy.2','Prior.lines.of.therapy.3','Prior.lines.of.therapy.4','Prior.lines.of.therapy.6','Major.Bleed','Minor.Bleed')]

df_cat_cyto_cleaned <- Bleed_data_cytocleaned[,c("gender",'Platelets...50..Y.N.','Anemia..hb...100...Y.N.','Anemia..hb...110...Y.N.','HR.Molecular.Cytogenetics..Y.N.','anticoagulation..Y.N.', 'anti.platelet..Y.N.','PMHx.bleeding.risk..Y.N.','Prior.lines.of.therapy.1', 'Prior.lines.of.therapy.2','Prior.lines.of.therapy.3','Prior.lines.of.therapy.4','Prior.lines.of.therapy.6','Major.Bleed','Minor.Bleed')] 

#head(df_cat)

Converting dataframe of cases to a dataframe of counts

In [4]:
countdf1 <- as.data.frame(table(df_cat))
countdf2 <- as.data.frame(table(df_cat_cyto_cleaned))
#countdf1

Without Cytogenetics: 
First model: Assuming all are independent 

In [5]:
#fitting the model and displaying the summary 
mod0 <- glm(Freq ~ gender + Platelets...50..Y.N. + Anemia..hb...100...Y.N. + anticoagulation..Y.N. + anti.platelet..Y.N. + PMHx.bleeding.risk..Y.N. + Prior.lines.of.therapy.1 + Prior.lines.of.therapy.2 + Prior.lines.of.therapy.3 + Prior.lines.of.therapy.4 + Prior.lines.of.therapy.6 + Major.Bleed + Minor.Bleed, 
            data = countdf1, family = poisson)
summary(mod0)

#displaying the p-value of the model
pchisq(deviance(mod0), df = df.residual(mod0), lower.tail = F)
print("Since the null of this test is that the expected frequencies satisfy the given loglinear model, clearly they do (since p > 0.05 and hence you accept the null). ")


Call:
glm(formula = Freq ~ gender + Platelets...50..Y.N. + Anemia..hb...100...Y.N. + 
    anticoagulation..Y.N. + anti.platelet..Y.N. + PMHx.bleeding.risk..Y.N. + 
    Prior.lines.of.therapy.1 + Prior.lines.of.therapy.2 + Prior.lines.of.therapy.3 + 
    Prior.lines.of.therapy.4 + Prior.lines.of.therapy.6 + Major.Bleed + 
    Minor.Bleed, family = poisson, data = countdf1)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.9397  -0.0242  -0.0053  -0.0011   3.8538  

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)                 0.9684     0.1838   5.268 1.38e-07 ***
gender1                     0.4951     0.1586   3.122   0.0018 ** 
Platelets...50..Y.N.1      -1.2376     0.1843  -6.717 1.86e-11 ***
Anemia..hb...100...Y.N.1   -0.1542     0.1543  -0.999   0.3178    
anticoagulation..Y.N.1     -1.4542     0.1963  -7.407 1.29e-13 ***
anti.platelet..Y.N.1       -1.4542     0.1963  -7.407 1.29e-13 ***
PMHx.bleeding.risk..Y.N.1 

[1] "Since the null of this test is that the expected frequencies satisfy the given loglinear model, clearly they do (since p > 0.05 and hence you accept the null). "


With Cytogenetics: 
First model: Assuming all are independent 

In [6]:
#fitting the model and displaying the summary 
mod00 <- glm(Freq ~ HR.Molecular.Cytogenetics..Y.N. + gender + Platelets...50..Y.N. + Anemia..hb...100...Y.N.  + anticoagulation..Y.N. + anti.platelet..Y.N. + PMHx.bleeding.risk..Y.N. + Prior.lines.of.therapy.1 + Prior.lines.of.therapy.2 + Prior.lines.of.therapy.3 + Prior.lines.of.therapy.4 + Prior.lines.of.therapy.6 + Major.Bleed + Minor.Bleed, 
            data = countdf2, family = poisson)
summary(mod00)

#displaying the p-value of the model
pchisq(deviance(mod00), df = df.residual(mod0), lower.tail = F)
print("Since the null of this test is that the expected frequencies satisfy the given loglinear model, clearly they do (since p > 0.05 and hence you accept the null). ")


Call:
glm(formula = Freq ~ HR.Molecular.Cytogenetics..Y.N. + gender + 
    Platelets...50..Y.N. + Anemia..hb...100...Y.N. + anticoagulation..Y.N. + 
    anti.platelet..Y.N. + PMHx.bleeding.risk..Y.N. + Prior.lines.of.therapy.1 + 
    Prior.lines.of.therapy.2 + Prior.lines.of.therapy.3 + Prior.lines.of.therapy.4 + 
    Prior.lines.of.therapy.6 + Major.Bleed + Minor.Bleed, family = poisson, 
    data = countdf2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1011  -0.0144  -0.0030  -0.0006   4.4980  

Coefficients:
                                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)                       0.23751    0.21119   1.125  0.26076    
HR.Molecular.Cytogenetics..Y.N.1 -0.08113    0.16453  -0.493  0.62197    
gender1                           0.55431    0.17075   3.246  0.00117 ** 
Platelets...50..Y.N.1            -1.13498    0.19159  -5.924 3.14e-09 ***
Anemia..hb...100...Y.N.1         -0.16252    0.16494  -0.985  0.32447    
anticoagulat

[1] "Since the null of this test is that the expected frequencies satisfy the given loglinear model, clearly they do (since p > 0.05 and hence you accept the null). "


Without cytogenetics: 
Compare fitted values to the observed values

In [7]:
cbind(mod0$data, fitted(mod0))
print("Evidently the fit could be better")

gender,Platelets...50..Y.N.,Anemia..hb...100...Y.N.,Anemia..hb...110...Y.N.,anticoagulation..Y.N.,anti.platelet..Y.N.,PMHx.bleeding.risk..Y.N.,Prior.lines.of.therapy.1,Prior.lines.of.therapy.2,Prior.lines.of.therapy.3,Prior.lines.of.therapy.4,Prior.lines.of.therapy.6,Major.Bleed,Minor.Bleed,Freq,fitted(mod0)
0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,2.6336574
1,0,0,0,0,0,0,0,0,0,0,0,0,0,9,4.3208442
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.7639617
1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1.2533746
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,2.2574206
1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,3.7035807
0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0.6548243
1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1.0743211
0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2.6336574
1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,4.3208442


[1] "Evidently the fit could be better"


With cytogenetics: Compare fitted values to the observed values

In [8]:
cbind(mod00$data, fitted(mod00))
print("Evidently the fit could be better")

gender,Platelets...50..Y.N.,Anemia..hb...100...Y.N.,Anemia..hb...110...Y.N.,HR.Molecular.Cytogenetics..Y.N.,anticoagulation..Y.N.,anti.platelet..Y.N.,PMHx.bleeding.risk..Y.N.,Prior.lines.of.therapy.1,Prior.lines.of.therapy.2,Prior.lines.of.therapy.3,Prior.lines.of.therapy.4,Prior.lines.of.therapy.6,Major.Bleed,Minor.Bleed,Freq,fitted(mod00)
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1.2680815
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.2074011
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.4075976
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.7095218
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1.0778693
1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1.8762910
0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.3464580
1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.6030935
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1.2680815
1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,2.2074011


[1] "Evidently the fit could be better"


Without cytogenetics: Second model: homogeneous association ("pairwise interactions") 

In [9]:
#fitting the model and displaying the summary 
mod1 <- glm(Freq ~ (gender + Platelets...50..Y.N. + Anemia..hb...100...Y.N. + anticoagulation..Y.N. + anti.platelet..Y.N. + PMHx.bleeding.risk..Y.N. + Prior.lines.of.therapy.1 + Prior.lines.of.therapy.2 + Prior.lines.of.therapy.3 + Prior.lines.of.therapy.4 + Prior.lines.of.therapy.6 + Major.Bleed + Minor.Bleed)^2, 
            data = countdf1, family = poisson)
summary(mod1)


#displaying the p-value of the model
pchisq(deviance(mod1), df = df.residual(mod1), lower.tail = F)
print("The high p-value says we have insufficient evidence to reject the null hypothesis that the expected frequencies satisfy our model.")

“glm.fit: fitted rates numerically 0 occurred”


Call:
glm(formula = Freq ~ (gender + Platelets...50..Y.N. + Anemia..hb...100...Y.N. + 
    anticoagulation..Y.N. + anti.platelet..Y.N. + PMHx.bleeding.risk..Y.N. + 
    Prior.lines.of.therapy.1 + Prior.lines.of.therapy.2 + Prior.lines.of.therapy.3 + 
    Prior.lines.of.therapy.4 + Prior.lines.of.therapy.6 + Major.Bleed + 
    Minor.Bleed)^2, family = poisson, data = countdf1)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-3.414   0.000   0.000   0.000   2.918  

Coefficients:
                                                      Estimate Std. Error
(Intercept)                                          1.203e+00  2.980e-01
gender1                                              5.600e-01  3.363e-01
Platelets...50..Y.N.1                               -3.594e+00  7.131e-01
Anemia..hb...100...Y.N.1                            -6.181e-01  3.809e-01
anticoagulation..Y.N.1                              -1.515e+00  4.969e-01
anti.platelet..Y.N.1                                -1.88

[1] "The high p-value says we have insufficient evidence to reject the null hypothesis that the expected frequencies satisfy our model."


With cytogenetics: Second model: homogeneous association ("pairwise interactions") 

In [10]:
#fitting the model and displaying the summary 
mod11 <- glm(Freq ~ (HR.Molecular.Cytogenetics..Y.N. + gender + Platelets...50..Y.N. + Anemia..hb...100...Y.N. + anticoagulation..Y.N. + anti.platelet..Y.N. + PMHx.bleeding.risk..Y.N. + Prior.lines.of.therapy.1 + Prior.lines.of.therapy.2 + Prior.lines.of.therapy.3 + Prior.lines.of.therapy.4 + Prior.lines.of.therapy.6 + Major.Bleed + Minor.Bleed)^2, 
            data = countdf2, family = poisson)
summary(mod11)


#displaying the p-value of the model
pchisq(deviance(mod11), df = df.residual(mod1), lower.tail = F)
print("The high p-value says we have insufficient evidence to reject the null hypothesis that the expected frequencies satisfy our model.")

“glm.fit: fitted rates numerically 0 occurred”


Call:
glm(formula = Freq ~ (HR.Molecular.Cytogenetics..Y.N. + gender + 
    Platelets...50..Y.N. + Anemia..hb...100...Y.N. + anticoagulation..Y.N. + 
    anti.platelet..Y.N. + PMHx.bleeding.risk..Y.N. + Prior.lines.of.therapy.1 + 
    Prior.lines.of.therapy.2 + Prior.lines.of.therapy.3 + Prior.lines.of.therapy.4 + 
    Prior.lines.of.therapy.6 + Major.Bleed + Minor.Bleed)^2, 
    family = poisson, data = countdf2)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-3.236   0.000   0.000   0.000   3.394  

Coefficients:
                                                             Estimate
(Intercept)                                                -1.284e+00
HR.Molecular.Cytogenetics..Y.N.1                            2.398e+00
gender1                                                     7.630e-01
Platelets...50..Y.N.1                                      -4.293e+00
Anemia..hb...100...Y.N.1                                   -9.674e-01
anticoagulation..Y.N.1                    

[1] "The high p-value says we have insufficient evidence to reject the null hypothesis that the expected frequencies satisfy our model."


Without cytogenetics: Model 2: compare fitted to observed 

In [11]:
cbind(mod1$data, fitted(mod1))
print("Evidently the fit is better")

gender,Platelets...50..Y.N.,Anemia..hb...100...Y.N.,Anemia..hb...110...Y.N.,anticoagulation..Y.N.,anti.platelet..Y.N.,PMHx.bleeding.risk..Y.N.,Prior.lines.of.therapy.1,Prior.lines.of.therapy.2,Prior.lines.of.therapy.3,Prior.lines.of.therapy.4,Prior.lines.of.therapy.6,Major.Bleed,Minor.Bleed,Freq,fitted(mod1)
0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,3.329607737
1,0,0,0,0,0,0,0,0,0,0,0,0,0,9,5.829265739
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.091515437
1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0.350161043
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1.794630676
1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,2.426655833
0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0.198416605
1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0.586359274
0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,3.329607737
1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5.829265739


[1] "Evidently the fit is better"


With cytogenetics: Model 2: compare fitted to observed 

In [12]:
cbind(mod11$data, fitted(mod11))
print("Evidently the fit is better")

gender,Platelets...50..Y.N.,Anemia..hb...100...Y.N.,Anemia..hb...110...Y.N.,HR.Molecular.Cytogenetics..Y.N.,anticoagulation..Y.N.,anti.platelet..Y.N.,PMHx.bleeding.risk..Y.N.,Prior.lines.of.therapy.1,Prior.lines.of.therapy.2,Prior.lines.of.therapy.3,Prior.lines.of.therapy.4,Prior.lines.of.therapy.6,Major.Bleed,Minor.Bleed,Freq,fitted(mod11)
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0.276792965
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.593636519
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.003783741
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.022211108
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.105204650
1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.180673564
0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.005658099
1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.026595826
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0.276792965
1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0.593636519


[1] "Evidently the fit is better"
