To analyze the effect of the independent variables on the outcome/dependent variable (bleeding) wholistically, must perform a multivariate analysis. A logistic regression model is a suitable choice for a dataset with a binary outcome. I'll explore partioning the outcome data in three different ways: 

1. The binary outcome is Major vs. Minor bleeding
2. The binary outcome is:
Major vs. Minor + None bleeding, 
Major + Minor vs. None bleeding,
Major vs. None, 
Minor vs. None,
Major vs. Minor

3. The ternary outcome is Major vs. Minor vs. None (will have to use a multinomial classification model for this: Multinomial logistic regression or Linear discriminant analysis) 

The independent variables that will be implemented into this model are: 
- age of diagnosis, gender, platelet count, anti-coagulation, antiplatelet, invasive procedure (which I believe is PMHx bleeding risk), (maybe INR later), Molecular/cytogenetics, anemia, prior lines of therapy, 

Begining with 1: 

Import necessary libraries

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print("complete")

complete


Read dataset

In [2]:
df = pd.read_csv("/Users/anthonyquint/Desktop/LHSC_Work_Folder/Mina/Bleeding_study/Ibruntinib Data Set, June 10,2021 de- identified data.csv")
df.head()

Unnamed: 0,Age at diagnosis,gender,Diagnosis year,Plt at diagnosis,plt at start of ibrutinib,plt at the time of bleed,Plt Nadir while on Ibrutinib,Platelets < 50 (Y/N),hb at diognosis,hb at start of Ibrutinib,...,action?,post op bleed? /action,INR,past medical history,PMHx bleeding risk (Y/N),Ibrutinib Dose,Comments,other ibrutinib SE,Unnamed: 43,Unnamed: 44
0,48,f,2006,260,15,,15,Y,130,71,...,,,,"deppression, schwanoma of leg",Y,"420mg,",ITP at the time of starting ibrutinib,,,
1,66,m,2017,175,83,155.0,93,N,145,93,...,,,,"cryoglobinemia,MGUS,CAD,HTN,COPD",N,420mg,,,,
2,74,F,2016,189,200,,nl plts,N,116,87,...,,,,"dm2,htn,",N,"420mg,",reaction to first obino so switched to ibrutin...,,,
3,53,F,2002,237,67,,40,Y,135,118,...,,,,"HTN,B12 def,IDA",N,420mg,WAIHA,easy bruising,,
4,60,m,1999,198,85,70.0,49,Y,154,104,...,died,,1.1,"prostitis,mycosis,chronc sinusitis",N,ibrutinib dose reduced to 140 in oct 2015 for ...,"cutaneous oral mucosal involvement w CLL, als...",,,


Cleaning the dataset

In [3]:
#Removing all columns except the columns corresponding to our relevant 
#independent variables (indicated at top of notebook) and dependent variable (Major Bleed (Y/N))

df = df[["Age at diagnosis","gender","Platelets < 50 (Y/N)","Anemia (hb < 110) (Y/N)", "HR Molecular/Cytogenetics (Y/N)","Prior lines of therapy","anticoagulation (Y/N)","anti platelet (Y/N)","PMHx bleeding risk (Y/N)","Major Bleed (Y/N)"]]
#df = df[["Age at diagnosis","gender","Platelets < 50 (Y/N)","Anemia (hb < 110) (Y/N)","Prior lines of therapy","anticoagulation (Y/N)","anti platelet (Y/N)","PMHx bleeding risk (Y/N)","Major Bleed (Y/N)"]]


#Gender has inputs of F or M, but sometimes they are lowercase. Using "upper()" to ensure they are all uppercase

df['gender'] = df['gender'].str.upper()

#Removing all NaN values. Since all independent variables are effectively filled completely, removing the NaN values
#takes out the dependent variables outcomes where there was no bleeding event; leaving behind only major and minor
#bleeding events

df = df.dropna().copy() 

# removing rows that have "Unknown" cytogenetics. 
# This decreases the dataset size, which was already very small. However, neglecting this step doesnt make the 
# coefficients significant. 
df = df[~df['HR Molecular/Cytogenetics (Y/N)'].isin(['unknown'])] 

df.head()


Unnamed: 0,Age at diagnosis,gender,Platelets < 50 (Y/N),Anemia (hb < 110) (Y/N),HR Molecular/Cytogenetics (Y/N),Prior lines of therapy,anticoagulation (Y/N),anti platelet (Y/N),PMHx bleeding risk (Y/N),Major Bleed (Y/N)
1,66,M,N,Y,Y,0,N,Y,N,N
4,60,M,Y,Y,N,3,N,N,N,Y
6,64,M,N,Y,N,1,Y,N,N,N
8,66,M,N,Y,N,1,Y,Y,N,N
10,56,M,N,Y,Y,0,Y,N,Y,Y


Counting number of people who had major vs. minor bleed

In [4]:
df['Major Bleed (Y/N)'].value_counts(dropna=False)   #Counting number of people who had major vs. minor bleed

## should 0 and 1 appear roughly at equal frequencies? ##

N    24
Y    12
Name: Major Bleed (Y/N), dtype: int64

Converting categorical data into numerical representation 

In [5]:
number = LabelEncoder()
df['gender'] = number.fit_transform(df['gender'].astype('str'))
df['Platelets < 50 (Y/N)'] = number.fit_transform(df['Platelets < 50 (Y/N)'].astype('str'))
df['Anemia (hb < 110) (Y/N)'] = number.fit_transform(df['Anemia (hb < 110) (Y/N)'].astype('str'))
df['HR Molecular/Cytogenetics (Y/N)'] = number.fit_transform(df['HR Molecular/Cytogenetics (Y/N)'].astype('str'))
df['anticoagulation (Y/N)'] = number.fit_transform(df['anticoagulation (Y/N)'].astype('str'))
df['anti platelet (Y/N)'] = number.fit_transform(df['anti platelet (Y/N)'].astype('str'))
df['PMHx bleeding risk (Y/N)'] = number.fit_transform(df['PMHx bleeding risk (Y/N)'].astype('str'))
df['Major Bleed (Y/N)'] = number.fit_transform(df['Major Bleed (Y/N)'].astype('str'))


## How to determine how python numerically labelled the categories? ##

df.head()

Unnamed: 0,Age at diagnosis,gender,Platelets < 50 (Y/N),Anemia (hb < 110) (Y/N),HR Molecular/Cytogenetics (Y/N),Prior lines of therapy,anticoagulation (Y/N),anti platelet (Y/N),PMHx bleeding risk (Y/N),Major Bleed (Y/N)
1,66,1,0,1,1,0,0,1,0,0
4,60,1,1,1,0,3,0,0,0,1
6,64,1,0,1,0,1,1,0,0,0
8,66,1,0,1,0,1,1,1,0,0
10,56,1,0,1,1,0,1,0,1,1


Splitting data into independent and dependent variables, and then into training and testing set

In [14]:
clinical_features = ['Age at diagnosis','gender','Platelets < 50 (Y/N)','Anemia (hb < 110) (Y/N)','HR Molecular/Cytogenetics (Y/N)','Prior lines of therapy','anticoagulation (Y/N)','anti platelet (Y/N)','PMHx bleeding risk (Y/N)']
#clinical_features = ['Age at diagnosis','gender','Platelets < 50 (Y/N)','Anemia (hb < 110) (Y/N)','Prior lines of therapy','anticoagulation (Y/N)','anti platelet (Y/N)','PMHx bleeding risk (Y/N)']

X = df[clinical_features]   #Independent variables 
y = df['Major Bleed (Y/N)']  #Dependent variables 

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0) #Splitting variables into training/testing set

np.asarray(X)

array([['66', 1, 0, 1, 1, 0, 0, 1, 0],
       ['60', 1, 1, 1, 0, 3, 0, 0, 0],
       ['64', 1, 0, 1, 0, 1, 1, 0, 0],
       ['66', 1, 0, 1, 0, 1, 1, 1, 0],
       ['56', 1, 0, 1, 1, 0, 1, 0, 1],
       ['72', 0, 0, 1, 0, 0, 0, 1, 0],
       ['56', 1, 1, 1, 1, 3, 1, 0, 0],
       ['70', 0, 0, 1, 0, 1, 0, 1, 0],
       ['72', 1, 0, 1, 0, 0, 1, 1, 0],
       ['62', 0, 0, 1, 0, 3, 1, 1, 0],
       ['66', 0, 0, 1, 0, 1, 1, 0, 0],
       ['82', 0, 0, 1, 1, 0, 0, 0, 0],
       ['59', 1, 1, 1, 1, 2, 0, 0, 0],
       ['53', 1, 0, 0, 0, 1, 0, 0, 0],
       ['61', 0, 1, 1, 1, 2, 0, 0, 0],
       ['72', 1, 0, 0, 0, 1, 0, 0, 0],
       ['80', 1, 1, 1, 0, 1, 0, 0, 0],
       ['78', 1, 0, 1, 1, 1, 0, 0, 0],
       ['41', 1, 1, 1, 0, 1, 0, 0, 0],
       ['69', 1, 1, 1, 0, 0, 0, 1, 0],
       ['68', 0, 0, 0, 1, 0, 1, 0, 0],
       ['66', 1, 1, 1, 0, 1, 0, 0, 1],
       ['38', 1, 0, 0, 0, 2, 1, 0, 0],
       ['68', 1, 0, 0, 1, 0, 0, 0, 0],
       ['79', 1, 0, 0, 1, 0, 1, 0, 1],
       ['50', 1, 0, 1, 0,

Model Implementation 

In [8]:
logit_model=sm.Logit(y.astype(float),X.astype(float))
result=logit_model.fit()
print(result.summary())

print("None of these coefficients are statistically significant. This is most likely because the dataset is way too small given the amount of independent variables. The coefficients remain insignificant even when I remove the cytogenetics column, in an attempt to retain a larger dataset size.")

Optimization terminated successfully.
         Current function value: 0.555055
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:      Major Bleed (Y/N)   No. Observations:                   36
Model:                          Logit   Df Residuals:                       27
Method:                           MLE   Df Model:                            8
Date:                Tue, 15 Jun 2021   Pseudo R-squ.:                  0.1280
Time:                        13:26:48   Log-Likelihood:                -19.982
converged:                       True   LL-Null:                       -22.915
Covariance Type:            nonrobust   LLR p-value:                    0.6623
                                      coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
Age at diagnosis                   -0.0261      0.019     -1.397  