To analyze the effect of the independent variables on the outcome/dependent variable (bleeding) wholistically, must perform a multivariate analysis. A logistic regression model is a suitable choice for a dataset with a binary outcome. I'll explore partioning the outcome data in three different ways: 

1. The binary outcome is Major vs. Minor bleeding
2. The binary outcome is:
Major vs. Minor + None bleeding, 
Major + Minor vs. None bleeding,
Major vs. None, 
Minor vs. None,
Major vs. Minor
3. The ternary outcome is Major vs. Minor vs. None (will have to use a multinomial classification model for this: Multinomial logistic regression or Linear discriminant analysis) 

The independent variables that will be implemented into this model are: 
- age of diagnosis, gender, platelet count, anti-coagulation, antiplatelet, invasive procedure (which I believe is PMHx bleeding risk), (maybe INR later), Molecular/cytogenetics, anemia, prior lines of therapy,

Continuing with 3, where I'll apply a multinomial logistic regression 

Import necessary libraries

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.utils import resample
import imblearn
from imblearn.over_sampling import SMOTE
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

print("complete")

complete


Read dataset

In [2]:
df = pd.read_csv("/Users/anthonyquint/Desktop/LHSC_Work_Folder/Mina/Bleeding_study/Ibrutinib Data Set, June 10,2021, de- identified data.csv")
df.head()

Unnamed: 0,Age at diagnosis,gender,Diagnosis year,Plt at diagnosis,plt at start of ibrutinib,plt at the time of bleed,Plt Nadir while on Ibrutinib,Platelets < 50 (Y/N),hb at diognosis,hb at start of Ibrutinib,...,action?,post op bleed? /action,INR,past medical history,PMHx bleeding risk (Y/N),Ibrutinib Dose,Comments,other ibrutinib SE,Unnamed: 43,Unnamed: 44
0,48,f,2006,260,15,,15,Y,130,71,...,,,,"deppression, schwanoma of leg",Y,"420mg,",ITP at the time of starting ibrutinib,,,
1,66,m,2017,175,83,155.0,93,N,145,93,...,,,,"cryoglobinemia,MGUS,CAD,HTN,COPD",N,420mg,,,,
2,74,F,2016,189,200,,nl plts,N,116,87,...,,,,"dm2,htn,",N,"420mg,",reaction to first obino so switched to ibrutin...,,,
3,53,F,2002,237,67,,40,Y,135,118,...,,,,"HTN,B12 def,IDA",N,420mg,WAIHA,easy bruising,,
4,60,m,1999,198,85,70.0,49,Y,154,104,...,died,,1.1,"prostitis,mycosis,chronc sinusitis",N,ibrutinib dose reduced to 140 in oct 2015 for ...,"cutaneous oral mucosal involvement w CLL, als...",,,


Cleaning the dataset

In [3]:
#Removing all columns except the columns corresponding to our relevant 
#independent variables (indicated at top of notebook) and dependent variable (Major Bleed (Y/N))

df = df[["Age at diagnosis","gender","Platelets < 50 (Y/N)","Anemia (hb < 110) (Y/N)", "HR Molecular/Cytogenetics (Y/N)","Prior lines of therapy","anticoagulation (Y/N)","anti platelet (Y/N)","PMHx bleeding risk (Y/N)","Major Bleed (Y/N)"]]

#Gender has inputs of F or M, but sometimes they are lowercase. Using "upper()" to ensure they are all uppercase

df['gender'] = df['gender'].str.upper()

# removing rows that have "Unknown" cytogenetics.  
# df = df[~df['HR Molecular/Cytogenetics (Y/N)'].isin(['unknown'])] 

#changing NaN values in "Major Bleed (Y/N)" to M. I.e. Y = major bleed, N = minor bleed, M = no bleed 
df['Major Bleed (Y/N)'].fillna('M', inplace=True)

# Delete index 112 since it has a question mark for age 
df = df.drop(labels=112, axis=0)

df.head()

Unnamed: 0,Age at diagnosis,gender,Platelets < 50 (Y/N),Anemia (hb < 110) (Y/N),HR Molecular/Cytogenetics (Y/N),Prior lines of therapy,anticoagulation (Y/N),anti platelet (Y/N),PMHx bleeding risk (Y/N),Major Bleed (Y/N)
0,48,F,Y,Y,Y,1,N,N,Y,M
1,66,M,N,Y,Y,0,N,Y,N,N
2,74,F,N,Y,Y,1,N,N,N,M
3,53,F,Y,N,N,2,N,N,N,M
4,60,M,Y,Y,N,3,N,N,N,Y


Counting number of people who had major vs. minor vs. no bleed

In [4]:
df['Major Bleed (Y/N)'].value_counts(dropna=False)   #Counting number of people who had major vs. minor bleed

## should 0 and 1 appear roughly at equal frequencies? ##

M    127
N     25
Y     17
Name: Major Bleed (Y/N), dtype: int64

Converting categorical data into numerical representation

In [5]:
number = LabelEncoder()
df['gender'] = number.fit_transform(df['gender'].astype('str'))
df['Platelets < 50 (Y/N)'] = number.fit_transform(df['Platelets < 50 (Y/N)'].astype('str'))
df['Anemia (hb < 110) (Y/N)'] = number.fit_transform(df['Anemia (hb < 110) (Y/N)'].astype('str'))
df['HR Molecular/Cytogenetics (Y/N)'] = number.fit_transform(df['HR Molecular/Cytogenetics (Y/N)'].astype('str'))
df['anticoagulation (Y/N)'] = number.fit_transform(df['anticoagulation (Y/N)'].astype('str'))
df['anti platelet (Y/N)'] = number.fit_transform(df['anti platelet (Y/N)'].astype('str'))
df['PMHx bleeding risk (Y/N)'] = number.fit_transform(df['PMHx bleeding risk (Y/N)'].astype('str'))
df['Major Bleed (Y/N)'] = number.fit_transform(df['Major Bleed (Y/N)'].astype('str'))


df.head()

Unnamed: 0,Age at diagnosis,gender,Platelets < 50 (Y/N),Anemia (hb < 110) (Y/N),HR Molecular/Cytogenetics (Y/N),Prior lines of therapy,anticoagulation (Y/N),anti platelet (Y/N),PMHx bleeding risk (Y/N),Major Bleed (Y/N)
0,48,0,1,1,1,1,0,0,1,0
1,66,1,0,1,1,0,0,1,0,1
2,74,0,0,1,1,1,0,0,0,0
3,53,0,1,0,0,2,0,0,0,0
4,60,1,1,1,0,3,0,0,0,2


Splitting data into independent and dependent variables, then into training and testing set, (and then upsample the training sets?)

In [6]:
clinical_features = ['Age at diagnosis','gender','Platelets < 50 (Y/N)','Anemia (hb < 110) (Y/N)','HR Molecular/Cytogenetics (Y/N)','Prior lines of therapy','anticoagulation (Y/N)','anti platelet (Y/N)','PMHx bleeding risk (Y/N)']
#clinical_features = ['Age at diagnosis','gender','Platelets < 50 (Y/N)','Anemia (hb < 110) (Y/N)','Prior lines of therapy','anticoagulation (Y/N)','anti platelet (Y/N)','PMHx bleeding risk (Y/N)']

X = df[clinical_features]   #Independent variables 
y = df['Major Bleed (Y/N)']  #Dependent variables 

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0,stratify=y) #Splitting variables into training/testing set

#########

X_Y_train = pd.concat([X_train,y_train],axis=1)

def upsample_minority(df):
    
    df_majority = df.loc[df['Major Bleed (Y/N)'] == 0]
    df_minority_1 = df.loc[df['Major Bleed (Y/N)'] == 1]
    df_minority_2 = df.loc[df['Major Bleed (Y/N)'] == 2]
    
    df_minority_1_upsampled = resample(df_minority_1, replace = True, n_samples=len(df_majority), random_state=123)
    df_minority_2_upsampled = resample(df_minority_2, replace = True, n_samples=len(df_majority), random_state=123)
    
    df_upsampled = pd.concat([df_majority,df_minority_1_upsampled,df_minority_2_upsampled])
    
    df = df_upsampled
    
    return df 

def smote(train_1,train_2):
    
    sm = SMOTE(random_state=27)
    train_1, train_2 = sm.fit_resample(train_1, train_2)
    
    return train_1,train_2 

#upsampling the minority directly
X_Y_train_upsample = upsample_minority(X_Y_train)
y_train_upsample = X_Y_train_upsample['Major Bleed (Y/N)']
X_train_upsample = X_Y_train_upsample.drop('Major Bleed (Y/N)', axis=1)

df_upsample = upsample_minority(df)
y_upsample = df_upsample['Major Bleed (Y/N)']
X_upsample = df_upsample.drop('Major Bleed (Y/N)', axis=1)

#upsampling by generating synthetic samples (smote)
X_train_smote, y_train_smote = smote(X_train, y_train)

X_smote, y_smote = smote(X, y)

# Choosing which of the 2 sets of upsampled data I'll make the model with (upsampling vs. smote). 
# Upsampling produces better results
XX = X_train_upsample
yy = y_train_upsample

#X_Y_train_upsample['Major Bleed (Y/N)'].value_counts(dropna=False)


## https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18



Removing uncorrelated independent variables (variables that are uncorrelated for both Major bleed: 1 and 2)

In [7]:
# For x y train upsample: 
#XX = XX.drop(['Age at diagnosis'],axis=1)
#XX = XX.drop(['Platelets < 50 (Y/N)'],axis=1)
#XX = XX.drop(['HR Molecular/Cytogenetics (Y/N)'],axis=1)
#XX = XX.drop(['Prior lines of therapy'],axis=1)
#XX = XX.drop(['anti platelet (Y/N)'],axis=1)
#XX = XX.drop(['PMHx bleeding risk (Y/N)'],axis=1)



#X_test = X_test.drop(['Age at diagnosis'],axis=1)
#X_test = X_test.drop(['Platelets < 50 (Y/N)'],axis=1)
#X_test = X_test.drop(['HR Molecular/Cytogenetics (Y/N)'],axis=1)
#X_test = X_test.drop(['Prior lines of therapy'],axis=1)
#X_test = X_test.drop(['anti platelet (Y/N)'],axis=1)
#X_test = X_test.drop(['PMHx bleeding risk (Y/N)'],axis=1)

# For x y train smote: 

#XX = XX.drop(['Anemia (hb < 110) (Y/N)'],axis=1)
#XX = XX.drop(['PMHx bleeding risk (Y/N)'],axis=1)

#X_test = X_test.drop(['Anemia (hb < 110) (Y/N)'],axis=1)
#X_test = X_test.drop(['PMHx bleeding risk (Y/N)'],axis=1)

Using statsmodel to assess coefficients 

In [8]:
logit_model=sm.MNLogit(yy,sm.add_constant(XX.astype(float)))
result=logit_model.fit()
stats1=result.summary()
print(stats1)

Optimization terminated successfully.
         Current function value: 0.906312
         Iterations 6
                          MNLogit Regression Results                          
Dep. Variable:      Major Bleed (Y/N)   No. Observations:                  285
Model:                        MNLogit   Df Residuals:                      265
Method:                           MLE   Df Model:                           18
Date:                Fri, 02 Jul 2021   Pseudo R-squ.:                  0.1750
Time:                        17:28:18   Log-Likelihood:                -258.30
converged:                       True   LL-Null:                       -313.10
Covariance Type:            nonrobust   LLR p-value:                 3.718e-15
            Major Bleed (Y/N)=1       coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
const                              -3.8146      1.317     -2.896  

Implenting the model with LogisticRegression

In [9]:
logreg = LogisticRegression(random_state=0, multi_class='multinomial', penalty='none', solver='newton-cg')
model1 = logreg.fit(XX, yy)
preds = model1.predict(X_test)

Assessing the model

In [10]:
confmtrx = np.array(confusion_matrix(y_test, preds))
pd.DataFrame(confmtrx, index=['None','Minor', 'Major'],
columns=['predicted_None', 'predicted_Minor', 'predicted_Major']) 

Unnamed: 0,predicted_None,predicted_Minor,predicted_Major
,14,11,7
Minor,2,4,1
Major,2,0,2


In [11]:
print('Accuracy Score:', metrics.accuracy_score(y_test, preds))

Accuracy Score: 0.46511627906976744


In [12]:
class_report=classification_report(y_test, preds)
print(class_report)

              precision    recall  f1-score   support

           0       0.78      0.44      0.56        32
           1       0.27      0.57      0.36         7
           2       0.20      0.50      0.29         4

    accuracy                           0.47        43
   macro avg       0.41      0.50      0.40        43
weighted avg       0.64      0.47      0.50        43



In [13]:
#ROC Curve
#y_pred_proba = logreg.predict_proba(X_test)[::,1]
#fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
#auc = metrics.roc_auc_score(y_test, y_pred_proba)
#plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
#plt.legend(loc=4)
#plt.show()