This notebook will train and validate a number of Machine Learning algorithms to classify reports as either positive or negative for fluid collection.

The first part of this will read in the data and transform it from free text into sparse vectors. The second part will train the algorithms and evaluate their performance.

# Data Processing

Right now the data is stored in a sqlite database. There are two main columns:
- ``text``: the free text of the radiology report
- ``doc_class``: a 1 if there is a fluid collection, 0 if there isn't.

They need to be represented as sparse vectors. We'll read them in, preprocess them, and convert them using ``sklearn``.

In [81]:
import os
import pandas as pd
import sqlite3 as sqlite

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score,accuracy_score, precision_score, recall_score, classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.utils import shuffle

In [19]:
DATADIR = '../stats_data'
DB = os.path.join(DATADIR, 'Reference Standard', 'radiology_reports.sqlite')
os.path.exists(DB)

True

In [22]:
conn = sqlite.connect(DB)

# Training data
train_df = pd.read_sql("SELECT * FROM training_notes;", conn)
#Testing data
test_df = pd.read_sql("SELECT * FROM testing_notes;", conn)
conn.close()

In [23]:
train_df.head()

Unnamed: 0,rowid,name,text,referenceXML,doc_class,subject,HADM_ID,CHARTDATE
0,1,No_10792_131562_05-29-20,\n CT ABDOMEN W/CONTRAST; CT PELVIS W/CONTRAS...,"<?xml version=""1.0"" encoding=""UTF-8""?>\n<annot...",0,32,131562,05-29-20
1,2,No_11050_126785_11-03-33,\n CT CHEST W/CONTRAST; CT ABDOMEN W/CONTRAST...,"<?xml version=""1.0"" encoding=""UTF-8""?>\n<annot...",0,34,126785,11-03-33
2,3,No_11879_166554_06-22-37,\n CTA CHEST W&W/O C &RECONS; CT 100CC NON IO...,"<?xml version=""1.0"" encoding=""UTF-8""?>\n<annot...",0,35,166554,06-22-37
3,4,No_11879_166554_06-23-37,\n CT ABDOMEN W/O CONTRAST; CT PELVIS W/O CON...,"<?xml version=""1.0"" encoding=""UTF-8""?>\n<annot...",0,35,166554,06-23-37
4,5,No_11879_166554_07-02-37,\n CT CHEST W/O CONTRAST \n ~ Reason: r/o ste...,"<?xml version=""1.0"" encoding=""UTF-8""?>\n<annot...",0,35,166554,07-02-37


In [24]:
test_df.shape

(100, 8)

The `TfidfVectorizer` and `CountVectorizer` classes transform raw texts into matrices where the rows represent reports and the columns represents terms that are present in that report. `Tfidf` uses *Term-Freqeuncy Inverse-Document Frequency* weighting, while `Count` uses raw counts.

We'll use the vectorizer to preprocess the text, as well. We'll make the reports lowercase, remove all stopwords, and extracts ngrams from 1-3. We'll also set a minimum document threshold, stating that an ngram must appear in at least 10% of the documents to be included. This will help cut down on a lot of noise.

In [28]:
vectorizer = TfidfVectorizer(min_df=0.1, lowercase=True, stop_words='english',
                                     ngram_range=(1, 3))
X_train = list(train_df.text)
y_train = list(train_df.doc_class)

# Fit the vectorizer and transform the training notes into a vector
X_train = vectorizer.fit_transform(X_train)


# Now use this fitted vectorizer to transform the test data
X_test = list(test_df.text)
y_test = list(test_df.doc_class)
X_test = vectorizer.transform(X_test)

In [82]:
# Shuffle the training data
X_train, y_train = shuffle(X_train, y_train)

Our vocabulary consists of 682 ngrams. Our data has 545 training notes and 100 testing notes.

Now we can start training.

In [29]:
print(X_train.shape)
print(X_test.shape)

(545, 682)
(100, 682)


# Machine Learning

We'll consider these algorithms as document classifiers:
- Logistic Regression
- Random Forest
- Naive Bayes
- Linear SVM

For each of these, we'll also consider a number of hyper-parameters to try and find the optimal performance.

In [99]:
def train_and_evaluate_model(X_train, y_train, X_test, y_test, clf, model_name, results):
    print(model_name)
    clf.fit(X_train, y_train)
    y_pred_train = clf.predict(X_train)
    f1, p, r = (f1_score(y_train, y_pred_train,average='binary'),
               precision_score(y_train, y_pred_train, average='binary'),
               recall_score(y_train, y_pred_train,average='binary'))
    print("Train:")
    print(f1, p, r)
    results = results.append({'test/train':'train','name':model_name,
                              'F1':f1,'Precision':p,'Recall':r,
                              "Classifier": clf},ignore_index=True)

    # Now evaluate on testing data
    y_pred_test = clf.predict(X_test)
    f1, p, r = (f1_score(y_test, y_pred_test, average='binary'),
               precision_score(y_test, y_pred_test, average='binary'),
               recall_score(y_test, y_pred_test, average='binary'))
    print("Test:")
    print(f1, p, r)
    results = results.append({'test/train':'test', 
                              'name':model_name,
                              'F1':f1, 'Precision':p, 'Recall':r,
                             "Classifier": clf},
                             ignore_index=True)
    return results

In [100]:
# This dataframe will keep track of all of our scores
results = pd.DataFrame(columns=('test/train','name', 'F1',"Precision","Recall", "Classifier"))


### Logistic Regression

In [101]:
clf = LogisticRegression()
penalties = ['l1', 'l2']
Cs = [0.01, 0.1, 0.5, 1.0]

for penalty in penalties:
    for c in Cs:
        model_name = "Logistic Regression: penalty={}, C={}".format(penalty, c)
        clf = LogisticRegression(penalty=penalty, C=c)
        
        results = train_and_evaluate_model(X_train, y_train, X_test, y_test, clf, model_name, results)
print(results.tail())

Logistic Regression: penalty=l1, C=0.01
Train:
0.0 0.0 0.0
Test:
0.0 0.0 0.0
Logistic Regression: penalty=l1, C=0.1
Train:
0.0816326530612 0.909090909091 0.042735042735
Test:
0.176470588235 1.0 0.0967741935484
Logistic Regression: penalty=l1, C=0.5
Train:
0.735849056604 0.821052631579 0.666666666667
Test:
0.666666666667 0.782608695652 0.58064516129
Logistic Regression: penalty=l1, C=1.0
Train:
0.778801843318 0.845 0.722222222222
Test:
0.666666666667 0.782608695652 0.58064516129
Logistic Regression: penalty=l2, C=0.01
Train:
0.0 0.0 0.0
Test:
0.0 0.0 0.0
Logistic Regression: penalty=l2, C=0.1
Train:
0.614492753623 0.954954954955 0.452991452991
Test:
0.545454545455 0.923076923077 0.387096774194
Logistic Regression: penalty=l2, C=0.5
Train:
0.80378250591 0.899470899471 0.726495726496
Test:
0.777777777778 0.913043478261 0.677419354839
Logistic Regression: penalty=l2, C=1.0
Train:
0.837528604119 0.901477832512 0.782051282051
Test:
0.785714285714 0.88 0.709677419355
   test/train            

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


### Random Forest

In [102]:
n_estimators = [200,500,1000]
max_features = ['sqrt',None]
max_depth = [50,None]
for feat in max_features:
    for d in max_depth:
        for n in n_estimators:
            # Train and evaluate on training data
            model_name = 'Random Forest: n={n}, feat={f}, depth={d}'.format(n=n,f=feat,d=d)
            print(model_name)
            forest = RandomForestClassifier(n_estimators=n,max_depth=d,max_features=feat, n_jobs=-1)
            results = train_and_evaluate_model(X_train, y_train, X_test, y_test, forest, model_name, results)
        
results

Random Forest: n=200, feat=sqrt, depth=50
Random Forest: n=200, feat=sqrt, depth=50
Train:
1.0 1.0 1.0
Test:
0.779661016949 0.821428571429 0.741935483871
Random Forest: n=500, feat=sqrt, depth=50
Random Forest: n=500, feat=sqrt, depth=50
Train:
1.0 1.0 1.0
Test:
0.758620689655 0.814814814815 0.709677419355
Random Forest: n=1000, feat=sqrt, depth=50
Random Forest: n=1000, feat=sqrt, depth=50
Train:
1.0 1.0 1.0
Test:
0.779661016949 0.821428571429 0.741935483871
Random Forest: n=200, feat=sqrt, depth=None
Random Forest: n=200, feat=sqrt, depth=None
Train:
1.0 1.0 1.0
Test:
0.793103448276 0.851851851852 0.741935483871
Random Forest: n=500, feat=sqrt, depth=None
Random Forest: n=500, feat=sqrt, depth=None
Train:
1.0 1.0 1.0
Test:
0.793103448276 0.851851851852 0.741935483871
Random Forest: n=1000, feat=sqrt, depth=None
Random Forest: n=1000, feat=sqrt, depth=None
Train:
1.0 1.0 1.0
Test:
0.766666666667 0.793103448276 0.741935483871
Random Forest: n=200, feat=None, depth=50
Random Forest: n=2

Unnamed: 0,test/train,name,F1,Precision,Recall,Classifier
0,train,"Logistic Regression: penalty=l1, C=0.01",0.0,0.0,0.0,"LogisticRegression(C=0.01, class_weight=None, ..."
1,test,"Logistic Regression: penalty=l1, C=0.01",0.0,0.0,0.0,"LogisticRegression(C=0.01, class_weight=None, ..."
2,train,"Logistic Regression: penalty=l1, C=0.1",0.081633,0.909091,0.042735,"LogisticRegression(C=0.1, class_weight=None, d..."
3,test,"Logistic Regression: penalty=l1, C=0.1",0.176471,1.0,0.096774,"LogisticRegression(C=0.1, class_weight=None, d..."
4,train,"Logistic Regression: penalty=l1, C=0.5",0.735849,0.821053,0.666667,"LogisticRegression(C=0.5, class_weight=None, d..."
5,test,"Logistic Regression: penalty=l1, C=0.5",0.666667,0.782609,0.580645,"LogisticRegression(C=0.5, class_weight=None, d..."
6,train,"Logistic Regression: penalty=l1, C=1.0",0.778802,0.845,0.722222,"LogisticRegression(C=1.0, class_weight=None, d..."
7,test,"Logistic Regression: penalty=l1, C=1.0",0.666667,0.782609,0.580645,"LogisticRegression(C=1.0, class_weight=None, d..."
8,train,"Logistic Regression: penalty=l2, C=0.01",0.0,0.0,0.0,"LogisticRegression(C=0.01, class_weight=None, ..."
9,test,"Logistic Regression: penalty=l2, C=0.01",0.0,0.0,0.0,"LogisticRegression(C=0.01, class_weight=None, ..."


### Naive Bayes

In [103]:
# There are no hyperparemters for this algorithm
mnb = MultinomialNB()
results = train_and_evaluate_model(X_train, y_train, X_test, y_test, mnb, "Naive Bayes", results)
results.tail()

Naive Bayes
Train:
0.769574944072 0.807511737089 0.735042735043
Test:
0.779661016949 0.821428571429 0.741935483871


Unnamed: 0,test/train,name,F1,Precision,Recall,Classifier
37,test,"Random Forest: n=500, feat=None, depth=None",0.757576,0.714286,0.806452,"(DecisionTreeClassifier(class_weight=None, cri..."
38,train,"Random Forest: n=1000, feat=None, depth=None",1.0,1.0,1.0,"(DecisionTreeClassifier(class_weight=None, cri..."
39,test,"Random Forest: n=1000, feat=None, depth=None",0.746269,0.694444,0.806452,"(DecisionTreeClassifier(class_weight=None, cri..."
40,train,Naive Bayes,0.769575,0.807512,0.735043,"MultinomialNB(alpha=1.0, class_prior=None, fit..."
41,test,Naive Bayes,0.779661,0.821429,0.741935,"MultinomialNB(alpha=1.0, class_prior=None, fit..."


### Linear SVM

In [104]:
slacks = [2, 1, 0.1, 0.05, 0.01]
for slack in slacks:
    model_name = "SVM: slack={slack}".format(slack=slack)
    svc_clf = LinearSVC(C=float(slack),loss='hinge')
    results = train_and_evaluate_model(X_train, y_train, X_test, y_test, mnb, model_name, results)
print(results.tail())

SVM: slack=2
Train:
0.769574944072 0.807511737089 0.735042735043
Test:
0.779661016949 0.821428571429 0.741935483871
SVM: slack=1
Train:
0.769574944072 0.807511737089 0.735042735043
Test:
0.779661016949 0.821428571429 0.741935483871
SVM: slack=0.1
Train:
0.769574944072 0.807511737089 0.735042735043
Test:
0.779661016949 0.821428571429 0.741935483871
SVM: slack=0.05
Train:
0.769574944072 0.807511737089 0.735042735043
Test:
0.779661016949 0.821428571429 0.741935483871
SVM: slack=0.01
Train:
0.769574944072 0.807511737089 0.735042735043
Test:
0.779661016949 0.821428571429 0.741935483871
   test/train             name        F1  Precision    Recall  \
47       test   SVM: slack=0.1  0.779661   0.821429  0.741935   
48      train  SVM: slack=0.05  0.769575   0.807512  0.735043   
49       test  SVM: slack=0.05  0.779661   0.821429  0.741935   
50      train  SVM: slack=0.01  0.769575   0.807512  0.735043   
51       test  SVM: slack=0.01  0.779661   0.821429  0.741935   

                     

# Analysis

Now that we've evaluated a number of classifiers, we'll pick the best one, save its predictions, and do a further analysis on it.

In [105]:
results.head()

Unnamed: 0,test/train,name,F1,Precision,Recall,Classifier
0,train,"Logistic Regression: penalty=l1, C=0.01",0.0,0.0,0.0,"LogisticRegression(C=0.01, class_weight=None, ..."
1,test,"Logistic Regression: penalty=l1, C=0.01",0.0,0.0,0.0,"LogisticRegression(C=0.01, class_weight=None, ..."
2,train,"Logistic Regression: penalty=l1, C=0.1",0.081633,0.909091,0.042735,"LogisticRegression(C=0.1, class_weight=None, d..."
3,test,"Logistic Regression: penalty=l1, C=0.1",0.176471,1.0,0.096774,"LogisticRegression(C=0.1, class_weight=None, d..."
4,train,"Logistic Regression: penalty=l1, C=0.5",0.735849,0.821053,0.666667,"LogisticRegression(C=0.5, class_weight=None, d..."


In [106]:
# Filter to only look at test scores
# Then sort by F1
results[results['test/train'] == 'test'].sort_values(by='F1', ascending=False).head()

Unnamed: 0,test/train,name,F1,Precision,Recall,Classifier
25,test,"Random Forest: n=500, feat=sqrt, depth=None",0.793103,0.851852,0.741935,"(DecisionTreeClassifier(class_weight=None, cri..."
23,test,"Random Forest: n=200, feat=sqrt, depth=None",0.793103,0.851852,0.741935,"(DecisionTreeClassifier(class_weight=None, cri..."
15,test,"Logistic Regression: penalty=l2, C=1.0",0.785714,0.88,0.709677,"LogisticRegression(C=1.0, class_weight=None, d..."
51,test,SVM: slack=0.01,0.779661,0.821429,0.741935,"MultinomialNB(alpha=1.0, class_prior=None, fit..."
49,test,SVM: slack=0.05,0.779661,0.821429,0.741935,"MultinomialNB(alpha=1.0, class_prior=None, fit..."


In [111]:
results.iloc[23]

test/train                                                 test
name                Random Forest: n=200, feat=sqrt, depth=None
F1                                                     0.793103
Precision                                              0.851852
Recall                                                 0.741935
Classifier    (DecisionTreeClassifier(class_weight=None, cri...
Name: 23, dtype: object

Looks like our highest-performing model is a Random Forest with 200 estimators, the sqrt number of features, a max depth of 50. Let's now retrain that model and save its predictions:

In [120]:
clf = results.iloc[25].Classifier
y_pred_test = clf.predict(X_test)
y_pred_test

array([0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 0, 1, 0, 0, 1])

In [121]:
# Let's make sure we get the scores we expect
f1_score(y_test, y_pred_test, average='binary')

0.7931034482758621

In [122]:
test_df['pred'] = y_pred_test
test_df.head()

Unnamed: 0,rowid,name,text,referenceXML,doc_class,subject,HADM_ID,CHARTDATE,pred
0,1,No_1007_141227_06-18-95,\n CT CHEST W/CONTRAST \n ~ Reason: assess de...,"<?xml version=""1.0"" encoding=""UTF-8""?>\n<annot...",0,5,141227,06-18-95,0
1,2,No_12344_140694_08-13-21,"\n CTA CHEST W&W/O C&RECONS, NON-CORONARY; CT...","<?xml version=""1.0"" encoding=""UTF-8""?>\n<annot...",0,37,140694,08-13-21,0
2,3,No_14176_126791_10-19-39,"\n CTA CHEST W&W/O C&RECONS, NON-CORONARY; CT...","<?xml version=""1.0"" encoding=""UTF-8""?>\n<annot...",0,41,126791,10-19-39,0
3,4,No_15847_121459_06-07-77,\n CTA ABD W&W/O C & RECONS; CTA PELVIS W&W/O...,"<?xml version=""1.0"" encoding=""UTF-8""?>\n<annot...",0,45,121459,06-07-77,0
4,5,No_15847_121459_06-16-77,\n CT ABDOMEN W/O CONTRAST; CT PELVIS W/O CON...,"<?xml version=""1.0"" encoding=""UTF-8""?>\n<annot...",0,45,121459,06-16-77,0


In [123]:
print(classification_report(y_test, y_pred_test))

             precision    recall  f1-score   support

          0       0.89      0.94      0.92        69
          1       0.85      0.74      0.79        31

avg / total       0.88      0.88      0.88       100



In [None]:
# Now we'll save the dataframe that also has predictions
test_df.to_pickle('../stats_data/')