# Text Mining - Physician Notes

This data set is made available by i2b2 challenge. It contains data about the physician notes of 398 patients. Each row represents a single patient. The TEXT column includes the physician notes. Multi-class classification task: predict the smoking status or a patient. This is an important task for many healthcare provider.

## Goal

Use the **smoker.csv** data set and build a model to predict **STATUS**.

# Read and Prepare the Data

In [1]:
import pandas as pd
import numpy as np

smokers = pd.read_csv('smokers.csv')

smokers.head(5)

Unnamed: 0,ID,TEXT,STATUS
0,641,977146916\nHLGMC\n2878891\n022690\n01/27/1997 ...,CURRENT SMOKER
1,643,026738007\nCMC\n15319689\n3/25/1998 12:00:00 A...,CURRENT SMOKER
2,681,071962960\nBH\n4236518\n417454\n12/10/2001 12:...,CURRENT SMOKER
3,704,418520250\nNVH\n61562872\n3/11/1995 12:00:00 A...,CURRENT SMOKER
4,757,301443520\nCTMC\n49020928\n448922\n1/11/1990 1...,CURRENT SMOKER


In [3]:
#Check counts of unique STATUS values
smokers['STATUS'].value_counts()

UNKNOWN           252
NON-SMOKER         66
PAST SMOKER        36
CURRENT SMOKER     35
SMOKER              9
Name: STATUS, dtype: int64

In [4]:
# replace SMOKER status with CURRENT SMOKER to consolidate values
smokers['STATUS'] = smokers['STATUS'].replace(['SMOKER'],'CURRENT SMOKER')

In [5]:
#Check counts of unique STATUS values
smokers['STATUS'].value_counts()

UNKNOWN           252
NON-SMOKER         66
CURRENT SMOKER     44
PAST SMOKER        36
Name: STATUS, dtype: int64

### Convert target variable to ordinal

In [6]:
from sklearn.preprocessing import OrdinalEncoder

enc = OrdinalEncoder()

#add a column to df called target which is ordinal encoded STATUS
smokers['target'] = enc.fit_transform(smokers[['STATUS']])

smokers

Unnamed: 0,ID,TEXT,STATUS,target
0,641,977146916\nHLGMC\n2878891\n022690\n01/27/1997 ...,CURRENT SMOKER,0.0
1,643,026738007\nCMC\n15319689\n3/25/1998 12:00:00 A...,CURRENT SMOKER,0.0
2,681,071962960\nBH\n4236518\n417454\n12/10/2001 12:...,CURRENT SMOKER,0.0
3,704,418520250\nNVH\n61562872\n3/11/1995 12:00:00 A...,CURRENT SMOKER,0.0
4,757,301443520\nCTMC\n49020928\n448922\n1/11/1990 1...,CURRENT SMOKER,0.0
...,...,...,...,...
393,401,917989835 RWH\n5427551\n405831\n9660879\n01/09...,UNKNOWN,3.0
394,403,817406016 RWH\n3154334\n554691\n3547577\n7/6/2...,UNKNOWN,3.0
395,416,517502848 ELMVH\n18587541\n6634152\n12/12/2004...,UNKNOWN,3.0
396,417,895872725 ELMVH\n99080881\n979718\n5/25/2002 1...,UNKNOWN,3.0


In [7]:
target = smokers['target']

### Select the input variable

In [8]:
#Check for missing values

smokers[['TEXT']].isna().sum()

#no missing values

TEXT    0
dtype: int64

In [9]:
input_data = smokers['TEXT']

### Split data into train/test

In [10]:
from sklearn.model_selection import train_test_split

train_set, test_set, train_y, test_y = train_test_split(input_data, target, test_size=0.3, random_state=42)

In [11]:
train_set.shape, train_y.shape

((278,), (278,))

In [12]:
test_set.shape, test_y.shape

((120,), (120,))

# Text Preparation

### Count Vectorizer

In [13]:
#Countvectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(stop_words='english')

#Fit_transform train
train_x_tr = count_vect.fit_transform(train_set)

#Transform test
test_x_tr = count_vect.transform(test_set)

train_x_tr, test_x_tr

(<278x12240 sparse matrix of type '<class 'numpy.int64'>'
 	with 71810 stored elements in Compressed Sparse Row format>,
 <120x12240 sparse matrix of type '<class 'numpy.int64'>'
 	with 31387 stored elements in Compressed Sparse Row format>)

### TF-IDF Transformer

In [14]:
from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer = TfidfTransformer()

#Fit_transform train
train_x_tfidf = tf_transformer.fit_transform(train_x_tr)

#Transform test
test_x_tfidf = tf_transformer.transform(test_x_tr)

train_x_tfidf.shape, test_x_tfidf.shape

((278, 12240), (120, 12240))

In [15]:
train_x_tfidf[:,:].toarray()

array([[0.0095233 , 0.01267037, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.02616653, 0.03481352, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.01448694, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.03956983, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.06738015, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.02353552, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

### Singular Value Decomposition

In [16]:
from sklearn.decomposition import TruncatedSVD

#If you are performing Latent Semantic Analysis, recommended number of components is 100
svd = TruncatedSVD(n_components=278, n_iter=10)

#Fit_transform train
train_x_lsa = svd.fit_transform(train_x_tfidf)

#Transform test
test_x_lsa = svd.transform(test_x_tfidf)

train_x_lsa.shape, test_x_lsa.shape

((278, 278), (120, 278))

Reduced columns from 12240 to 278.

#### Investigate SVDs

In [17]:
svd.explained_variance_.sum()

0.913337144862521

91.3% of the original dataset is explained by the decomposed dataset, which is sufficient

In [18]:
#These are the all the components:
svd.components_.shape, svd.components_

((278, 12240), array([[ 9.51924155e-02,  2.25488789e-02,  2.35022916e-03, ...,
          7.07483170e-04,  7.76410807e-04,  1.61359431e-03],
        [ 9.85048104e-02, -1.34518367e-02,  4.78349375e-03, ...,
          5.85934053e-04, -1.83940502e-04,  7.16964445e-04],
        [ 8.84296448e-03, -3.73134168e-04,  2.12293663e-02, ...,
          9.13180719e-04,  3.69633462e-05,  2.00497791e-03],
        ...,
        [-6.58163454e-03,  4.62943408e-04, -5.98493644e-02, ...,
          2.81213266e-04,  6.66252811e-04,  1.70103977e-03],
        [-2.26376980e-02,  4.16516725e-03,  2.51525681e-02, ...,
         -1.81107211e-03,  4.52255630e-03,  3.59022933e-03],
        [ 3.27830762e-02, -1.84857966e-03, -1.87368082e-02, ...,
         -7.45520429e-04, -1.89975895e-03, -1.61297007e-03]]))

In [19]:
#Investigating the first component:
first_component = svd.components_[0,:]

indices = np.argsort(first_component).tolist()
indices

[5005,
 9257,
 2317,
 11723,
 5848,
 11488,
 2289,
 3868,
 1904,
 10395,
 7292,
 267,
 2185,
 10410,
 50,
 5876,
 343,
 5863,
 1355,
 508,
 6753,
 9876,
 3371,
 11614,
 7841,
 1381,
 2005,
 5007,
 8674,
 8673,
 990,
 10775,
 10826,
 11449,
 8706,
 5023,
 140,
 11967,
 4204,
 8140,
 2932,
 4143,
 1561,
 2989,
 4140,
 11278,
 2990,
 4119,
 8044,
 495,
 6184,
 1661,
 9383,
 1665,
 4690,
 4387,
 11868,
 7729,
 188,
 10253,
 6353,
 10988,
 671,
 8215,
 2455,
 7871,
 6771,
 3312,
 6008,
 8043,
 11113,
 5597,
 855,
 8992,
 13,
 4022,
 774,
 1776,
 6204,
 983,
 3425,
 1751,
 9867,
 7557,
 4814,
 591,
 1265,
 7738,
 1116,
 6296,
 6347,
 4107,
 6130,
 9002,
 5158,
 2583,
 6200,
 4322,
 7684,
 12101,
 4314,
 7648,
 8108,
 7072,
 5080,
 5036,
 10088,
 7844,
 10552,
 3201,
 5638,
 7359,
 6160,
 7337,
 2206,
 4649,
 5511,
 7216,
 952,
 10372,
 3908,
 6772,
 11681,
 5453,
 1442,
 5418,
 4439,
 8280,
 9083,
 2333,
 10738,
 10825,
 3422,
 5288,
 4494,
 11946,
 6363,
 1181,
 3973,
 4934,
 3827,
 371,
 3

In [20]:
#Feature names from the count vectorizer
feat_names = count_vect.get_feature_names()

#Print the last 10 terms (i.e., the 10 terms that have the highest weigths)
for index in indices[-10:]:
    print(feat_names[index], "\t\tweight =", first_component[index])

pain 		weight = 0.10248823839931734
admission 		weight = 0.10253058753598993
day 		weight = 0.12040240049295393
right 		weight = 0.12692134203163782
po 		weight = 0.133892215288022
left 		weight = 0.1405772360530631
history 		weight = 0.1469565618560393
discharge 		weight = 0.1645097206449623
mg 		weight = 0.17571947228021093
patient 		weight = 0.3029343775413456


The tokens patient, mg, and discharge have the highest weights

# Determine Baseline Accuracy

In [21]:
# Find the majority class:
test_y.value_counts()

3.0    71
1.0    23
2.0    15
0.0    11
Name: target, dtype: int64

In [22]:
#Find the percentage of the majority class:
test_y.value_counts()/len(test_y)

3.0    0.591667
1.0    0.191667
2.0    0.125000
0.0    0.091667
Name: target, dtype: float64

Test baseline accuracy is predicting STATUS as 3.0 (which corresponds to UNKNOWN), with about 59% accuracy. 

# Model 1 - Random Forest

In [23]:
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

rnd_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=4, n_jobs=-1) 

rnd_clf.fit(train_x_lsa, train_y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=4,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [24]:
#Train accuracy
train_y_pred = rnd_clf.predict(train_x_lsa)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.6510791366906474


In [25]:
#Test accuracy
test_y_pred = rnd_clf.predict(test_x_lsa)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.5916666666666667


In [26]:
#Usually created on test set
confusion_matrix(test_y, test_y_pred)

array([[ 0,  0,  0, 11],
       [ 0,  0,  0, 23],
       [ 0,  0,  0, 15],
       [ 0,  0,  0, 71]], dtype=int64)

The random forest appears to be predicting all of the statuses as UNKNOWN, so it has the same accuracy as the baseline.

# Model 2 - SGD Classifier

In [42]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=200, eta0=0.2, learning_rate='adaptive', tol=1e-3)

sgd_clf.fit(train_x_lsa, train_y)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.2, fit_intercept=True,
              l1_ratio=0.15, learning_rate='adaptive', loss='hinge',
              max_iter=200, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [43]:
#Train accuracy
train_y_pred = sgd_clf.predict(train_x_lsa)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 1.0


In [44]:
#Test accuracy
test_y_pred = sgd_clf.predict(test_x_lsa)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.6333333333333333


In [45]:
#Usually created on test set
confusion_matrix(test_y, test_y_pred)

array([[ 2,  3,  0,  6],
       [ 0,  7,  0, 16],
       [ 1,  1,  2, 11],
       [ 2,  4,  0, 65]], dtype=int64)

Neither of the two models performs much better than the baseline accuracy of 59% due to the skewed dataset with a large number of observations marked 'UNKNOWN'. The Random Forest predicts all values as UNKNOWN, and therefore is no better than the baseline. The SGD Classifier does marginally better, achieving 63.3% accuracy. However, since the train accuracy is 100% there is signficant overfitting.