# Intact Medical Data ML model using Naive Bayes Classification Model

### By: Daniyal, Hibah, Abhishek and Adam

In our Data Hackathon project, we were given medical transcription data by Intact to predict which of the 30 provided medical specialties it should be sent to. We are judged based off the macro f-score. Here are the steps we took to maximize the f-score:

### Step 1: Import Libraries and read in the Dataset to train on

In [4]:
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import _stop_words
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.metrics import classification_report
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

## Reading in our data
df = pd.read_csv("new_train.csv", index_col=0)
print("Test size with duplicates: ", len(df))

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/daniyalmohammed/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/daniyalmohammed/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/daniyalmohammed/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/daniyalmohammed/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


FileNotFoundError: [Errno 2] No such file or directory: 'new_train.csv'

### Step 2: Pre-process our data

I think this is the most important step here, the ML model is only as good as its dataset, so we gotta make sure it's squeaky clean.

All of the basic pre-processing is done by the CountVectorizer, these tasks include:
- Tokenize (divide words individually)
- Remove stop-words (remove "the, and, to, or, ..."; other special characters)
- Lemmatize (convert similar words into its base root; eating, eats, ate => eat)

In [None]:
# Create labels/target values
y = df.labels
print("Label size: ", len(y))
y

Label size:  3969


0       0
1       1
2       1
3       2
4       0
       ..
3995    4
3996    1
3997    1
3998    5
3999    1
Name: labels, Length: 3969, dtype: int64

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df["transcription"], y, test_size=0.6, random_state=42)

# X_train: training data of features
print("X_train size: ", len(X_train))
# y_train: training data of label
print("y_train size: ", len(y_train))

# X_test: test data of features
print("X_test size: ", len(X_test))
# y_test: test data of label
print("y_test size: ", len(y_test))

# X_train
# y_train[:50]
# X_test
# y_test

X_train size:  1587
y_train size:  1587
X_test size:  2382
y_test size:  2382


In [None]:
# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()


# Custom pre-processing function
def preprocess_data(text):
    text = text.lower()
    text = re.sub(r'\d|_', '', text) # removes digits and '_'
    text = wordnet_lemmatizer.lemmatize(text)
    return text

# , preprocessor=preprocess_data
# Initialize a CountVectorizer object
count_vectorizer = CountVectorizer(stop_words="english", preprocessor=preprocess_data, max_df=0.2, min_df=25, ngram_range=(1, 2))

print(type(count_vectorizer))

<class 'sklearn.feature_extraction.text.CountVectorizer'>


### Step 3: Fit and Transform the Data

Specifically, we must fit AND transform the feature training data and only transform the feature test data.
This is a preliminary step.

In fit_transform(), what happens is that we calculate the mean and variance of the training data and standardize the entire dataset (hence, transform). We only need transform() for the test data because we are using the mean and variance of the training data to standardize the test data.

In [None]:
# Fit and transform the TRAINING data using only the 'transciption' column values
count_train = count_vectorizer.fit_transform(X_train.values)
# Transform the TEST data using only the 'transciption' column values
count_test = count_vectorizer.transform(X_test.values)


# Print number of words processing
print("Number of words: ", len(count_vectorizer.get_feature_names_out())) # number of test data from split
# Print the features (individual tokens) of the count_vectorizer
print(count_vectorizer.get_feature_names_out()[:500])



Number of words:  3315
['abc' 'abcd' 'abcd general' 'abdomen' 'abdomen pelvis' 'abdomen prepped'
 'abdomen soft' 'abdominal' 'abdominal cavity' 'abdominal pain'
 'abdominal wall' 'ability' 'able' 'abnormal' 'abnormalities'
 'abnormality' 'abscess' 'absent' 'abuse' 'ac' 'access' 'accident'
 'accommodate' 'accommodation' 'accompanied' 'accomplished' 'according'
 'ace' 'achieved' 'acid' 'active' 'activities' 'activity' 'actually'
 'acute' 'acute distress' 'adaptic' 'add' 'added' 'addition' 'additional'
 'additionally' 'adenocarcinoma' 'adenopathy' 'adequate'
 'adequate general' 'adequately' 'adhesions' 'adjacent' 'administered'
 'administered patient' 'administration' 'admission' 'admit' 'admitted'
 'admitted hospital' 'admitting' 'adnexal' 'adrenal' 'adult' 'advanced'
 'advised' 'afebrile' 'affect' 'african' 'african american' 'afternoon'
 'age' 'aggressive' 'ago' 'ago patient' 'agree' 'agreed' 'ahead' 'aid'
 'air' 'airway' 'albumin' 'albuterol' 'alcohol' 'alcohol use' 'alert'
 'alert or

### Step 5: Train our models here

From our previous work, we can see that the Multinomial Naive Bayes is the most accurate model to classify our labels

In [None]:
# Instantiate a Multinomial Naive Bayes classifier
nb_clf = MultinomialNB(alpha=0.01)
# Fit the classifier to the training data
nb_clf.fit(count_train, y_train)
# Create the predicted tags
pred = nb_clf.predict(count_test)

# Print the predictions for each row of the dataset (1001 rows)
print("Number of predictions: ", len(pred)) # Equal to the number of test data (when it got split)
print(pred)

Number of predictions:  2382
[ 7  1 34 ...  9  3  1]


### Step 6: Evaluate the model

We will create an accuracy score and also a confusion matrix.

Precision = TP/(TP + FP)

Recall = TP/(TP+FN)

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

In [None]:
# Calculate the accuracy score
score = metrics.accuracy_score(y_test, pred)
# Calculate the confusion matrix
# conf_matrix = metrics.confusion_matrix(y_test, pred)

print(score)
print(classification_report(y_test, pred))

0.3513853904282116
              precision    recall  f1-score   support

           0       0.19      0.07      0.10        42
           1       0.35      0.23      0.28       485
           2       0.38      0.37      0.37       119
           3       0.29      0.50      0.37        24
           4       0.35      0.38      0.37        95
           5       0.39      0.42      0.40       106
           6       0.44      0.39      0.41       174
           7       0.52      0.31      0.39       196
           8       0.13      0.08      0.10        38
           9       0.47      0.51      0.49        53
          10       0.18      0.28      0.22       132
          11       0.22      0.17      0.19        41
          12       0.26      0.36      0.30        14
          13       0.25      0.27      0.26        73
          14       0.27      0.43      0.33         7
          15       0.57      0.71      0.63        34
          16       0.31      0.33      0.32       254
        