# Intact Medical Data ML model using Naive Bayes Classification Model

### By: Daniyal, Hibah, Abhishek and Adam

In our CxC Data Hackathon project, we were given medical transcription data by Intact and our goal was to predict which of the 40 provided medical specialties each transcription should be assigned to. This is our multiclass classification problem. We are judged based off the macro f1-score. Here are the steps we took to maximize the f1-score:

### Step 1: Import Libraries and read in the Dataset to train on

In [32]:
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import _stop_words
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.metrics import classification_report
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

## Reading in our data
df = pd.read_csv("./IntactInstructions/new_train.csv", index_col=0)
print("Test size with duplicates: ", len(df))

Test size with duplicates:  3969


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/daniyalmohammed/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/daniyalmohammed/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/daniyalmohammed/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/daniyalmohammed/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### Step 2: Pre-process our data

This is one of the most important steps. The ML model is only as good as its dataset, so we're going to make sure it's clean.

All of the basic pre-processing is done by the CountVectorizer. These tasks include:
- Tokenize (divide words individually)
- Remove stop-words (remove "the, and, to, or, ..."; other special characters)
- Lemmatize (convert similar words into its base root; eating, eats, ate => eat)

In [33]:
# Create labels/target values
y = df.labels
print("Label size: ", len(y))
y

Label size:  3969


0       0
1       1
2       1
3       2
4       0
       ..
3995    4
3996    1
3997    1
3998    5
3999    1
Name: labels, Length: 3969, dtype: int64

In [34]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df["transcription"], y, test_size=0.6, random_state=42)

# X_train: training data of features
print("X_train size: ", len(X_train))
# y_train: training data of label
print("y_train size: ", len(y_train))

# X_test: test data of features
print("X_test size: ", len(X_test))
# y_test: test data of label
print("y_test size: ", len(y_test))

# X_train
# y_train[:50]
# X_test
# y_test

X_train size:  1587
y_train size:  1587
X_test size:  2382
y_test size:  2382


In [35]:
# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()


# Custom pre-processing function
def preprocess_data(text):
    text = text.lower()
    text = re.sub(r'\d|_', '', text) # removes digits and '_'
    text = wordnet_lemmatizer.lemmatize(text)
    return text

# , preprocessor=preprocess_data
# Initialize a CountVectorizer object
count_vectorizer = CountVectorizer(stop_words="english", preprocessor=preprocess_data, max_df=0.2, min_df=20, ngram_range=(1, 1))

print(type(count_vectorizer))

<class 'sklearn.feature_extraction.text.CountVectorizer'>


### Step 3: Fit and Transform the Data

Specifically, we must fit AND transform the feature training data and only transform the feature test data.
This is a preliminary step.

In fit_transform(), what happens is that we calculate the mean and variance of the training data and standardize the entire dataset (hence, transform). We only need transform() for the test data because we are using the mean and variance of the training data to standardize the test data.

new cell

In [36]:
# Fit and transform the TRAINING data using only the 'transciption' column values
count_train = count_vectorizer.fit_transform(X_train.values)
# Transform the TEST data using only the 'transciption' column values
count_test = count_vectorizer.transform(X_test.values)


# Print number of words processing
print("Number of words: ", len(count_vectorizer.get_feature_names_out())) # number of test data from split
# Print the features (individual tokens) of the count_vectorizer
print(count_vectorizer.get_feature_names_out()[:500])

Number of words:  2451
['abc' 'abcd' 'abdominal' 'ability' 'able' 'abnormal' 'abnormalities'
 'abnormality' 'abscess' 'absent' 'abuse' 'ac' 'access' 'accident'
 'accommodate' 'accommodation' 'accompanied' 'accomplished' 'according'
 'ace' 'achieved' 'acid' 'active' 'activities' 'activity' 'actually'
 'acute' 'adaptic' 'add' 'added' 'addition' 'additional' 'additionally'
 'address' 'adenocarcinoma' 'adenopathy' 'adequate' 'adequately'
 'adhesions' 'adjacent' 'administered' 'administration' 'admission'
 'admit' 'admits' 'admitted' 'admitting' 'adnexal' 'adrenal' 'adult'
 'advance' 'advanced' 'advised' 'afebrile' 'affect' 'african' 'afternoon'
 'age' 'aggressive' 'ago' 'agree' 'agreed' 'ahead' 'aid' 'air' 'airway'
 'albumin' 'albuterol' 'alcohol' 'alert' 'alignment' 'alkaline' 'allergic'
 'allergies' 'allergy' 'allis' 'allograft' 'allow' 'allowed' 'allowing'
 'alt' 'alternative' 'alternatives' 'ambulate' 'ambulation' 'american'
 'amounts' 'amoxicillin' 'analysis' 'anastomosis' 'anatomic' 

### Step 4: Fine-tune Parameters and Choose the Best Classification Model

Fine-tuning parameters and selecting the best classification model is important because it can greatly improve the performance and accuracy of a machine learning model.

When building a classification model, there are typically many different algorithms and parameters that can be used to train the model. Different algorithms may be more suitable for different types of data, and adjusting the parameters of a particular algorithm can also have a significant impact on its performance.

Fine-tuning the parameters of a model involves adjusting the settings that control how the model learns and makes predictions, such as the learning rate, regularization parameters, or the number of hidden layers in a neural network. By optimizing these parameters, we can ensure that the model is better able to learn the underlying patterns in the data, and that it can make more accurate predictions.

Similarly, choosing the best classification model involves selecting the algorithm that is most suited to the particular problem we are trying to solve. For example, some algorithms may work better with binary classification problems, while others may be more appropriate for multi-class classification tasks. By choosing the best algorithm, we can improve the accuracy and performance of our model, and ensure that it is able to generalize well to new data.

Overall, fine-tuning parameters and selecting the best classification model is an essential step in building effective machine learning models, and can help to ensure that they are able to make accurate predictions and deliver real-world value.


In [37]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import f1_score
import warnings
warnings.filterwarnings("ignore")

# Load the CSV file into a pandas dataframe
df1 = pd.read_csv('./IntactInstructions/new_train.csv')

# Split the data into training and testing sets
X_train1, X_test1, y_train1, y_test1 = train_test_split(df['transcription'], df['medical_specialty'], test_size=0.2, random_state=42)

# Create a CountVectorizer to convert the text into a bag-of-words representation
vectorizer1 = CountVectorizer(max_features=10000)
X_train_vectors1 = vectorizer1.fit_transform(X_train1)
X_test_vectors1 = vectorizer1.transform(X_test1)

# Define the hyperparameters to tune for each classifier
nb_params = {'alpha': [0.1, 0.5, 1.0]}
#svm_params = {'C': [0.1, 0.5, 1.0]}
dt_params = {'max_depth': [5, 10, 15], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4]}
#gb_params = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 15], 'learning_rate': [0.1, 0.5, 1.0]}
#rf_params = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 15], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4]}
knn_params = {'n_neighbors': [3, 5, 7]}
#mlp_params = {'hidden_layer_sizes': [(100,), (100, 50), (100, 50, 25)], 'alpha': [0.0001, 0.001, 0.01]}
#xgb_params = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 15], 'learning_rate': [0.1, 0.5, 1.0]}

# Train and evaluate each classifier using grid search and cross-validation
classifiers = [('Naive Bayes', MultinomialNB(), nb_params),
               #('Support Vector Machines', LinearSVC(), svm_params),
               ('Decision Tree', DecisionTreeClassifier(), dt_params),
               #('Gradient Boosting', GradientBoostingClassifier(), gb_params),
               #('Random Forest', RandomForestClassifier(), rf_params),
               ('K-Nearest Neighbors', KNeighborsClassifier(), knn_params),
               #('Multi-Layer Perceptron', MLPClassifier(), mlp_params),
               #('XGBoost', XGBClassifier(), xgb_params)
               ]

for name, clf, params in classifiers:
    print(f'starting... {name}')
    grid_search = GridSearchCV(clf, params, cv=5, n_jobs=-1, scoring='f1_macro')
    grid_search.fit(X_train_vectors1, y_train1)
    y_pred1 = grid_search.predict(X_test_vectors1)
    f1_macro = f1_score(y_test1, y_pred1, average='macro')
    print(f'Best parameters: {grid_search.best_params_}')
    print(f'Testing F1-macro score: {f1_macro:.3f}')
    print('-' * 80)


starting... Naive Bayes
Best parameters: {'alpha': 0.1}
Testing F1-macro score: 0.276
--------------------------------------------------------------------------------
starting... Decision Tree
Best parameters: {'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 10}
Testing F1-macro score: 0.081
--------------------------------------------------------------------------------
starting... K-Nearest Neighbors
Best parameters: {'n_neighbors': 5}
Testing F1-macro score: 0.140
--------------------------------------------------------------------------------


Note that we have commented out some of the classifications for runtime purposes.

### Step 5: Train our models here

From our previous work, we can see that the Multinomial Naive Bayes is the most accurate model to classify our labels

In [38]:
# Instantiate a Multinomial Naive Bayes classifier
nb_clf = MultinomialNB(alpha=0.4)
# Fit the classifier to the training data
nb_clf.fit(count_train, y_train)
# Create the predicted tags
pred = nb_clf.predict(count_test)

# Print the predictions for each row of the dataset (1001 rows)
print("Number of predictions: ", len(pred)) # Equal to the number of test data (when it got split)
print(pred)

Number of predictions:  2382
[ 7  6 34 ...  9  3 20]


new cell

### Step 6: Evaluate the model

We will create an accuracy score and also a confusion matrix.

Precision = TP/(TP + FP)

Recall = TP/(TP+FN)

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

In [39]:
# Calculate the accuracy score
score = metrics.accuracy_score(y_test, pred)
# Calculate the confusion matrix
# conf_matrix = metrics.confusion_matrix(y_test, pred)

print(score)
print(classification_report(y_test, pred))

0.35684298908480266
              precision    recall  f1-score   support

           0       0.29      0.10      0.14        42
           1       0.36      0.20      0.25       485
           2       0.38      0.34      0.36       119
           3       0.30      0.54      0.39        24
           4       0.35      0.38      0.37        95
           5       0.41      0.47      0.44       106
           6       0.44      0.41      0.43       174
           7       0.47      0.34      0.39       196
           8       0.27      0.24      0.25        38
           9       0.48      0.57      0.52        53
          10       0.21      0.33      0.25       132
          11       0.21      0.17      0.19        41
          12       0.28      0.50      0.36        14
          13       0.23      0.25      0.24        73
          14       0.20      0.14      0.17         7
          15       0.60      0.74      0.66        34
          16       0.30      0.30      0.30       254
       

### Step 7: Try the Test Data and Get the Predictions

Thus, our final results indicate that our macro average of classification guessing is 0.35. Given the turbulent data set and the many duplicate answers, these results are likely our largest possible macro f1-score. Further next steps would be to clean the data further.

In [40]:
import pandas as pd

# Load the test data
test_data = pd.read_csv("./IntactInstructions/new_test.csv")

# Preprocess the test data
test_counts = count_vectorizer.transform(test_data["transcription"])

# Make predictions on the test data
test_preds = nb_clf.predict(test_counts)

# Format the predictions as desired
output = ""
for i in range(len(test_preds)):
    output += str(i) + "," + str(test_preds[i]) + "\n"

# Write the output to a file
with open("predictions.csv", "w") as file:
    file.write(output)
