## Choice for Multinomial Naive Bayes
Naive Bayes is based on Bayes’ theorem. Naive means that features in the dataset are mutually independent i.e occurrence of one feature does not affect the probability of occurrence of the other feature.

* On smaller datasets, outperforms more powerful techniques
* Robust
* Fast & Accurate
* Performs well in text classification problem 

Multinomial Naïve Bayes considers a feature vector where a given term represents its frequency.

In [1]:
import numpy as np
import pandas as pd

import joblib
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import accuracy_score, f1_score, classification_report, matthews_corrcoef
from sklearn.model_selection import StratifiedKFold
from sklearn.naive_bayes import MultinomialNB

**CONSTANTS**

In [3]:
NUM_SPLITS = 5
RANDOM_STATE = 42

**LOADING DATA**

In [4]:
df = pd.read_csv("../data/processed/processed_data.csv")

**MAPPING TARGETS**

In [5]:
df["Package"].value_counts()

 Surgery                          1088
 Consult - History and Phy.        516
 Cardiovascular / Pulmonary        371
 Orthopedic                        355
 Radiology                         273
 General Medicine                  259
 Gastroenterology                  224
 Neurology                         223
 SOAP / Chart / Progress Notes     166
 Urology                           156
 Obstetrics / Gynecology           155
 Discharge Summary                 108
 ENT - Otolaryngology               96
 Neurosurgery                       94
 Hematology - Oncology              90
 Ophthalmology                      83
 Nephrology                         81
 Emergency Room Reports             75
 Pediatrics - Neonatal              70
 Pain Management                    61
 Psychiatry / Psychology            53
 Office Notes                       50
 Podiatry                           47
 Dermatology                        29
 Dentistry                          27
 Cosmetic / Plastic Surge

In [6]:
strings = sorted(df["Package"].unique())
labels = list(range(len(strings)))
mapping = dict(zip(strings, labels))

In [7]:
df["Package"] = df["Package"].map(mapping)

**STRATIFIED K-FOLD**

In [8]:
kfold = StratifiedKFold(n_splits=NUM_SPLITS, shuffle=True, random_state=RANDOM_STATE)

**METRICS**

In [9]:
def print_metrics(y_true, y_pred):
    print(f"ACCURACY: {accuracy_score(y_true, y_pred)}")
    print(f"MCC: {matthews_corrcoef(y_true, y_pred)}")

**TRAINING**

In [10]:
oof_preds = np.zeros(len(df))

In [11]:
for fold, (train_index, valid_index) in enumerate(kfold.split(df["Medical_Description"], df["Package"])):
    print("*"*40)
    print("*"+" "*16+f"FOLD {fold+1}"+" "*16+"*")
    print("*"*40, end="\n")    
    
    X_train = df.iloc[train_index, :].reset_index(drop=True)
    X_valid = df.iloc[valid_index, :].reset_index(drop=True)
    
    y_train = X_train["Package"]
    y_valid = X_valid["Package"]
    
    vec = TfidfVectorizer()
    train_term_doc = vec.fit_transform(X_train["Medical_Description"])
    valid_term_doc = vec.transform(X_valid["Medical_Description"])
    
    naive_bayes = MultinomialNB()
    naive_bayes.fit(train_term_doc, y_train)
    
    valid_preds = naive_bayes.predict(valid_term_doc)
    print_metrics(y_valid, valid_preds)
    
    oof_preds[valid_index] = valid_preds
    
    joblib.dump(vec, f"../pickles/tfidf_{fold}.joblib")
    joblib.dump(naive_bayes, f"../models/classifier_{fold}.joblib")

****************************************
*                FOLD 1                *
****************************************
ACCURACY: 0.3199195171026157
MCC: 0.2193310661692734
****************************************
*                FOLD 2                *
****************************************
ACCURACY: 0.32225579053373615
MCC: 0.22028884595015077
****************************************
*                FOLD 3                *
****************************************
ACCURACY: 0.3192346424974824
MCC: 0.2171551789249324
****************************************
*                FOLD 4                *
****************************************
ACCURACY: 0.32124874118831825
MCC: 0.21922927228272304
****************************************
*                FOLD 5                *
****************************************
ACCURACY: 0.32225579053373615
MCC: 0.22101595709728927


In [12]:
print_metrics(df["Package"].tolist(), oof_preds)

ACCURACY: 0.32098268223922677
MCC: 0.2193265477989161
