# Introduction

There were 17 versions for the back-end portion of this project and 2 versions of the front-end. All of these drafts are located in the 'drafts' folder in the 'Notebooks' folder. I will summarize the different approaches that I took in each notebook and try to include some sample code below. The sample code here is taken out of the code context and is not meant to run, but it is supposed to make it so you don't have to go into each notebook to see it yourself. 

Besides the front and back-end iterations, I also have a draft book for connecting to the UMLS API, which I used to retrieve the CPT descriptions used in my web app. There is also a notebook for data exploration. I will do my best to cover the steps I took, but this is just an overview and won't be exhaustive of all the different things that I considered in my project. If each notebook is opened, most of them have changes from the previous notebook written at the top for easy reference.

# Notebook Iterations Explanation 

I put the accuracy score next to each notebook for quick reference. The accuracy gradually improved with each iteration. I summarize steps taken for each notebook below:
1. Imported the data and ran the initial naive bayes model. Filtered the dataset to just discharge summary to improve accuracy. Ran the model on all CPT codes and only got 11% accuracy.
2. Limited the clinical notes to just those codes with over 1000 notes
3. Limited the model to predict just the 4 most frequent CPT codes
4. Added a text cleaning function to improve accuracy
5. Compared running the count vectorizer to the tf-idf vectorizer. The tf-idf vectorizer performed better.
6. Tried using a lemmitizer, didn't seem to improve the model performance and so I didn't use it. The code sample I used is below
7. Label encoded the data and found it doesn't affect performance, but it's good practice
8. Did some hyperparameter tuning on the model and vectorizer
9. Shuffled data to improve accuracy, updated the stop words
10. Limiting CPT codes to the first CPT code assigned
11. Tried using sentence tokenization instead of word tokenization
12. Tried using UMLS and cTakes preprocessing to train - didn't improve model accuracy
13. Cleaned up the folder and the data
14. Split out the model by CPT section and ran one model for each one
15. Hyperparameter tuning
16. Added diagnosis text to clinical text to improve model performance by using more notes - there were only diagnosis notes for two CPT codes
17. Combined the CPT and ICD notebooks into one, used grid search for hyperparameter tuning. Example is listed below.

# Code Snippet from CPT Notebook Iteration #6

In [None]:
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

def my_tokenizer(text):
    lemmatizer = WordNetLemmatizer()

    tokens = word_tokenize(text)
    tokens_tag = nltk.pos_tag(tokens)
    tokens_tag = [(i[0], get_wordnet_pos(i[1])) for i in tokens_tag]
    final_tokens = [lemmatizer.lemmatize(i[0], i[1]) if i[1] != '' else i[0] for i in tokens_tag]
    return final_tokens

# Run stop words through my tokenizer
my_stop_words_updated = [my_tokenizer(i)[0] for i in my_stop_words]

# Tokenize the data -----

# Import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words=my_stop_words_updated, max_df = .7, tokenizer=my_tokenizer)

# Transform the training data
tfidf_train = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data
tfidf_test = tfidf_vectorizer.transform(X_test)

# Code Snippet from Notebook Iteration #17 - Hyperparameter Tuning

In [None]:
# Define stop words
my_stop_words = list(set(stopwords.words('english'))) \
                + ['admission', 'date', 'sex'] \
                + ['needed', 'every', 'seen', 'weeks', 'please', 'ml', 'unit', 'small', 'year', 'old', 'cm', 'non', 'mm', 'however']
                # Got the above from my top 100 most predictive words that I wanted to remove

# Taken from: https://stackoverflow.com/questions/44066264/how-to-choose-parameters-in-tfidfvectorizer-in-sklearn-during-unsupervised-clust/44080802
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=my_stop_words)),
    ('clf', LogisticRegression(random_state=123)),
])
parameters = {
#     'tfidf__max_df': (.15,.2,.25,.3) # .2 is the best param
#     , 'tfidf__sublinear_tf' : (True, False) # True
#     'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)], # (1,2) is best
#     'clf__alpha': (0.001, .01) # .001 is best
    
#     'tfidf__max_df': (.2,.5,.7) # .7 is the best param
#     , 'tfidf__ngram_range': [(1, 1), (1, 2)] # (1,2) is best
#     , 'tfidf__min_df': (1,2,3) # 2 is the best
#     , 'clf__alpha': (0.001, .01, .1,.5) # .001 is best
    'clf__C': (.8,.9,1) # 1
#     , 'clf__solver': ('liblinear', 'sag', 'saga') # sag
    , 'clf__max_iter': (25,40,50) # 25
   
}

grid_search_tune = GridSearchCV(pipeline, parameters, cv=2, n_jobs=-1, verbose=3, scoring='accuracy')
grid_search_tune.fit(tt_dict['X_train_other'], tt_dict['y_train_other'])

print("Best parameters set:")
print(grid_search_tune.best_estimator_.steps)