# Medical Transcriptions Categorization
Freya Gray
CS39AA - Natural Language Processing
<br>
This project uses medical transcripts to try and classify the medical specialty that the transcript originated from. The dataset is from the Medical Transcriptions dataset on [Kaggle](https://www.kaggle.com/tboyle10/medicaltranscriptions). 
This notebook will implement a baseline model for the classification problem

## Imports

In [29]:
import numpy as np 
import pandas as pd 
import spacy 
import nltk
from nltk.tokenize import RegexpTokenizer, sent_tokenize, word_tokenize
from nltk import WordNetLemmatizer
from nltk.corpus import wordnet, stopwords
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\freya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Import Dataset

In [30]:
df = pd.read_csv('mtsamples.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."


## Clean Dataset

Remove nonessential columns and drop null entries

In [31]:
df.drop(['Unnamed: 0','description','sample_name','keywords'], axis = 1, inplace = True)
df.dropna(inplace = True)
df.reset_index(drop = True, inplace = True)
df.describe()

Unnamed: 0,medical_specialty,transcription
count,4966,4966
unique,40,2357
top,Surgery,"PREOPERATIVE DIAGNOSIS: , Low back pain.,POSTO..."
freq,1088,5


Convert transcripts to lowercase, lemmatize and remove stopwords

In [32]:
nlp = spacy.load('en_core_web_lg')

from spacy.lang.en.stop_words import STOP_WORDS
stopwords = list(STOP_WORDS)
import string
punct=string.punctuation

def text_data_cleaning(sentence):
    doc = nlp(sentence)
    
    tokens = []
    for token in doc:
        if token.lemma_ != "-PRON-":
            temp = token.lemma_.lower().strip()
        else:
            temp = token.lower_
        tokens.append(temp)
    
    cleaned_tokens = []
    for token in tokens:
        if token not in stopwords and token not in punct:
            cleaned_tokens.append(token)
    return cleaned_tokens

In [33]:
tfidf = TfidfVectorizer(tokenizer = text_data_cleaning)
#classifier = svm.LinearSVC()
classifier = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')

## Create Train and Test Sets

In [34]:

x_train, x_test, y_train, y_test = train_test_split(df['transcription'], df['medical_specialty'],test_size=0.3)

In [35]:
Encoder = LabelEncoder()
y_train = Encoder.fit_transform(y_train)
y_test = Encoder.fit_transform(y_test)

## Label Encoding


In [36]:
clf = Pipeline([('tfidf', tfidf), ('clf', classifier)])

In [37]:
clf.fit(x_train,y_train)
y_pred = clf.predict(x_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       1.00      1.00      1.00         2
           2       0.00      0.00      0.00         6
           3       0.21      0.18      0.19       113
           4       0.00      0.00      0.00         3
           5       0.19      0.39      0.25       137
           6       0.00      0.00      0.00         8
           7       0.00      0.00      0.00         7
           8       0.00      0.00      0.00         7
           9       0.00      0.00      0.00         4
          10       0.26      0.35      0.30        34
          11       0.21      0.16      0.18        37
          12       0.00      0.00      0.00        26
          13       0.00      0.00      0.00         5
          14       0.06      0.03      0.04        75
          15       0.05      0.02      0.03        86
          16       0.06      0.04      0.05        26
          17       0.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [38]:
print(accuracy_score(y_test, y_pred))

0.20469798657718122


## Model 