# Glia.ConversationAI.NLP.TextClassifier
## Aliasgher Dalal
## 2023-11-20

The project is done as a take home assignment from Glia Inc. given as part of the assessment for the role of Data Scientist: Conversational AI

## Libraries

In [None]:
import pandas as pd
import numpy as np
import csv
import os
import re

from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
#from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
cVectorize = CountVectorizer()

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import collections
from collections import defaultdict

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
from tensorflow.python import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding,LSTM,Embedding,SpatialDropout1D
from tensorflow.keras.callbacks import EarlyStopping
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
#from keras.layers import Embedding, LSTM, Dense,SpatialDropout1D
#from keras.models import Sequential
from keras import utils as np_utils
#from tensorflow.keras.utils import to_categorical

from tkinter import *
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
import scipy


## Business Understanding & Requirements

The business goal is to develop an application that can categorize a medical abstract into one of the five conditions. Each class represent current medical condition of a patient. These conditions are:  
These classes 
    1	neoplasms
	2	digestive system diseases
	3	nervous system diseases
	4	cardiovascular diseases
	5	general pathological conditions
To aid in the development of the application, dataset containing ~28k labelled abstracts is provided. 

Initial analysis shows that a text classification machine learning technique may be applied to develop an application, which when given an abstract will be capable of returning the most likely condition that the medical abstract is referring to in the text and hence the patient associated to the abstract is currently afflicted to.

## Data Understanding & Processing

This section of the notebook processes data, analyses and prepares data for the next stage of classification i.e. machine learning modelling.

Data is ingested - in this case Kaggle dataset for medical abstracts is ingested. The data is then processed through various claeaning stages including removal of missing data. In addition, the abstract text is converted to lower case, and split into word tokens. The stopwords are then removed from the text and lemmatization is performed before putting the processed word tokens back into string. The string represent processed medical abstract ready for further processing.

Finally, the processed text is vectorized for input into the machine learnign model(s).



### Data Ingestion
#### Source: https://www.kaggle.com/datasets/chaitanyakck/medical-text/

In [None]:
#Ingest data from csv into pandas dataframe
data_folder = "/Users/aliasgherdalal/Documents/Glia/dataset"
fLabel = data_folder + "/medical_tc_labels.csv"
fTrain = data_folder + "/medical_tc_train.csv"
fTest  = data_folder + "/medical_tc_test.csv"

dfLabel=pd.read_csv(fLabel)
dfTrain=pd.read_csv(fTrain)
dfTest=pd.read_csv(fTest)


In [None]:
print(dfLabel.head(),'\n',dfLabel.shape)

In [None]:
print(dfTrain.head(),'\n',dfTrain.shape)
dfTrain['condition_label'].value_counts().plot.bar()


In [None]:
print(dfTest.head(),'\n',dfTest.shape)
dfTest['condition_label'].value_counts().plot.bar()


### Data Processing

#### Basic Processing

In [None]:
#Missing Values

print('Training Set: \n',dfTrain.isna().sum(),'Test Set: \n',dfTest.isna().sum(),'Labels: \n',dfLabel.isna().sum())


#### Text processing

In [None]:
def listify(df):
    # simply converts a pandas column into a list for processing.
    return df.tolist()
def dataProcessor(dfList):
    # Performs text processing on a list of text string (medical abstracts) and returns a processed list 
    # for vectorizing for machine learning.
    lemmatizer = WordNetLemmatizer()
    containerDoc=[]
    for i in range(len(dfList)):
        tempDoc=dfList[i]
        tempDoc=tempDoc.lower() # lower case
        tempDoc = tempDoc.split() # split into words
        tempDoc = [word for word in tempDoc if word not in stopwords.words('english')] # remove stops words
        tempDoc = [lemmatizer.lemmatize(word) for word in tempDoc] #Lemmatize
        tempDoc = ' '.join(tempDoc) # recreate the doc
        containerDoc.append(tempDoc)
    return containerDoc
def dataProcessorString(dfString):
    # Performs text processing on a text string (a medical abstract) and returns processed string 
    # for vectorizing for machine learning inference.    
    tempDoc=dfString
    tempDoc=tempDoc.lower() # lower case
    tempDoc = tempDoc.split() # split into words
    tempDoc = [word for word in tempDoc if word not in stopwords.words('english')] # remove stops words
    tempDoc = [lemmatizer.lemmatize(word) for word in tempDoc] #Lemmatize
    tempDoc = ' '.join(tempDoc) # recreate the doc
    return tempDoc

In [None]:
dfListTrain=dataProcessor(dfTrain['medical_abstract'])
dfListTest=dataProcessor(dfTest['medical_abstract'])


#### Vectorize Dataset

In [None]:
def vectorizerFit(dfX,dfY):
    dfX=cVectorize.fit_transform(dfX)
    dfY=dfY.tolist()
    dfY=np.array(dfY)
    return dfX,dfY
x_train_counts,condition_label_list_train=vectorizerFit(dfListTrain,dfTrain['condition_label'])
def vectorizerTransform(dfX,dfY):
    dfX = cVectorize.transform(dfX)
    dfY=dfY.tolist()
    dfY=np.array(dfY)
    return dfX,dfY
x_test_counts,condition_label_list_test=vectorizerTransform(dfListTest,dfTest['condition_label'])

#### TF-IDF

In [None]:
tfidf_transformer = TfidfTransformer()
x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts)
x_test_tfidf = tfidf_transformer.transform(x_test_counts)


## Modelling

### Model 1 - Naive Bayes

In [None]:
clf = MultinomialNB().fit(x_train_counts, condition_label_list_train)

y_score = clf.predict(x_test_counts)

n_right = 0
for i in range(len(y_score)):
    if y_score[i] == condition_label_list_test[i]:
        n_right += 1

print("Accuracy: %.2f%%" % ((n_right/float(len(condition_label_list_test)) * 100)))


### Model 2 - Logistic Regression

In [None]:
LR = LogisticRegression(solver = "saga")
LR.fit(x_train_counts,condition_label_list_train)
pred_lr=LR.predict(x_test_counts) # Here is where I get an error
count=0
for i in range (2888):
    pred_lrOne=LR.predict(x_test_counts[i])
    if (dfTest['condition_label'].values[i] == pred_lrOne ):
        count=count+1
print(np.round(count/28.88),"%")


## Model 3 - Deep Learning LSTM

In [None]:
dXTrain= dfListTrain
dXTest = dfListTest
dYTrain = dfTrain['condition_label'].to_list()
dYTest = dfTest['condition_label'].to_list()

num_classes = len(set(dYTrain)) # number of classes
max_words = 10000 # max number of words to use in the vocabulary MAX_NB_WORDS
max_len = 250 # max length of each text (in terms of number of words)MAX_SEQUENCE_LENGTH
embedding_dim = 100 # dimension of word embeddings
lstm_units = 64 # number of units in the LSTM layer
epochs = 10
batch_size = 16

# Tokenize the training set
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(dXTrain)
sequences = tokenizer.texts_to_sequences(dXTrain)
dXTrain = pad_sequences(sequences, maxlen=max_len)

# Tokenize the test set
tokenizer.fit_on_texts(dXTest)
sequences= tokenizer.texts_to_sequences(dXTest)
dXTest = pad_sequences(sequences, maxlen=max_len)


dYTrain = [x - 1 for x in dYTrain]
dYTrain=np.array(dYTrain)

dYTest = [x - 1 for x in dYTest]
dYTest=np.array(dYTest)


#### Model 3.1

In [None]:
modelDL1 = Sequential()
modelDL1.add(Embedding(max_words, embedding_dim, input_length=max_len))
modelDL1.add(SpatialDropout1D(0.2))
modelDL1.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
modelDL1.add(Dense(num_classes, activation='softmax'))
modelDL1.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
modelDL1 = modelDL1.fit(dXTrain, dYTrain, epochs=epochs, batch_size=batch_size,validation_data=(dXTest,dYTest),callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])

#### Model 3.2

In [None]:
modelDL2 = Sequential()
modelDL2.add(Embedding(max_words, embedding_dim, input_length=max_len))
modelDL2.add(LSTM(lstm_units))
modelDL2.add(Dense(num_classes, activation='softmax'))
modelDL2.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
modelDL2 = modelDL2.fit(dXTrain, dYTrain, epochs=epochs, batch_size=batch_size,validation_data=(dXTest,dYTest),callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])

In [None]:
print(model.metrics_names)
print(model.summary())


In [None]:
labelsT = dfTest['condition_label'].to_list()
labelsT = [x - 1 for x in labelsT]
labelsT=np.array(labelsT)
sequences = tokenizer.texts_to_sequences(dfListTest)
xT = pad_sequences(sequences, maxlen=max_len)
model.evaluate(x=xT, y=labelsT)
#result=model.predict(xT)
#result,labelsT

## Evaluation

### Model 1 - Naive Bayes

In [None]:
# A confusion matrix is a table that is used to evaluate the performance of a classification model. Diagonal values represent accurate predictions, while non-diagonal elements are inaccurate predictions.
cnf_matrix = metrics.confusion_matrix(condition_label_list_test, y_score)
print("Confusion matrix\n",cnf_matrix)

print("Accuracy:",metrics.accuracy_score(condition_label_list_test, y_score))
print("Precision:",metrics.precision_score(condition_label_list_test, y_score,average='weighted'))
print("Recall:",metrics.recall_score(condition_label_list_test, y_score,average='weighted'))


In [None]:
#Plotting the confusion matrix
plt.figure(figsize=(14,12))
sns.heatmap((cnf_matrix), annot=True)
plt.title('Confusion Matrix')
plt.ylabel('Actal Values')
plt.xlabel('Predicted Values')
plt.show()


### Model 2 - Logistic Regression

In [None]:
# A confusion matrix is a table that is used to evaluate the performance of a classification model. Diagonal values represent accurate predictions, while non-diagonal elements are inaccurate predictions.
cnf_matrix = metrics.confusion_matrix(condition_label_list_test, pred_lr)
print("Confusion matrix\n",cnf_matrix)

print("Accuracy:",metrics.accuracy_score(condition_label_list_test, pred_lr))
print("Precision:",metrics.precision_score(condition_label_list_test, pred_lr,average='weighted'))
print("Recall:",metrics.recall_score(condition_label_list_test, pred_lr,average='weighted'))


In [None]:
#Plotting the confusion matrix
plt.figure(figsize=(14,12))
sns.heatmap((cnf_matrix), annot=True)
plt.title('Confusion Matrix')
plt.ylabel('Actal Values')
plt.xlabel('Predicted Values')
plt.show()


## Saving the Model

### Model 1 - Naive Bayes

In [None]:
import pickle

# save the iris classification model as a pickle file
model_pkl_fileNB = "NBTextClass.pkl"  
vec_file = 'vectorizer.pickle'
pickle.dump(cVectorize, open(vec_file, 'wb'))
with open(model_pkl_fileNB, 'wb') as file:  
    pickle.dump(clf, file)

In [None]:
# load model from pickle file
with open(model_pkl_fileNB, 'rb') as file:  
    modelUploaded = pickle.load(file)

### Model 2 - Logistic Regression

In [None]:
# save the iris classification model as a pickle file
model_pkl_fileLR = "LRTextClass.pkl"  
#vec_file = 'vectorizer.pickle'
#pickle.dump(cVectorize, open(vec_file, 'wb'))
with open(model_pkl_fileLR, 'wb') as file:  
    pickle.dump(LR, file)

In [None]:
# load model from pickle file
with open(model_pkl_fileLR, 'rb') as file:  
    modelUploaded = pickle.load(file)

### Model 3 - DL LSTM

#### Model 3.1

In [None]:
# save the iris classification model as a pickle file
model_pkl_fileDL1 = "DL1TextClass.pkl"  
#vec_file = 'vectorizer.pickle'
#pickle.dump(cVectorize, open(vec_file, 'wb'))
with open(model_pkl_fileDL1, 'wb') as file:  
    pickle.dump(modelDL1, file)

#### Model 3.2

In [None]:
# save the iris classification model as a pickle file
model_pkl_fileDL2 = "DL2TextClass.pkl"  
#vec_file = 'vectorizer.pickle'
#pickle.dump(cVectorize, open(vec_file, 'wb'))
with open(model_pkl_fileDL2, 'wb') as file:  
    pickle.dump(modelDL2, file)

# ****************************************************************************************